SIMPLE: Harmonised Semantic Lexicons for the European languages

SIMPLE

Semantic Information for Multifunctional Plurilingual LExicons

An Overview

SIMPLE is a project sponsored by EC DGXIII in the framework of the Language Engineering programmme. This project represents the first attempt to develop wide-coverage semantic lexicons for a large number of languages (12), with a harmonised common model that encodes structured "semantic types" and semantic (subcategorisation) frames. Even though SIMPLE is a lexicon building project, it also addresses challenging research issues and provides a framework for testing and evaluating the maturity of the current state-of-the-art in the realm of lexical semantics grounded on, and connected to, a syntactic foundation.

Many theoretical approaches are currently tackling different aspects of semantics. However, such approaches have to be tested i) with wide-coverage implementations, and ii) with respect to their actual usefulness and usability in real-world systems both of mono- and multi-lingual nature. The SIMPLE project addresses point i) directly, while providing the necessary platform to allow application projects to address point ii). SIMPLE is coherent with the strategic EC policy that aims at providing a core set of language resources for the EU languages.

SIMPLE should be considered as a follow up to the PAROLE project (see http://www.ilc.pi.cnr.it/) because it adds a semantic layer to a subset of the existing morphological and syntactic layers developed by PAROLE. The semantic lexicons (about 10,000 word meanings) are built in a harmonised way for the 12 PAROLE languages. These lexicons will be partially corpus-based, exploiting the harmonised and representative corpora built within PAROLE. In this way, the semantic encoding will respect actual corpus distinctions. The lexicons are designed bearing in mind a future cross-language linking: they share and are built around the same core ontology and the same set of semantic templates. The "base concepts" identified by EuroWordNet (about 800 senses at a high level in the taxonomy) are used as a common set of senses, so that a cross-language link for all the 12 languages is already provided automatically through their link to the EuroWordNet Interlingual Index (see http://www.let.uva.nl/~ewn).

The model

In the first stage of the project, the formal representation of the "conceptual core" of the lexicons was specified, i.e. the basic structured set of "meaning-types" (the SIMPLE ontology). This constitutes a common starting point on which to base the building of the language specific semantic lexicons. The development of 12 harmonised semantic lexicons requires strong mechanisms for guaranteeing uniformity and consistency. The multilingual aspect translates into the need to identify elements of the semantic vocabulary for structuring word meanings which are both language independent but able to capture linguistically useful generalisations for different NLP tasks.

The SIMPLE model is based on the recommendations of the EAGLES Lexicon/Semantics Working Group (http://www.ilc.pi.cnr.it/EAGLES96/rep2) and on extensions of Generative Lexicon theory. An essential characteristic is its ability to capture the various dimensions of word meaning. The basic vocabulary relies on an extension of "qualia structure" (cf. Pustejovsky 1995) for structuring the semantic/conceptual types as a representational device for expressing the multi-dimensional aspect of word meaning. The model has a high degree of generality in that it provides the same mechanisms for generating broad-coverage and coherent concepts independently of their grammatical/semantic category (entities, events, qualities, etc.).

In order to combine the theoretical framework with the practical lexicographic task of lexicon encoding, we have created a common "library" of language independent templates (see a sample), which act as "blueprints" for any given type - reflecting the conditions of well-formedness and providing constraints for lexical items belonging to that type. The relevance of this approach for building consistent resources is that types both provide the formal specifications and guide subsequent encoding, thus satisfying theoretical and practical methodological requirements.

The SIMPLE model thus contains three types of formal entities:

SemU - word senses are encoded as Semantic Units or SemU. Each SemU is assigned a semantic type plus other sorts of information which are intended to identify a word sense, and to discriminate it from the other senses of the same lexical item. SemUs are language specific. SemUs which identify the same sense in different languages will be assigned the same semantic type.
(Semantic) Type - it corresponds to the semantic type which is assigned to SemUs. Each type involves structured information, organized in the four Qualia Roles, adopted in the Generative Lexicon framework. The Qualia information is sorted out into type-defining information and additional information. The former is information which intrinsically defines a semantic type as it is. In other words, a SemU can not be assigned a certain type, unless its semantic content includes the information that defines that type. On the other hand, additional information specifies further semantic components a SemU, rather than entering into the characterization of its semantic type.
Template - a schematic structure which the lexicographer uses to encode a given lexical item. The template expresses the semantic type, plus additional information, e.g. domain, semantic class, gloss, predicative representation, argument structure, polysemous classes, etc. Templates are intended to guide, harmonize, and facilitate the lexicographic work. A set of top templates have been prepared during the specification phase, while more specific ones will be eventually elaborated by the different partners according to the need of encoding more specific concepts in a given language.

The SIMPLE model provides the formal specification for the representation and encoding of the following information (the items marked with an asterisk, refer to the information which is obligatorily encoded for every word sense):

Semantic type (*)
Domain information (*)
Glossa (*)
Argument structure (*)
Selectional restrictions on the arguments (*)
Event type for verbs (*)
Link of the arguments to the syntactic subcategorization frames, as represented in the PAROLE lexicons (*)
Type hierarchy information
Qualia information, in terms of both features and relations between SemUs
Information about regular polisemous alternation in which a word sense may enter
Information concering cross-part of speech relations (e.g. "intelligent" - "intelligence"; "writer" - "to write")
Eventual collocations from the corpus
Synonymy relations

The semantic types in SIMPLE form a general Ontology (see a sample), which is structured in such a way to take into accounts the principles of orthogonal organization of types, as formalized in the Generative Lexicon. This allows the SIMPLE Ontology to be the initial core of a larger hierarchy of semantic types, which can overcome some of the well-known shortcomings of current ontologies developed for Language Engineering research and applications.

The hierarchy of types has been further subdivided in three layers:

The Core Ontology - it is formed by those types which have been identified as the central and common ones for the construction of the different lexicons in SIMPLE. The Core Ontology has been elaborated according to the following criteria:

Their central position in the organization of the lexicon;
The fact that they are widely acknowledged in the linguistic and NLP literature as core notions for the semantic characterization of words;
The low level of granularity of the semantic description they provide, which also ensures their multilingual usability. Therefore, the elements of the Core Ontology represent the highest nodes in the hierarchy of types.

Recommended Ontology - this is formed by more specific types (lower nodes in the hierarchy), which provide a more granular organization of the word-senses.
Language Specific types - more detailed types may be created in order to organize a lexicon for language-, domain- or application-specific needs. These types are not provided in the specification phase, and can be eventually added if their elaboration is consistent with the organization of the rest of the SIMPLE model.