Semantic Information for Multifunctional Plurilingual LExicons
SIMPLE is a project sponsored by EC DGXIII in the framework of the Language Engineering programmme. This project represents the first attempt to develop wide-coverage semantic lexicons for a large number of languages (12), with a harmonised common model that encodes structured "semantic types" and semantic (subcategorisation) frames. Even though SIMPLE is a lexicon building project, it also addresses challenging research issues and provides a framework for testing and evaluating the maturity of the current state-of-the-art in the realm of lexical semantics grounded on, and connected to, a syntactic foundation.
Many theoretical approaches are currently tackling different aspects of semantics. However, such approaches have to be tested i) with wide-coverage implementations, and ii) with respect to their actual usefulness and usability in real-world systems both of mono- and multi-lingual nature. The SIMPLE project addresses point i) directly, while providing the necessary platform to allow application projects to address point ii). SIMPLE is coherent with the strategic EC policy that aims at providing a core set of language resources for the EU languages.
SIMPLE should be considered as a follow up to the PAROLE project (see http://www.ilc.pi.cnr.it/) because it adds a semantic layer to a subset of the existing morphological and syntactic layers developed by PAROLE. The semantic lexicons (about 10,000 word meanings) are built in a harmonised way for the 12 PAROLE languages. These lexicons will be partially corpus-based, exploiting the harmonised and representative corpora built within PAROLE. In this way, the semantic encoding will respect actual corpus distinctions. The lexicons are designed bearing in mind a future cross-language linking: they share and are built around the same core ontology and the same set of semantic templates. The "base concepts" identified by EuroWordNet (about 800 senses at a high level in the taxonomy) are used as a common set of senses, so that a cross-language link for all the 12 languages is already provided automatically through their link to the EuroWordNet Interlingual Index (see http://www.let.uva.nl/~ewn).
In the first stage of the project, the formal representation of the "conceptual core" of the lexicons was specified, i.e. the basic structured set of "meaning-types" (the SIMPLE ontology). This constitutes a common starting point on which to base the building of the language specific semantic lexicons. The development of 12 harmonised semantic lexicons requires strong mechanisms for guaranteeing uniformity and consistency. The multilingual aspect translates into the need to identify elements of the semantic vocabulary for structuring word meanings which are both language independent but able to capture linguistically useful generalisations for different NLP tasks.
The SIMPLE model is based on the recommendations of the EAGLES Lexicon/Semantics Working Group (http://www.ilc.pi.cnr.it/EAGLES96/rep2) and on extensions of Generative Lexicon theory. An essential characteristic is its ability to capture the various dimensions of word meaning. The basic vocabulary relies on an extension of "qualia structure" (cf. Pustejovsky 1995) for structuring the semantic/conceptual types as a representational device for expressing the multi-dimensional aspect of word meaning. The model has a high degree of generality in that it provides the same mechanisms for generating broad-coverage and coherent concepts independently of their grammatical/semantic category (entities, events, qualities, etc.).
In order to combine the theoretical framework with the practical lexicographic task of lexicon encoding, we have created a common "library" of language independent templates (see a sample), which act as "blueprints" for any given type - reflecting the conditions of well-formedness and providing constraints for lexical items belonging to that type. The relevance of this approach for building consistent resources is that types both provide the formal specifications and guide subsequent encoding, thus satisfying theoretical and practical methodological requirements.
The SIMPLE model thus contains three types of formal entities:
The SIMPLE model provides the formal specification for the representation and encoding of the following information (the items marked with an asterisk, refer to the information which is obligatorily encoded for every word sense):
The semantic types in SIMPLE form a general Ontology (see a sample), which is structured in such a way to take into accounts the principles of orthogonal organization of types, as formalized in the Generative Lexicon. This allows the SIMPLE Ontology to be the initial core of a larger hierarchy of semantic types, which can overcome some of the well-known shortcomings of current ontologies developed for Language Engineering research and applications.
The hierarchy of types has been further subdivided in three layers: