LREC-2004 workshop: “a Registry of linguistic data categories within an integrated language resources repository area”
Data categories in
Lexical Markup Framework
OR how to lighten a model
Gil FRANCOPOULO AFNOR-INRIA email@example.com
Monte GEORGE ANSI firstname.lastname@example.org
Mandy PET ANSI-ORACLE email@example.com
rnumbered I (DC)AFNOR-
Markup Framework (aka LMF) is a work in progress in order to define an ISO norm
for human oriented lexical databases and lexica for natural language
processing. The official name for LMF is ISO-24613. The
Data categories (aka DC) we are dealing with in this paper are about
linguistics phenomenons. These DC are defined and managed under the auspices of
the ISO-12620 revision by Laurent Romary (INRIA).We will see how the DC
ease the definition and use of various norms and particularly lexical models.
concerning linguistics constants, the two following strategies are appliedTraditionally,
concerning linguistics constants, the two following strategies are applied:
The lexical model defines the list of all the possible values for a certain type of information. For instance, /gender/ could be /masculine/, /feminine/ or /neutral/.
More precisely, there are two sub-strategies:
· define that /gender/ is /masculine/, /feminine/ or /neutral/ without any more details.
· define that /gender/ is /masculine/ or /feminine/ for French and /masculine/, /feminine/ or /neutral/ for German.
The values are not listed at all. The model just states that there is the notion of gender.
An example of the first strategy is applied in the GENELEX [Antony-Lay] and EAGLES models where the DTD contains all the possible values. The drawback of such an approach is that the DTD is necessary huge and could be incomplete, specially for languages unknown to the model authors.
The advantage of the second strategy is that the model is simple and nothing is forgotten. But its drawback is that such a model is useless and we will see that in the next paragraph.
For a lexical model, we can distinguish two criteria:
· The power of representation: what kind of data the model is able to represent ? what language the model could be applied to ?
· The power of operation: is it possible to compare two words ? how to present a pick list to a user of an interactive workstation ? is it possible to merge two LMF conforming lexica ?
The two criteria are somehow contradictory: the more generic the approach, the more diverse lexica are needed to merge.
Coming back to the second strategy that is to avoid defining the possible values for gender, the power of representation is high but the power of operation is very low. Nothing guarantees that a lexicon defines gender as /m/ and /f/, or /mas/ and /fem/ or worth /neuter/ for French. In such a situation, comparing words or merging various lexica are difficult operations and the norm becomes useless.
Let’s detail a bit what is merging.
Merging can take various forms such as the following use cases:
Situation: Multilingual lexicon in N languages
Goal: Add 1 new language to this lexicon
Situation: Monolingual lexicon in language L
Goal: Add words in language L
Situation: Multilingual lexicon in N languages
Goal: Add missing translations
Let’s add that merging is a frequent operation and is an heavy burden for the lexicon manager.
The solution is not easy. We must represent existing data and due to the extension of multilingual databases and various formats used, merging seems to be the most demanding operation.
There is another point to be mentioned. This problem is not specific to lexicon management. The gender definition is shared by other processes like text annotation and features structures.
That means that:
· It is not very wise to duplicate the effort in various norms.
· Text annotation, features structure coding and lexical representation are not independent processes. In case of parsing for instance, the information extracted from the lexicon will be transferred to annotation or feature structures, there is the danger to produce different (and so incompatible) values.
The solution is to define data categories in a separate norm. These values will then be shared by the lexicon, annotation and features structures norms. And of course other future norms could take place in this architecture.
The data categories are not only constants like /masculine/ preferred to /m/ or /mas/ but are defined according to the language processed.
More precisely each feature will be defined as a tree. The top node is /gender/ for instance. One level below, we have /french/ and the possible values are /masculine/ and /feminine/. At the same level as /french/, we have /german/ and the possible values are /masculine/, /feminine/ and /neuter/.
For an unknown language, the possible values are the union of all values extracted from all languages.
As it could be noticed, the number of values is quite important. A management tool is needed in order to ease data category search and selection. Such a tool is provided by INRIA under the auspices of the Syntax project.
The process used is similar to the one of TMF (aka Terminological Markup Framework) that is the ISO norm for thesaurus [Romary].
Data categories are located at the lower level of the TC37 family of norms as sketched in the following diagram.
And the four norms are based on data categories, so each norm is light, non redundant and can interoperate with the others.
Like the other norms of the family, the base line for LMF is to:
· Concentrate on structuring the elements and linking elements together.
· Relegate language idiosyncrasies in an external and shared norm: ISO-12620.
As we have seen, LMF is part of a more global ISO move in order to define a set of coherent norms based on data categories.
Antoni-Lay M-H., Francopoulo G. and Zaysser L. 1994
A generic model for reusable lexicons: The GENELEX project.
Literary and Linguistic Computing 9(1): 47-54.
Romary L. 2001
Towards an Abstract Representation of Terminological Data Collections – the TMF model.