EUREKA PROJECT GENELEX

Report on the

MORPHOLOGICAL LAYER

GENELEX Consortium

 

 

 

 

Version 3.3

November 2, 1994

 

 

 

 

TABLE OF CONTENTS

 

A.. PRELIMINARIES 1

1. Generality of Format 3

2. Principle of Linguistic Realism 3

3. Recommendations or Warnings 4

4. Document for General Usage 4

5. Reader's Guide 5

6. Vocabulary List 5

B. ANALYSIS OF MORPHOLOGICAL UNITS 9

I. Morphological Units 11

1. Definition 11

1.1. General Definition 11

1.2. Criteria for Splitting Morphological Units 15

1.2.1. Grammatical Category 16

1.2.2. Gender 18

1.2.3. Number 19

1.2.4. Meaning 20

1.2.5. Function 21

1.3. Conclusion on the Criteria for Splitting 21

1.3.1. Etymology 21

1.3.2. Derivation 22

1.4. Autonomous vs. Non-autonomous Morphological Units 23

2. Simple Morphological Units 24

3. Affix Morphological Units 24

4. Derivation 25

4.1. Foreword 25

4.2. Definition 26

4.3. Structural Ambiguity - Concurrent Derivations 28

4.4. Trees of N-ary Derivation 29

4.5. Systems of Inflection and Derived Categories 31

Prefixing 31

Suffixing 31

4.6. References to Ums, Umgs/Umps and Radicals 32

4.7. Derivation Levels 34

5. Compound Morphological Units 35

5.1. Foreword 35

5.2. Definition 35

5.3. Description 36

5.4. Observations 38

6. Short Forms 39

7. Contracted Morphological Units 40

II. Phonetic and Graphic Forms 41

1. Introduction 41

2. Graphic Form 42

2.1. Graphic Morphological Units: Free Variants 43

2.2. Free Graphic Variants of a Compound 44

2.3. Graphic Radicals : Combinatory Variants 45

3. Phonetic Form 47

3.1. Phonemic Morphological Units : Free Variants 48

3.2. Free Phonemic Variants of a Compound 48

3.3. Phonemic Radicals : Combinatory Variants 48

4. Structural Symmetry - Graphic System / Phonetic System 49

III. Grammatical Categories 51

1. General Points 51

2. Category: Determiner 51

3. Noun-Adjective Distinction 54

4. The Status of Past Participles 55

5. Grammatical Subcategories 56

IV. Morphological Features 59

1. General Points 59

2. Combination of Morphological Features 59

2.1. General Points 59

2.2. Relation between combination of Morphological

Features and Grammatical Category 60

3. Gender 61

3.1. Gender of Nouns 61

3.1.1. Gender of animate beings 61

3.1.2. Gender of Inanimate Objects 62

3.2. Gender of Adjectives 63

3.3. Gender of Determiners 65

3.4. Gender of Past Participles 65

3.5. Gender of Pronouns 67

3.6. Gender of Possessives (Determiners, Adjectives or Pronouns) 68

4. Number 69

4.1. Number of Nouns 69

4.2. Number of Adjectives 70

4.3. Number of Determiners 70

4.4. Number of Verbs 71

4.5. Number of Past Participles 71

4.6. Number of Pronouns 71

4.7. Number of Possessives (Determiners, Adjectives, or Pronouns) 72

5. Mood 73

6. Tense 73

7. Person 74

7.1. Person of Pronouns 74

7.2. Person of Verbs 74

V. Inflected Forms 75

1. Definition 75

2. Inflectional System of Simple Words 76

2.1. General Points 76

2.2. Inflection of Determiners and Pronouns 77

2.3. Derivation of Inflected Forms 78

2.3.1. Addition of an affix to a radical 78

2.3.2. Removal or addition of characters 79

2.4. Inflectional variants 80

3. System of Inflection of Compounds 82

3.1. CombTM of compound different from CombTM

of the head component 83

3.2. CombTM of compound identical to CombTM

of the head component 84

3.3. Constant component 85

3.4. Free Variation of a Component 85

3.5. Complex Cases 86

3.6. Prepositional and adverbial components 87

3.7. Contextual variant of a component 88

3.8. Recursion in Morphological Compositions 88

VI. Etymology 91

C. USER'S MANUAL 93

1. Introduction 95

2. Morphological Unit (Um) 96

2.1. Splitting Morphological Units 96

2.2. Relations between Um and Umg 97

2.3. Relations between Um and Ump 98

2.4. Relations between Ums 99

3. Combination of Morphological Features (CombTM) 100

4. Graphic Morphological Unit (Umg) 100

5. System of Written Inflection (Mfg) 101

6. Computation of Written Forms (Cff) 102

6.1. Two types of representation 102

6.2. Relations between Mfg, CombTM and Cff 103

7. Phonemic Morphological Unit (Ump) 104

8. System of Phonemic Inflection (Mfp) 104

9. Computation of Phonemic Inflected Forms (Cff) 105

9.1. Two types of Representation 105

9.2. Relations between Mfp, CombTM, and Cff 106

9.3. Relations between Graphic and Phonemic Forms 106

10. System of Inflection of Compounds (Mfc) 107

D. ENTITY-RELATION DIAGRAMS 109

1. The Morphological Unit and its different subclasses 111

1.1. Simple Morphological Unit 112

1.2. Affix Morphological Unit 113

1.3. Compound Morphological Unit 114

1.4. Contracted Morphological Unit 115

1.5. Derivation 116

2. Examples of Morphological Units : "chaise", "boulanger",

"dentiste", "interface", "leitmotiv", "amour", "fiançailles",

"chibouk/que", "afur/affure", "arol/arole/arolle" 117

3. Derivations of "dénationalisation" 128

4. Inflected Variants for "concerto" 129

E. DTD SGML 131

1. Translation of the Conceptual Model 133

2. DTD Commented 134

2.1. DTD genelex.dtd 134

2.2. DTD morpho.dtd 136

2.3. Entities morpho.ent 142

2.4. Constraints morpho.ctr 143

3. Examples of marked data 144

3.1. Morphological unit for "chaise" 144

3.2. Morphological unit for "boulanger" 144

3.3. Morphological unit for "dentiste" 145

3.4. Morphological unit for "une interface" 145

3.5. Morphological unit for "leitmotiv" 146

3.6. Morphological unit for "amour" 146

3.7. Morphological unit for "fiançailles" 146

3.8. Morphological unit for "chibouk/que" 147

3.9. Derivation of "dénationalisation" 147

3.10.Inflected variations for "concerto" 148

A. PRELIMINARIES

 

 

This report presents the work accomplished by all partners of the GENELEX France consortium for the definition of a model describing the morphological aspects of an electronic dictionary.

 

1. Generality of Format

The issue involved in the development of GENELEX (GENEric LEXicon) is the generality of its format, which covers the following aspects:

maximum coverage: to take into account, for a given data entry, the maximum amount of non-redundant linguistic information. Information that can be deduced systematically from other linguistic information will not be recorded. In addition, the information to be recorded must not be limited by the needs of specific applications.

maximum portability: to be able to support various types of implementations, the GENELEX model must be a semantic and not a physical model of data. GENELEX pursues exactly the same objectives as the Text Encoding Initiative. The realisation of the GENELEX semantic data model was therefore conceived independently of physical models, permitting our partners to choose different physical models for the implementation of GENELEX dictionaries.

minimum discrimination: this project strives not to divide, but to unite... We have therefore searched to remain as independent as possible of a particular linguistic theory. For this reason, the linguistic aspect has been reduced to accommodate the needs of all users, who may then, with formal rules of translation and/or deduction, find the information of interest in a GENELEX dictionary.

2. Principle of Linguistic Realism

All of our partners have distinguished and separated from one another linguistic objects (words), the semantic model, and the physical model of data. We have favored the strategy that seeks for the greatest isomorphy -- or the smallest distance -- between linguistic objects and objects of the semantic model, without concern for the physical model. We call our process the principle of linguistic realism.

3. Recommendations or Warnings

In the description of the morphological layer, we have included both recommendations and warnings so that potential users of the GENELEX format may make the best use of it. The GENELEX format provides quality assurance, especially in the transcription of an editorial dictionary to an electronic version or when merging several electronic dictionaries.

4. Document for General Usage

This document addresses all potential users of the GENELEX format: lexicographers, linguists, computer scientists and decision-makers. We have presented the issues at stake and justified our choice. Certain points may therefore appear elementary and perhaps even useless. We apologise in advance and ask for your patience and understanding as our document was conceived for general usage.

5. Reader's Guide

Certain conventions have been adopted in this document which should aid the reader's comprehension.

Capitalised words indicate that the concept belongs to the GENELEX model; however, it is often a concept known outside this specific domain. For example, we may refer to a noun in the general sense, or we may refer to a Noun as a potential value for the attribute, Grammatical Category. In some cases the capitalised concepts may be followed by an acronym used in future references: Morphological Unit (Um).

This document presents the GENELEX model as instantiated for the French language, therefore the examples are necessarily in French. These are often accompanied by an annotated translation in English. These annotations refer to those concepts, such as gender and number, that may be obvious to a French speaker because of morphological clues but which may escape the English reader who is unfamiliar with French.

In general, the French text appears in italics, and its English translation in reduced normal fonts. Certain morphological aspects may be highlighted in bold.

Ex: Rien n'est facile dans la vie.

Nothing is easy in life.

The examples will be presented as follows: for a Morphological Unit, the lemmatised form is in bold separated from the Inflected Forms by a colon. The Inflected Forms appear in italics and are set off by curly brackets. The Grammatical Category appears in the right-hand margin.

Each translated example is followed by its English translation which may include either or both the semantic and literal translation. The translation appears in a smaller, normal font and will be annotated to reflect gender and number of the nouns, adjectives, and conjugations.

Ex: cheval: {cheval, chevaux} N

horse: {ms, mp}

6. Vocabulary List

The key words of the GENELEX model are expressed in French, such as the names of relations, entities, fields and their values. The following is list of the GENELEX vocabulary as it appears in the original model, or simply in the text, and its English equivalent.

Adj adjective

Adjectif adjective

Adv adverb

Adverbe adverb

Affixe affix

Agglutiné contracted

Archaïque archaic

Argotique slang

Attestation citation

Autonome autonomous

Cff calculation of graphical/

phonemic inflected form

Cardinal cardinal

Catgram grammatical category

Catégorie grammaticale grammatical category

Combcpose combination of morphological features

for a compound

Combcposant combination of morphological features

for a component

CombTM combination of morphological features

Commun common

Comparatif comparative

Comparatif_Egalite comparative of equal degree

Comparatif_Inferiorite comparative of lesser degree

Complétif completive

Composé compound

Composant component

Conditionnel conditional

Conj conjunction

Contexte_var var(iant) context

Coordination coordination

Copule copula

Courant common (frequency)

Datation dating

Défini definite

Démonstratif demonstrative

Dét determiner

Déterminant determiner

Espace space

Etymologie etymology

Exclamatif exclamative

Familier informal

Féminin feminine

Flexion inflection

Fonction function

Forme brève short form

Futur future

Génitif genitive

Genre gender

Graphie graphic form

Graphique graphic, written

Imparfait imperfect

Impersonnel impersonal

Indéfini indefinite

Indicatif indicative

Infinitif infinitive

Infinitive infinitive clause

Interjection interjection

Interrogative interrogative

Libellé spelling (graphic or phonemic)

Littéraire literary

Masculin masculine

Mfc system of inflection for compounds

Mfg system of graphical inflection

Mfp system of phonemic inflection

Mode mood

Mode de flexion system of inflection

Moderne modern

Modifieur modifier

N noun

Neutre neuter

Niveaulgue style (language)

Nom noun

Nombre number

NombrePosseur Number-Possessor

Obligatoire mandatory

Objet object

Objet_direct direct object

Ordinal ordinal

Ordre-linéaire linear order

Participe participle

Particule particle

Partitif partitive

Passé past

Passé simple past historic

Passif passive

Personne person

Personnel personal

Personnel_faible weak personal (pronoun)

Personnel_fort strong personal (pronoun)

Phonie phonetic form

Phonémique phonemic

Pluriel plural

Pluriel_Posseur plural possessor

Populaire vernacular

Possessif possessive

Préfixe prefix

Prép preposition

Préposition preposition

Présent present

Pronominal pronominal

Propre proper

ProRel relative pronoun

Qualificatif qualifier, qualifying (pronoun)

Quantité quantity

Relatif relative

Retrait remove

Retraitg graphical removal (removalg)

Retraitp phonemic removal (removalp)

Savant scholarly

Séparg graphic separator

Séparp phonemic separator

Simple simple

Singulier singular

Singulier_Posseur singular possessor

Sous-categorie subcategory

Sscatmorph morphological subcategory

Sscatgram grammatical subcategory

Statut status

Subjonctif subjunctive

Subordination subordination

Subordonné subordinate

Suffixe suffix

Sujet subject

Superlatif superlative

Superlatif_absolu absolute superlative

Superlatif_inferiorite superlative of lesser degree

Superlatif_superiorite superlative of greater degree

Temps (adverb of) time

Temps tense (of a verb)

Tête head

Tiret hyphen

Trait feature

Typaff affix type

Typebref short form type

Um (Unité Morphologique) morphological unit

Um_Aff affix morphological unit

Um_Agg contracted morphological unit

Um_C compound morphological unit

Umg graphic morphological unit

Ump phonemic morphological unit

Vargeog geographical variant

Vedette preferred (form)

Verbe verb

Vieilli dated

Vulgaire vulgar

 

 

B. ANALYSIS OF MORPHOLOGICAL UNITS

 

I. Morphological Units

1. Definition

1.1. General Definition

A Morphological Unit (Um) is a regrouping of words based on morphological properties.

An uninflected Morphological Unit is defined by associating a written form (with its variants) with a Grammatical Category.

Ex: de: {de, d'} Prep

An inflected Morphological Unit is defined by an equivalent class of Inflected Forms and by a Grammatical Category (refer to the paragraph on Inflected Forms).

Ex: {cheval, chevaux} N

horse {ms, mp}

{cuillerée, cuillerées} N

spoonful {fs, fp}

{peau-rouge, peau-rouge, peaux-rouges, peaux-rouges} N

Indian (lit. red skin) {ms, fs, mp, fp}

This Morphological Unit is identified by its graphic and/or phonemic lemma. The lemmatised form is the singular form if there is a variation in number, the masculine form if there is a variation in gender, and the infinitive for all verbs.

Ex: cheval : {cheval, chevaux} N

horse (ms): {horse (ms), horses (mp)}

cuillerée : {cuillerée, cuillerées} N

spoonful (fs): {spoonful (fs), spoonfuls (fp)}

peau-rouge : {peau-rouge, peau-rouge,

peaux-rouges, peaux-rouges} N

red skin (ms): {ms, fs, mp, fp}

Unless, of course, if the specified form does not exist. In effect, certain Nouns or adjectives are defective in the singular or the masculine form.

Ex: abats : {abats} N

giblets : {mp}

deux : {deux} Adj

two : {mfp}

vaticane : {vaticane} Adj

Vatican : {f}

Based on equivalent classes of Inflected Forms, we distinguish in GENELEX:

i. the Inflected Forms of lemmatised forms

Ex: fraiseur : {fraiseur, fraiseuse,

fraiseurs, fraiseuses} N

milling machine operator: {ms, fs, mp, fp}

ii. the homographic, Inflected Forms of a lemmatised form of another equivalent class

Ex: fraiseuse : {fraiseuse, fraiseuses} N

milling machine: {fs, fp}

fraiseur : {fraiseur, fraiseuse,

fraiseurs, fraiseuses} N

milling machine operator: {ms, fs, mp, fp}

favoris : {favoris} N

sideburns : {mp}

favori : {favori, favorite, favoris, favorites} N

favorite: {ms, fs, mp, fp}

iii. the homographs with or without an equivalent class of Inflected Forms

Ex: vers : 0 Adv

towards, about

vers : {vers, vers} N

verse : {ms, mp}

iv. certain homographs of different categories

Ex: rire : {rire, rires} N

laughter : {ms, mp}

rire: {ris, ris, rit, rions, riez, rient, ... } V

to laugh: { I, you , (s)he, we, you (pl), they -

laugh}

végétal: {végétal, végétaux} N

vegetable : {ms, mp}

végétal: {végétal, végétale,

végétaux, végétales} Adj

vegetable: {ms, fs, mp, fp}

v. certain homographs of an identical Category but with different meanings

Ex: débiteur: {débiteur, débiteurs} N

spool, reel : {ms, mp}

Ce projecteur comporte deux débiteurs à vitesse de

rotation constante.

This projector has two reels rotating at a constant speed.

débiteuse: {débiteuse, débiteuses} N

saw : {fs, fp}

Une débiteuse à disque sert à scier un arbre.

A chainsaw is used to cut a tree.

débiteur: {débiteur, débiteuse,

débiteurs, débiteuses} N

woodcutter : {ms, fs, mp, fp}

Paul est débiteur de bois.

Paul is a woodcutter.

débiteur: {débiteur, débitrice,

débiteurs, débitrices} N

debtor : {ms, fs, mp, fp}

Luc est un débiteur de Paul.

Luc is indebted to Paul.

 

danois: {danois} N

Danish (language): {ms}

Il parle danois.

He speaks Danish.

danois: {danois, danois} N

Dane (dog): {ms, mp}

Il promène son danois tous les matins.

He walks his Dane every morning.

Danois: {Danois, Danoise,

Danois, Danoises} N

Dane ( person): {ms, fs, mp. fp}

Il a rencontré un Danois.

He met a Dane.

 

pourri: {pourri} N

rot : {ms}

Cette pomme sent le pourri.

This apple smells rotten. (Translated by an adverb in English:)

pourri: {pourri, pourrie,

pourris, pourries} N

swine ( fig): {ms, fs, mp, fp}

Ce type est un vrai pourri.

This fellow is a real swine.

 

comique: {comique} N

comic aspect : {ms}

Je ne trouve aucun comique à cette situation.

I find nothing funny about that situation.

comique: {comique, comiques,

comique, comiques}

clown : {ms, fs, mp, fp}

Ce type est un vrai comique.

This fellow is a real clown.

Two Morphological Units may thus be distinguished although they are homographs or homophones -- in other words, homonyms. On the other hand, we consider a word to be polysemic if it has several meanings and is represented by a single Morphological Unit.

This definition is necessary but insufficient.

Indeed it is circular in that the notion of equivalent classes of Inflected Forms is based on the notion of lemmas. Numerous homographs are processed in the manner described above, but two cases in particular are problematic:

i. an equivalent class of Inflected Forms properly included in another

Ex: fraiseuse: {fraiseuse, fraiseuses} N

milling machine {fs, fp}

fraiseur: {fraiseur, fraiseuse,

fraiseurs, fraiseuses} N

milling machine operator {ms, fs, mp, fp}

échecs: {échecs} N

chess

échec: {échec, échecs} N

failure, failures (or chess)

 

ii. two equivalent classes of Inflected Forms that are differentiated only by their Grammatical Category.

Ex: autiste: {autiste, autiste, autistes, autistes} N

autistic person { ms, fs, mp, fp}

autiste: {autiste, autiste, autistes, autistes} Adj

autistic {ms, fs, mp, fp}

We must therefore establish formal criteria for splitting Morphological Units.

1.2. Criteria for Splitting Morphological Units

We have adopted various criteria for splitting Morphological Units, the most important being the criterion of Grammatical Category because of its generality. All other criteria have methods of application dependent on the specific Category.

All of the criteria which will be presented should be considered as recommendations. The lexicographer is free to choose other criteria as long as they are applied to all of the data.

However, all criteria to be presented are sufficient criteria and insure consistency in data encoding.

1.2.1. Grammatical Category

If two distinct syntactic categories can be associated with a lemmatised form, then we may consider that we have at hand two distinct lemmas.

Ex: autiste: {autiste, autiste, autistes, autistes} N

autistic person: {ms, fs, mp, fp}

autiste: {autiste, autiste, autistes, autistes} Adj

autistic: {ms, fs, mp, fp}

It should be noted, however, that it is difficult to distinguish certain categories, such as:

Noun/Adj: (see section III.3)

Past Participles/Adj, Present Participles/Adj: the past and present participles may be used as adjectives. The opposite is not true.

Ex: Cet étudiant est malmené.

This student was manhandled.

L'étudiant malmené va faire une réapparition aujourd'hui.

The manhandled student will make a reappearance today.

La personne convenant de l'accord doit le signer.

The person making the agreement must sign it.

Il trouve cette couleur tout à fait seyante.

He finds this color absolutely becoming.

phenomena with several derivations, that is, past or present participles used as adjectives, then as Nouns.

Ex: Le malmené va faire une réapparition aujourd'hui.

The manhandled (person) will make a reappearance today.

Le convenant doit signer l'accord.

The (person) making (the agreement) must sign it.

These difficulties result from the often imprecise delimitation between nature and function and from the arbitrariness of the nature of these cases. When do we record as adjectives in the dictionary the present and past participles used as such? The same goes for nominalized adjectives.

Ex: confiant

a confident person

coiffant

hairdressing lotion

convenant

an agreement

amputé

an amputated person

ajouré

openwork

amplifié

something amplified

In GENELEX the coding of grammatical categories is assigned entirely to lexicographers, who have complete control over the status of the categories. Ultimately, they are the ones responsible for the final decision.

All grammatical subcategories concerning Nouns, Determiners and Adjectives are also criteria for splitting lemmas.

distinction between proper noun/common noun

Ex: Il croit en Dieu.

He believes in God.

Il joue comme un dieu.

He plays like a god.

distinction between strong and weak personal pronouns

Ex: Elle lui a fait un cadeau.

She gave him a present.

Elle pense à lui.

She thinks about him.

distinction between indefinite and qualifying adjectives

Ex: Vous acceptez bien certains compromis.

You accept certain (some and not others) compromises.

Il a une certaine confiance en lui.

He has a certain (amount of) confidence in him.

Il a une confiance certaine en lui.

He has complete confidence in him.

(His confidence in him is certain.)

Sa confiance est certaine.

His confidence is certain.

 

1.2.2. Gender

Gender is a criterion for splitting, except when the variation in gender:

reflects uniquely the sex of the named animal/person.

is a free variation on the arbitrary gender, the two genders having the same meaning.

Ex: Un page/une page

a page (boy)/ a page (in a book)

2 Ums, as une page (fs) denotes neither a "female" page (ms) nor simply un page (ms).

 

Ex: Un infirmier/ une infirmière

a (male) nurse/ a nurse

1 Um, because une infirmière has strictly the same meaning as a "female" infirmier (ms).

 

Ex: Un colonel/ une colonelle

a colonel/ a wife of a colonel

2 Ums, because une colonelle is not a "female" colonel, but denotes the wife of a colonel.

 

Ex: Un empereur/ une impératrice

an emperor/ an empress

 

2 Ums, which are as follows:

empereur: {empereur, impératrice, empereurs, impératrices}

emperor : {ms, fs, mp, fp}

Here, impératrice corresponds to a "female" empereur, that is, "reigning empress." It is an Inflected Form which is distinct from its graphic lemma.

impératrice: {impératrice, impératrices}

empress : {fs, fp}

In this case, impératrice designates the "wife of an empereur." It represents the written lemma itself.

 

Ex: Un interface/ une interface

an interface/ an interface

1 Um, because une interface has exactly the same meaning as un interface.

 

1.2.3. Number

Number, as a system of determination, sometimes strongly alters the meaning of Nouns.

generic, non-specific, and specific determination.

determination of discreteness, mass, and abstraction.

Ex: Il a la raison avec lui.

He has reason on his side.

Les raisons de son geste sont inconsidérées.

The reasons (motives) for his gesture are rash.

Il vend de la viande de boeuf.

He sells beef.

Les viandes de boeuf ont pourri.

The beef has rotted.

Le football est un sport très différent du basketball.

Football is a very different sport from basketball.

Ces pays pratiquent des footballs très différents.

These countries play very different (types of) football (games).

Il a acheté des rillettes du Mans.

He bought some rillettes from Le Mans.

La rillette du Mans est vraiment la meilleure.

Rillettes from Le Mans are truly the best.

Here, we are dealing with the morphology of Nouns independently of the operations that they may undergo. If the difference in meaning is tied only to the determination caused by the number, we do not consider number as a criterion for splitting. In other words, it is a matter of polysemy and not homonymy.

For all other cases, one must refer to meaning as a criterion for splitting. However, we do have reservations about this criterion as well, and prefer to consider that number is not globally an operating criterion for splitting.

Ex: bretelle, bretelles 1 Um

(shoulder) strap, suspenders (UK braces)

lunette, lunettes 1 Um

telescope, glasses

échec, échecs 1 Um

failure, failures (or chess)

However, with full knowledge of the facts, the lexicographer may choose a different option.

Ex: bretelle, bretelles 2 Ums

(shoulder) strap, suspenders (UK braces)

lunette, lunettes 2 Ums

telescope, glasses

échec, échecs 2 Ums

failure, failures (or chess)

 

1.2.4. Meaning

Meaning may be a criterion for splitting if no relations (etymological, derivational, rhetorical) can be established between two meanings of a word. Unfortunately, judgement may vary as to the existence or non-existence of such a relation. For this reason, we do not possess formal criteria to separate homonymy and polysemy. Here again, the lexicographer has carte blanche in his choice.

Ex: fraise

strawberry, milling machine

poêle

stove, pall

We wish to recall however that the only homogeneous processing envisaged is total polysemy (all meanings which do not present any difference in terms of their variation of form and their Grammatical Category are grouped under the same Um) and that a homogeneous processing of dictionaries improves the performance of their tools for fusion.

1.2.5. Function

It is agreed that there are no inflectional cases in modern French. However, the weak personal pronouns and the relative and interrogative pronouns all have a grammatical function.

It would be difficult however to attribute to all of these pronouns one of the cases from among the following: nominative, accusative, dative, genitive, and locative.

Ex: dont complement with de (<> genitive)

en

y complement with à (<> dative + locative)

We prefer rather to refer to the function of the pronouns as a criteria for splitting (see also V.2.2 Inflection of Determiners and Pronouns.). Since function does not explicitly appear at the morphological level; it is not represented in the Ums.

Ex. le object

il subject

We have considered function as a criterion for splitting (refer also to Pronominal Inflections).

1.3. Conclusion on the Criteria for Splitting

Etymology and derivation are not considered as operating criteria for splitting.

1.3.1. Etymology

A lemma may have several concurrent etymons, a chain of etymons, or several etymons (composition or lexical derivation).

Examples taken from the Petit Robert:

Ex: mecanisme: from Latin mecanisma

mechanism

élastique: from Latin elasticus, from Greek elasis

elastic

plénipotentiaire: from Latin plenus and potentia

plenipotentiary

We see also that two distinct lemmas may have identical etymons.

Ex: peser: from Latin pensare and pendere.

to weigh

penser: from Latin pensare and pendere.

to think

When two different etymologies may be associated with a single word, it is sometimes difficult to decide whether we have at hand concurrent etymologies or different lemmas.

We therefore believe that etymology is not a sufficient criterion for splitting lemmas.

Ex: poêle (masc): from Latin pallium / pensilis 1 Um

 

1.3.2. Derivation

A lemma may be the root for several Derivatives of the same Category, whether or not it is polysemic or homonymic. Consequently, derivation is not a sufficient criterion for splitting lemmas.

Ex: Verbs from which one derives several Nouns.

coller -> collage

to paste collage

-> colleur

poster(one who posts bills)

griller -> grillage

to put bars on wire netting

griller -> grillade

to grill, to toast grill , gridiron

raffiner -> raffinage

to refine (products) refining

-> raffinement

to polish, refine (language, manner) refinement, sophistication

In GENELEX, we prefer to have three verb entries (coller, griller, raffiner). Derivation is not sufficient to discern between polysemy and homonymy, the eternal dilemma of lexicographers. (For example, in the Dictionnaire Hachette du français, one finds for these given examples handled as follows: coller - one meaning; griller - multiple meanings; raffiner - two homonyms.)

1.4. Autonomous vs. Non-autonomous Morphological Units

Autonomous Morphological Units are units which stand alone in a language, as opposed to those which appear only in other Morphological Units, in idioms, in hyphenated words or in fossilised expressions. They all possess a Grammatical Category, one or more syntactic behaviors and possibly one or more semantic meanings.

Ex: demain, après-demain, aujourd'hui

tomorrow, day after tomorrow, today

matin, cuillerée, après-midi, table

morning, spoonful, afternoon, table

venir, se moquer, se la couler douce

to come, to make fun of, to have an easy time

avec, en compagnie de

with, in the company of

lui, quelqu'un

him, someone

grand, rarissime, avant-coureur

great, (very) rare, forerunner

quand, au moment où

when, at the moment of

ho

ho

Guadeloupe, "le Maurice, La Rochelle

Guadeloupe, Mauritius, La Rochelle

Autonomous units may be Simple or compound, and they may or may not have derivation relations with other units.

Non-autonomous Morphological Units may be either Simple or Affix. They are employed only in Derivatives, phrases, compound words or set expressions.

Non-autonomous Affix Um : they appear only in derivations ( see Section 3).

Ex: -tion (a suffix)

anti- (a prefix)

api- (a prefix)

-nome (a suffix)

 

Non-autonomous Simple Um : they are found only in compound Morphological Units (that includes compound words, locutions, and fixed expressions). GENELEX has chosen to mark their non-autonomous status. A Category may possibly be attributed to them on an etymological basis or in an arbitrary manner.

Ex: parce (in parce que, because)

afin (in afin de, in order to)

tohu (in tohu-bohu, jumble, confusion)

bohu (see above)

quo (in ex quo, on the same level)

Note: The term lemma is often mistakenly used in place of Morphological Unit, although not all Morphological Units are lemmas. (for example: non-autonomous Morphological Units).

2. Simple Morphological Units

A Simple Morphological Unit (Um_S) is associated with a written form (or several in the case of variants) composed of a series of alphabetical characters [a-z, A-Z, ç, é, è, ‘, ê, ‡, à, ‰, Š, , oe, ", •, ô, š, —, ù, û, Ÿ, ƒ, , , , , , , €, , , , …, , , , †] , separators (hyphens, apostrophe, period),and possibly punctuation indicating hyphenation.

A Simple Morphological Unit may have a grammatical category as well as a subcategory. Whether the Um_S is autonomous or not is also indicated (See Section 1.4.).

3. Affix Morphological Units

An Affix Morphological Unit (Um_Aff) may be one of the following types: prefix, infix, or suffix, or of neither of these classes in the case in which it derives its status from the context of its derivation or its composition.

Ex: -tion suffix

re- prefix

gyne no type of affix :

(as in androgynous or gynecology)

 

4. Derivation

4.1. Foreword

Derivation traces back to a deep structure of lexical morphology. One finds in the lexicon basic lemmas and lexical morphemes of derivation. Morphological derivation is very productive in French: it is estimated that some 35,000 lexical morphemes are considered results of derivation in the dictionaries.

Here we must take into account those derivational processes which are productive and regular.

Derivation can be viewed from two perspectives. We may

conceive the processes of lexical formation by derivation as linguistic objects and describe all morphological derivational rules while taking into consideration, for example, the editorial practices of traditional dictionaries.

consider that derivation is essentially a syntactic phenomenon and that it is useless to describe systematically morphological information that can only be redundant in the first place. In this case, we appeal to a derivational process in morphology only for those words whose syntax is not well known. (such as cerise/cerisaie: cherry tree/cherry tree orchard).

The final choice is ultimately dependent on the lexicographer. However, the model must in any case be able to take into account the morphological derivation of a word.

It is of course not advisable to analyse as derived forms lemmas which formally may appear to be derived forms of a basic lemma, between which no real links exist. To illustrate, we have the following examples:

Ex: bouter / boutade

to drive, push/ jest, sally

manger / démanger

to eat/ to itch

Morphological derivation is generally dependent on the length of character chains.

Ex: culture -> cultiver

cultivation cultivate

aliéner -> aliénation

alienate alienation

loyal -> loyauté

loyal loyalty

But this is not always the case, especially with the verb forms for the 3rd conjugation group or with improper Derivatives.

Improperly derived forms, may arbitrarily be associated with one derivation or another.

Ex: dîner V

to dine

dîner N

dinner

Improper derivation is not a particular case of derivation: no prefix nor suffix occur in the derivation process.

It should be noted that improper derivation is not deductible from the simple homograph between two entries of distinct Categories. The derivation relationship must necessarily be registered in the dictionary.

Ex: boucher V

to stop up

boucher N

butcher

4.2. Definition

The Derived Morphological Units are those Simple Morphological Units that have derived relationships with other Morphological Units (Simple or Affix).

These units are nevertheless analysable in that they consist of

0 to N prefixes

a radical (root)

0 to N suffixes

The same step of derivation may combine prefixes and suffixes.

Ex: culture -> acculturation

The derivational process is recursive. At each step, the process constructs and applies to a new autonomous Morphological Unit. The order of application is therefore very important, all the more so because it does not necessarily correspond to the linear order of derivation

Ex: dé/national/is/ation

denationalization

1 nation -> national

2 national -> nationaliser

3 nationaliser -> nationalisation

4 nationaliser -> nationaliser

The complete analysis of a derived form presents all radical + affixes combination at each step of derivation -- in other words, the "unfolding" of the recursive process.

Ex: Dénationaliser

 

Equivalent bracketed notation:

(dé((national)iser))

Note: Because derivation is a recursive process, a radical may be either a Simple or a derived Morphological Unit.

4.3. Structural Ambiguity - Concurrent Derivations

Certain derived forms may present an ambiguity of structure.

Ex: Dénationalisation

Analysis 1

Analysis 2

Equivalent bracketed notation:

Analysis 1 ((dé((national)is))ation)

Analysis 2 (dé(((national)is)ation)

Since the process is recursive, it suffices to represent the final step of derivation. We have only to find the constituting autonomous Morphological Unit to determine the anterior steps.

Ex: dénationaliser

(dé(nationaliser))

dénationalisation

Analysis 1 ((dénationalis)ation)

Analysis 2 ((dé)nationalisation)

The model allows us to record concurrent analyses. The lexicographer may therefore decide to favor a particular analysis by only recording that one, or to record all of the different possible analyses, or to reflect the ambiguity of the derivations by recording all of the constituents in a flat format.

4.4. Trees of N-ary Derivation

The trees of derivation are n-ary and not necessarily binary. Indeed, we remind the reader that at each step, the derivation must apply to autonomous bases.

Ex: brum/is/ation

cannot be analysed as

1 brume -> *brumiser

2 *brumiser -> brumisation

but must be analysed as

1 brume -> brumisation

mist misting

where the base word and two suffixes are combined in a single step.

Ex: ac/culture/ation

can neither be analysed as

1 culture -> *acculture

2 *acculture -> acculturation

nor as

1 culture -> *culturation

2 *culturation-> acculturation

but must be analysed as

1 culture -> acculturation

where a base word, a prefix and a suffix are combined in a single step.

The derivation tree may in certain cases by unary, indicating either an improper derivation or a partially represented derivation. The lexicographer is free to denote a relationship between two Simple Morphological Units without specifying the mechanisms of prefixing or suffixing brought into play.

Ex: tartre -> tartrage

scale descaling

The derivation components, which have been recorded, are identified by a rank, indicating their order of appearance in the Derivative. In the case of partially represented derivations, no indication of rank is noted. This lack of rank distinguishes them from improper derivations whose rank is indicated by the ordinal number 1.

4.5. Systems of Inflection and Derived Categories

The system of inflection and the Category of the derived form are determined by the affix and the base word.

Prefixing

Prefixes rarely cause a change in class. The Grammatical Category and the system of inflection therefore remain the same as those of the base word.

Ex: faire (V) -> défaire (V)

do undo

re faire (V) -> refaire (V)

do redo

There are however several exceptions:

Ex: anti char (N) -> antichar (Adj)

tank antitank

pour boire (V) -> pourboire (N)

lit.:for drink tip

Suffixing

Suffixes may or may not entail a change in class. It is the suffix that determines the necessity of such a change.

Ex: asser rêver (V) -> rêvasser (V).

to dream to day dream

Other suffixes cause a change in Category. They constrain the Category of the base and derived forms.

Ex: age atterrir (V) -> atterrissage (N)

to land landing

 

These suffixes however do not always allow one to deduce unequivocally the Category of the base and of the Derivative. Certain suffixes may be applied to a group of categories, as far as both selection and production are concerned.

Ex: ible submerger(V) -> submersible (N)

ible to submerge submarine

-> submersible (Adj)

submergible

Ex: isme social (A) -> socialisme (N)

ism

despote (N) -> despotisme (N)

Marx -> Marxisme (N)

In the case in which the suffix produces a noun, the suffix may also determine the gender of the derivative.

Ex: tion définir (V) -> définition (N feminine)

However, in all cases, the affix is the determinant as its value permits us to predict systematically the categories of the base and the Derivative. An affix must therefore carry the following information: the Category selected, the resulting Category, and the Gender of the resulting nouns. Based on the prescribed information, we may provide help for decisions and consistency checks for the derived Morphological Units.

The same may be said about inflectional system associated with a derivative: it is often deducible from the suffix when the latter possesses an Inflected Form. This is not, however, systematic and therefore we have decided not to make a constraint of it.

Ex: al, ale, aux, als, ales

-> abdominaux (N)

stomach muscles (mp)

-> festival, -als (N)

festival (ms, mp)

-> général, -ale, -aux, -ales (N)

general (ms, fs, mp, fp)

4.6. References to Ums, Umgs/Umps and Radicals

The derivation relation refers to Ums (Simple or Affix) and indicates for each, its linear order of appearance and its status in the derivation.

The derivation is a relation between different Ums and is applied by default to all of their graphic and phonemic Morphological Units (Umgs and Umps).

In certain cases however, we may specify that for a given derivation of a Um and its derivation components, the recorded derivation should apply only to certain ones of its possible graphic or phonemic variants, and to certain ones of its graphic or phonemic radicals. The radicals and the variants are referenced by their number.

For each affix, we specify which of its combinatory variants (refer to II.2.3 Radicals and their Combinatory Variants) plays a role in the process of derivation. For this reason we number the radical of the affix.

The prefix in, for example, has various surface forms according to the base word to which it is attached. Here we have a case of regressive assimilation.

Ex: imperméable (raincoat)

inconcevable (inconceivable)

irréfléchi (thoughtless)

The suffix tion also has various forms depending on the base word with which it is associated.

Ex: maturation

inhibition

For each base word, we specify which of its radicals is active during the derivation process.

In most cases, the Umg (graphic Morphological Unit) of the base serves to form the Derivative. The radical is equivalent therefore to the Umg .

Ex: rouge -> rougeoyer Umg = Radg0 : rouge

red to glow red

In other cases, we observe a change in the base during the process of derivation.

i. Retrograde alteration associated with the suffix

Reduplication of consonants (associated with denasalization)

Ex: mouton -> moutonner Radg1 : moutonn

ancien -> ancienneté Radg1 : ancienn

Change of accent on the e (associated with vowel fronting)

Ex: célèbre -> célébrité Radg1 : célébr

ii. Historical change

The Derivative takes a form that is not identical to that of the base.

Ex: faux -> falsification

false

4.7. Derivation Levels

In order to describe a derivation process, one must first identify the base in the modern language from which the word was derived.

During the derivation process, phonological and contextual variations on the base word may be observed. Retrograde alterations are characterised by an alteration on the base word by the right context.

Ex: é -> è in célèbre/célébrité

d -> t in attendre/attente

g -> c in agir/action

These variations may be more or less significant, especially concerning verbs from the 3rd group.

Ex: convaincre -> conviction

to convince belief

In addition these variations may have origins that are more or less ancient, which leads us to etymology. As concerns the etymological criteria, lexicography clearly distinguishes two cases:

i. The base and the Derivative have a similar etymology. In this case, the base and the Derivative have simply evolved differently on the basis of that etymon.

Ex: faux -> falsifier

ii. The base and the Derivative do not have the same etymology. In this case, we don't recognise the morphological derivation in modern language.

Thus all the relationship

Noun -Adjective

Verb - Noun

etc.

are not necessarily based on morphology.

Ex: prison / carcéral

prison / prison

cheval / équestre

horse / equestrian

In these pairs, the base and the Derivative do not have the same etymon. These adjectival derivations are not represented by a morphological derivation but rather by either a syntactic derivation (relationship between two syntactic units) or by a semantic derivation (relationship between two semantic units).

5. Compound Morphological Units

5.1. Foreword

The definition of a compound is a highly controversial question. We do not claim to have settled these issues.

GENELEX provides a framework that allows the realisation of certain complex expressions from the morphological level. Phrases (afin de - in order to), compounds (garde-malade - home nurse), certain synapses [Benveniste 1974] (fils à papa - daddy's boy) or set expressions (se tourner les pouces - twiddle one's thumbs) may therefore be recorded as Compound Morphological Units (Um_C) by the lexicographer who so desires.

The reader should be forewarned about the terminology concerning complex expressions used in GENELEX. The term, "Compound Morphological Unit" does not cover the traditional notion of "compound word". Formally, it deals only with those complex expressions that a lexicographer has chosen to encode in the morphological layer of the GENELEX model. This choice is conceptually based on a certain number of linguistic criteria that in effect define the very notion of a Um_C.

You will find, however, for purposes of illustration, that we have established criteria for definition in order to validate our model. The users of GENELEX may decide either to utilise or amend the model.

5.2. Definition

By Morphological Compound, we mean the complex expressions that satisfy at least one of the following criteria:

i. one of the components appears only in this complex expression.

Ex: aujourd'hui (today)

au fur et à mesure (as )

ii. morphological particularity (change of gender, number, system of inflection,... during a composition)

Ex: une deux-chevaux (a deux-chevaux (type of car))

un peau rouge (an Indian)

iii. particularity in the graphic form (presence of a graphic separator: hyphen, apostrophe)

Ex: garde-malade (a home nurse)

iv. indivisible compound. Insertion between the elements is disallowed.

Ex: à force de (by dint of)

Typically, mettre en marche (to start up) should be considered a matter at the syntactic level, because we may say mettre le bulldozer en marche.

v. assimilable within a terminal functional category

Ex: En vertu de (in accordance with)

This locution is said to be "prepositional", that is, a series characterised by a Preposition-Noun-Preposition is assimilable to a Preposition.

vi. non-existence of a semantic composition

Ex: a sage-femme (midwife) is not a woman who is sage.

The Nouns referring to compound numbers are not registered, because they define an infinite set.

None of theses criteria are necessary. Only (ii) is sufficient.

A Compound Um is defined by its components which, by definition, must already exist. A component may be a Simple, autonomous Um, a Simple, non-autonomous Um, an Affix Um, a contracted Um, or a Compound Um. The components of a given compound may be of different types.

Ex: au fur et à mesure

fur is non-autonomous whereas all other components are autonomous Ums.

5.3. Description

A Compound Morphological Unit is distinguished from other Ums by the fact that it has no graphic or phonemic Morphological Units. However, it has a Grammatical Category, and perhaps sub-categories.

Ex: Um_C peau rouge

catgram : Noun

From a single Compound Morphological Unit branch N Composition relations , N being the number of Components.

Each Composition relation associates with the Compound:

one of its Components

possible restriction on the Units or the Radicals, graphic or phonemic, of the component concerned by the composition

a Compositional System of Inflection which describes the relation between the inflection of the Compound and its Component.

and consists of (from attributes on the relation itself) the following information:

the linear order of the Component in the Compound.

the punctuation mark preceding the Component.

The graphic (and phonetic) form of a Compound Um is deducible from the form of its components carrying information of the punctuation marks (presence or absence of hyphens, spaces, or apostrophes) between the components. Consequently, it is not necessary to note the Umgs of Compound Ums in the model.

A Compound Um is therefore only indirectly identified by a series of alphabetic characters: [a-z, A-Z, ç, é, è, ‘, ê, ‡, à, ‰, Š, , oe, , •, ô, š, —, ù, û, Ÿ, ƒ, , , , , , , €, , , , …, , , , †] as well as space marks, hyphens, apostrophes and underlines.

One may wish to consider certain units as Compound units despite the fact that they do not have a punctuation mark indicating hyphenation. The absence of such a graphic separator is indicated in the model by an attribute, "joined", on the composition relation.

Ex: bonhomme (a good-natured fellow)

If a component has graphic variations, then the composition inherits these by default, that is if no restrictions are indicated on the composition relation.

Ex: clef/clé -> porte-clefs/porte-clés

Compounds, as other Morphological Units, are inflected. That is, the graphic and phonemic Inflected Form may be deduced from its system of inflection. (Refer to V.3 System of Inflection for Compounds).

Ex: ms peau rouge

mp peaux rouges

fs peau rouge

fp peaux rouges

 

5.4. Observations

Non-autonomous Compound Ums do not exist.

The number of components is equal to at least two.

A Compound Um may have the same Um as a component more than one time.

Ex: moitié-moitié

(half-and-half)

The components may be Ums of various types: simple words (autonomous or not), Derivatives, affixes, or compound words.

Ex: wagon-lit (sleeping car)

au fur et à mesure (as )

accouchement sans douleur (painless childbirth)

anti-buée (anti-condensation)

virage en épingle à cheveux (a hairpin turn)

6. Short Forms

Short or abbreviated forms are represented in the same way as Morphological Units (Simple or Compound), that is, as a separate entry related to the corresponding developed unit. The relation has attributes in order to indicate if the short form is an abbreviation, a set of initials, or an acronym.

Ex: S.N.C.F initials: Société Nationale des

Chemins de Fer Français

OTAN acronym: Organisation du Traité

de l'Atlantique Nord

adj. abbreviation: adjective

cinéma abbreviation: cinématographe

ONU initials and acronym: Organisation des

Nations Unies

The different types of abbreviations may be distinguished by the usage attributes (frequency, dating, ...) . For example, certain abbreviations are viewed as simple graphic shortcuts (fam. for familier) whereas others have become a part of the language and replace either entirely or partially the original expanded form (métro for métropolitain).

A Um may have several short forms.

Ex: cinéma abbreviation : cinématographe

ciné abbreviation : cinématographe

Inversely, a Um may in certain cases be the short form of several expanded Ums.

Ex: micro abbreviation : microphone

micro abbreviation : microordinateur

The lexicographer will decide, of course, the importance of these factorisations.

We have decided to make two distinct Morphological Units for the short form and for the expanded form rather than two variations of the same unit for the following reasons:

the short form may sometimes have a different syntactic behaviour than that of the expanded form: one may refer to a petite P.M.E. (semantically, a small business) but never to a petite Petite et Moyenne Enterprise (literally, a small Small and Medium (sized) Enterprise; P.M.E. is a classification of businesses ).

as other units, the short form is a graphic/phonemic abstraction.

7. Contracted Morphological Units

The Contracted ("agglutinated") Morphological Unit (Um_Agg) is used to record written contractions of two units.

Ex: du is the agglutination of de le

The Um_Agg does not possess a Grammatical Category. It is represented in the model as having a composition relation with the elements from which it is derived:

the combination of pertinent inflectional features for the "agglutinate" is represented as indicated below and is put in relationship with the combination of inflectional features for the element:

Ex: duquel (ms) has a relationship to de and lequel (ms)

the choice of graphic or phonemic variants is also possible

Ex: au only changes according to the le form of the masculine definite article and not on its elided variant form, l'

Unlike the case of the Compound Unit, the spelling of the Um_Agg is not obtained by concatenating the spelling of the components. Therefore, the Um_Agg has its own Umg and Ump.

The lexicographer has the choice of entering the contracted paradigms or not.

Ex: the three contractions may be split or regrouped

au = à le (ms)

aux = à les (mp)

aux = à les (fp)

Note that the form à la (fs) is not included in this paradigm since it is not a contracted form.

The Um_Agg is also characterised by the attribute indicating whether it is mandatory or optional. This feature, as well as the facility for recording paradigms, would be particularly adapted to those languages that have a higher frequency of contraction than in French. For example, in Italian, contraction of the preposition con with the definite article is optional.

II. Phonetic and Graphic Forms

1. Introduction

There are various types of information to be processed in the morphological layer: graphical, phonemic, morphological features, grammatical categories, and etymological.

So as the GENELEX morphological layer may correspond to the linguistic notion of morphology, we have not limited morphology to the graphic form as have many electronic dictionaries for application. This is validated in our model by two fundamental points: the distinction between homographs (refer to Section I.1 Definition of Morphological Units) and the grouping of graphic variants (refer to Section 2.2 below).

Ex: souris P: 1, N: singular, T: pres, M: indicative Cat: V

I smile

souris P: 2, N: singular, T: pres, M: indicative Cat: V

you smile

souris P: 1, N: singular, T: past hist, M: indicative Cat: V

I smiled

souris P: 2, N: singular, T: past hist, M: indicative Cat: V

you smiled

souris G: feminine, N: singular, Cat: N

mouse

souris G: feminine, N: plural, Cat: N

mice

abattis, abatis: G: masculine, N: singular, Cat: N

abattis, abatis: G: masculine, N: plural, Cat: N

Morphology is not limited to the written form because we must also consider the phonetic form.

A Morphological Unit is an abstraction of graphic and phonemic Morphological Units. This abstract entity serves as a pivot between the morphological, syntactic and semantic layers of our model.

2. Graphic Form

Morphological units are identified by their written or graphic form, that is, by a series of alphabetic characters with or without accent marks, in upper or lower case letters: [a-z, A-Z, ç, é, è, ‘, ê, ‡, à, ‰, Š, , oe, ", •, ô, š, —, ù, û, Ÿ, ƒ, , , , , , , €, , , , …, , , , †] as well as space marks, hyphens, apostrophes and underlines. The underline is used strictly as a syllabication mark.

Ex: après

after

au_jourd'hui

today

au fur et à me_sure

as

pèse-per_son_ne

scale

Gua_de_loupe

Guadeloupe

"le Maurice

Mauritius

grand-duché de Luxem_bourg

Great Duchy of Luxembourg

La Ro_chelle

La Rochelle

As for those characters which carry an accent mark, we conform to the Text Encoding Initiative, which has chosen the ISO-8879 norm. This norm codes the accented characters in the form of publicly declared SGML entities, or "character-entities." This coding is formally non-ambiguous as the nomenclature of character-entities is part of the ISO norm and that these entities are demarcated between opening and closing marks (&) and (;).

Ex: jusqu'à ce siècle ;

jusqu'&agrave; ce si&egrave;cle ;

This notation permits us to convert systematically character-entities into accented characters dependent on the operating system or word processing program without running the risk of errors by simple substitution of a chain of characters.

The SGML coding must not only serve to exchange and transfer data but must also remain completely transparent to the user of a GENELEX electronic dictionary. To conserve it as an editing format would make the document illegible and render it a source of errors.

The graphic form of a Compound is computed from the graphic forms of its components.

2.1. Graphic Morphological Units: Free Variants

When, for a single Morphological Unit, various lemmatised forms are concurrent (whatever the context), we have at hand a case of graphic variants. [M. Mathieu-Colas, 1987]

Ex clé, clef (key) 1 Um, 2 Umgs

As variants, these forms are logically attached to the same lemma rather than split into different lemmas.

For those words which may or may not include a graphic separator, the lexicographer is free to represent this by either a Compound Morphological Unit with variations on the separator or by a Simple unit and its variants.

Ex: réécriture, ré-écriture, récriture (rewriting)

Dating, frequency of usage and style may serve to distinguish the variants.

Ex: abattis, abatis (Québec)

poète, po‘te (archaic)

djihad, jihad (rare)

However, such differentiations are not always possible, creating a lexicographic problem of recognition for an independent entry.

Ex: clé, clef

agasse, agace

The notion of "entry" stems from the linear order of traditional, printed dictionaries. As GENELEX is an electronic dictionary, it is not necessary to conserve this notion. All variants may be placed on the same level if we do not assign an entry header. In this case, only the information attached (regionalism, dating, frequency and style) will vary.

 

2.2. Free Graphic Variants of a Compound

The graphic form(s) of a Compound is(are) deduced from the components rather than recorded directly under a single(or more) Compound Umg(s).

The graphic variations on the Compound have two origins:

variation on the separator preceding a component (the list of separators preceding the component is indicated in the relation R_Compose).

Ex: col-vert, colvert

separator before vert: hyphen or nothing.

entr'apercevoir, entrapercevoir

separator before apercevoir: apostrophe or nothing

haute-fidelité, haute fidelité

separator before fidelité: hyphen or space

graphic variation of the component

Ex: porte-clef, porte-clé

2 Umgs of a single component: clef and clé

We allow however, the possibility of not systematically applying the graphic variation of the component to the Compound.

The derivation of all graphic forms of a Compound in full consists of listing all combinatorials of the sources of variation.

In certain cases, the extensive list produced may be very long.

 

2.3. Graphic Radicals : Combinatory Variants

We will take "radical" in its traditional sense of a "bare" (base) morpheme (form 0) pertinent for a Umg. Radicals are used in the process of inflection and derivation.

In certain cases, the Umg itself constitutes this form 0 and is noted under Radg0.

Ex: Umg : célèbre,

Radg0: célèbre -> célèbres

Note: by definition Radg0 always has the same value as the corresponding Umg. Referring to Radg0 consequently amounts to referring to the Umg. It is therefore not necessary to implement Radg0.

In other cases, the "bare" morpheme is distinct from the Umg. It is noted under Radg1, Radg2, ..., RadgN

Ex: Umg : chanter (to sing)

Radg1 : chant- -> chantais (I was singing)

Umg : appeler (to call)

Radg1 : appel- -> appelons (we call)

Radg2 : appell- -> appelle (he calls)

Umg : céder (to cede)

Radg1 : céd- -> cédons (we cede)

Radg2 : cèd- -> cède (he cedes)

Umg : aller (to go)

Radg1 : all- -> allons (we go)

Radg2 : ir- -> irai (I will go)

Umg : célèbre (famous, celebrated)

Radg1 : célébr- -> célébrité (celebrity)

Note that the radical is where combinatory variants (as opposed to free variants at the level of the Umg) are expressed.

All variants of a Um that are not free but combinatory are recorded under Radg and not under Umg.

combinatory variants of Simple Morphological Units

Ex: Umg : beau ("beautiful," canonical form)

Radg0 : beau

Radg1 : bel

Umg : lorsque ("when," canonical form)

Radg0 : lorsque

Radg1 : lorsqu

Umg : de ("of," canonical form)

Radg0 : de

Radg1 : d

combinatory variants of affixes

Ex: Umg : tion (suffix, canonical form)

Radg0 : tion

Radg1 : ition

Radg2 : ation

impact of combinatory variants on Compound Morphological Units

A Compound may sometimes be formed based on only one of the graphical radicals of its components; in this case, the selected radical must be indicated in the composition relation.

Ex: Umg : militaire (A)

Radg0 : militaire

Radg1 : militaro

The Compound militaro-industriel is formed using the Radg1 form of militaire.

Combinatory variants may have a label in the attribute "contexte_var."

Ex: lorsqu contexte_var : before a vowel, followed by an apostrophe

 

3. Phonetic Form

A phonemic transcription identifies the pronunciation of a Morphological Unit and its variations that are regular, such as the pronunciation or non-pronunciation of the silent e. The phonetic transcriptions that correspond to these variants, essential for applications, are derivable from the phonemic transcription to which all pertinent lexical information is attached.

A phonemic transcription comprises of a linear list of phonemes corresponding to an alphabet. For French, we give the following example:

[p, t, k, b, d, g, f, s, S, v, z, Z, l, r, m, n, N, J, w, h, i, e, a, y, {, u, o, @, ', +, *, :]

In this alphabet, most of the signs are taken from the International Phonetic Alphabet (IPA). Others have a special interpretation.

Ex: /S/ in chic /Sik/

/Z/ in joie /Zua/

/N/ in ring /riN/

/J/ in signe /siJ/

/{/ in feu /f{/

/h/ in hérisser /herise/ to indicate the aspirated h.

/@/ in pelouse /p@luz/ and /'/ in sonnerie /son'ri/ to represent the two types of silent e's.

/+/ in rouage /ru+aZ/ to indicate that the preceding phoneme may have a syllabic value.

/*/ in sot /sot*/ to indicate that the final consonant is not pronounced.

/:/ in jaune /Zo:n/ to indicate an exceptional [o]

/an/, /en/, /in/, /on/, /yn/ placed before a consonant represent the nasal vowels in sentir, plainte, and conte.

/an*/, /en*/, /in*/, /on*/, /yn*/ placed at the end of a word represent the final nasal vowels in plan, mien, fin, bon and commun.

The mark of syllabication (underline) may be introduced in the phonemic transcription.

Ex: si_la_ba_sion*.

 

3.1. Phonemic Morphological Units : Free Variants

When, for a single Morphological Unit, there are concurrent pronunciations (whatever the context) with no regular phonetic variations, we attribute to them all of the necessary distinct phonemic transcriptions.

Ex: ananas /anana/, /ananas/ 1 Um, 2 Umps

As variants, these forms are attached to the same Morphological Unit. Dating, frequency and style may distinguish the phonemic variants.

Ex: suspect /syspe/, /syspekt/ (rare)

but such is not always the case:

Ex: razzia /razia/, /radzia/

As for graphic Morphological Units, we do not consider the notion of entry header.

3.2. Free Phonemic Variants of a Compound

Refer to Free Written Variants of a Compound.

3.3. Phonemic Radicals : Combinatory Variants

We treat the term "radical" in its traditional sense of a "bare" morpheme (form 0) pertinent for a Ump. The radicals are used in the process of inflection and derivation.

In certain cases, the Ump itself constitutes the form 0 and is therefore noted under Radp0.

Ex: Ump : /selebr/

Radp0 : /selebr/

In other cases, the "bare" morpheme is distinct from the Ump. Here we note it under Radp1, Radp2,..., RadpN.

Ex: Ump : /ale/

Radp1 : /al/ -> /alonz*/

Radp2 : /ir/ -> /ire/

Note: By definition Radp0 always has the same value as the corresponding Ump. Referring to Radp0 consequently amounts to referring to the Ump. It is therefore not necessary to implement Radp0.

Note also that the radical is where combinatory variants (as opposed to free variants at the level of the Ump) are expressed. For this reason the combinatory variants of affixes are recorded under Radp and not Ump.

Ex: Ump: /sion/ (canonical form)

Radp0 : /sion*/

Radp1 : /ision*/

Radp2 : /asion*/

4. Structural Symmetry - Graphic System / Phonetic System

For a given Morphological Unit we observe a symmetric axis between the graphic representation of a lemma and its phonemic form. We find on both sides of this axis the graphic and phonemic variations as well as the graphic and phonemic systems of inflection. The combinations of Morphological Features are shared by these two systems.

 

III. Grammatical Categories

1. General Points

The consortium has pondered for a long time on whether GENELEX should include all the traditional parts of speech (nouns, verbs, articles, adjectives, pronouns, prepositions, adverbs, conjunctions, interjections and particles).

The principal debates have concerned the Category "determinative," the grouping of the categories "Nouns" and "adjectives," and the status of Past Participles.

2. Category: Determiner

Several syntactic theories do not agree on the definition (category, function, or both) and range of determiners, although they all appeal to the notion. All agree on the traditional concepts -- articles, adjectives, indefinites, cardinal numbers, possessives, demonstratives and interrogatives. On the other hand, some include whereas others exclude the forms with de (adverbs, or SN of quantity).

simple forms : adverbs of quantity

Ex Peu de gens (few people)

Beaucoup de pain (a lot of bread)

complex forms: SN of quantity (N or Pro Head).

Ex: Un peu de pain (a little (bit) of bread)

La moitié des gens (half of the people)

La plupart des gens (most of the people)

Certaines des personnes (some of the people)

We have chosen to introduce Determiner ("Déterminant") in the list of Categories. The traditional possessive adjectives have played a decisive role in our choice. Indeed, the difference in the written forms of notre / nôtre is not a free variation (as we have defined it) for the realization of the adjective. Here, it is the case of two distinct lemmas distinguished only by Category. (refer to Section I.1.2 on Criteria for Splitting Lemmas).

Ex: Notre chien. (our dog)

Il est nôtre. (it is ours)

We have adopted the most common point of view:

i. We define the following Categories: Noun, Adjective, Determiner, Adverb, Verb, Preposition, Pronoun, Interjection, Conjunction and Particle.

ii. As with Adverbs, Determiners may be considered as a category or as a function. This function, however, does not appear in the morphological layer of the dictionary.

iii. The Determiners cover the traditional categories of articles, interrogative and demonstrative adjectives, indefinite adjectives, possessives and cardinal numbers.

Ex: les enfants (the children)

deux enfants (two children)

certains choix (certain choices)

notre chien (our dog)

We exclude the forms introducing de: Adverbs, Pronouns and Nouns, as these analyses of quantitative and partitive expressions are numerous and varied. Note that the behavior of Adverbs, Nouns and Pronouns which may appear in these expressions are at any rate described in the syntactic layer. This information is therefore not lost. Those who wish to consider Adverbs, Nouns and Pronouns of quantity as Determiners may consequently reinsert the pertinent information in a dictionary for application and verify that the Ums of these categories have the following syntactic properties:

1- SN[SPEC[#] ... N] or SN[...#... de SN]

Ex: deux enfants or la moitié du pain

two children or half of the bread

2- P[il y en a #]

Ex: il y en a deux, il y en a la moitié

there are two, there is half

Complex Determiners may be processed in GENELEX as all complex forms. In other words, they may be recorded at various levels of the model: at the morphological level (la moitié, la plupart, un peu), at the syntactic level (deux douzaines de kilos de pommes de terre), and at the semantic level (un ab"me de plaisirs). As such, the Dét Category behaves like Advs, Preps and Conjs, which may also have simple or complex surface realizations.

Ex: Cette (this) Dét

La moitié (half)

Aussitôt (as soon as) Adv

A brûle-pourpoint (point blank)

Dans (in) Prep

Dans le giron de (in the bosom of)

Quand (when) Conj

Au moment où (at the moment when)

 

3. Noun-Adjective Distinction

We could well have grouped Nouns and adjectives in the same Category (Noun-adj), as several morphological and syntactic reasons support this treatment [Noailly 85]. The two categories carry the marks of gender and number. Adjectives may also be nominalized. Nouns may appear in the same positions as adjectives.

Ex: Les verts ont gagné.

The Greens won.

Un veston très homme d'affaires (cf J-M Marandin)

A very businessman(-like) jacket

In fact, this question revives the ancient debate on the distinction between nature and function. Only nature (absolute property) should be recorded in the lexicon [Milner 90]. Therefore, in the case of Nouns and adjectives, the fact that certain functions are defining criteria for a category does not exclude another category from satisfying the requirements. Although a Noun may fulfill the function of a nominal modifier (epithet) or an object of a copula (attribute), it nevertheless remains fundamentally a Noun .

On the other hand, not all adjectives may be nominalized.

in the case of adjectives having a pronominal equivalent (exact or approximate)

Ex: *Les quelques sont venus. Quelques-uns sont venus.

*The some came. Some came.

*Les certains sont venus. Certains sont venus.

*The certain (of them) came. Certain (of them) came.

in the case of ellipses, as opposed to nominalizations.

Ex: Le rouge est trop grand pour elle.

The red (one) is too big for her.

Les deux sont venus.

The two (of them) came.

We distinguish Nouns from Adjectives in the last of the Grammatical Categories.

The Category recorded in morphology, however, could be identical to (in the majority of cases) or distinct from the Category recorded in syntax. In terms of integrity, this difference is authorized only if there is a gap between the Grammatical Category and the functional Category (verb / adverb, for example, is unlikely).

4. The Status of Past Participles

We could well have processed Past Participles systematically as deverbative adjectives or as Inflected Forms of verbs.

In fact all Past Participles cannot be employed systematically as adjectives. This is the case with participles formed from intransitive verbs (in the sense of Dubois).

Ex: plu (pleased)

ri (laughed)

téléphoné (telephoned)

We are therefore forced to place Past Participles in the Inflected Forms of verbs and indeed to introduce the Morphological Feature of gender for verbs although it may not necessarily be pertinent for any of the other Inflected Forms.

 

5. Grammatical Subcategories

Some of the grammatical categories that we have included in the GENELEX model may be sub-typed; thus we may differentiate, for example, proper nouns from common nouns, co-ordinating conjunctions from subordinating conjunctions, etc. We have decided to take these differences into account by attaching a Grammatical Sub-Category to the Category that it specifies.

We propose the following values for the attribute Grammatical Sub-Category: Proper, Common, Possessive, Demonstrative, Partitive, Definite, Indefinite, Cardinal, Ordinal, Exclamative, Qualifying, Interrogative, Relative, Coordination, Subordination, Weak Personal, Strong Personal, Impersonal.

Of course, all combinations of Category/Sub-Category are not legitimate. Here below is a list of the authorised Sub-Categories according to Grammatical Category.

Grammatical Category

Valid Sub-Category Values

Verb

none

Noun

Proper, Common

Adjective

Indefinite,Possessive, Interrogative, Cardinal, Ordinal, Exclamative, Qualifying

Adverb

none

Determiner

Possessive, Demonstrative, Partitive, Definite, Indefinite, Cardinal, Ordinal, Exclamative, Interrogative, Relative

Pronoun

Weak Personal, Strong Personal, Impersonal, Indefinite, Relative, Possessive, Demonstrative, Partitive, Exclamative, Interrogative

Préposition

none

Conjonction

Coordination, Subordination

Interjection

none

Particle

none

 

 

IV. Morphological Features

1. General Points

Certain units do not carry any Morphological Features: conjunction, interjection, etc.

The model for the French language possesses the following Morphological Features: Gender, number, number-possessor, person, tense, and mood. It is evident that each language defines its own system of Morphological Features.

Morphological features are difficult to define except by consulting an extensive list of their values and of the categories to which they are applicable.

2. Combination of Morphological Features

2.1. General Points

Morphological features can be combined following different patterns that are applicable for a given type of Um:

i. gender, number .

Ex: nouns, adjectives

-> table [feminine, singular]

ii. person, gender, number .

Ex personal pronouns

-> elle [3rd person, feminine, singular]

she

iii. person, gender, number, number-possessor .

Ex: possessive adjectives

-> mon [1st person, masculine, singular, singular]

my

iv. mood, tense, person, gender, number.

Ex: verbs

-> mangerais [conditional, present,

1st/2nd, -, singular]

I/you would eat

-> mangées [participle, past, -, feminine, plural]

eaten

2.2. Relation between combination of Morphological Features and Grammatical Category

The various grammatical categories of nouns, adjectives, etc, are partially defined by their morphological characteristics. The general characteristics are as follows:

adjectives, determiners, nouns and pronouns carry the mark of gender and number.

verbs carry the mark of mood, tense, person, gender and number.

adverbs (except tout), prepositions, conjunctions and interjections have no Morphological Features.

The combination of Morphological Features and Category are therefore related. There is not however, a complete overlap between the Category and the authorized combinations of Morphological Features. To illustrate, the adverb tout carries the mark of gender and number in very specific environments (before a feminine word that begins with a consonant).This particular case may be handled by either an adverb that may be inflected or an adverb having two contextual variants tout and toute.

It is nevertheless possible to propose to the lexicographer, default combinations of Morphological Features according to the Grammatical Category. We must nonetheless authorize these combinations in order to guarantee the independence of combinations of Morphological Features from their Category.

Noun, Adjective, Determiner, Pronoun :

combination G(ender) N(umber)

Verb:

combination M(ood) T(ense) P(erson) G(ender) N(umber)

 

3. Gender

{masculine, feminine, neuter}

These values depend on the Category to which gender is applied.

3.1. Gender of Nouns

{masculine, feminine}

In French, all nouns have a specific, arbitrary gender.

3.1.1. Gender of animate beings

Gender is not necessarily determined by the sex of the denoted person/animal.

grammatical inversion of gender/sex

Ex: un bas-bleu / une petite frappe

a pedantic woman (m)/ a hooligan (f)

a single grammatical gender for both sexes (epicene nouns)

Ex: Monsieur le Ministre / Madame le Ministre

Mr. Prime Minister (m) / Madame Prime Minister (m)

Un agent

(an agent -m)

Une sentinelle

(a sentinel -f)

Une victime

(a victim -f)

When there is a match between gender and sex (masculine-male, feminine-female), we have a case of natural gender. However, we must also note that natural gender itself covers two cases:

i. no morphological variation on the same Morphological Unit (i.e., two completely separate words) (refer to the definition of a Morphological Unit below). In this case, nothing prevents the choice of the gender from being completely arbitrary.

Ex: un étalon / une jument 2 Ums

(a stallion, a mare)

un pédéraste / une lesbienne 2 Ums

(a gay (man), a lesbian)

ii. morphological variation on a single Morphological Unit. In this case only gender is determined by the sex of the denoted person/animal.

Ex: un infirmier / une infirmière 1 Um

a (male) nurse/ a (female) nurse

un dentiste / une dentiste 1 Um

a (male) dentist/ a (female) dentist

 

3.1.2. Gender of Inanimate Objects

"The separation of inanimate nouns into feminine and masculine , often interpreted as natural, stems in fact from the metaphorical male/female distinction which influences our perception. In such a manner, certain objects or concepts are conceived, under the influence of an arbitrary grammatical gender, as in essence male or female. In fact, gender often carries a symbolic value. Ex: the opposition between la lune (the moon), symbol of femininity, and le soleil (the sun), symbol of masculinity." [Marina Yaguello 1981].

Ex: Un truc / une chose

a thing (ms) / a thing (fs)

Le fleuve / une rivière

a river (ms) / a river (fs)

The morphological variations in gender of some inanimate nouns may fluctuate arbitrarily.

Ex: Un / une globule 1 Um

a globule (m,f)

Un / une interface 1 Um

an interface (m, f)

We distinguish two types of variations in the gender of nouns. One is a function of the sex of the denoted (animate nouns). The other is a free variation of an arbitrary gender (inanimate nouns).

 

In the model, we may:

decide not to take into account such a distinction in morphology. We may compare variation in morphological gender to the denotative constraint "animate"/"inanimate" in semantics. It is for the user to draw the necessary conclusions. In this case, we do not distinguish in morphology

Ex: dentiste {dentiste, dentiste, dentistes, dentistes}

-> ms, fs, mp, fp 1 Umg

from

interface {interface, interface, interfaces,interfaces}

-> ms, fs, mp, fp 1 Umg

on the other hand, we may decide to take into account the said distinction in morphology. We consider in this case that the variants of gender of inanimate nouns may be processed as orthographic or phonetic variants by associating two Umgs with a Morphological Unit: one for the masculine, and the other for the feminine form. The same goes for two Umps.

dentiste {dentiste, dentiste, dentistes, dentistes}

-> ms, fs, mp, fp 1 Umg

interface 1 Um

{interface, interfaces}

-> ms, mp

{interface, interfaces}

-> fs, fp 2 Umg

 

3.2. Gender of Adjectives

{masculine, feminine}

Adjectives receive their gender and number from the noun that they modify.

Ex: un être grand

a big (ms) creature (ms)

une personne grande

a big (fs) person (fs)

 

Epicene adjectives do not undergo any spelling change according to gender.

Ex: un être habile

an adept (ms) creature (ms)

une personne habile

an adept (fs) person (fs)

This is also true for cardinal numbers.

Ex: Les deux garçons

The two (mp) boys (mp)

Les deux filles

The two (fp) girls (fp)

It is also the case for "invariable" adjectives (which carry nonetheless Morphological Features), either because they are words of foreign origin or that are of "expressive formation." (cf. Grevisse)

Ex: Des chattes angora.

some angora (fp, but no mark) cats (fp)

une fille gnangnan.

a whining (fp, but no mark) girl (fs)

Certain adjectives have only a single gender (cf. Grevisse):

adjectives with limited usage.

Ex: bibliothèque vaticane

(the) Vatican (fs) Library (fs)

adjective that apply only to animate nouns of a single sex.

Ex: Une femme enceinte

a pregnant (fs) woman (fs)

Un jeune homme benêt

a silly (ms) young man (ms)

Une poule pondeuse

a prolific (fs) hen (fs)

 

3.3. Gender of Determiners

{masculine, feminine}

Determiners receive their gender from the noun that they modify.

Ex: Le conflit

(the (ms) conflict (ms))

La bataille

(the (fs) battle (fs))

Les conflits

(the (mp) conflicts (mp))

Les batailles

(the (fp) battles (fp))

For a given written form, certain determiners have only a single value in gender (mainte, certaine, quelle, ma, le, etc), others have several (les, etc).

3.4. Gender of Past Participles

{masculine, feminine}

(Refer also to the Section III.4 Status of Past Participles.)

The rules of agreement for Past Participles determine the authorized Inflected Form among the four possible Inflected Forms of a participle. This is valid for both gender and number.

Auxiliary verbs (être or avoir) and the syntactic Sub-Category (direct or indirect transitives, essential pronominals, direct or indirect reciprocal pronominals, impersonals) of a verb are all criteria in these rules.

If a lexicographer wanted to use this information to code Past Participles in morphology, he must take into account that all information, and in general all information on the formation of complex tenses, is attached to the syntactic layer of the model. In effect, the existence of complex tenses and their auxiliary(ies) is considered as a syntactic phenomenon.

The choice is left up to the lexicographer. He must, however, adhere to the constraint of coherence of the model (correspondence of a group of inflected past participial forms to the syntactic Sub-Category) and refrain from reworking the model.

These constraints of coherence are based on:

i. variable Past Participles

direct transitives with être (to be) or avoir (to have),

certain pronominals

indirect transitives with être

intransitives with être

Ex:

(est / a) mangé, (est/ a) mangée, (sont/ a) mangés, (sont/ a) mangées

(is/has) eaten (ms), (is/ has) eaten (fs), (are/has) eaten (mp), (are/has) eaten (fp)

s'est fié, s'est fiée, se sont fiés, se sont fiées

trusted (ms), trusted (fs), trusted (mp), trusted (fp)

le jury s'est concerté, l'assemblée s'est concertée, les jurés se sont concertés, les victimes se sont concertées

the jury (ms) consulted (ms), the assembly (fs) consulted (fs), the juries (mp) consulted (mp), the victimes (fp) consulted (fp)

s'est soigné, s'est soignée, se sont soignés, se sont soignées

look after oneself (ms), oneself (fs), themselves (mp), themselves (fp)

est apparu, est apparue, sont apparus, sont apparues

appeared (ms), (fs), (mp), (fp)

est mort, est morte, sont morts, sont mortes

died (ms), (fs), (mp), (fp)

ii. invariable Past Participles

indirect transitives with avoir

intransitives with avoir

certain pronominals

impersonals

Ex: a appartenu, a fonctionné, se sont plu, a fallu

belonged to, worked, pleased, had to

Note: We are aware that the categories of direct transitives, indirect transitives, and intransitives are very schematic and do not alone suffice to take into account all usage.

The defectives are nevertheless registered.

iii. no Past Participles

Ex: gésir

to lie (down)

 

3.5. Gender of Pronouns

{masculine, feminine, neuter}

When a pronoun refers to a noun (anaphorically or deictically), it takes the gender of the noun.

Ex: La boîte, tu la mettras sur la table.

The box (fs), you will put it (fs) on the table.

When a pronoun refers to an utterance (anaphoric usage), it is masculine. There are those who would consider it neuter in spite of the fact that the masculine form is the neutralizing form in French. Recall that reference is not necessarily registered in the Morphological Feature of a word (pronoun on morphologically singular, but semantically plural). GENELEX offers the neuter gender (cf. demonstratives and quoi) to the lexicographer to use as he sees fit.

Ex:

-La société devrait faire de gros bénéfices cette année, nous le souhaitons vivement.

The company should be very profitable this year, we hope for it fervently.

-Tu vois je te l' avais bien dit.

You see that I told you about it.

-Mais ce que tu veux faire est irréalisable.

But what you want to do cannot be carried out.

-Rien n'est facile dans la vie.

Nothing is easy in life.

When a pronoun refers to an entity being uttered (deictic usage), it is masculine, feminine or neuter (for demonstratives).

Ex:

Cet article ne date pas d'hier, lis plutôt celui-ci.

This is not yesterday's article (ms), read this one (ms) instead.

Cette tarte ne date pas d'hier, prends plutôt celle-ci.

This is not yesterday's tarte (fs), take this one (fs) instead.

‚a n'est pas très surprenant. Ecoute plutôt ceci.

This (ns) is not very surprising. Listen to this (ns) instead.

For a given form, certain pronouns have only a single value in gender (il, le, lequel, laquelle, chacun, cela, etc); others may have several (je, les, lui, qui, que, etc).

3.6. Gender of Possessives (Determiners, Adjectives or Pronouns)

{masculine, feminine}

In French, the possessives receive their gender from the word that they modify. The gender of the possessor plays no role, reducing the situation into the following general case:

possessive determiners have the gender of the noun that they modify.

Ex: mon chien, ma chienne

my (ms) dog (ms), my (fs) bitch (fs)

possessive adjectives have the gender of the noun to which they refer.

Ex: ce chien est mien, cette chienne est mienne

this dog (ms) is mine (ms), this bitch (fs) is mine (fs)

possessive pronouns have either the gender of the noun, of the utterance to which they refer (textual anaphora), or of the entity identified deictically.

Ex: le mien, la mienne

mine (ms), mine (fs)

For a given written form, certain possessive may have two values in gender (votre, notre, nôtre, vôtre).

 

4. Number

{singular, plural}

4.1. Number of Nouns

The majority of nouns vary according to number. Some, however, are used only in the singular (le sud -the South), and others, only in the plural (fiançailles - engagement).

The sense of a noun may vary with number, as the type of determination is often linked to number, which often plays the role of an operator (refer to Section I.1.2.3 Number as a Criterion for Splitting). We must therefore be vigilant in the description of possible forms by anticipating all associated inflections and all cases in which this operator is applicable.

Ex: Le travail fatigue.

Work is tiring.

Les travaux sont à rendre.

The work is to be handed in.

 

La délibération a duré deux heures.

The deliberation lasted two hours.

Les délibérations ont satisfait tout le monde.

The deliberations satisfied everyone.

 

Il cultive des orchidacées.

He cultivates orchids.

J'ai acheté une orchidacée.

I bought an orchid (plant).

 

Il a annoncé ses fiançailles.

He announced his engagement (mp in French).

*Il a annoncé sa fiançaille.

He announced his engagement (*ms in French).

Some nouns are identical in the singular and the plural.

Ex: souris (mouse - ms, mp)

4.2. Number of Adjectives

Adjectives (except cardinal numbers) receive their number from the noun that they modify.

Certain Adjectives with limited usage are used only in the plural (cf. Grevisse).

Ex: dépouilles opimes

rich spoils

humeurs peccantes

foul mood

fourches caudines

in the expression passer sous les fourches caudines,

to be forced into shameful conditions

Others have the same spelling in the masculine singular and the masculine plural.

Ex: courtois (courteous - ms, mp)

This is also the case of invariable Adjectives (refer to Section IV.3.2 Gender of Adjectives).

Cardinal numbers except un (one) are always plural

4.3. Number of Determiners

Determiners carry the number of the noun that they specify (except in the cases of compound determination that we do not consider as determiners in morphology).

Ex: Les oeufs.

The (mp) eggs (mp).

Une douzaine d'oeufs.

A (fs) dozen (fs) of eggs.

Numeral determiners are always plural except for un. The same is true for the indefinite determiners plusieurs, divers, différents. The determiners aucun and chaque are always singular.

 

4.4. Number of Verbs

It is associated with person.

4.5. Number of Past Participles

Refer to Gender of Past Participles.

Some past participles do not present spelling changes between the masculine singular and the masculine plural.

Ex: inclus (included - ms, mp)

4.6. Number of Pronouns

Pronouns take the number of the noun to which they refer.

Ex: Ces romans ne sont pas bien, lis plutôt ceux-là.

These novels (mp) are not good, read those (mp) instead.

When a pronoun refers to an utterance (anaphoric usage), it is singular.

Ex: Je me suis trompé, c'est vrai.

I was wrong, it's true.

When a pronoun refers to one or several entities being identified deictically, it is either singular or plural.

Ex: Ces livres ne sont pas intéressants, lis plutôt ceux-là.

These books (mp) are not interesting, read those (mp) instead.

For a given form, certain pronouns have only a single value in number (autrui, rien, chacun, ils, elles, les, ceux, auxquels, lesquels, les miens, etc), while others may have both values (qui, que, dont, etc).

Some pronouns are always singular: autrui, rien, chacun.

 

4.7. Number of Possessives (Determiners, Adjectives, or Pronouns)

In French, the possessives carry the number of the possessor and that which is possessed.

Possessive determiners therefore carry the number of the noun that they specify as well as the number of the possessor.

Ex: mon chien, mes chiens, notre chien, votre chiens

my (pssor s, pssed s) dog (sing); my (pssor s, possed p) dogs (p)

our (pssor p, possed s) dog (s); your (pssor p, possed p) dogs (p).

The possessive adjectives therefore carry the number of the noun that they modify as well as the number of the possessor.

Ex: ce chien est mien, ces chiens sont miens, ce chien est nôtre

that dog (sing) is mine (pssor s, pssed s),

those dogs (p) are mine (pssor s, possed p),

that dog (sing) is ours (pssor p, possed s)

The possessive pronouns have the same number of the noun that they represent (textual anaphora), or the number of the entity referred to in a statement (deictic use) as well as the number of the possessor.

Ex: le mien, les miens, le nôtre, les nôtres

mine (pssor s, pssed s),

mine (pssor s, possed p),

ours (pssor p, possed s),

ours (pssor p, possed s)

 

 

5. Mood

{indicative, subjunctive, conditional, imperative, infinitive, participle}

Lacking a consensus on another model, we have taken mood in the traditional sense. We are aware, however, that the distinction between mood/tense is at times arbitrary, and that it is possible to group together all the pertinent values while prohibiting certain combinations. However, this distinction, besides its consensual, if not traditional, aspect, is actually very useful for constraining morphological behaviour from syntax. In this way, we may distinguish indicative noun clauses from subjunctive noun clauses.

We represent arbitrarily the imperative by mood = imperative, tense = present, the past and present participles respectively by mood = participle, tense = past and by mood = participle, tense = present.

6. Tense

{present, imperfect, past historic, future, past}

Only simple tenses are noted. Complex tenses formed regularly from simple tenses have been excluded.

The Past Participle is considered as a verbal form, and is not handled systematically as an Adjective. In the morphological model, it is the residual trace of complex tenses which are handled in the syntactic part of the model.

Tense is considered as a syntactic property and is represented in the syntactic layer of the model.

7. Person

{1st, 2nd, 3rd}

7.1. Person of Pronouns

Personal pronouns, as the term implies, always express person.

Relative and interrogative pronouns have forms for virtually all persons possible (except lequel, auquel, duquel, quel, which are only in the third person).

7.2. Person of Verbs

Theories diverge on whether an infinitive, when it has a subject, carries or not the mark of person. GENELEX accepts both points of view, as it does not require that person be applied to all moods (cf. participial mood).

Some verbs are expressed only in the third person:

impersonal verbs (pleuvoir, falloir, s'ensuivre, etc)

certain defective verbs (seoir, échoir, etc).

V. Inflected Forms

1. Definition

By "Graphic Inflected Form" we mean a couplet of graphic form-combination of Morphological Features produced by a system of inflection.

Ex: buvait M: indicative, Tense: imp, P: 3, N: singular

boire M: infinitive, T: present

boire G: masculine, N: singular

grandes G: feminine, N: plural

grand G: masculine, N singular

chevaux G: masculine, N: plural

cheval G: masculine, N: singular

Notice that some Inflected Forms are identical to the graphic Morphological Unit. They are nonetheless Inflected Forms. Our definition of Inflected Forms is therefore not a naive definition of "that which is conjugated," "that which is put in the feminine," "that which is pluralized," because the infinitive of Verbs, the singular, masculine form of Adjectives and the singular of Nouns with a single gender are considered as Inflected Forms as well.

Only Verbs, Adjectives, adjectival Determiners and Nouns belong in a system of inflection.

Although relative and interrogative pronouns (except all forms of quel) possess Morphological Features, they are not produced by a system of inflection.

Ex: qui G: masculine, N: singular

G: feminine, N: singular

G: masculine, N: plural

G: feminine, N: plural

As for prepositions (except contracted forms), adverbs (except tout in certain conditions), interjections and conjunctions, they do not have Morphological Features and are considered "invariable."

 

2. Inflectional System of Simple Words

2.1. General Points

By inflection we mean the description of an equivalent class of Inflected Forms. Inflection is therefore not conceived as an action (conjugation, feminization, pluralization), but as a state of existence. An equivalent class of Inflected Forms is viewed in a static and not dynamic manner.

This definition presents two advantages:

1. Information on the graphic form-combination couplet of inflectional features has been completely flattened. Indeed, flattening information can only be advantageous for a generic model.

2. We represent in the same simple model phenomena that are very different and at times very complex.

arbitrary variation

Ex: free variation of gender of nouns

interface interface (G: masculine, N: singular)

interface (G: feminine, N: singular)

interfaces (G: masculine, N: plural)

interfaces (G: feminine, N: plural)

regular variations

Ex: grand, grands, grande, grandes

big - ms, mp, fs, fp

irregular variations

Ex: aller

to go (stem varies -- all- , v- and ir-)

oeil, yeux

eye, eyes

specific variations

Ex: Nouns that have different genders according to number: amours, délices, orgues. (these words are masculine in the singular and feminine in the plural).

absence of variation

Nouns without variations in gender are processed in the same manner as nouns with variations: in both cases, the Morphological Feature of gender is pertinent. Only the number of possible values for this feature changes.

Ex: cheval cheval (G: masculine, N: singular)

chevaux (G: masculine, N: plural)

The masculine form is the only possible value for the feature gender.

defectives

Ex: Adjectives or nouns without singular: fiançailles, obsèques, opimes

Adjectives or nouns without plural: sauvette, boire

Adjectives without masculine : vaticane

Adjectives without feminine: aquilin

defective verbs: frire

We call the method which permits us to construct an equivalent class of Inflected Forms for a Morphological Unit a system of inflection ("Mode de Flexion").

The Inflected Forms derived from a Umg are more or less numerous according to the Category, the maximum being 51.

As Inflected Forms may be derived from a Umg and since there is a great number of common systems of derivation, we do not represent Inflected Forms in the dictionary. Here we adopt the method of traditional dictionaries.

However, we find it necessary to know the system of inflection of a Morphological Unit as it is a piece of information in itself (regular, marginal, and irregular formation) and because it permits us to generate the Inflected Forms.

2.2. Inflection of Determiners and Pronouns

We wish to remind the reader that in GENELEX, Morphological Units are essentially defined by a Grammatical Category and Sub-Category, and a paradigm of inflection. On this basis, we can group {je, tu, il, elle, nous, vous, ils, elles}{I, you, he; she; we; you; they(m), they(f)} or {ma, ta, sa, mon, ton, son, mes, tes, ses, notre, votre, leur, nos, vos, leurs} {my, your, his/her (possed fs, possor s), my, your, his/her (possed ms , possor s), my, your, his/her (possed m/fp, possor s), our, your, their (possed m/fs, possor p), our, your, their (possed m/fp, possor p)} ,under the same Morphological Unit and consider them as various Inflected Forms of the same entry -- as only the combination of Morphological Features distinguishes them. Notice also that the names "subject personal pronoun," "strong personal pronoun," and "impersonal pronoun" cover precisely the paradigm of inflection.

{je, tu, il,elle, nous, vous, ils, elles} subject personal

{I, you, he, she, it, we, you, they} subject personal

{me, te, le, la, nous, vous, les} weak direct object personal

{me, te, lui, nous, vous, leur} weak indirect object personal

{moi, toi, lui, elle, nous, vous, eux, elles} strong personal

{me, you, him, her, it, us, you, them} strong personal

{me, te, se, nous, vous, se} weak reflexive

{soi} strong reflexive

{myself, yourself, himself, herself, itself, ourselves,

yourselves, themselves} reflexive

{oneself} reflexive

{il} and {on} impersonal

{it} and {one} impersonal

These entries are identified, as other entries, by their canonical form (il, son). Remember that an electronic dictionary, as opposed to a traditional dictionary, may be consulted by Inflected Forms.

Finally, those who wish to note compounds in their dictionary will see the advantage, indeed necessity, of constructing such groupings.

2.3. Derivation of Inflected Forms

There are two valid methods for the derivation of Inflected Forms:

addition of an affix to a radical

removal or addition of characters for a graphic or phonemic Morphological Unit.

2.3.1. Addition of an affix to a radical

This method of derivation is particularly useful for describing verbal inflections because:

the radical is always different from the Umg.

at times we must select a radical among several choices.

The radicals are identified by a number (n-th rad). For the derivation of an Inflected Form, we select (by its number) the pertinent radical and add a termination corresponding to the indicated combination of mood-tense-person-gender-number.

Ex: chanter (to sing)

indpre30p Radg1: chant ADD: ent -> chantent

finir (to finish)

indpre30p Radg1: finisse ADD: ent -> finissent

aller (to go)

indpre20p Radg1: all ADD: ons -> allons

indpre30p Radg2: v ADD: ont -> vont

devoir (must)

indpre20p Radg1: dev ADD: ons -> devons

indpre30p Radg2: doiv ADD: ent -> doivent

It is possible that the radical is identical to the graphic and phonemic Morphological Unit. In this case the radical must carry the number 0.

Ex: table

fp Radg0: table ADD: s -> tables

2.3.2. Removal or addition of characters

Inflected forms are derived from a Umg by removing or adding a series of characters. (Recall that the label for a Umg is the same as its radical 0.) In an equivalent class of Inflected Forms, at least one form corresponds to the Umg. For this form the derivation by removal/addition is not necessary.

Ex: quantum: {quantum, quanta}

-> ms, mp

{(removal: nil, addition: nil); (removal: -um, addition: -a)}

We authorized a "joker" in the removal and addition of characters. This measure permits us to group together conjugations, notably those with alternations of accent marks and those with reduplication of consonants.

The value of a joker is found by comparing the written lemma and the chain to be removed. A joker, noted as $, may represent one or several characters.

Ex: célébrer, il célèbre (to celebrate, he celebrates)

assécher, il assèche (to drain, he drains)

exécrer, il exècre (to execrate, he execrates)

céder, il cède (to cede, he cedes)

etc.

Remove: é$er Add: è$e

Certain paradigms do not adapt well to this method of derivation:

Ex: {empereur, impératrice, empereurs, impératrices}

-> ms, fs, mp, fp

{je, je, tu, tu, il, elle, nous, nous, vous, vous, ils, elles}

-> 1ms, 1fs, 2ms, 2fs, 3ms, 3fs, 1mp, 1fp, 2mp, 2fp, 3mp, 3fp

 

2.4. Inflectional variants

Inflectional variants occur when, for a given Umg, the same combination of inflectional features may have different surface forms.

The inflectional variants may be described in terms of:

i. Variation in the radical. In certain cases, it is possible to use concurrently two different radicals for the same Inflected Form.

Ex: assoir

indipf10s Radg1: assey ADD: ais -> asseyais

Radg2: assoy ADD: ais -> assoyais

ii. Variation in the suffix

Ex: media

mp Radg0: media ADD: s -> medias

Radg0: media ADD: nil -> media

iii. Combination of two variations

Ex: scénario

mp Radg0: scenario ADD: s -> scenarios

Radg1: scenari ADD: i -> scenarii

pouvoir

indpres10s Radg1: peu ADD: x -> peux

Radg2: pu ADD: is -> puis

Radg2: pu ADD: issé -> puissé

The variation in inflection may influence either the entire paradigm of inflection or only certain Inflected Forms.

i. verbs that have two regular conjugations

conjugation with oi and oy, and with e and é

Ex: asseoir (to sit)

conjugation with ou• and conjugation with oi, oy and o

Ex: ou•r (to hear)

ii. verbs that have variants in certain forms only

Ex: payer (to pay)

There are various "labels" that indicate the source of variation, encoded under "contexte_var".

iii. Nouns that have two plurals, a French and a foreign plural

Ex: scenarios / scenarii (French plural / Italian plural)

leitmotifs / leitmotive (French plural / German plural)

iv. Verbs that have variants in certain forms only, variants dictated by the type of sentence.

Ex: peux/ puis/ puissé (affirmative/ interrogative/ exclamatory)

Note: We distinguish between variation in spelling and variation in inflection. Concurrent lemmatized forms are variants of written lemmas. All other forms are inflectional variants.

 

3. System of Inflection of Compounds

As for all other Morphological Units, Inflected Forms of Compounds are not given in full but are derived. The Inflected Forms of a Compound are derived from the Inflected Forms of its Components.

Keep in mind that the Inflected Forms of the Components are not stored in full and that each Component has a system of inflection which permits us, based on a Combination of Morphological Features, to derive the desired Inflected Form. To select an Inflected Form of a Component, it suffices to specify the corresponding Combination of Morphological Features.

Ex: To select the Inflected Form peaux of the Component peau, we have only to specify the CombTM associated with this form, that is to say, feminine / plural.

An Mfc (System of Inflection of Compounds) enters into the relation of Composition (R_Compose) between a Compound and a Component. The Mfc allows us to select, for each CombTM of a Compound, the Inflected Form(s) of the pertinent Components.

Using this system we will elaborate various cases.

 

3.1. CombTM of compound different from CombTM of the head component

Ex: Un peau rouge (Indian - ms)

Des peaux rouges (mp)

Une peau rouge (fs)

Des peaux rouges (fp)

 

 

3.2. CombTM of compound identical to CombTM of the head component

Ex: Une chaise longue

Des chaises longues

 

3.3. Constant component

Ex: Une deux-chevaux

Des deux-chevaux

 

3.4. Free Variation of a Component

Ex: Un tire-fesse(s)

Des tire-fesses

 

 

3.5. Complex Cases

Ex: Un franc-maçon

Des francs-maçons

Une franc-maçonne

Des franc-maçonnes

3.6. Prepositional and adverbial components

Morphological compounds may consist of any type of simple words and be themselves of any Category.

Ex: Sur le vif

 

3.7. Contextual variant of a component

Ex: ramasse-miette(s)

(small utensil used to clear crumbs from the table)

The traditional spelling formed the plural based on the second component: ramasse-miettes, whereas the recent spelling reform proposes the singular form: ramasse-miette.

The attribute contexte_var is used to record this distinction:

contexte_var = former spelling / new spelling

 

3.8. Recursion in Morphological Compositions

Ex: virage en épingle à cheveux

which consists of two Um_S and one Um_C:

virage Um_S

en Um_S

épingle à cheveux Um_C

 

Notice that we have also described the process of derivation for the compound épingle à cheveux.

VI. Etymology

A lemma may have concurrent etymons, a chain of etymons, or several etymons (composition or lexical derivation).

Examples taken from the Petit Robert.

Ex: mecanisme: from Latin mecanisma

élastique: from Latin elasticus, from Greek elasis

plénipotentiaire: from Latin plenus and potentia

 

(Refer also to the Section I.1.3 Conclusion on the Criteria for Splitting)

 

C. USER'S MANUAL

 

1. Introduction

The GENELEX model of morphological data was conceived to represent the variations in form and the grammatical properties of words. In this chapter, we shall present a brief description of the terminology that we have adopted.

The forms of words are grouped into Morphological Units (Um). A Morphological Unit corresponds either to an invariable word: adverb, preposition, etc, or to a paradigm of forms related by inflectional operations: conjugation, change of gender or number, etc. For example, all simples forms derived from conjugation of a verb comes from the same Um. The notion of Um corresponds often to that of "lemmas" in traditional dictionaries, in which only one of the forms (the infinitive) is at the head of an entry.

This grouping has, as an indispensable complement, a set of information on the possible inflectional variations for each Um, represented in the following manner: Each Um is linked by its Umg and Ump and their system of inflection to one or more combinations of Morphological Features (CombTM) in which the pertinent features are recorded: gender, number, tense, etc. For example, the Um of the adjective net is related to four combinations generated by the combinatory variations of gender and number.

Um and CombTM are defined without any reference to the spelling or pronunciation of the words. They concern uniquely the combinatorial aspect of Morphological Feature s, which is often, but not always, reflected in spelling and phonetic variations. For example, the four CombTMs of the adjective net have a total of four different spellings, but no difference of pronunciation in modern French (except the liaison in the plural). Two parallel systems are therefore put into place to represent respectively the spelling and phonetic variations of words. Let us begin our discussion with the spelling variations.

In the absence of free spelling variations, a single Um corresponding to a simple word is related to a one and only one orthographic (graphic) Morphological Unit (Umg) which indicates the spelling of the word, or more precisely, the spelling of its canonical form: the infinitive for a verb, the masculine singular for an Adjective, etc. The distinction between Um and Umg serves to represent free variations.

In the case of nouns, Adjectives and verbs, there is a problem of representing the inflections associated with these categories. For these words, each Umg is related to a system of orthographic inflection (Mfg) which characterizes the manner in which the Inflected Forms are derived from the canonical form. For example, each verbal Umg possesses a conjugation that it shares generally with a number of other verbs. This conjugation is represented in the form of an Mfg.

Given an Mfg and a CombTM, we may associate with them an elementary rule of derivation for the Inflected Forms (Cff). For example, if we consider the conjugation of laisser, represented by an Mfg and the CombTM of the present participle, these two entities are related to a Cff which specifies: remove -er, add, -ant.

Each Umg and Mfg has their equivalent phonemic Morphological Unit: a Ump and a Mfp which correspond to the same information and derivation but which act on the phonemic transcription rather than the graphic form.

2. Morphological Unit (Um)

2.1. Splitting Morphological Units

The separation of Morphological Units was our first challenge. In the view of normalization, we proposed the following rules.

Because Grammatical Category is considered as a basic grammatical information, we have established it as an attribute of a Um. Homographs of distinct categories give rise to distinct Ums -- such is the case for the adjective ferme (Luc a été ferme sur ce point - Luc was firm on this point) and for the noun ferme (Cette maison est une ferme - This house is on a farm). This rule is more difficult to apply in the case of numerous noun-adjective homographs in which the noun and the adjective have the same CombTM and the same forms -- fragile (Luc est fragile. - Luc est un fragile. (Luc is frail.)). In these type of examples, the precise nature of the Grammatical Category is dependent on syntax; moreover, if the morphological description is limited to a Um characterized as an adjective, it is sufficient to generate all CombTMs and all forms. The lexicographers is therefore allowed to ignore the existence of a noun such as fragile.

The notion of Um is founded upon that of inflection. The delimitation of Morphological Units presupposes a distinction between inflection and derivation.

The passage from masculine to feminine is considered a result of inflection, except when it results in a change of meaning, in which case several Ums are involved. For example, the masculine and feminine forms of a participle or of an adjective such as net, and the forms of a normal [+human] noun such as adepte are grouped into their respective and unique Um. On the other hand, the masculine and feminine forms of manche (La po‘le a un manche. The skillet has a handle (m)- La chemise a une manche trop longue. The shirt has a sleeve (f) that is too long.) are separated into two distinct Ums. Such is also the case for the masculine noun moteur (motor) and the feminine noun motrice (motive) which designate distinct concrete objects. The masculine and feminine forms of the word interface have the same meaning, and therefore are grouped under a single Um. The processing of grammatical gender here goes further than simple morphology. However, it does not go as far as representing sex or distinguishing between arbitrary and natural gender.

The passage between the singular and the plural is always considered as a derivation from the inflectional system, for reasons of simplicity. For example, the noun obligatorily plural échecs is attached to the Um. of the noun échec.

The distinction between inflection and derivation poses problems that lexicographers know only too well. These problems manifest themselves when we search to delimit Ums. Besides the verb agacer, which possesses a past participle agacé in the Um, should we always include an adjective agacé? We do not propose any particular rules in addition to those followed by lexicographers.

The splitting of Ums is a manner to differentiate certain meanings, such as masculine manche (handle) and feminine manche (sleeve), when the two meanings present differences in their variations of form or in grammatical properties. It is not possible to represent all distinctions of usage by distinguishing Ums. For example, the various uses of the feminine manche (La chemise a une manche trop longue. The shirt has sleeves that are too long- La partie a eu trois manches. The game was played in three rounds.) are grouped under a single Um.

Two distinct paradigms define two distinct Ums. For example, the verb ressortir has two different conjugations: (Luc ressort de la cuisine. Luc is coming out from the kitchen (again)- Cette affaire ressortit à la compétence de ce tribunal. This matter is under the jurisdiction of the court.) This verb is therefore represented under two Ums. Two paradigms may be differentiated uniquely by the existence of certain CombTMs that are present in a paradigm but absent in another. To illustrate, the human noun comique (Luc est un comique - Luc is a clown), variable in gender and in number, does not have the same paradigm as the invariable noun comique (La situation est d'un comique irrésistible - The situation is irresistibly funny.), which has only one CombTM -- the masculine singular. Nevertheless, when distinct uses have minor differences in terms of the existence of certain Inflected Forms, they are grouped into a single Um. For example, certain uses of the verb sentir have a variable past participle (le chien sent les sacs- the dog smells the bag), and others an invariable past participle (Les sacs sentent la sardine - the bags smell like sardines). This minor difference, however, does not justify splitting the word into two Ums and finds its place in syntax. Assessment of the situation is left to the lexicographer.

2.2. Relations between Um and Umg

A Um corresponding to a simple word or to an element of a word is generally related to one and only one Umg, which serves to describe the spelling of the word and of its Inflected Forms. This information has been placed into a distinct entity of the Um because of the existence of spelling differences. In the case of free spelling variations with or without change in pronunciation, as in clé which is written as clef as well, a specific Umg is reserved for each of the variants and a unique Um serves to describe the information independent of the spelling, thus permitting us to express the equivalence of the different spellings. If the word admits an inflection, we distinguish several Umgs only if the canonical form (the infinitive for verbs, the masculine singular for adjectives, etc) has several spellings, such as clé and clef, or several genders, such as interface. When there are free spelling variations in certain Inflected Forms but not in the canonical form, we represent the word under one Um, with the differences registered in the inflection. Asseoir, for example, has only one infinitive but two Inflected Forms for "he sits" - (assoit, assied). (refer to Section 6.1 below).

The spelling of compounds is processed with the help of a relation of composition (see 2.4 Relations between Ums, in this section.). A Um which corresponds to a compound is therefore not directly linked to any Umg.

2.3. Relations between Um and Ump

A Um corresponding to a simple word or to an element of a word is generally linked to one and only one phonemic Morphological Unit (Ump) , which serves to describe the pronunciation of the word and its Inflected Forms. This information has been placed into a distinct entity in the Um because of the existence of variation in pronunciation. In the case of free variations in pronunciation with or without change in spelling, as in razzia, pronounced either [razja] or [radzja], a specific Ump is reserved for each of the variants and a unique Um serves to describe the information independent of the pronunciation, thus permitting us to express the equivalence of the different pronunciations. If a word admits an inflection, we distinguish several Umps only when the canonical form (the infinitive for a verb, the masculine singular for an adjective, etc) has several pronunciations, as razzia, or several genders, as interface. When there are free phonemic variations in certain Inflected Forms but not in the canonical form, we represent the word under one Ump and the variations are registered in the inflection. The noun lobby, for example, has two pronunciations in the plural -- lobbies : [lobi, lobiz].

Regular phonetic variations are not taken into consideration at the level of Ums and Umps : a silent e which could either be pronounced or not, as in retirer, or a u which could either be syllabic or non-syllabic, as in tuer, does not justify the coding of several Umps. These variations generate, however, several phonetic transcriptions for the same word: in the International Phonetic Alphabet, [tye] and [ty+e]. Given their regularity, they correspond to a single phonemic transcription, for example /ty+e/ for tuer, whereas the variants of razzia, which do not stem from any regular variation in French, justify two distinct phonemic transcriptions -- /razia/ and /radzia/. The lexicographer may also classify the variations, following their regularity, at either a phonemic or a phonetic level. Only phonemic variations are taken into consideration in the Ump. As for regular variations, categories at the phonetic level and all lexical information that concerns them are encoded in the phonemic representations, and an algorithm is associated to the phonemic system to generate the phonetic variants. The pronunciations of compounds is processed with the help of a relation of composition (see 2.4 below.) A Um which corresponds to a compound is therefore not linked directly to any Ump.

2.4. Relations between Ums

Ums do not represent simple words solely, but also elements of a word: prefixes, etc, and compounds, as in sur le vif. Indeed, there are relations between Ums.:

either derivational, that is, interior to a simple word, as for example condamner and condamnation.

or compositional, that is, exterior to simple words, such as vif and sur le vif.

These relations permit us to encode the decomposition of a Um into a series of Ums. They indicate the order of the element in the series, for example 3 for the relation between vif and sur le vif and 1 for the relation between électro- and électrochoc. They also indicate the presence and the value of the characters of separation.

written separators : spaces in sur le vif, hyphens in sur-le-champ, an apostrophe in aujourd'hui, and an absence of separators in électrochoc.

phonemic separators: introduction of an element of liaison as in mot à mot.

3. Combination of Morphological Features (CombTM)

The combinations of Morphological Features serve to characterize all Inflected Forms of a Um and to indicate the Morphological Features of these forms. Even when a Um has only one Inflected Form, as fiançailles, the attributes of the unique CombTM indicate the gender and number. For words of a fixed gender, such as lit, inspection of the CombTMs permits us to deduce its gender and the fact that the gender is fixed. There are four CombTMs for nouns and adjectives, 51 for verbs (only simple tenses are represented: the formation of complex tenses is represented in the syntactic layer), and several others for determiners and pronouns. There is also a special "empty" CombTM for those invariable words for which Morphological Features are not relevant (prepositions, adverbs, ...) - all the attributes of this CombTM have a value sans (without).

4. Graphic Morphological Unit (Umg)

A Umg gives the spelling of a simple word or of an element of a word. In the case of free spelling variation, we have one Umg per variant. Each Umg may therefore carry information on the style (level of language) to which the variant belongs, its frequency of use, the date of its first attestation in the language, or geographical location of its use.

Umgs all possess a system of inflection. The Umgs of words which carry Morphological Features (nouns, adjective, verbs, and certain grammatical words) are linked to an Mfg which specifies the existence, the number and spelling of the Inflected Forms, even if there is only one such form. The Umgs of words without Morphological Features are in relation with an empty Mfg.

5. System of Written Inflection (Mfg)

The Mfg concerns words which carry Morphological Features (nouns, adjectives, verbs and certain grammatical words). It specifies the existence, number, and spelling of the Inflected Forms, even if there is only one such form. Two simple words have the same Mfg if and only if they have the same number of Inflected Forms with the same combinations of Morphological Features, and if the spellings of these forms are deducible in the same manner for the two words. It is, for example, the case for the verbs parvenir and prévenir. Only simple tenses are registered in the attributes of Mfgs. The formation of complex tenses (être parvenu, avoir prévenu) and of the pronominal form (se prévenir) are processed in the syntactic layer.

Two Mfgs may differ uniquely by the existence of certain CombTMs, present in one but absent in the other. For example, the human noun comique (Luc est un comique), variable in gender and number, does not have the same Mfg as the invariable comique (La situation est d'un comique irrésistible), which only has a single CombTM -- masculine, singular. However, when inflectional paradigms have only minor differences in certain Inflected Forms, only one Mfg is attributed. For example, in accordance with tradition, the conjugation of the verb b‰tir, with a variable past participle, is not distinguished from that of agir, whose past participle is invariable. The Mfg for b‰tir is attributed to agir and the syntactic description indicates if the past participle is variable or not.

The "empty" Mfg is a special one, having only one "empty" CombTM and only one Cff (elementary calculation rule for an inflected form) with a null Removal and a null Addition.

6. Computation of Written Forms (Cff)

6.1. Two types of representation

A Cff is an elementary rule of computation for a graphic Inflected Form. Associated with an Mfg, the Cff is used to calculate the graphic Inflected Form. There are two types of rules:

i. Addition of a termination to a radical

Ex: devoir

Radg 1: dev

Radg 2: doiv

...

Cff => doivent

nieme_radgp : 2

Removal : ""

Addition: "ent"

ii. computation of a radical by Removal of a chain of characters from the radical and Addition of a suffix

Ex: devoir

Cff => doivent

nieme_radgp : 0

Removal : "evoir"

Addition: "oivent"

In the most irregular conjugations, the chain to be removed and to be added may extend to the entire form, for example étaient as computed from être.

 

6.2. Relations between Mfg, CombTM and Cff

With an Mfg and a CombTM may be associated a Cff which permits us to find the Inflected Form of this CombTM for those words which inflect according to the Mfg. For example, if we consider the conjugation of laisser, represented by an Mfg, and the CombTM of the present participle, these two entities are linked to a Cff which specifies in substance: remove -er, add -ant. If free variants exist for certain Inflected Forms, several Cffs are associated with the same Mfg and the same CombTM. For example, the Mfg of asseoir and the CombTM of the present, third person singular are linked to two Cffs:

remove -eoir, add -ied,

remove -eoir, add -oit.

7. Phonemic Morphological Unit (Ump)

A Ump gives the phonemic transcription of a simple word or of an element of a word. If there are free phonemic variations, we have one Ump per variant. Each Ump may therefore carry information on the style (level of language) of the variant, its frequency in usage and the date of its first attestation in the language. Remember that the lexicographer may classify the variations in pronunciation, following their regularity, at either a phonemic or phonetic level. Only phonemic variations are taken into account in the Umps. For the phonetic level, refer to the section above.

In the case of free variants; the correspondence between the different Umgs and Umps of the same word is not obvious. It is always explicitly indicated by a direct relationship between these entities.

8. System of Phonemic Inflection (Mfp)

The Mfp concerns words which carry Morphological Features (nouns, adjectives, verbs and certain grammatical words). It specifies the existence, number, and a phonemic transcription of the Inflected Forms, even if there is only one such form. Two simple words have the same Mfp if and only if they have the same number of Inflected Forms with the same combinations of Morphological Features, and if the transcriptions of these forms are deducible in the same manner for the two words. It is for example the case for the verbs croire and pourvoir. As it is for Mfgs, only the simple tenses enter into consideration for the attributes of Mfps: the formation of complex tenses and of pronominal forms is processed in the syntactic layer. The definition of Mfps is completely parallel to that of Mfgs: two Mfps may differ uniquely by the existence of certain CombTM, present in one but absent in another. However, when morphological paradigms have only minor differences in the existence of certain Inflected Forms, the same Mfp is attributed to them (see Section 5 on Mfgs above). We see that the Mfgs are in complex relation with the Mfps, because two words may have the same Mfg and yet have distinct Mfps (amer and léger) or vice versa (croire and pourvoir).

 

9. Computation of Phonemic Inflected Forms (Cff)

9.1. Two types of Representation

A Cff (see Section 6 above.) is an elementary rule of derivation of a phonemic Inflected Form. There are two types of rules:

i. addition of a termination to a radical

Ex: devoir /d"vuar/

Radp /d"v/

nieme = 1

Radp /duav/

nieme = 2

...

Cff => /duavt*/

nieme_radgp : 2

Removal: //

Addition: /t*/

ii. computation of a radical by Removal of a chain of characters from the radical and Addition of a suffix

Ex: devoir

Cff => /duavt*/

nieme_radgp : 0

Removal : /"vuar/

Addition: /uavt*/

In the most irregular conjugations, the chain to be removed and that to be added may extend to the entire form, for example sera from être. This computation is carried out by phonemic and not phonetic transcriptions. The regular phonetic variations that appear during inflection does not complicate the computation. For example, in the conjugation of céder, the phonetic alternation between [é] and [è] (in céder [sede] and cède [sed]) does not affect the phonemic transcriptions: céder /sede/, cède /sed/.

9.2. Relations between Mfp, CombTM, and Cff

With an Mfp and a CombTM may be associated a Cff which permits us to find the Inflected Form of the CombTM for those words which are inflected according to the Mfp. For example, if we consider the phonemic inflection of courir, represented by an Mfp, and the CombTM of the third person singular of the future, these two entities are linked to a Cff which specifies in substance: remove /ir/, add /ra/. If there are free variants in certain Inflected Forms, several Cffs may be associated with the same Mfp and the same CombTM. For example, le Mfp of asseoir and the CombTM of the present third person singular are related to two Cffs:

remove /uar/, add /ie/.

remove /r/, add the empty termination.

9.3. Relations between Graphic and Phonemic Forms

The relation is written/ is pronounced expresses the link between a Cff used to calculate a graphic form and a Cff used to calculate a phonemic form. In the majority of cases, this relation is deducible from the CombTM, as a function of the Umg, the Ump, the Mfg and the Mfp.

Ex: ms [table] /tabl/

mp [tables] /tabl/

ms [djihad] /dZiad/ /Ziad/

mp [djihad] /dZiad/ /Ziad/

This relation is, however, necessary when there are variants in the graphic and phonemic systems. We must be able to link each graphic variant to the correct phonemic variant and vice versa.

Ex: imppre20p [asseyez] /aseje/ [assoyez] /asuaye/

10. System of Inflection of Compounds (Mfc)

An Mfc enters into the relation of composition (R_Compose) between a compound and a component. The Mfc permits us to select for each CombTM of the compound the Inflected Form(s) of the pertinent components.

Um_C points to R_Compose which:

points to a component .

gives the place of this component in the compound (linear order)

indicates the possible separators before the component (sepg/p)

points to the Mfc of the compound, given the component.

The Mfc points to couplets of CombTM (Comb_Comb) which allows us to associate with one CombTM of the component the CombTM(s) of the pertinent compounds.

Each of these couplets:

points to one CombTM of the compound

points to one to N CombTMs of the component (for the selection of the pertinent Inflected Form(s)).

specifies, as needed, the contextual variant (contexte_var) to be selected among the Inflected Forms of the compound.

 

D. ENTITY-RELATION DIAGRAMS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

E. DTD SGML

 

 

1. Translation of the Conceptual Model

The conceptual model of GENELEX has been expressed to a great extent in terms of an entity-attribute-relation model (Merise).

Many constraints of integrity are expressed in this EAR model: type of objects, type of relations, cardinality of relations, etc. However, as the model was not conceived to express rules - doing so gives rise to extreme complications - certain constraints had to be expressed in the accompanying document (restriction on the combination of values). It follows that the morphological model of GENELEX is a combination of EAR formalism and comments in natural language (NL).

An SGML DTD (Definition of Type of Document) is a physical model of grammar which describes the marking of data.

During the transition from the conceptual model (EAR + NL)to the GENELEX DTD, the EAR models had to be translated in a very systematic way. We have attempted to express formally all of the integrity constraints, expressed in natural language, in the DTD.

Certain rules for translating the EAR formalism to SGML have been applied:

the EAR entities become SGML elements.

the attributes of EAR entities become attributes of SGML elements. If the values of an attribute (EAR) are unique and form a closed vocabulary set, they are represented as an SGML attribute list.

the unmodified EAR relations that are linked to an unshared entity are expressed by the hierarchical structure of the DTD elements. Their cardinality is expressed by the SGML occurrence indicators : ? + *.

the unmodified EAR relations that are linked to a shared entity are expressed by reference links between the DTD elements.

the EAR relations that are modified by attributes are expressed in the form of SGML elements and attributes, and are linked, by either hierarchy or by reference, to those SGML elements representing the EAR entities to which they were attached.

A constraint file was developed to faciliate cross-referencing. The constraints that are expressed therein appear as comments and are, as such, ignored by the SGML parser; they express, in an intuitive syntax, the reference types.

 

2. DTD Commented

2.1. DTD genelex.dtd

<!--Consortium GENELEX @(#) genelex.dtd 3.1@(#) -->

**************A WORD TO THE USERS *************************

Your remarks concerning the DTD will be studied by the GENELEX consortium.

If changes to the DTD warrant the release of a new version, the consortium assumes responsibility for its diffusion.

************************************************************ -->

<!DOCTYPE Genelex [

<!ENTITY % ISOlat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN">

%ISOlat1

<!ENTITY % CustEnt PUBLIC "-//GLX-TEAM//ENTITIES Custom Entity Set//FR">

<!ENTITY % MorpEnt PUBLIC "-//GLX-TEAM//ENTITIES Morphologie Entity Set//FR">

<!ENTITY % SyntEnt PUBLIC "-//GLX-TEAM//ENTITIES Syntaxe Entity Set//FR">

<!ENTITY % SemEnt PUBLIC "-//GLX-TEAM//ENTITIES Semantique Entity Set//FR">

%CustEnt

%MorpEnt

%SyntEnt

%SemEnt

<!--

A Genelex document is made of several parts:

- the morphological description

- the syntactic description

- ...

To select the desired part, specify the appropriate key word (INCLUDE or IGNORE) in the following entity declarations:

-->

<!ENTITY % isMor "INCLUDE" >

<!ENTITY % isSyn "INCLUDE" >

<!ENTITY % isSem "INCLUDE" >

<!ELEMENT Genelex - O ( GenelexMorpho? & GenelexSyntaxe? & GenelexSemant? & CombVE*)>

<!ATTLIST Genelex

nom CDATA #REQUIRED

langue CDATA #REQUIRED

version CDATA #IMPLIED

date_creation1 CDATA #IMPLIED

date_creationglx CDATA #IMPLIED

date_modif CDATA #IMPLIED

propriete CDATA #IMPLIED

copyright CDATA #IMPLIED

integrite (SANS_B|%pBooleen) SANS_B

certification CDATA #IMPLIED>

 

<!-- ****************************************************** -->

<!ENTITY % pGlose

"appellation CDATA #IMPLIED

exemple CDATA #IMPLIED

commentaire CDATA #IMPLIED">

<!-- In general, throughout the file :

- "appellation" (name) of the object that is meaningful,

and if possible, unique

- "exemple" (example) illustrates the use of the object

such as an author's citation, example from a

dictionary or a linguist)

- "commentaire" (comment) free field for the user -->

<!-- ******************************************************* -->

<!ELEMENT CombVE - O EMPTY>

<!ATTLIST CombVE

id ID #REQUIRED

datation (SANS_D|%pDatation) SANS_D

niveaulgue (SANS_NL|%pNiveauLgue) SANS_NL

frequence (SANS_F|%pFrequence) SANS_F

vargeog CDATA #IMPLIED>

 

<![ %isMor [

<!ENTITY % GLXmor PUBLIC "-//GLX-TEAM//DTD Description Morphologie//FR">

<!ENTITY % MorpCtr PUBLIC "-//GLX-TEAM//DTD Contraintes Morphologie//FR">

%GLXmor

%MorpCtr

]]>

<![ %isSyn [

<!ENTITY % GLXsyn PUBLIC "-//GLX-TEAM//DTD Description Syntaxe//FR">

<!ENTITY % SyntCtr PUBLIC "-//GLX-TEAM//DTD Contraintes Syntaxe//FR">

%GLXsyn

%SyntCtr

]]>

<![ %isSem [

<!ENTITY % GLXsem PUBLIC "-//GLX-TEAM//DTD Description Semantique//FR">

<!ENTITY % SemaCtr PUBLIC "-//GLX-TEAM//DTD Contraintes Semantique//FR">

%GLXsem

%SemaCtr

]]>

]>

2.2. DTD morpho.dtd

<!--Consortium GENELEX morpho.dtd 3.3-->

<!-- *************A WORD TO THE USERS ************************

Your remarks concerning the DTD will be studied by the GENELEX consortium. If changes to the DTD warrant the release of a new version, the consortium assumes responsibility for its diffusion.

************************************************************ -->

<!ELEMENT GenelexMorpho - O (

(Um_S|Um_C|Um_Agg|Um_Aff)* &

Etymon* &

Mfg* &

Mfp* &

CombTM* &

Mfc* &

Comb_Comb*)>

<!-- *************************************************** -->

<!-- ******* DEFINITION OF MORPHOLOGICAL UNITS ****** -->

<!-- *************************************************** -->

<!ENTITY % pUmAtt

"id ID #REQUIRED

appellation CDATA #IMPLIED

attestation CDATA #IMPLIED

combve IDREF #IMPLIED

etymon_l IDREFS #IMPLIED">

<!ELEMENT Um_S - O ((Umg|Ump)+ & Derivation* & FormeBreve*)>

<!ATTLIST Um_S

%pUmAtt

catgram (SANS_C|%pCatGram) SANS_C

sscatgram (SANS_SC|%pSsCatGram) SANS_SC

autonomie (SANS_B|%pBooleen) SANS_B

usyn_l IDREFS #IMPLIED>

<!-- The content token, (Umg|Ump)+, indicates that a Simple

Morphological Unit must have either a Graphical Unit or a

Phonemic Unit.

The content token, Derivation, indicates the possible

derivations that are associated with a Um_S.

The content token, FormeBreve (Short Form) indicates those

relations that a Um_S may have with other Units that are

abbreviated forms. -->

<!ELEMENT Um_C - O (R_Compose+ & FormeBreve*)>

<!ATTLIST Um_C

%pUmAtt

catgram (SANS_C|%pCatGram) SANS_C

sscatgram (SANS_SC|%pSsCatGram) SANS_SC

usyn_l IDREFS #IMPLIED>

<!-- A Compound Morphological Unit has no Umg or Ump of its own:

these graphic and phonemic forms are deduced from the Units

which make up the Compound Unit.

Each Component that participates in the Um_C is indicated

by an R_Compose relationship. -->

<!ELEMENT Um_Agg - O ((Umg|Ump)+ & R_Compose+)>

<!ATTLIST Um_Agg

%pUmAtt

obligatoire (SANS_B|%pBooleen) SANS_B>

<!-- A Contracted Morphological Unit is associated with the

elements that have been incorporated in the contraction by

way of the R_Compose relationship.

The attribute, obligatoire ("mandatory"), indicates whether

the use of the contraction, as opposed to the corresponding

expanded form, is mandatory or optional. -->

<!ELEMENT Um_Aff - O ((Umg|Ump)+ & CatGram_Select* & CatGram_Result* & Genre_Result*)>

<!ATTLIST Um_Aff

%pUmAtt

typaff (SANS_T|%pTypaff) SANS_T

usem_aff_l IDREFS #IMPLIED>

<!-- The attribute, typeaff ("affix type"), records the type of

a Morphological Affix Unit; in the case in which an affix

may be typed only within its derivation context, this

attribute will have the value, SANS_T ("typeless").

The context tokens, CatGram_Select/Result ("grammatical

category selected/resulting") and Genre_Result ("gender

resulting"), indicate for a given Morphological Affix Unit

possible restrictions concerning the grammatical category

and the gender of the Units which result from the

derivation. -->

<!ELEMENT CatGram_Result - O EMPTY>

<!ATTLIST CatGram_Result

catgram (SANS_C|%pCatGram) SANS_C>

<!ELEMENT CatGram_Select - O EMPTY>

<!ATTLIST CatGram_Select

catgram (SANS_C|%pCatGram) SANS_C>

<!ELEMENT Genre_Result - O EMPTY>

<!ATTLIST Genre_Result

genre (SANS_G|%pGenre) SANS_G>

<!-- ************************************************** -->

<!-- ********* GRAPHIC FORM / PHONIC FORM ******* -->

<!-- ************************************************** -->

<!ENTITY % pUmgpAtt

"nieme NUMBER #IMPLIED

vedette SANS_B|%pBooleen) SANS_B

appellation CDATA #IMPLIED

attestation CDATA #IMPLIED

combve IDREF #IMPLIED

mf IDREF #REQUIRED

corresp_l NUMBERS #IMPLIED">

<!ELEMENT Umg - O (Lib & Radg*)>

<!ELEMENT Ump - O (Lib & Radp*)>

<!ATTLIST (Umg|Ump)

%pUmgpAtt>

 

<!-- In the case in which a Unit has Graphic and/or Phonic

variants, that is either several Umgs and or Umps, these

Umgs and Umps will have an attribute identifying the rank

of the variant. The relationship between a Umg and a

Ump is established by a list of integers, corresp_l. If

there is a preferred form among the variants, this can be

recorded in the attribute, vedette.

The field, mf, indicates the system of inflection. This is

left blank (null value) if the system of inflection for the

given Umg/p is unknown. In the case of those Units that are

not inflected (prepositions, ...) the mf field contains

an empty value: such as "mf_empty". -->

<!ENTITY % pRadgpAtt

"nieme NUMBER #IMPLIED

contexte_var CDATA #IMPLIED">

<!ELEMENT (Radg|Radp) - O (Lib)>

<!ATTLIST (Radg|Radp)

%pRadgpAtt>

<!-- the radical has two functions:

- it is used by the Mfg/p (graphic/phonic system of

inflection) to calculate the inflected forms.

A radical that is the same as the label, Lib, of the

Umg/p is not recorded as a radical element, but

simply as the label; one can however, refer to it as

the 0th radical.

- it is used in the derivation process -->

<!-- ************************************************** -->

<!-- *********** ETYMOLOGY ************** -->

<!-- ************************************************** -->

<!ELEMENT Etymon - O (Lib?)>

<!ATTLIST Etymon

id ID #REQUIRED

langue CDATA #IMPLIED

sens CDATA #IMPLIED

date CDATA #IMPLIED

appellation CDATA #IMPLIED>

<!-- *********************************************** -->

<!-- ***** GRAPHIC AND PHONIC SYSTEM OF INFLECTION *** -->

<!-- *********************************************** -->

 

<!ENTITY % pMfAtt

"id ID #REQUIRED

%pGlose">

<!ELEMENT (Mfg|Mfp) - O (CombTM_Cff+)>

<!ATTLIST (Mfg|Mfp)

%pMfAtt>

<!ELEMENT CombTM_Cff - O (Cff+)>

<!ATTLIST CombTM_Cff

combtm IDREF #REQUIRED>

 

<!ELEMENT Cff - O (Retrait,Ajout)>

<!ATTLIST Cff

nieme NUMBER #IMPLIED

nieme_radgp NUMBER 0

contexte_var CDATA #IMPLIED

corresp_l NUMBERS #IMPLIED>

<!-- The attributes, nieme ("nth") and corresp_l

("correspondence list"), are usedto associate possible

variations in the inflected forms.

Ex: calculation of je peux/je puis (two forms of "I can")

The attribute, nieme_radgp ("nth radical), indicates which

radical, the nth, is used to form the inflected form. A

value of 0 refers to the attribute,Lib, of the Umg/p. -->

<!ELEMENT (Lib|Ajout|Retrait) O O (#PCDATA)>

<!-- combination of morphological features -->

<!ELEMENT CombTM - O EMPTY>

<!ATTLIST CombTM

id ID #REQUIRED

mode (SANS_M|%pMode) SANS_M

temps (SANS_T|%pTemps) SANS_T

personne (SANS_P|%pPersonne) SANS_P

genre (SANS_G|%pGenre) SANS_G

nombre (SANS_N|%pNombre) SANS_N

nombreposseur (SANS_NP|

%pNombrePosseur) SANS_NP>

<!-- ********************************************** -->

<!-- ******* MORPHOLOGICAL DERIVATION ********** -->

<!-- ********************************************** -->

<!ELEMENT Derivation - O (RestrictUm* & R_Derive+)>

<!ATTLIST Derivation

appellation CDATA #IMPLIED

commentaire CDATA #IMPLIED>

<!-- The content token, R_Derive, is used to record the

different components of a derivation. Concurrent

derivations are indicated by recording several Derivation

elements on one derived Unit.

The content token, RestrictUm, refers to the

derived unit.-->

<!ELEMENT R_Derive - O (RestrictUm*)>

<!ATTLIST R_Derive

ordre_lineaire NUMBER #IMPLIED

statut (SANS_S|%pStatut) SANS_S

retraitg CDATA #IMPLIED

retraitp CDATA #IMPLIED

um IDREF #REQUIRED>

<!-- The field, um, indicates the component of the derivation.

RestrictUm applies here to that component. -->

 

<!ELEMENT RestrictUm - O EMPTY>

<!ATTLIST RestrictUm

nieme_umg NUMBER #IMPLIED

nieme_radg NUMBER #IMPLIED

nieme_ump NUMBER #IMPLIED

nieme_radp NUMBER #IMPLIED>

<!-- In the context of a Morphological Unit, this element

expresses a restriction on that unit while allowing the

selection of a graphic and/or phonemic variant

(or a radical). -->

<!-- ********************************************** -->

<!-- *************** SHORT FORM *************** -->

<!-- *********************************************** -->

<!ELEMENT FormeBreve - O EMPTY>

<!ATTLIST FormeBreve

typebref (SANS_T|%pTypeBref) SANS_T

um IDREF #REQUIRED>

<!-- The attribute, um ("morphological unit"), indicates the Um

which is the short form of that Um which has the

relationship Forme_Breve ("short_form"). -->

<!-- ********************************************** -->

<!-- ******* MORPHOLOGICAL COMPOSITION ********* -->

<!-- ********************************************** -->

<!ELEMENT R_Compose - O (RestrictUm*)>

<!ATTLIST R_Compose

ordre_lineaire NUMBER #IMPLIED

separg (ATTAQUE_G|%pSeparg) ATTAQUE_G

separp (ATTAQUE_P|%pSeparp) ATTAQUE_P

um IDREF #REQUIRED

mfc IDREF #REQUIRED>

<!-- The attribute, um, indicates a Um_S/Aff/Agg/C (Um sub-

classes) component which participates in the composition.

The attribute, ordre_lineaire ("linear order"), specifies

the position of the component in the composition.

The attributs, separg/p ("graphic/phonemic seperators"),

gives the list of possible separators which may appear

before the component. -->

<!-- systems of inflection for composed morphological units -->

<!ELEMENT Mfc - O EMPTY>

<!ATTLIST Mfc

id ID #REQUIRED

%pGlose

comb_comb_l IDREFS #REQUIRED>

 

<!ELEMENT Comb_Comb - O EMPTY>

<!ATTLIST Comb_Comb

id ID #REQUIRED

contexte_var CDATA #IMPLIED

combcpose IDREF #REQUIRED

combcposant_l IDREFS #REQUIRED>

<!-- The element, Comb_Comb, establishes a relation between :

- a combination of inflectional features for a Compound

Morphological Unit, and

- one (or more in the case in which a compound allows

variations of inflection) combination of inflectional

features of a component.

The attribute, contexte_var ("context of variant"), labels

the inflection variations for the componds.

Ex : des pare-soleil(s)

The plural of the compound is formed from

either the singular (old spelling) or the

plural (new spelling) of the component, soleil.

Indications such as "old spelling" or "new spelling" are

noted in the attribute, contexte_var.

One must allow for a separator between the zones of this

CDATA type,the order of the zones must correspond to the

order of the IDREFS in the attribute, compcposant_l

("compound component list"):

"old spelling | new spelling" -->

<!-- ********************************************** -->

<!-- ********** SIMPLIFYING MECANISMS ********** -->

<!-- ********************************************** -->

<!-- The opening markers for the elements, ajout ("add") and

retrait ("remove"),may be omitted by using the following

SHORTREFs. These elements may appear in the marked file in

the form of two character strings separated by a comma. -->

<!ENTITY e-s-ajout "<ajout>" >

<!SHORTREF s-ajout

"," e-s-ajout >

<!USEMAP s-ajout retrait >

<!-- In case of the configuration : <cff>,s</>, where the element, retrait,

contains no PCDATA, the USEMAP will introduce <retrait><ajout> -->

<!ENTITY e-s-retrait "<retrait><ajout>" >

<!SHORTREF s-retrait

"," e-s-retrait >

<!USEMAP s-retrait Cff >

 

 

2.3. Entities morpho.ent

For English equivalents for the following terms, please refer to the Vocabulary List (Section A.6).

<!--Consortium GENELEX morpho.ent 3.3-->

<!-- **************A WORD TO THE USERS *************************

Your remarks concerning the DTD will be studied by the GENELEX consortium.

If changes to the DTD warrant the release of a new version, the consortium assumes responsibility for its diffusion.

************************************************************ -->

<!ENTITY % pBooleen "OUI|NON" >

<!ENTITY % pDatation "ARCHAIQUE|VIEILLI|MODERNE" >

<!ENTITY % pNiveauLgue "FAMILIER|VULGAIRE|ARGOTIQUE|POPULAIRE

|LITTERAIRE|SAVANT|STANDARD" >

<!ENTITY % pFrequence "RARE|COURANT" >

<!ENTITY % pCatGram "NOM|ADJECTIF|ADVERBE|VERBE|PREPOSITION

|CONJONCTION|INTERJECTION|DETERMINANT

|PRONOM|PARTICULE" >

<!ENTITY % pSsCatGram "PROPRE|COMMUN|POSSESSIF|DEMONSTRATIF

|PARTITIF|DEFINI|INDEFINI|CARDINAL|ORDINAL

|EXCLAMATIF|QUALIFICATIF|INTERROGATIF

|RELATIF|COORDINATION|SUBORDINATION

|PERSONNEL_FORT|PERSONNEL_FAIBLE|IMPERSONNEL>

<!ENTITY % pMode "INDICATIF|SUBJONCTIF|CONDITIONNEL|IMPERATIF

|INFINITIF|PARTICIPE" >

<!ENTITY % pTemps

"PRESENT|IMPARFAIT|PASSE_SIMPLE|FUTUR|PASSE" >

<!ENTITY % pPersonne "1|2|3" >

<!ENTITY % pGenre "MASCULIN|FEMININ|NEUTRE" >

<!ENTITY % pNombre "SINGULIER|PLURIEL" >

<!ENTITY % pNombrePosseur

"SINGULIER_POSSEUR|PLURIEL_POSSEUR" >

<!ENTITY % pTypaff "PREFIXE|SUFFIXE|INFIXE" >

<!ENTITY % pStatut "%pTypaff|BASE" >

<!ENTITY % pTypeBref "ABREVIATION|SIGLE|ACRONYME" >

<!ENTITY % pSeparg "TIRET|APOSTROPHE|ESPACE|JOINTURE

|TIRET_ESPACE|TIRET_JOINTURE|TIRET_APOSTROPHE

|TIRET_ESPACE_JOINTURE

|APOSTROPHE_JOINTURE" >

<!ENTITY % pSeparp "LIAISON_t|LIAISON_z|LIAISON_k

|LIAISON_n|LIAISON_r

|FRONTIERE_MOT" >

 

2.4. Constraints morpho.ctr

<!--Consortium GENELEX morpho.ctd 3.2 -->

<!--CONTRAINTE Um_S

combve TYPE CombVE

etymon_l TYPE Etymon

usyn_l TYPE Usyn - ->

<!--CONTRAINTE Um_C

combve TYPE CombVE

etymon_l TYPE Etymon

usyn_l TYPE Usyn -->

<!--CONTRAINTE Um_Agg

combve TYPE CombVE

etymon_l TYPE Etymon -->

<!--CONTRAINTE Um_Aff

combve TYPE CombVE

etymon_l TYPE Etymon

usem_aff_l TYPE Usem_Aff -->

<!--CONTRAINTE Umg

combve TYPE CombVE

mf TYPE Mfg -->

<!--CONTRAINTE Ump

combve TYPE CombVE

mf TYPE Mfp -->

<!--CONTRAINTE CombTM_Cff

combtm TYPE CombTM -->

<!--CONTRAINTE R_Derive

um TYPE (Um_S|Um_Agg

|Um_Aff) -->

<!--CONTRAINTE FormeBreve

um TYPE (Um_S|Um_C

|Um_Agg|Um_Aff) -->

<!--CONTRAINTE R_Compose

um TYPE (Um_S|Um_C

|Um_Agg|Um_Aff)

mfc TYPE Mfc -->

<!--CONTRAINTE Mfc

comb_comb_l TYPE Comb_Comb -->

<!--CONTRAINTE Comb_Comb

combcpose TYPE CombTM

combcposant_l TYPE CombTM -->

 

3. Examples of marked data

3.1. Morphological unit for "chaise"

<UM_S ID="UM11315" CATGRAM="NOM">

<UMG MF="MFG210">chaise</>

<UMP MF="MFP210">Sez</>

</>

<MFG ID="MFG210"

COMMENTAIRE="formation de base des noms ou adjectifs f&eacute;minins"

EXEMPLE="chaise,chaises">

<COMBTM_CFF COMBTM="GN2">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

<COMBTM_CFF COMBTM="GN4">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

</>

</>

<COMBTM ID="GN2" GENRE="FEMININ" NOMBRE="SINGULIER">

<COMBTM ID="GN4" GENRE="FEMININ" NOMBRE="PLURIEL">

3.2. Morphological unit for "boulanger"

<UM_S ID="UM8275" CATGRAM="NOM">

<UMG MF="MFG420">boulanger</>

<UMP MF="MFP320">bulanZer*</>

</>

<MFG ID="MFG420"

COMMENTAIRE="noms et adjectifs des deux genres avec pluriel en -s, et masculin

et f&eacute;minin respectivement en -er et -&egrave;re"

EXEMPLE="boulanger,boulang&egrave;re,boulangers,boulang&egrave;res">

<COMBTM_CFF COMBTM="GN1">

<CFF NIEME_RADGP="0"><RETRAIT>er</><AJOUT>er</></>

</>

<COMBTM_CFF COMBTM="GN2">

<CFF NIEME_RADGP="0"><RETRAIT>er</><AJOUT>&egrave;re</></>

</>

<COMBTM_CFF COMBTM="GN3">

<CFF NIEME_RADGP="0"><RETRAIT>er</><AJOUT>ers</></>

</>

<COMBTM_CFF COMBTM="GN4">

<CFF NIEME_RADGP="0"><RETRAIT>er</><AJOUT>&egrave;res</></>

</>

</>

<COMBTM ID="GN1" GENRE="MASCULIN" NOMBRE="SINGULIER">

<COMBTM ID="GN2" GENRE="FEMININ" NOMBRE="SINGULIER">

<COMBTM ID="GN3" GENRE="MASCULIN" NOMBRE="PLURIEL">

<COMBTM ID="GN4" GENRE="FEMININ" NOMBRE="PLURIEL">

 

3.3. Morphological unit for "dentiste"

<UM_S ID="UM18851" CATGRAM="NOM">

<UMG MF="MFG310">dentiste</>

<UMP MF="MFP310">dantist</>

</>

<MFG ID="MFG310"

COMMENTAIRE="noms et adjectifs des deux genres avec pluriel en -s et m&ecirc;me

forme au masculin et au f&eacute;minin"

EXEMPLE="artiste,artistes">

<COMBTM_CFF COMBTM="GN1">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

<COMBTM_CFF COMBTM="GN2">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

<COMBTM_CFF COMBTM="GN3">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

</>

<COMBTM_CFF COMBTM="GN4">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

</>

</>

<COMBTM ID="GN1" GENRE="MASCULIN" NOMBRE="SINGULIER">

<COMBTM ID="GN2" GENRE="FEMININ" NOMBRE="SINGULIER">

<COMBTM ID="GN3" GENRE="MASCULIN" NOMBRE="PLURIEL">

<COMBTM ID="GN4" GENRE="FEMININ" NOMBRE="PLURIEL">

3.4. Morphological unit for "une interface"

<UM_S ID="UM36658" CATGRAM="NOM">

<UMG MF="MFG210">interface</>

<UMP MF="MFP210">interfas</>

</>

<MFG ID="MFG210"

COMMENTAIRE="formation de base des noms ou adjectifs f&acute;minins"

EXEMPLE="chaise,chaises">

<COMBTM_CFF COMBTM="GN2">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

<COMBTM_CFF COMBTM="GN4">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

</>

</>

<COMBTM ID="GN2" GENRE="FEMININ" NOMBRE="SINGULIER">

<COMBTM ID="GN4" GENRE="FEMININ" NOMBRE="PLURIEL">

 

3.5. Morphological unit for "leitmotiv"

<UM_S ID="UM38852" CATGRAM="NOM">

<UMG MF="MFG3">leitmotiv</>

<UMP NIEME="0" MF="MFP10">lajtmotiv</>

<UMP NIEME="1" MF="MFP10">lejtmotiv</>

<UMP NIEME="2" MF="MFP10">letmotiv</>

</>

<MFG ID="MFG3"

COMMENTAIRE="noms ou adjectifs masculins, pluriel allemand"

EXEMPLE="leitmotiv,leitmotivs,leitmotive">

<COMBTM_CFF COMBTM="GN1">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

<COMBTM_CFF COMBTM="GN3">

<CFF NIEME="0" NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

<CFF NIEME="1" NIEME_RADGP="0"><RETRAIT></><AJOUT>e</></>

</>

</>

<COMBTM ID="GN1" GENRE="MASCULIN" NOMBRE="SINGULIER">

<COMBTM ID="GN3" GENRE="MASCULIN" NOMBRE="PLURIEL">

3.6. Morphological unit for "amour"

<UM_S ID="UM2462" CATGRAM="NOM">

<UMG MF="MFG60">amour</>

<UMP MF="MFP60">amur</>

</>

<MFG ID="MFG60"

COMMENTAIRE="noms masculins qui ont aussi un pluriel f&eacute;minin selon la

formation de base"

EXEMPLE="amour,delice,orgue">

<COMBTM_CFF COMBTM="GN1">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

<COMBTM_CFF COMBTM="GN3">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

</>

<COMBTM_CFF COMBTM="GN4">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT>s</></>

</>

</>

<COMBTM ID="GN1" GENRE="MASCULIN" NOMBRE="SINGULIER">

<COMBTM ID="GN3" GENRE="MASCULIN" NOMBRE="PLURIEL">

<COMBTM ID="GN4" GENRE="FEMININ" NOMBRE="PLURIEL">

3.7. Morphological unit for "fiançailles"

<UM_S ID="UM28147" CATGRAM="NOM">

<UMG MF="MFG222">fian&ccedil;ailles</>

<UMP MF="MFP210">fi+ansaj</>

</>

<MFG ID="MFG222"

COMMENTAIRE="noms ou adjectifs f&eacute;minins sans singulier"

EXEMPLE="fian&ccedil;ailles">

<COMBTM_CFF COMBTM="GN4">

<CFF NIEME_RADGP="0"><RETRAIT></><AJOUT></></>

</>

</>

<COMBTM ID="GN4" GENRE="FEMININ" NOMBRE="PLURIEL">

3.8. Morphological unit for "chibouk/que"

<UM_S ID="UM11954" CATGRAM="NOM">

<UMG NIEME="0" MF="MFG210">chibouk</>

<UMG NIEME="1" MF="MFG10">chibouk</>

<UMG NIEME="2" MF="MFG210">chibouque</>

<UMG NIEME="3" MF="MFG10">chibouque</>

<UMP MF="MFP210">Sibuk</>

</>

3.9. Derivation of "denationalisation"

<Um_Aff id="uma01" typaff="PREFIXE">

<Umg nieme="1" vedette="NON"><Lib>d&eacute;</Lib>

<Radg nieme="1"><Lib>d&eacute;s</Lib></Radg></Umg>

</Um_Aff>

<Um_Aff id="uma02" typaff="SUFFIXE">

<Umg nieme="1" vedette="NON"><Lib>er</Lib>

<Radg nieme="1"><Lib>iser</Lib></Radg></Umg>

<CatGram_Select catgram="NOM"><CatGram_Result catgram="VERBE">

</Um_Aff>

<Um_Aff id="uma03" typaff="SUFFIXE">

<Umg nieme="1" vedette="NON"><Lib>tion</Lib>

<Radg nieme="1"><Lib>ition</Lib></Radg>

<Radg nieme="2"><Lib>ution</Lib></Radg>

<Radg nieme="3"><Lib>ation</Lib></Radg></Umg>

<CatGram_Select catgram="VERBE"><CatGram_Result catgram="NOM">

<Genre_Result genre="FEMININ">

</Um_Aff>

<Um_S id="ums01" autonomie="OUI" catgram="ADJECTIF" sscatgram="QUALIFICATIF">

<Umg mf="Areg" nieme="0" vedette="NON"><Lib>national</Lib></Umg>

</Um_S>

<Um_S id="ums02" autonomie="OUI" catgram="VERBE" sscatgram="SANS_SC">

<Umg mf="Vgroup1" nieme="0" vedette="NON"><Lib>nationaliser</Lib></Umg>

<Derivation>

<R_Derive ordre_lineaire="1" statut="BASE" um="ums01"></R_Derive>

<R_Derive ordre_lineaire="2" statut="SUFFIXE" um="uma02">

<RestrictUm nieme_radg="1"></R_Derive>

</Derivation></Um_S>

<Um_S id="ums03" autonomie="OUI" catgram="NOM" sscatgram="COMMUN">

<Umg mf="Nfreg" nieme="0" vedette="NON"><Lib>nationalisation</Lib></Umg>

<Derivation>

<R_Derive ordre_lineaire="1" retraitg="er" statut="BASE" um="ums02"></R_Derive>

<R_Derive ordre_lineaire="2" statut="SUFFIXE" um="uma03">

<RestricUm nieme_radg="3"></R_Derive>

</Derivation></Um_S>

 

<Um_S id="ums04" autonomie="OUI" catgram="VERBE" sscatgram="SANS_SC">

<Umg mf="Vgroup1" nieme="0" vedette="NON"><Lib>d&eacute;nationaliser</Lib></

Umg>

<Derivation>

<R_Derive ordre_lineaire="1" statut="PREFIXE" um="uma01"></R_Derive>

<R_Derive ordre_lineaire="2" statut="BASE" um="ums02"></R_Derive>

</Derivation></Um_S>

<Um_S id="ums05" autonomie="OUI" catgram="NOM" sscatgram="COMMUN">

<Umg mf="Nfreg" nieme="0" vedette="NON"><Lib>d&eacute;nationalisation</Lib></

Umg>

<Derivation>

<R_Derive ordre_lineaire="1" statut="PREFIXE" um="uma01"></R_Derive>

<R_Derive ordre_lineaire="2" statut="BASE" um="ums03"></R_Derive>

</Derivation>

<Derivation>

<R_Derive ordre_lineaire="1" retraitg="er" statut="BASE" um="ums04"></R_Derive>

<R_Derive ordre_lineaire="2" statut="SUFFIXE" um="uma03">

<RestrictUm nieme_radg="3"></R_Derive>

</Derivation></Um_S>

3.10. Inflected variations for "concerto"

<MFG ID="MFG131"

COMMENTAIRE="noms ou adjectifs masculins avec variante flexionelle"

EXEMPLE="solo,solos,soli">

<COMBTM_CFF COMBTM="GN1">

<CFF NIEME_RADGP="0"><RETRAIT>o</><AJOUT>o</></>

</>

<COMBTM_CFF COMBTM="GN3">

<CFF NIEME="0" CORRESP_L="0" NIEME_RADGP="0"><RETRAIT>o</><AJOUT>os</></>

<CFF NIEME="1" CORRESP_L="1" NIEME_RADGP="0"><RETRAIT>o</><AJOUT>i</></>

</>

</>

<MFP ID="MFP131"

COMMENTAIRE="avec variations libres"

EXEMPLE="solo, solos ou soli">

<COMBTM_CFF COMBTM="GN1">

<CFF NIEME_RADGP="0"><RETRAIT>o</><AJOUT>o</></>

</>

<COMBTM_CFF COMBTM="GN3">

<CFF NIEME="0" CORRESP_L="0" NIEME_RADGP="0"><RETRAIT>o</><AJOUT>o</></>

<CFF NIEME="1" CORRESP_L="1" NIEME_RADGP="0"><RETRAIT>o</><AJOUT>i</></>

</>

</>