Tagmatica - TagChunker

TagParser

TagParser makes a morphological, syntactic and semantic analysis of a text.

This is an operational software module that processes thousands of documents each day.

Introduction

TagParser proceeds in three phases: chunking, relation linking and semantic computation.

TagParser is a bottom up parser. That means that, starting from the words, the structures are grouped together to form the final result as described in the external specification.

It an hybrid algorithm. The algorithm is the combination of automata and statistical processing. The linguistics data that drive parsing are extracted from a lexicon download and a training upon an annoted corpus.

The following text presents TagParser with some simple examples.

To get a more complete picture, please consult UML activity diagram.

1) CHUNKING

After a morphological analysis, the aim is to obtain:

a) A part of speech tagging.

For instance:

In the sentence: "The table is near the wall", the word "table" will be distinguished from a verb usage like in: "They table the results".

b) A syntactic groups delimitation (i.e. Chunking).

It is to determine the strict border between syntactic groups.

So, in the first example, the following delimitation will be set:

[The table][is][near the wall]

c) A group labelling

The groups will be marked. The following result will be produced:

GN GV GP GP

2) SYNTACTIC RELATION LABELING

The chunking process produces a unique result made of constituents. The aim, now is to link these structures by means of relations like subject or noun-modifier. On the contrary of the chunker (which is based on machine learning techniques), the relations are built by hand-written rules. There are 14 types of relations as specified in the PEAS annotation guide lines and then reused and slightly improved within the ANR Passage, see here

You will find more details in the paper presented at TALN-2003, in the workshop "Evaluation of syntactic parsers".

3) SEMANTIC COMPUTATION

Then, a pipe line of four modules are applied:

named entity recognition to identify person names, organization names or dates,
word sense disambiguation. The goal is mainly to detect whether a word has one or many meanings. Moreover, for some words, a domain mark like "agriculture" is attached to a specific word sense.
coreference resolution with three subtasks:

variant grouping, like a linking between "Nicolas Sarkosy" with "Sarko" (nickname)
anaphora resolution, like "Robert Mitchum ... he".
function name identification in order to link "Nicolas Sarkozy ... the president".

quotation extraction in order to extract "who said what".