ExtractAPI

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.ta.TagExtractAPI
Interface ExtractAPI

public interface ExtractAPI

Top-level API for extraction

Basically the process is as follows:
- create an Extract instance and set an ExtractAPI interface
- call apply()
- create a Filtering instance and set a FilteringAPI interface
- call apply()

Version:: 2.0
See Also:: FilteringAPI

Method Summary
`java.lang.String`	`apply(java.lang.String language, boolean withLanguageDetection, java.util.ArrayList<java.lang.String> languagePerimeter, java.util.ArrayList<java.lang.String> corpusFiles, boolean weakTypography)` Runs the extractor on a set of files
`java.util.ArrayList<ExtractSentence>`	`getResult()` To obtain the extraction result
`int`	`getSentenceNumber()` To get the number of sentences in the corpus
`int`	`getWordNumber()` To get the number of words after lemmatization
`boolean`	`isWeakTypography()` To consult the mode: weak = uppercase without any accent
`void`	`writeResult(java.lang.String fileName)` Write the result in an XML file

Method Detail

apply

java.lang.String apply(java.lang.String language,
                       boolean withLanguageDetection,
                       java.util.ArrayList<java.lang.String> languagePerimeter,
                       java.util.ArrayList<java.lang.String> corpusFiles,
                       boolean weakTypography)

Runs the extractor on a set of files

Parameters:: language - Language of the content to be extracted. TagExtract is not able to process two languages at the same time. Only "fr" and "en" are available.; withLanguageDetection - When set, the language detection is processed.; languagePerimeter - When multilingual, the list of languages to be considered. If not specified, the list is set to da (Danish), de (German), en (English), es (Spanish), fr (French), it (Italian), nl (Dutch), no (Norvegian), po (Polish) and se (Suedish). Language is identified on a per sentence basis, in order to deal with multilingual texts. If the perimeter is known, it's better to set it instead on relying on the default list of languages.; corpusFiles - Files to be processed; weakTypography - When set, the text is considered to be coded in weak typography (uppercase and no accent). In this case, the process is longer and often noisy.
Returns:: ok or a message in case of error

getResult

java.util.ArrayList<ExtractSentence> getResult()

To obtain the extraction result

Returns:: a list of ExtractSentence in the same order as the original corpus

writeResult

void writeResult(java.lang.String fileName)

Write the result in an XML file

getWordNumber

int getWordNumber()

To get the number of words after lemmatization

getSentenceNumber

int getSentenceNumber()

To get the number of sentences in the corpus

isWeakTypography

boolean isWeakTypography()

To consult the mode: weak = uppercase without any accent