Interface ExtractAPI

public interface ExtractAPI

Top-level API for extraction

Basically the process is as follows:
- create an Extract instance and set an ExtractAPI interface
- call apply()
- create a Filtering instance and set a FilteringAPI interface
- call apply()

See Also:

Method Summary
 java.lang.String apply(java.lang.String language, boolean withLanguageDetection, java.util.ArrayList<java.lang.String> languagePerimeter, java.util.ArrayList<java.lang.String> corpusFiles, boolean weakTypography)
          Runs the extractor on a set of files
 java.util.ArrayList<ExtractSentence> getResult()
          To obtain the extraction result
 int getSentenceNumber()
          To get the number of sentences in the corpus
 int getWordNumber()
          To get the number of words after lemmatization
 boolean isWeakTypography()
          To consult the mode: weak = uppercase without any accent
 void writeResult(java.lang.String fileName)
          Write the result in an XML file

Method Detail


java.lang.String apply(java.lang.String language,
                       boolean withLanguageDetection,
                       java.util.ArrayList<java.lang.String> languagePerimeter,
                       java.util.ArrayList<java.lang.String> corpusFiles,
                       boolean weakTypography)
Runs the extractor on a set of files

language - Language of the content to be extracted. TagExtract is not able to process two languages at the same time. Only "fr" and "en" are available.
withLanguageDetection - When set, the language detection is processed.
languagePerimeter - When multilingual, the list of languages to be considered. If not specified, the list is set to da (Danish), de (German), en (English), es (Spanish), fr (French), it (Italian), nl (Dutch), no (Norvegian), po (Polish) and se (Suedish). Language is identified on a per sentence basis, in order to deal with multilingual texts. If the perimeter is known, it's better to set it instead on relying on the default list of languages.
corpusFiles - Files to be processed
weakTypography - When set, the text is considered to be coded in weak typography (uppercase and no accent). In this case, the process is longer and often noisy.
ok or a message in case of error


java.util.ArrayList<ExtractSentence> getResult()
To obtain the extraction result

a list of ExtractSentence in the same order as the original corpus


void writeResult(java.lang.String fileName)
Write the result in an XML file


int getWordNumber()
To get the number of words after lemmatization


int getSentenceNumber()
To get the number of sentences in the corpus


boolean isWeakTypography()
To consult the mode: weak = uppercase without any accent