com.ta.TagExtractAPI
Interface ExtractAPI


public interface ExtractAPI

Top-level API for extraction

Basically the process is as follows:
- create an Extract instance and set an ExtractAPI interface
- call apply()
- create a Filtering instance and set a FilteringAPI interface
- call apply()

Version:
2.0
See Also:
FilteringAPI

Method Summary
 java.lang.String apply(java.lang.String language, boolean withLanguageDetection, java.util.ArrayList<java.lang.String> languagePerimeter, java.util.ArrayList<java.lang.String> corpusFiles, boolean weakTypography)
          Runs the extractor on a set of files
 java.util.ArrayList<ExtractSentence> getResult()
          To obtain the extraction result
 int getSentenceNumber()
          To get the number of sentences in the corpus
 int getWordNumber()
          To get the number of words after lemmatization
 boolean isWeakTypography()
          To consult the mode: weak = uppercase without any accent
 void writeResult(java.lang.String fileName)
          Write the result in an XML file
 

Method Detail

apply

java.lang.String apply(java.lang.String language,
                       boolean withLanguageDetection,
                       java.util.ArrayList<java.lang.String> languagePerimeter,
                       java.util.ArrayList<java.lang.String> corpusFiles,
                       boolean weakTypography)
Runs the extractor on a set of files

Parameters:
language - Language of the content to be extracted. TagExtract is not able to process two languages at the same time. Only "fr" and "en" are available.
withLanguageDetection - When set, the language detection is processed.
languagePerimeter - When multilingual, the list of languages to be considered. If not specified, the list is set to da (Danish), de (German), en (English), es (Spanish), fr (French), it (Italian), nl (Dutch), no (Norvegian), po (Polish) and se (Suedish). Language is identified on a per sentence basis, in order to deal with multilingual texts. If the perimeter is known, it's better to set it instead on relying on the default list of languages.
corpusFiles - Files to be processed
weakTypography - When set, the text is considered to be coded in weak typography (uppercase and no accent). In this case, the process is longer and often noisy.
Returns:
ok or a message in case of error

getResult

java.util.ArrayList<ExtractSentence> getResult()
To obtain the extraction result

Returns:
a list of ExtractSentence in the same order as the original corpus

writeResult

void writeResult(java.lang.String fileName)
Write the result in an XML file


getWordNumber

int getWordNumber()
To get the number of words after lemmatization


getSentenceNumber

int getSentenceNumber()
To get the number of sentences in the corpus


isWeakTypography

boolean isWeakTypography()
To consult the mode: weak = uppercase without any accent