|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
public interface ExtractAPI
Top-level API for extraction
Basically the process is as follows:
- create an Extract instance and set an ExtractAPI interface
- call apply()
- create a Filtering instance and set a FilteringAPI interface
- call apply()
FilteringAPI
Method Summary | |
---|---|
java.lang.String |
apply(java.lang.String language,
boolean withLanguageDetection,
java.util.ArrayList<java.lang.String> languagePerimeter,
java.util.ArrayList<java.lang.String> corpusFiles,
boolean weakTypography)
Runs the extractor on a set of files |
java.util.ArrayList<ExtractSentence> |
getResult()
To obtain the extraction result |
int |
getSentenceNumber()
To get the number of sentences in the corpus |
int |
getWordNumber()
To get the number of words after lemmatization |
boolean |
isWeakTypography()
To consult the mode: weak = uppercase without any accent |
void |
writeResult(java.lang.String fileName)
Write the result in an XML file |
Method Detail |
---|
java.lang.String apply(java.lang.String language, boolean withLanguageDetection, java.util.ArrayList<java.lang.String> languagePerimeter, java.util.ArrayList<java.lang.String> corpusFiles, boolean weakTypography)
language
- Language of the content to be extracted. TagExtract is not able to process two languages at the same time. Only "fr" and "en" are available.withLanguageDetection
- When set, the language detection is processed.languagePerimeter
- When multilingual, the list of languages to be considered. If not specified, the list is set to da (Danish), de (German), en (English), es (Spanish), fr (French), it (Italian), nl (Dutch), no (Norvegian), po (Polish) and se (Suedish). Language is identified on a per sentence basis, in order to deal with multilingual texts. If the perimeter is known, it's better to set it instead on relying on the default list of languages.corpusFiles
- Files to be processedweakTypography
- When set, the text is considered to be coded in weak typography (uppercase and no accent). In this case, the process is longer and often noisy.
java.util.ArrayList<ExtractSentence> getResult()
void writeResult(java.lang.String fileName)
int getWordNumber()
int getSentenceNumber()
boolean isWeakTypography()
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |