- format detector of files from the content.
- file reader
(processed formats: Text, HTML, SGML,
XML)
- language detector (32 recognized languages),
- text segmentor that produces sentences,
- sentence segmentor that produces words
(tokenization),
- spell corrector,
- morphological analyzer for simple words and compound words,
- robust syntactic parser
(based on a chunker),
- unknow words extractor
for simple words and coumpounds words, by the means of customizable patterns.
- document indexor,
- search engine upon the index built by the indexor,
- text mining tool to compare and classify texts or sum up a document
by means of a small set of terms.
a)
be a library accessed thru an API in order to integrate the code in a Knowledge Management (KM) application,
a text mining application or another application.
b)
be a ready to use application with an HTML or Swing graphical interface.
The code does not depend on the operating system. At the moment, the
programs runs on Windows and Linux..
A complementary work can be done in your office or our office.
It could be for instance to develop lacking functionalities,
integration, consulting or training.