Page History

...

document text parser
- provide a parser component that extracts the plain text from a PDF or HTML document using some open source libraries like PDF box for example or NekoHTML.

simple whitespace tokenizer
- writing a simple whitespace tokenizer that extracts tokens from a plain text document for whitespace separated languages.
language detection annotator
- writing an annotator that detects the language of a document using for examples simple language specific word lists.
word list annotator
- writing an annotator that use a word list to create annotations of a specified type. The word list can either be provided as XML input or in a compiled format.

casToXML
- provide a UIMA CAS consumer that writes the analysed documents in a configurable XML representation to the filesystem. The types that should be serialized can be specified in the settings of the CAS consumer.

Versions Compared