THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- document text parser
- provide a parser component that extracts the plain text from a PDF or HTML document using some open source libraries like PDF box for example or NekoHTML.
Annotators
- simple whitespace tokenizer
- writing a simple whitespace tokenizer that extracts tokens from a plain text document for whitespace separated languages.
- language detection annotator
- writing an annotator that detects the language of a document using for examples simple language specific word lists.
- word list annotator
- writing an annotator that use a word list to create annotations of a specified type. The word list can either be provided as XML input or in a compiled format.
Consumer
- casToXML
- provide a UIMA CAS consumer that writes the analysed documents in a configurable XML representation to the filesystem. The types that should be serialized can be specified in the settings of the CAS consumer.