Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • document text parser
    • provide a parser component that extracts the plain text from a PDF or HTML document using some open source libraries like PDF box for example or NekoHTML.

 Annotators

  • simple whitespace tokenizer
    • writing a simple whitespace tokenizer that extracts tokens from a plain text document for whitespace separated languages.
  • language detection annotator
    • writing an annotator that detects the language of a document using for examples simple language specific word lists.
  • word list annotator
    • writing an annotator that use a word list to create annotations of a specified type. The word list can either be provided as XML input or in a compiled format.

 Consumer

  •  casToXML
    • provide a UIMA CAS consumer that writes the analysed documents in a configurable XML representation to the filesystem. The types that should be serialized can be specified in the settings of the CAS consumer.