You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

On this page we would like to suggest and discuss components and tooling for the UIMA sandbox.

The sandbox was designed to host UIMA analysis components like annotators, parser or consumers, as well as UIMA tooling. The provided components are free to use and everyone is invited to suggest new components or work on some of them.

Suggested Analysis Components

 Parser

  • document text parser
    • provide a parser component that extracts the plain text from a PDF or HTML document using some open source libraries like PDF box or NekoHTML.

 Annotators

  • simple whitespace tokenizer
    • writing a simple whitespace tokenizer that extracts tokens from a plain text document for whitespace separated languages.
  • language detection annotator
    • writing an annotator that detects the language of a document using for examples simple language specific word lists.
  • word list annotator
    • writing an annotator that use a word list to create annotations of a specified type. The word list can either be provided as XML input or in a compiled format.

 Consumer

  •  casToXML
    • provide a UIMA CAS consumer that writes the analysed documents in a configurable XML representation to the filesystem. The types that should be serialized can be specified in the settings of the CAS consumer.
  • No labels