Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Wiki Markup
{scrollbar}

Section
Column
width65%
Panel
titleContents of this Page
Table of Contents
minLevel2
Column
Include Page
Menu cTAKES 4.0 to Include
Menu cTAKES 4.0 to Include

...

This project contains several analysis engines (annotators), including:

  • a the default sentence detector annotator
  • an alternative sentence detector
  • a tokenizer
  • an annotator that does not update the CAS in any way - useful when an annotator is required but you don't actually want to use one
  • an annotator that creates a single Segment annotation encompassing the entire document text
Info

End-of-line characters are considered end-of-sentence markers. Hyphenated words that appear in the hyphenated words list with frequency values greater than the FreqCutoff will be considered one token. Refer to the Context Dependent Tokenizer..

Sentence detector models are A sentence detector model is included with this project.

Info

The model for the default sentence detector derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

A wrapper around the OpenNLP sentence detector that creates Sentence annotations based on the location of end-of-line characters and on the output of the OpenNLP sentence detector. This annotator considers an end-of-line character as an end-of-sentence marker. Optionally it can skip certain sections of the document. See the section called Running the sentence detector and tokenizer for more details.

...

Resources
MaxentModelFile - the Maxent model sentence detector.

SentenceDetectorAnnotatorBIO.xml

 

The default sentence detector uses a discriminative classifier on a small set of candidate sentence-splitting characters. However, in clinical data a sentence break could be indicated by something as subtle as a series of spaces. This new model classifies a sequence of characters as the Beginning, Inside, or Outside (BIO) of a sentence, which allows the use of similar features as previous systems while allowing arbitrary sentence boundaries. This requires many more classification decisions but avoids major performance penalties by only classifying non-alphanumeric characters.

Parameters 
All parameters are optional.

SimpleSegmentAnnotator.xml

Creates a single Segment annotation, encompassing the entire document. For This annotator is for use prior to annotators that require a Segment annotation, when the pipeline does not contain a different annotator that creates Segment annotations. This annotator is typically used for plain text files, which doesndon't have section (aka segment) tags; but not for CDA documents, as the CdaCasInitializer annotator creates Segment annotations.

...

This is the original cTAKES tokenizer. Hyphenated words that appear in the hyphenated words list (HyphFreqFile) with frequency values greater than the FreqCutoff will be considered one token. See classes edu.mayo.bmi.uima.core.ae.TokenizerAnnotator and edu.mayo.bmi.nlp.tokenizer.Tokenizer for implementation details, and refer to the Context Dependent Tokenizer.

Parameters
SegmentsToSkip - (optional) the list of sections not to create token annotations for.
FreqCutoff - cutoff value for which entries to include from the hyphenated words list(HyphFreqFile)

...

Tools Training a sentence detector model

To For the default sentence detector, to train a sentence detector that model that recognizes the same set of candidate end-of-sentence characters that the SentenceDetectorAnnotator uses:
java -cp <classpath> edu.mayo.bmi.uima.core.ae.SentenceDetector <sents_file> <model> <iters> <cut>
Where

...

Strictly speaking, it would not be necessary to run the SentenceDetectorAnnotator in order to test the TokenizerAnnotator. The TokenizerAnnotator does not require the presence of Sentence annotations.