Page History

Wiki Markup
{scrollbar}

Section

Column

width	65%

Panel

title	Contents of this Page

Table of Contents

minLevel	2

Column

Include Page

	Menu cTAKES 4.0 to Include
	Menu cTAKES 4.0 to Include

...

This project contains several analysis engines (annotators), including:

a the default sentence detector annotator
an alternative sentence detector
a tokenizer
an annotator that does not update the CAS in any way - useful when an annotator is required but you don't actually want to use one
an annotator that creates a single Segment annotation encompassing the entire document text

Info
End-of-line characters are considered end-of-sentence markers. Hyphenated words that appear in the hyphenated words list with frequency values greater than the FreqCutoff will be considered one token. Refer to the Context Dependent Tokenizer..

Sentence detector models are A sentence detector model is included with this project.

Info
The model for the default sentence detector derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

A wrapper around the OpenNLP sentence detector that creates Sentence annotations based on the location of end-of-line characters and on the output of the OpenNLP sentence detector. This annotator considers an end-of-line character as an end-of-sentence marker. Optionally it can skip certain sections of the document. See the section called Running the sentence detector and tokenizer for more details.

...

Resources
MaxentModelFile - the Maxent model sentence detector.

SentenceDetectorAnnotatorBIO.xml

The default sentence detector uses a discriminative classifier on a small set of candidate sentence-splitting characters. However, in clinical data a sentence break could be indicated by something as subtle as a series of spaces. This new model classifies a sequence of characters as the Beginning, Inside, or Outside (BIO) of a sentence, which allows the use of similar features as previous systems while allowing arbitrary sentence boundaries. This requires many more classification decisions but avoids major performance penalties by only classifying non-alphanumeric characters.

Parameters
All parameters are optional.

SimpleSegmentAnnotator.xml

Creates a single Segment annotation, encompassing the entire document. For This annotator is for use prior to annotators that require a Segment annotation, when the pipeline does not contain a different annotator that creates Segment annotations. This annotator is typically used for plain text files, which doesndon't have section (aka segment) tags; but not for CDA documents, as the CdaCasInitializer annotator creates Segment annotations.

...

This is the original cTAKES tokenizer. Hyphenated words that appear in the hyphenated words list (HyphFreqFile) with frequency values greater than the FreqCutoff will be considered one token. See classes edu.mayo.bmi.uima.core.ae.TokenizerAnnotator and edu.mayo.bmi.nlp.tokenizer.Tokenizer for implementation details, and refer to the Context Dependent Tokenizer.

Parameters
SegmentsToSkip - (optional) the list of sections not to create token annotations for.
FreqCutoff - cutoff value for which entries to include from the hyphenated words list(HyphFreqFile)

...

Tools Training a sentence detector model

To For the default sentence detector, to train a sentence detector that model that recognizes the same set of candidate end-of-sentence characters that the SentenceDetectorAnnotator uses:
java -cp <classpath> edu.mayo.bmi.uima.core.ae.SentenceDetector <sents_file> <model> <iters> <cut>
Where

...

Strictly speaking, it would not be necessary to run the SentenceDetectorAnnotator in order to test the TokenizerAnnotator. The TokenizerAnnotator does not require the presence of Sentence annotations.

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

SentenceDetectorAnnotatorBIO.xml

SimpleSegmentAnnotator.xml

Tools Training a sentence detector model