Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar}
Section
Column
width65%
Panel
titleContents of this Page
Table of Contents
minLevel2
Column
Include Page
CTAKES:Menu cTAKES 3.1 to Include
CTAKES:Menu cTAKES 3.1 to Include

Overview of Core

This project contains several analysis engines (annotators), including:

...

Info

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

Analysis engines (annotators)

AggregateAE.xml

This descriptor is included for testing. This descriptor is typically not used in a more complete pipeline. One or more of the individual analysis engines is normally included.

CopyAnnotator.xml

This is a utility annotator that copies data from an existing JCas object into a new JCas object.

NullAnnotator.xml

As its name implies, this annotator does nothing. It can be useful if you are using the UIMA CPE GUI and you are required to choose an analysis engine but you don't actually want to use one.

OverlapAnnotator.xml

  • An annotator that modifies one annotation (begin and end offsets) or deletes one (or both) of the annotations, when two annotations overlap. The action taken depends on the configuration parameters. It can extend an annotation to encompass overlapping annotations. It can also be configured to delete annotations of type A that are subsumed by other annotations of type A if you only want the longest annotations of the given type to be kept.
  • Refer to the Javadoc for edu.mayo.bmi.uima.core.ae.OverlapAnnotator for more details.

SentenceDetectorAnnotator.xml

A wrapper around the OpenNLP sentence detector that creates Sentence annotations based on the location of end-of-line characters and on the output of the OpenNLP sentence detector. This annotator considers an end-of-line character as an end-of-sentence marker. Optionally it can skip certain sections of the document. See the section called Running the sentence detector and tokenizer for more details.

...

Resources
MaxentModelFile - the Maxent model sentence detector.

SimpleSegmentAnnotator.xml

Creates a single Segment annotation, encompassing the entire document. For use prior to annotators that require a Segment annotation, when the pipeline does not contain a different annotator that creates Segment annotations. This annotator is used for plain text files, which doesn't have section (aka segment) tags; but not for CDA documents, as the CdaCasInitializer annotator creates Segment annotations.

Parameters
SegmentID - (optional) the identifier to use for the Segment annotation created.

TokenizerAnnotator.xml

Tokenizes text according to Penn Treebank tokenization rules. This is the default tokenizer for cTAKES as of cTAKES 2.0.

Parameters
SegmentsToSkip - (optional) the list of sections not to create token annotations for.

TokenizerAnnotatorVersion1.xml

This is the original cTAKES tokenizer. Hyphenated words that appear in the hyphenated words list (HyphFreqFile) with frequency values greater than the FreqCutoff will be considered one token. See classes edu.mayo.bmi.uima.core.ae.TokenizerAnnotator and edu.mayo.bmi.nlp.tokenizer.Tokenizer for implementation details.

...

Resources
HyphFreqFile - a file containing a list of hyphenated words and their frequency within some corpus.

Tools Training a sentence detector model

To train a sentence detector that recognizes the same set of candidate end-of-sentence characters that the SentenceDetectorAnnotator uses:
java -cp <classpath> edu.mayo.bmi.uima.core.ae.SentenceDetector <sents_file><model><iters><cut>
Where

...

Example 4.1. Sentence detector training data file sample
One sentence per line.
The boy ran.
Did the girl run too?
Yes, she did.
Where did she go?

Verify you can train a sentence detector model successfully

The sample model resources/sentdetect/sample_sd_included.mod was trained from data/test/sample_sd_training_sentences.txt, using default values (not specifying on the command line) for "iters" and "cut". You can verify your trained model with the sample one, using your favorite tool.

Using OpenNLP directly to train sentence detector model

You can train a sentence detector directly using the OpenNLP sentence detector (SentenceDetectorME) with the default set of candidate end-of-sentence characters, using:

...

"infile" uses the same format as in Example 4.1, "Sentence detector training data file sample".

Running the sentence detector and tokenizer

We provided a sentence detector CPE descriptor and a tokenizer CPE descriptor in this project. To run the CPE:

...

TIP Eclipse users may use the "SentenceDetector_annotator" and the "Tokenizer annotator" launches.

How do the CPEs work?

Since the sentence annotator processes the text one section at a time, there must be at least one section (segment) annotation for the SentenceDetectorAnnotator to add Sentence annotations. Therefore the first analysis engine is the SimpleSegmentAnnotator, which creates a single Segment annotation that covers the entire text. Then the SentenceDetectorAnnotator analysis engine adds Sentence annotations. Then if you're running the tokenizer, the TokenizerAnnotator analysis engine adds annotations for tokens, such as PunctuationToken, WordToken, NewlineToken.

...