Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar}
Panel
titleContents of this Page
Table of Contents
minLevel2

...

This project contains several analysis engines (annotators), including:

  • a sentence detector annotator
  • a tokenizer
  • an annotator that does not update the CAS in any way
  • an annotator that creates a single Segment annotation encompassing the entire document text
Info

End-of-line characters are considered end-of-sentence markers. Hyphenated words that appear in the hyphenated words list with frequency values greater than the FreqCutoff will be considered one token. Refer to the tokenizer information on SourceForge.net.Context Dependent Tokenizer.

A sentence detector model is included with this project.

...

A wrapper around the OpenNLP sentence detector that creates Sentence annotations based on the location of end-of-line characters and on the output of the OpenNLP sentence detector. This annotator considers an end-of-line character as an end-of-sentence marker. Optionally it can skip certain sections of the document. See the section called Running the sentence detector and tokenizer for more details.

Parameters
SegmentsToSkip
- (optional) the list of sections not to create Sentence annotations for.

Resources
MaxentModelFile>
MaxentModelFile - the Maxent model sentence detector.

...

Creates a single Segment annotation, encompassing the entire document. For use prior to annotators that require a Segment annotation, when the pipeline does not contain a different annotator that creates Segment annotations. This annotator is used for plain text files, which doesn't have section (aka segment) tags; but not for CDA documents, as the CdaCasInitializer annotator creates Segment annotations.

Parameters
SegmentID
- (optional) the identifier to use for the Segment annotation created.

...

Tokenizes text according to Penn Treebank tokenization rules. This is the default tokenizer for cTAKES as of cTAKES 2.0.

Parameters
SegmentsToSkip
- (optional) the list of sections not to create token annotations for.

...

This is the original cTAKES tokenizer. Hyphenated words that appear in the hyphenated words list (HyphFreqFile) with frequency values greater than the FreqCutoff will be considered one token. See classes edu.mayo.bmi.uima.core.ae.TokenizerAnnotator and edu.mayo.bmi.nlp.tokenizer.Tokenizer for implementation details.

Parameters
SegmentsToSkip
- (optional) the list of sections not to create token annotations for.
FreqCutoff
- cutoff value for which entries to include from the hyphenated words list(HyphFreqFile)

Resources
HyphFreqFile
- a file containing a list of hyphenated words and their frequency within some corpus.

...

To train a sentence detector that recognizes the same set of candidate end-of-sentence characters that the SentenceDetectorAnnotator uses:
java -cp <classpath> edu.mayo.bmi.uima.core.ae.SentenceDetector <sents_file><model><iters><cut>
Where

<sents_file>

...

is your sentences training data file, one sentence per line, see an example in Example 4.1, "Sentence detector training data file sample".
<model>

...

is the name of the model file to be created.
<iters>

...

(optional) is the number of iterations for training.
<cut>

...

(optional) is the cutoff value.

Tip

Eclipse users may run "SentenceDetector--train_ a_ new_model" launch.

...

You can train a sentence detector directly using the OpenNLP sentence detector (SentenceDetectorME) with the default set of candidate end-of-sentence characters, using:

...

...

java

...

-cp

...

<classpath>

...

opennlp.tools.sentdetect.SentenceDetectorME

...

<sents_file>

...

<model>

...

<iters><cut>
Where

<sents_file> is your sentences training data file, one sentence per line, see an example in Example 4.1, "Sentence detector training data file sample".
<model> is the name of the model file to be created.
<iters> (optional) is the number of iterations for training.
<cut> (optional) is the cutoff value.

The four parameters have the same meaning as the tool we provided, "infile" uses the same format as in Example 4.1, "Sentence detector training data file sample".

...

We provided a sentence detector CPE descriptor and a tokenizer CPE descriptor in this project. To run the CPE: *<iters>*

java -cp <classpath> org.apache.uima.tools.cpm.CpmFrame
Open

*<iters>*desc/collection_processing_engine/SentenceDetecorCPE.xml to run a sentence detector; or
*<iters>* desc/collection_processing_engine/SentencesAndTokensCPE.xml to run a tokenizer

...