Page History

...

Overview of Core

This project contains several analysis engines (annotators), including:

...

End-of-line characters are considered end-of-sentence markers. Hyphenated words that appear in the hyphenated words list with frequency values greater than the FreqCutoff will be considered one token. Refer to the Context Dependent Tokenizer.

...

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

<sents_file> is your sentences training data file, one sentence per line, see an example in Example 4.1, "Sentence detector training data file sample".
<model> is the name of the model file to be created.
<iters> (optional) is the number of iterations for training.
<cut> (optional) is the cutoff value.

...

Eclipse users may run "SentenceDetector--train_ a_ new_model" launch.

...

Example 4.1. Sentence detector training data file sample
One sentence per line.
The boy ran.
Did the girl run too?
Yes, she did.
Where did she go?

...

Space shortcuts

Child pages

Versions Compared

Old Version 10

New Version Current

Key

Overview of Core