Page History

Wiki Markup
{scrollbar}

Section

Column

width	65%

Panel

title	Contents of this Page

Table of Contents

minLevel	2

Column

Include Page

	Menu cTAKES 4.0 to Include
	Menu cTAKES 4.0 to Include

...

line 2: <?xml-stylesheet type="text/css" href="gpml.css" ?>
line 3: <!DOCTYPE set SYSTEM "gpml.merged.dtd">
line 5: <import resource="GENIAontology.daml" prefix="G"></import>
java -cp <classpath> data.pos.training.GeniaPosTrainingDataExtractor GENIAcorpus4 GENIAcorpus4.02.pos.xmlxml <genia-pos-training-data>

...

Creating a model

java -cp <classpath> opennlp.tools.postag.POSTaggerME <training-data> <model-name> iterations cutoff
Where

<training-data>* is an OpenNLP training data file.
<model-name>* is the file name of the resulting model. The name should end with either .txt (for a plain text model) or .bin.gz (for a compressed binary model).
<iterations>* determines how many training iterations will be performed. The default is 100.
<cutoff>* determines the minimum number of times a feature has to be seen to be considered for inclusion in the model. The default cutoff is 5

...

We have provided a mechanism for creating a tag dictionary. It can be run with the following command:
java -cp <classpath>edu<classpath> org.mayoapache.bmictakes.uima.pos_tagger.TagDictionaryCreatorpostagger.TagDictionaryCreator <training-data> <tag-dictionary>dictionary> case-sensitive
Where

<training-data>* is a file containing pos-of-speech tagged training data
<tag-dictionary>* the file name of the resulting tag dictionary
<case-sensitive>* is either 'true' or 'false' depending on whether the tag dictionary should be case sensitive or not.

...

OpenNLP provides a default tag dictionary for the English part-of-speech model called tag.bin.gz which can be downloaded from
}+from http://opennlp.sourceforge.net/models/english/parser/tagdict+.
. You You should use this tag dictionary only if you are using the model from
+ http://opennlp.sourceforge.net/models/english/parser/tag.bin.gz+.

Tip

If you want to use the tag dictionary in a case insensitive way, then entries in the tag dictionary which are not all lowercased will be ignored because the tag dictionary fails to lowercase entries read in from the file. It only lowercases the words that are compared against the dictionary when "CaseSensitive" is set to false. Therefore, if you want the tag dictionary to be used in a case insensitive way, be sure to build the tag dictionary using 'false' as the third argument.

...

If this is gold standard sentence:
The_DT major_JJ inducible_JJ protein_NN complex_NN that_WDT binds_VBZ ._.

And if this is the output for that sentence:
The_DT major_JJ inducible_NN protein_NN complex_NN that_WDT binds_VBD ._.

...

Use tokenizer generated tokens
Run the tokenizer and use this as input to the POS tagger.
In this scenario, we calculate F-measure in the following way:

true positive (TP)
a token that has the correct boundary and part-of-speech label

false positive (FP)
a tagged token that does not have the correct boundary and/or part-of-speech label

false negative (FN)
a token in the gold standard data that was not correctly generated by the tokenizer/POS tagger

An example is given in
Background Color

color	deeppink

...

given below in "Evaluate a POS tagger using generated tokens"

Evaluate a POS tagger using generated tokens

...

TP = 4, FP = 2, and FN = 3
F-measure = (2 * recall * precision) / (precision + recall) = (2 * TP) /
(2*TP + FP + FN) = (2 * 4) / (2*4 + 2 + 3) = 8 / 13 = .615

In fact, if you do the evaluation this way for the "gold standard tokens" evaluation, then you will get the same answer as the accuracy calculation given above.

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version 2

Key

Creating a model

Evaluate a POS tagger using generated tokens