Wiki Markup |
---|
{scrollbar} |
Section | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
line 2: <?xml-stylesheet type="text/css" href="gpml.css" ?>
line 3: <!DOCTYPE set SYSTEM "gpml.merged.dtd">
line 5: <import resource="GENIAontology.daml" prefix="G"></import>
java -cp <classpath> data.pos.training.GeniaPosTrainingDataExtractor GENIAcorpus4 GENIAcorpus4.02.pos.xmlxml <genia-pos-training-data>
...
Creating a model
java -cp <classpath> opennlp.tools.postag.POSTaggerME <training-data> <model-name> iterations cutoff
Where
- <training-data>* is an OpenNLP training data file.
- <model-name>* is the file name of the resulting model. The name should end with either .txt (for a plain text model) or .bin.gz (for a compressed binary model).
- <iterations>* determines how many training iterations will be performed. The default is 100.
- <cutoff>* determines the minimum number of times a feature has to be seen to be considered for inclusion in the model. The default cutoff is 5
...
We have provided a mechanism for creating a tag dictionary. It can be run with the following command:
java -cp <classpath>edu<classpath> org.mayoapache.bmictakes.uima.pos_tagger.TagDictionaryCreatorpostagger.TagDictionaryCreator <training-data> <tag-dictionary>dictionary> case-sensitive
Where
- <training-data>* is a file containing pos-of-speech tagged training data
- <tag-dictionary>* the file name of the resulting tag dictionary
- <case-sensitive>* is either 'true' or 'false' depending on whether the tag dictionary should be case sensitive or not.
...
OpenNLP provides a default tag dictionary for the English part-of-speech model called tag.bin.gz which can be downloaded from
}+from http://opennlp.sourceforge.net/models/english/parser/tagdict+.
. You You should use this tag dictionary only if you are using the model from
+ http://opennlp.sourceforge.net/models/english/parser/tag.bin.gz+.
Tip |
---|
If you want to use the tag dictionary in a case insensitive way, then entries in the tag dictionary which are not all lowercased will be ignored because the tag dictionary fails to lowercase entries read in from the file. It only lowercases the words that are compared against the dictionary when "CaseSensitive" is set to false. Therefore, if you want the tag dictionary to be used in a case insensitive way, be sure to build the tag dictionary using 'false' as the third argument. |
...
If this is gold standard sentence:
The_DT major_JJ inducible_JJ protein_NN complex_NN that_WDT binds_VBZ ._.
And if this is the output for that sentence:
The_DT major_JJ inducible_NN protein_NN complex_NN that_WDT binds_VBD ._.
...
- Use tokenizer generated tokens
- Run the tokenizer and use this as input to the POS tagger.
- In this scenario, we calculate F-measure in the following way:
true positive (TP)
a token that has the correct boundary and part-of-speech label
false positive (FP)
a tagged token that does not have the correct boundary and/or part-of-speech label
false negative (FN)
a token in the gold standard data that was not correctly generated by the tokenizer/POS tagger
An example is given in Background Color
color | deeppink |
---|
...
given below in "Evaluate a POS tagger using generated tokens"
Evaluate a POS tagger using generated tokens
...
TP = 4, FP = 2, and FN = 3
F-measure = (2 * recall * precision) / (precision + recall) = (2 * TP) /
(2*TP + FP + FN) = (2 * 4) / (2*4 + 2 + 3) = 8 / 13 = .615
In fact, if you do the evaluation this way for the "gold standard tokens" evaluation, then you will get the same answer as the accuracy calculation given above.