Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: fix links

Wiki Markup
{scrollbar}

Section
Column
width65%
Panel
titleContents of this Page
Table of Contents
minLevel2
Column
Include Page
Menu cTAKES 4.0 to Include
Menu cTAKES 4.0 to Include

...

line 2: <?xml-stylesheet type="text/css" href="gpml.css" ?>
line 3: <!DOCTYPE set SYSTEM "gpml.merged.dtd">
line 5: <import resource="GENIAontology.daml" prefix="G"></import>
java -cp <classpath> data.pos.training.GeniaPosTrainingDataExtractor GENIAcorpus4 GENIAcorpus4.02.pos.xmlxml  <genia-pos-training-data>

...

Creating a model

java -cp <classpath>  opennlp.tools.postag.POSTaggerME  <training-data>  <model-name>  iterations  cutoff
Where

  • <training-data>* is an OpenNLP training data file.
  • <model-name>* is the file name of the resulting model. The name should end with either .txt (for a plain text model) or .bin.gz (for a compressed binary model).
  • <iterations>* determines how many training iterations will be performed. The default is 100.
  • <cutoff>* determines the minimum number of times a feature has to be seen to be considered for inclusion in the model. The default cutoff is 5

...

We have provided a mechanism for creating a tag dictionary. It can be run with the following command:
java -cp <classpath>edu<classpath>  org.mayoapache.bmictakes.uima.pos_tagger.TagDictionaryCreatorpostagger.TagDictionaryCreator  <training-data>  <tag-dictionary>dictionary>  case-sensitive
Where

  • <training-data>* is a file containing pos-of-speech tagged training data
  • <tag-dictionary>* the file name of the resulting tag dictionary
  • <case-sensitive>* is either 'true' or 'false' depending on whether the tag dictionary should be case sensitive or not.

...

OpenNLP provides a default tag dictionary for the English part-of-speech model called tag.bin.gz which can be downloaded from
}+from http://opennlp.sourceforge.net/models/english/parser/tagdict+.
. You  You should use this tag dictionary only if you are using the model from
+ http://opennlp.sourceforge.net/models/english/parser/tag.bin.gz+.

Tip

If you want to use the tag dictionary in a case insensitive way, then entries in the tag dictionary which are not all lowercased will be ignored because the tag dictionary fails to lowercase entries read in from the file. It only lowercases the words that are compared against the dictionary when "CaseSensitive" is set to false. Therefore, if you want the tag dictionary to be used in a case insensitive way, be sure to build the tag dictionary using 'false' as the third argument.

...

If this is gold standard sentence:
  The_DT major_JJ inducible_JJ protein_NN complex_NN that_WDT binds_VBZ ._.

And if this is the output for that sentence:
  The_DT major_JJ inducible_NN protein_NN complex_NN that_WDT binds_VBD ._.

...

  • Use tokenizer generated tokens
  • Run the tokenizer and use this as input to the POS tagger.
  • In this scenario, we calculate F-measure in the following way:

true positive (TP)
  a token that has the correct boundary and part-of-speech label

false positive (FP)
  a tagged token that does not have the correct boundary and/or part-of-speech label

false negative (FN)
  a token in the gold standard data that was not correctly generated by the tokenizer/POS tagger

An example is given in
Background Color

colordeeppink

...

given below in "Evaluate a POS tagger using generated tokens"

Evaluate a POS tagger using generated tokens

...

TP = 4, FP = 2, and FN = 3
F-measure = (2 * recall * precision) / (precision + recall) = (2 * TP) /
 (2*TP + FP + FN) = (2 * 4) / (2*4 + 2 + 3) = 8 / 13 = .615

In fact, if you do the evaluation this way for the "gold standard tokens" evaluation, then you will get the same answer as the accuracy calculation given above.