Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Not finished yet.

...

Train and evaluate a baseline model without dictionary

 

 

TODO: format this:

# evaluate without dictionary

...

We start by training and evaluating without using Tag Dictionary

Code Block
languagebash
# Train a POS Model using the train corpus. No dictionary.
 
$ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_nodic.model -data portuguese_bosque_train.conll -encoding

...

 UTF-8
Indexing events using cutoff of 5
	Computing event counts...  done. 206678 events
	Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 206678
	    Number of Outcomes: 22
	  Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (0,996s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_nodic.model


# Evaluate the model using the test corpus
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_nodic.model -data portuguese_bosque_test.conll -encoding UTF-8
Loading POS Tagger model ... done (0,297s)
Evaluating ... done
Accuracy: 0.9609681268109767

Train and evaluate with a dictionary extracted from training data

Now we create a model with a embed XML dictionary created from training data. All entries that appear more than twice will be included (tagDictCutoff parameter).

Code Block
languagebash
# Train a POS Model using the train corpus. Dictionary created from training data (cutoff 2)
 
$ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_xmldic.model -data portuguese_bosque_train.conll -encoding UTF-8 -tagDictCutoff 2
Expanding POS Dictionary ...
... finished expanding POS Dictionary. [702ms]
Indexing events using cutoff of 5
  Computing event counts...  done. 206678 events
  Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 206678
      Number of Outcomes: 22
    Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (1,517s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_xmldic.model


# Evaluate the model using the test corpus
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data 

ls -lah pos-pt_nodic.model
-rw-r--r-- 1 colen staff 629K 8 Jul 01:09 pos-pt_nodic.model

...

portuguese_bosque_test.conll -encoding UTF-8

...


Loading POS Tagger model ... done (0,712s)
Evaluating ... done
Accuracy: 0

...

.9648883586159878

Note how the accuracy improved from 96.097% to 96.489% after including the dictionary, proving its importance.

Create a FSA Dictionary from the XML Tag Dictionary

Now we extract the XML Tag Dictionary and convert to a FSA Dictionary

Code Block
languagebash
# Extract the model to get the tag dictionary
 
$ unzip pos-pt_xmldic.model pos-pt_xmldic# Take a look at the file size$ ls -alh pos-pt_xmldic
total 3464
drwxr-xr-x 5 colen staff 170B 11 Jul 23:23 .
drwxr-xr-x 16 colen staff 544B 11 Jul 23:23 ..
-rw-r--r-- 1 colen staff 306B 11 Jul 21:03 manifest.properties
-rw-r--r-- 1 colen staff 1,1M 11 Jul 21:03 pos.model
-rw-r--r-- 1 colen staff 554K 11 Jul 21:03 tags.tagdict# Convert the tags.tagdict to a FSA Dictionary


# train and create a tag dictionary from corpus
bin/opennlp POSTaggerTrainer.conllx -type perceptron -params perceptron_0.properties -lang pt -model pos-pt_xmldic.model -data portuguese_bosque_train.conll -encoding UTF-8 -tagDictCutoff 2

...

-- extract the tagdict
unzip pos-pt_xmldic.model -d pos-pt_xmldic
more pos-pt_xmldic/tags.tagdict

bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic/tags.tagdict -outputFile pt-morfologik.txt -separator , -encoder prefix -encoding UTF-8

...