Page History

...

Train and evaluate a baseline model without dictionary

TODO: format this:

# evaluate without dictionary

...

We start by training and evaluating without using Tag Dictionary

Code Block

language	bash

# Train a POS Model using the train corpus. No dictionary.
 
$ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_nodic.model -data portuguese_bosque_train.conll -encoding

...

ls -lah pos-pt_nodic.model
-rw-r--r-- 1 colen staff 629K 8 Jul 01:09 pos-pt_nodic.model

...

 UTF-8
Indexing events using cutoff of 5
	Computing event counts...  done. 206678 events
	Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 206678
	    Number of Outcomes: 22
	  Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (0,996s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_nodic.model


# Evaluate the model using the test corpus
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_nodic.model -data portuguese_bosque_test.conll -encoding UTF-8

...


Loading POS Tagger model ... done (0,297s)
Evaluating ... done
Accuracy: 0.

...

9609681268109767

...

Train and evaluate with a dictionary extracted from training data

Now we create a model with a embed XML dictionary created from training data. All entries that appear more than twice will be included (tagDictCutoff parameter).

Code Block

language	bash

# Train a POS Model using the train corpus. Dictionary created from training data (cutoff 2)
 
$ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_xmldic.model -data portuguese_bosque_train.conll -encoding UTF-8 -tagDictCutoff

...

ls -lah pos-pt_xmldic.model
-rw-r--r-- 1 colen staff 839K 8 Jul 01:24 pos-pt_nodic.model

...

 2
Expanding POS Dictionary ...
... finished expanding POS Dictionary. [702ms]
Indexing events using cutoff of 5
  Computing event counts...  done. 206678 events
  Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 206678
      Number of Outcomes: 22
    Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (1,517s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_xmldic.model


# Evaluate the model using the test corpus
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data portuguese_bosque_test.conll -encoding UTF-8

...


Loading POS Tagger model ... done (0,712s)
Evaluating ... done
Accuracy: 0.

...

# convert TAGDICT

...

9648883586159878

Note how the accuracy improved from 96.097% to 96.489% after including the dictionary, proving its importance.

Create a FSA Dictionary from the XML Tag Dictionary

Now we extract the XML Tag Dictionary and convert to a FSA Dictionary

Code Block

language	bash

# Extract the model to get the tag dictionary
$ unzip pos-pt_xmldic.model

...

 pos-pt_xmldic

...


 
# Take a look at the file size
$ ls -alh pos-pt_xmldic

...


total 3464
drwxr-xr-x 5 colen staff 170B 11 Jul 23:23 .
drwxr-xr-x 16 colen staff 544B 11 Jul 23:23 ..
-rw-r--r-- 1 colen staff 306B 11 Jul 21:03 manifest.properties
-rw-r--r-- 1 colen staff 1,1M 11 Jul 21:03 pos.model
-rw-r--r-- 1 colen staff 554K 11 Jul 21:03 tags.tagdict
 
# Convert the tags.tagdict to a table like dictionary and .info file to be consumed by MorfologikDictionaryBuilder
$ bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic/tags.tagdict -outputFile pt-morfologik.txt -separator

...

 + -encoder prefix -encoding UTF-8

...


Created dictionary: pt-morfologik.txt
Created metadata: pt-morfologik.txt
 
# Create the FSA Dictionary
$ bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8

Comparing sizes

We can now compare the size of the XML dictionary and the FSA dictionary:

Code Block

language	bash

$ ls -

...

alh pt-morfologik.dict

...

 pos-pt_xmldic/tags.tagdict 
-rw-r--r--  1 colen

...

  staff   554K 11 Jul 21:03 pos-pt_xmldic/tags.tagdict
-rw-r--r--  1 colen  staff    89K 12 Jul 09:33 pt-morfologik.dict

The FSA dictionary is much smaller than the original XML version. Also, it performs much better during runtime, because the XML is loaded into a hash, while the FSA to a finite state automata.

Train a POS model with the FSA dictionary

We can use MorfologikPOSTaggerFactory to create a POS model with the an embedded FSA dictionary:

Code Block

language	bash

$ bin/opennlp-cp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_fsadic.model -data

...

portuguese_bosque_train.conll -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict
Indexing events using cutoff of 5
	Computing event counts...  done. 206678 events
	Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 206678
	    Number of Outcomes: 22
	  Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (1,050s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_fsadic.model

Evaluate

We can evaluate again and verify that the accuracy did not change.

Code Block

language	bash

$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_

...

fsadic.model -data portuguese_bosque_test.conll -encoding UTF-8
Loading POS Tagger model ... done (0,260s)
Evaluating ... done
Accuracy: 0.9648883586159878

-- evaluate

Child pages

Versions Compared

Old Version 1

New Version Current

Key

Train and evaluate a baseline model without dictionary

Train and evaluate with a dictionary extracted from training data

Create a FSA Dictionary from the XML Tag Dictionary

Comparing sizes

Train a POS model with the FSA dictionary

Evaluate