Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
# Extract the model to get the tag dictionary
 
$ unzip pos-pt_xmldic.model pos-pt_xmldic# Takexmldic
 
# Take a look at the file size$size
$ ls -alh pos-pt_xmldic
total 3464
drwxr-xr-x 5 colen staff 170B 11 Jul 23:23 .
drwxr-xr-x 16 colen staff 544B 11 Jul 23:23 ..
-rw-r--r-- 1 colen staff 306B 11 Jul 21:03 manifest.properties
-rw-r--r-- 1 colen staff 1,1M 11 Jul 21:03 pos.model
-rw-r--r-- 1 colen staff 554K 11 Jul 21:03 tags.tagdict# Converttagdict
 
# Convert the tags.tagdict to a FSA Dictionary

...

table like dictionary and .info file to be consumed by MorfologikDictionaryBuilder
$ bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic

...

/tags.tagdict -outputFile pt-morfologik.txt -separator + -encoder prefix -encoding UTF-8
Created dictionary: pt-morfologik.txt
Created metadata: pt-morfologik.txt
 
# Create the FSA Dictionary
$ bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8

Comparing sizes

We can now compare the size of the XML dictionary and the FSA dictionary:

Code Block
languagebash
$ ls -alh pt-morfologik.dict 

ls -lah pos-pt_xmldic.model
-rw-r--r-- 1 colen staff 839K 8 Jul 01:24 pos-pt_nodic.model

## evaluate
bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data portuguese_bosque_test.conll -encoding UTF-8

Accuracy: 0.9676154763933867

 

# convert TAGDICT

...

pos-pt_xmldic/tags.tagdict 
-

...

bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8

...

rw-r--r--  1 colen  staff   554K 11 Jul 21:03 pos-pt_xmldic/tags.tagdict
-rw-r--r--  1 colen

...

  staff    89K 12 Jul 09:33 pt-morfologik.dict

...

The FSA dictionary is much smaller than the original XML version. Also, it performs much better during runtime, because the XML is loaded into a hash, while the FSA to a finite state automata.

Train a POS model with the FSA dictionary

We can use MorfologikPOSTaggerFactory to create a POS model with the an embedded FSA dictionary:

Code Block
languagebash
$ bin/opennlp-cp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict
Indexing events using cutoff of 5
	Computing event counts...  done. 206678 events
	Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 206678
	    Number of Outcomes: 22
	  Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (1,050s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_fsadic.model

Evaluate

We can evaluate again and verify that the accuracy did not change.

Code Block
languagebash
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_

...

fsadic.model -data portuguese_bosque_test.conll -encoding UTF-8
Loading POS Tagger model ... done (0,260s)
Evaluating ... done
Accuracy: 0.9648883586159878

 -- evaluate