...
Code Block | ||
---|---|---|
| ||
# Extract the model to get the tag dictionary $ unzip pos-pt_xmldic.model pos-pt_xmldic# Takexmldic # Take a look at the file size$size $ ls -alh pos-pt_xmldic total 3464 drwxr-xr-x 5 colen staff 170B 11 Jul 23:23 . drwxr-xr-x 16 colen staff 544B 11 Jul 23:23 .. -rw-r--r-- 1 colen staff 306B 11 Jul 21:03 manifest.properties -rw-r--r-- 1 colen staff 1,1M 11 Jul 21:03 pos.model -rw-r--r-- 1 colen staff 554K 11 Jul 21:03 tags.tagdict# Converttagdict # Convert the tags.tagdict to a FSA Dictionary |
...
table like dictionary and .info file to be consumed by MorfologikDictionaryBuilder $ bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic |
...
/tags.tagdict -outputFile pt-morfologik.txt -separator + -encoder prefix -encoding UTF-8
Created dictionary: pt-morfologik.txt
Created metadata: pt-morfologik.txt
# Create the FSA Dictionary
$ bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8 |
Comparing sizes
We can now compare the size of the XML dictionary and the FSA dictionary:
Code Block | ||
---|---|---|
| ||
$ ls -alh pt-morfologik.dict |
ls -lah pos-pt_xmldic.model
-rw-r--r-- 1 colen staff 839K 8 Jul 01:24 pos-pt_nodic.model
## evaluate
bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data portuguese_bosque_test.conll -encoding UTF-8
Accuracy: 0.9676154763933867
# convert TAGDICT
...
pos-pt_xmldic/tags.tagdict - |
...
bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8
...
rw-r--r-- 1 colen staff 554K 11 Jul 21:03 pos-pt_xmldic/tags.tagdict -rw-r--r-- 1 colen |
...
staff 89K 12 Jul 09:33 pt-morfologik.dict |
...
The FSA dictionary is much smaller than the original XML version. Also, it performs much better during runtime, because the XML is loaded into a hash, while the FSA to a finite state automata.
Train a POS model with the FSA dictionary
We can use MorfologikPOSTaggerFactory to create a POS model with the an embedded FSA dictionary:
Code Block | ||
---|---|---|
| ||
$ bin/opennlp-cp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict Indexing events using cutoff of 5 Computing event counts... done. 206678 events Indexing... done. Collecting events... Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 206678 Number of Outcomes: 22 Number of Predicates: 29155 Computing model parameters... Performing 100 iterations. 1: . (191458/206678) 0.9263588770938368 2: . (197359/206678) 0.954910537164091 3: . (199143/206678) 0.9635423218726715 4: . (199932/206678) 0.9673598544595942 5: . (200688/206678) 0.9710177183831854 6: . (201205/206678) 0.9735191941087101 7: . (201657/206678) 0.9757061709519156 8: . (201980/206678) 0.9772689884748256 9: . (202236/206678) 0.9785076302267295 10: . (202493/206678) 0.9797511104229768 20: . (203958/206678) 0.9868394313860208 30: . (204511/206678) 0.989515091107907 40: . (204986/206678) 0.99181335217101 50: . (205151/206678) 0.9926116954876668 60: . (205344/206678) 0.9935455152459381 70: . (205453/206678) 0.994072905679366 80: . (205522/206678) 0.9944067583390588 90: . (205605/206678) 0.994808349219559 100: . (205657/206678) 0.9950599483254144 Stats: (204843/206678) 0.9911214546299074 ...done. Writing pos tagger model ... Compressed 29155 parameters to 23457 5067 outcome patterns done (1,050s) Wrote pos tagger model to path: apache-opennlp-1.6.0/pos-pt_fsadic.model |
Evaluate
We can evaluate again and verify that the accuracy did not change.
Code Block | ||
---|---|---|
| ||
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_ |
...
fsadic.model -data portuguese_bosque_test.conll -encoding UTF-8 Loading POS Tagger model ... done (0,260s) Evaluating ... done Accuracy: 0.9648883586159878 |
-- evaluate