...
Train and evaluate a baseline model without dictionary
TODO: format this:
# evaluate without dictionary
...
We start by training and evaluating without using Tag Dictionary
Code Block | ||
---|---|---|
| ||
# Train a POS Model using the train corpus. No dictionary. $ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_nodic.model -data portuguese_bosque_train.conll -encoding |
...
ls -lah pos-pt_nodic.model
-rw-r--r-- 1 colen staff 629K 8 Jul 01:09 pos-pt_nodic.model
...
UTF-8 Indexing events using cutoff of 5 Computing event counts... done. 206678 events Indexing... done. Collecting events... Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 206678 Number of Outcomes: 22 Number of Predicates: 29155 Computing model parameters... Performing 100 iterations. 1: . (191458/206678) 0.9263588770938368 2: . (197359/206678) 0.954910537164091 3: . (199143/206678) 0.9635423218726715 4: . (199932/206678) 0.9673598544595942 5: . (200688/206678) 0.9710177183831854 6: . (201205/206678) 0.9735191941087101 7: . (201657/206678) 0.9757061709519156 8: . (201980/206678) 0.9772689884748256 9: . (202236/206678) 0.9785076302267295 10: . (202493/206678) 0.9797511104229768 20: . (203958/206678) 0.9868394313860208 30: . (204511/206678) 0.989515091107907 40: . (204986/206678) 0.99181335217101 50: . (205151/206678) 0.9926116954876668 60: . (205344/206678) 0.9935455152459381 70: . (205453/206678) 0.994072905679366 80: . (205522/206678) 0.9944067583390588 90: . (205605/206678) 0.994808349219559 100: . (205657/206678) 0.9950599483254144 Stats: (204843/206678) 0.9911214546299074 ...done. Writing pos tagger model ... Compressed 29155 parameters to 23457 5067 outcome patterns done (0,996s) Wrote pos tagger model to path: apache-opennlp-1.6.0/pos-pt_nodic.model # Evaluate the model using the test corpus $ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_nodic.model -data portuguese_bosque_test.conll -encoding UTF-8 |
...
Loading POS Tagger model ... done (0,297s) Evaluating ... done Accuracy: 0. |
...
9609681268109767 |
...
Train and evaluate with a dictionary extracted from training data
Now we create a model with a embed XML dictionary created from training data. All entries that appear more than twice will be included (tagDictCutoff parameter).
Code Block | ||
---|---|---|
| ||
# Train a POS Model using the train corpus. Dictionary created from training data (cutoff 2) $ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_xmldic.model -data portuguese_bosque_train.conll -encoding UTF-8 -tagDictCutoff |
...
ls -lah pos-pt_xmldic.model
-rw-r--r-- 1 colen staff 839K 8 Jul 01:24 pos-pt_nodic.model
...
2 Expanding POS Dictionary ... ... finished expanding POS Dictionary. [702ms] Indexing events using cutoff of 5 Computing event counts... done. 206678 events Indexing... done. Collecting events... Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 206678 Number of Outcomes: 22 Number of Predicates: 29155 Computing model parameters... Performing 100 iterations. 1: . (191458/206678) 0.9263588770938368 2: . (197359/206678) 0.954910537164091 3: . (199143/206678) 0.9635423218726715 4: . (199932/206678) 0.9673598544595942 5: . (200688/206678) 0.9710177183831854 6: . (201205/206678) 0.9735191941087101 7: . (201657/206678) 0.9757061709519156 8: . (201980/206678) 0.9772689884748256 9: . (202236/206678) 0.9785076302267295 10: . (202493/206678) 0.9797511104229768 20: . (203958/206678) 0.9868394313860208 30: . (204511/206678) 0.989515091107907 40: . (204986/206678) 0.99181335217101 50: . (205151/206678) 0.9926116954876668 60: . (205344/206678) 0.9935455152459381 70: . (205453/206678) 0.994072905679366 80: . (205522/206678) 0.9944067583390588 90: . (205605/206678) 0.994808349219559 100: . (205657/206678) 0.9950599483254144 Stats: (204843/206678) 0.9911214546299074 ...done. Writing pos tagger model ... Compressed 29155 parameters to 23457 5067 outcome patterns done (1,517s) Wrote pos tagger model to path: apache-opennlp-1.6.0/pos-pt_xmldic.model # Evaluate the model using the test corpus $ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data portuguese_bosque_test.conll -encoding UTF-8 |
...
Loading POS Tagger model ... done (0,712s) Evaluating ... done Accuracy: 0. |
...
# convert TAGDICT
...
9648883586159878 |
Note how the accuracy improved from 96.097% to 96.489% after including the dictionary, proving its importance.
Create a FSA Dictionary from the XML Tag Dictionary
Now we extract the XML Tag Dictionary and convert to a FSA Dictionary
Code Block | ||
---|---|---|
| ||
# Extract the model to get the tag dictionary $ unzip pos-pt_xmldic.model |
...
pos-pt_xmldic |
...
# Take a look at the file size $ ls -alh pos-pt_xmldic |
...
total 3464 drwxr-xr-x 5 colen staff 170B 11 Jul 23:23 . drwxr-xr-x 16 colen staff 544B 11 Jul 23:23 .. -rw-r--r-- 1 colen staff 306B 11 Jul 21:03 manifest.properties -rw-r--r-- 1 colen staff 1,1M 11 Jul 21:03 pos.model -rw-r--r-- 1 colen staff 554K 11 Jul 21:03 tags.tagdict # Convert the tags.tagdict to a table like dictionary and .info file to be consumed by MorfologikDictionaryBuilder $ bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic/tags.tagdict -outputFile pt-morfologik.txt -separator |
...
+ -encoder prefix -encoding UTF-8 |
...
Created dictionary: pt-morfologik.txt Created metadata: pt-morfologik.txt # Create the FSA Dictionary $ bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8 |
Comparing sizes
We can now compare the size of the XML dictionary and the FSA dictionary:
Code Block | ||
---|---|---|
| ||
$ ls - |
...
alh pt-morfologik.dict |
...
pos-pt_xmldic/tags.tagdict -rw-r--r-- 1 colen |
...
staff 554K 11 Jul 21:03 pos-pt_xmldic/tags.tagdict
-rw-r--r-- 1 colen staff 89K 12 Jul 09:33 pt-morfologik.dict |
The FSA dictionary is much smaller than the original XML version. Also, it performs much better during runtime, because the XML is loaded into a hash, while the FSA to a finite state automata.
Train a POS model with the FSA dictionary
We can use MorfologikPOSTaggerFactory to create a POS model with the an embedded FSA dictionary:
Code Block | ||
---|---|---|
| ||
$ bin/opennlp-cp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_fsadic.model -data |
...
portuguese_bosque_train.conll -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict Indexing events using cutoff of 5 Computing event counts... done. 206678 events Indexing... done. Collecting events... Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 206678 Number of Outcomes: 22 Number of Predicates: 29155 Computing model parameters... Performing 100 iterations. 1: . (191458/206678) 0.9263588770938368 2: . (197359/206678) 0.954910537164091 3: . (199143/206678) 0.9635423218726715 4: . (199932/206678) 0.9673598544595942 5: . (200688/206678) 0.9710177183831854 6: . (201205/206678) 0.9735191941087101 7: . (201657/206678) 0.9757061709519156 8: . (201980/206678) 0.9772689884748256 9: . (202236/206678) 0.9785076302267295 10: . (202493/206678) 0.9797511104229768 20: . (203958/206678) 0.9868394313860208 30: . (204511/206678) 0.989515091107907 40: . (204986/206678) 0.99181335217101 50: . (205151/206678) 0.9926116954876668 60: . (205344/206678) 0.9935455152459381 70: . (205453/206678) 0.994072905679366 80: . (205522/206678) 0.9944067583390588 90: . (205605/206678) 0.994808349219559 100: . (205657/206678) 0.9950599483254144 Stats: (204843/206678) 0.9911214546299074 ...done. Writing pos tagger model ... Compressed 29155 parameters to 23457 5067 outcome patterns done (1,050s) Wrote pos tagger model to path: apache-opennlp-1.6.0/pos-pt_fsadic.model |
Evaluate
We can evaluate again and verify that the accuracy did not change.
Code Block | ||
---|---|---|
| ||
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_ |
...
fsadic.model -data portuguese_bosque_test.conll -encoding UTF-8 Loading POS Tagger model ... done (0,260s) Evaluating ... done Accuracy: 0.9648883586159878 |
-- evaluate