FSA Dictionary with morfologik-addon

Note: the morfologik-addon still under development.

Morfologik provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries.

The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use the use of FSA Morfologik dictionary tools:

opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
- Extends: opennlp.tools.postag.POSTaggerFactory
- Helps creating a POSTagger model with an embedded TagDictionary based on FSA
opennlp.morfologik.tagdict.MorfologikTagDictionary
- Implements: opennlp.tools.postag.TagDictionary
- A TagDictionary based on FSA is much smaller than the defaul XML based, and consumes less memory.
opennlp.morfologik.lemmatizer.MorfologikLemmatizer
- Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
- A dictionary based lemmatizer that uses FSA dictionary.

The addon also provides a command line interface that allows:

MorfologikDictionaryBuilder
- builds a binary POS Dictionary using Morfologik
XMLDictionaryToTable
- reads an OpenNLP XML tag dictionary and outputs it in a tab separated file that can be built into a FSA dictionary

Addon Installation

Note: today the addon is not available as a distributable and is not in any public Maven repository.

The addon should be compiled and the result should be copied on top of an OpenNLP binary distribution.

To create the binary distribution execute:

svn co https://svn.apache.org/repos/asf/opennlp/addons/morfologik-addon
cd morfologik-addon
mvn package

The distribution will be target/apache-opennlp-morfologik-addon-1.0-SNAPSHOT-bin.zip

Example of usage

Embed a FSA based dictionary in a POSModel

In this example we will use the free CONLL X Portuguese data to train a POS Tag dictionary and embed a FSA dictionary.

Download the Corpus

Download the Portuguese data data from http://ilk.uvt.nl/conll/free_data.html

Portuguese train: portuguese_bosque_train.conll
Portuguese test: portuguese_bosque_test.conll

Train and evaluate a baseline model without dictionary

We start by training and evaluating without using Tag Dictionary

# Train a POS Model using the train corpus. No dictionary.
 
$ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_nodic.model -data portuguese_bosque_train.conll -encoding UTF-8
Indexing events using cutoff of 5
	Computing event counts...  done. 206678 events
	Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 206678
	    Number of Outcomes: 22
	  Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (0,996s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_nodic.model


# Evaluate the model using the test corpus
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_nodic.model -data portuguese_bosque_test.conll -encoding UTF-8
Loading POS Tagger model ... done (0,297s)
Evaluating ... done
Accuracy: 0.9609681268109767

Train and evaluate with a dictionary extracted from training data

Now we create a model with a embed XML dictionary created from training data. All entries that appear more than twice will be included (tagDictCutoff parameter).

# Train a POS Model using the train corpus. Dictionary created from training data (cutoff 2)
 
$ bin/opennlp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_xmldic.model -data portuguese_bosque_train.conll -encoding UTF-8 -tagDictCutoff 2
Expanding POS Dictionary ...
... finished expanding POS Dictionary. [702ms]
Indexing events using cutoff of 5
  Computing event counts...  done. 206678 events
  Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
  Number of Event Tokens: 206678
      Number of Outcomes: 22
    Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (1,517s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_xmldic.model


# Evaluate the model using the test corpus
$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data portuguese_bosque_test.conll -encoding UTF-8
Loading POS Tagger model ... done (0,712s)
Evaluating ... done
Accuracy: 0.9648883586159878

Note how the accuracy improved from 96.097% to 96.489% after including the dictionary, proving its importance.

Create a FSA Dictionary from the XML Tag Dictionary

Now we extract the XML Tag Dictionary and convert to a FSA Dictionary

# Extract the model to get the tag dictionary
$ unzip pos-pt_xmldic.model pos-pt_xmldic
 
# Take a look at the file size
$ ls -alh pos-pt_xmldic
total 3464
drwxr-xr-x 5 colen staff 170B 11 Jul 23:23 .
drwxr-xr-x 16 colen staff 544B 11 Jul 23:23 ..
-rw-r--r-- 1 colen staff 306B 11 Jul 21:03 manifest.properties
-rw-r--r-- 1 colen staff 1,1M 11 Jul 21:03 pos.model
-rw-r--r-- 1 colen staff 554K 11 Jul 21:03 tags.tagdict
 
# Convert the tags.tagdict to a table like dictionary and .info file to be consumed by MorfologikDictionaryBuilder
$ bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic/tags.tagdict -outputFile pt-morfologik.txt -separator + -encoder prefix -encoding UTF-8
Created dictionary: pt-morfologik.txt
Created metadata: pt-morfologik.txt
 
# Create the FSA Dictionary
$ bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8

Comparing sizes

We can now compare the size of the XML dictionary and the FSA dictionary:

$ ls -alh pt-morfologik.dict pos-pt_xmldic/tags.tagdict 
-rw-r--r--  1 colen  staff   554K 11 Jul 21:03 pos-pt_xmldic/tags.tagdict
-rw-r--r--  1 colen  staff    89K 12 Jul 09:33 pt-morfologik.dict

The FSA dictionary is much smaller than the original XML version. Also, it performs much better during runtime, because the XML is loaded into a hash, while the FSA to a finite state automata.

Train a POS model with the FSA dictionary

We can use MorfologikPOSTaggerFactory to create a POS model with the an embedded FSA dictionary:

$ bin/opennlp-cp POSTaggerTrainer.conllx -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict
Indexing events using cutoff of 5
	Computing event counts...  done. 206678 events
	Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 206678
	    Number of Outcomes: 22
	  Number of Predicates: 29155
Computing model parameters...
Performing 100 iterations.
  1:  . (191458/206678) 0.9263588770938368
  2:  . (197359/206678) 0.954910537164091
  3:  . (199143/206678) 0.9635423218726715
  4:  . (199932/206678) 0.9673598544595942
  5:  . (200688/206678) 0.9710177183831854
  6:  . (201205/206678) 0.9735191941087101
  7:  . (201657/206678) 0.9757061709519156
  8:  . (201980/206678) 0.9772689884748256
  9:  . (202236/206678) 0.9785076302267295
 10:  . (202493/206678) 0.9797511104229768
 20:  . (203958/206678) 0.9868394313860208
 30:  . (204511/206678) 0.989515091107907
 40:  . (204986/206678) 0.99181335217101
 50:  . (205151/206678) 0.9926116954876668
 60:  . (205344/206678) 0.9935455152459381
 70:  . (205453/206678) 0.994072905679366
 80:  . (205522/206678) 0.9944067583390588
 90:  . (205605/206678) 0.994808349219559
100:  . (205657/206678) 0.9950599483254144
Stats: (204843/206678) 0.9911214546299074
...done.
Writing pos tagger model ... Compressed 29155 parameters to 23457
5067 outcome patterns
done (1,050s)
Wrote pos tagger model to
path: apache-opennlp-1.6.0/pos-pt_fsadic.model

Evaluate

We can evaluate again and verify that the accuracy did not change.

$ bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_fsadic.model -data portuguese_bosque_test.conll -encoding UTF-8
Loading POS Tagger model ... done (0,260s)
Evaluating ... done
Accuracy: 0.9648883586159878

Child pages