You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Note: the morfologik-addon still under development.

Morfologik provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries.

The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use the use of FSA Morfologik dictionary tools:

  • opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory
    • Extends: opennlp.tools.postag.POSTaggerFactory
    • Helps creating a POSTagger model with an embedded TagDictionary based on FSA
  • opennlp.morfologik.tagdict.MorfologikTagDictionary
    • Implements: opennlp.tools.postag.TagDictionary
    • A TagDictionary based on FSA is much smaller than the defaul XML based, and consumes less memory.
  • opennlp.morfologik.lemmatizer.MorfologikLemmatizer
    • Implements: opennlp.tools.lemmatizer.DictionaryLemmatizer
    • A dictionary based lemmatizer that uses FSA dictionary.

The addon also provides a command line interface that allows:

  • MorfologikDictionaryBuilder    
    • builds a binary POS Dictionary using Morfologik
  • XMLDictionaryToTable           
    • reads an OpenNLP XML tag dictionary and outputs it in a tab separated file that can be built into a FSA dictionary

Addon Installation

Note: today the addon is not available as a distributable and is not in any public Maven repository.

The addon should be compiled and the result should be copied on top of an OpenNLP binary distribution.

To create the binary distribution execute:

svn co https://svn.apache.org/repos/asf/opennlp/addons/morfologik-addon
cd morfologik-addon
mvn package

The distribution will be target/apache-opennlp-morfologik-addon-1.0-SNAPSHOT-bin.zip

Example of usage

Embed a FSA based dictionary in a POSModel

In this example we will use the free CONLL X Portuguese data to train a POS Tag dictionary and embed a FSA dictionary.

Download the Corpus

Download the Portuguese data data from http://ilk.uvt.nl/conll/free_data.html

Portuguese train: portuguese_bosque_train.conll
Portuguese test: portuguese_bosque_test.conll

Train and evaluate a baseline model without dictionary

 

 

TODO: format this:

# evaluate without dictionary

## train
bin/opennlp POSTaggerTrainer.conllx -type perceptron -params perceptron_0.properties -lang pt -modelos-pt_nodic.model -data portuguese_bosque_train.conll -encoding UTF-8

ls -lah pos-pt_nodic.model
-rw-r--r-- 1 colen staff 629K 8 Jul 01:09 pos-pt_nodic.model

## evaluate
bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_nodic.model -data portuguese_bosque_test.conll -encoding UTF-8

Accuracy: 0.5980910175558207


# train and create a tag dictionary from corpus
bin/opennlp POSTaggerTrainer.conllx -type perceptron -params perceptron_0.properties -lang pt -model pos-pt_xmldic.model -data portuguese_bosque_train.conll -encoding UTF-8 -tagDictCutoff 2

ls -lah pos-pt_xmldic.model
-rw-r--r-- 1 colen staff 839K 8 Jul 01:24 pos-pt_nodic.model

## evaluate
bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_xmldic.model -data portuguese_bosque_test.conll -encoding UTF-8

Accuracy: 0.9676154763933867

 

# convert TAGDICT

-- extract the tagdict
unzip pos-pt_xmldic.model -d pos-pt_xmldic
more pos-pt_xmldic/tags.tagdict

bin/morfologik-addon XMLDictionaryToTable -inputFile pos-pt_xmldic/tags.tagdict -outputFile pt-morfologik.txt -separator , -encoder prefix -encoding UTF-8

bin/morfologik-addon MorfologikDictionaryBuilder -inputFile pt-morfologik.txt -encoding UTF-8

ls -lah pt-morfologik.dict
-rw-r--r-- 1 colen staff 268K 8 Jul 10:37 pt-morfologik.dict


-- train

bin/opennlp POSTaggerTrainer.conllx -type perceptron -params perceptron_0.properties -lang pt -model pos-pt_morfologik.model -data portuguese_bosque_train.conll -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict

bin/opennlp POSTaggerEvaluator.conllx -model pos-pt_morfologik.model -data portuguese_bosque_test.conll -encoding UTF-8


-- evaluate

 

  • No labels