Page History

...

The 1.5.0 SourceForge models must be fully compatible with the 1.5.3
release. In this test all the English models are tested for compatibility
on the English 300k sentences Leipzig Corpus (Which file to download??). It is tested that
the output produced with the same model by both versions has the same md5 hash.

Component	Model	Perf 1.5.2	Perf 1.5.3	Tester	Passed	Comment
Sentence Detector	en-sent.bin	44870.8 sent/s	42733.8 sent/s	William	yes
Tokenizer	en-token.bin	2824.2 sent/s	2833.3 sent/s	William	yes
Name Finder	en-ner-person.bin	781.3 sent/s	761.6 sent/s	William	yes
POS Tagger	en-pos-maxent.bin	773.3 sent/s	816.2 sent/s	William	yes
POS Tagger	en-pos-perceptron.bin	1138.6 sent/s	1117.1 sent/s	William	yes
Chunker	en-chunker.bin	183.7 sent/s	181.1 sent/s	William	yes
Parser	en-parser-chunking.bin		16.0 sent/s	16.3 sent/s	William	yes

Note: Test was done on MacBook Pro 15", 2 GHz Core i7, 16GB Ram, 500GB HD running OS X 10.7.5 running
and Java 1.5.0_30. The performance varies because light weight tasks have been performed in the background while testing.

...

Package	File or Test	Comment
Binary	LICENSE	AL 2.0 and BSD for JWNL
Binary	NOTICE	standard notice, dates are correct. JWNL is mentioned
Binary	README	File was reviewed on the dev list.
Binary	RELEASE_NOTES.html	issue list is generated correctly
Binary	Test signatures: .md5, .sha1, .asc	rc4
Binary	JIRA issue list created
Binary	Contains maxent, tools, uima and jwnl jars
Source	LICENSE	standard AL 2.0 file
Source	NOTICE	standard notice, dates are correct
Source	Test signatures: .md5, .sha1, .asc	rc4
Source	Can build from source?	Test should be done without jwnl and opennlp in local m2 repo. Test was done on Ubuntu 10.10.

Notes about testing

Compatibility tests

The following commands can be used to reproduce the compatibility tests with Leipzig corpus.

Code Block

 
# Corpus preparation: the following command will create documents from the corpus. Sed is used to remove the language prefix

sh bin/opennlp DoccatConverter leipzig -data ../eng_news_2010_300K-text/eng_news_2010_300K-sentences.txt -encoding UTF-8 -lang en | sed -E 's/^en[[:space:]]//g' > ../out-tokenized-documents.test

# Corpus preparation: this forces the detokenization of the documents

sh bin/opennlp SentenceDetectorConverter namefinder -data ../out-tokenized-documents.test -encoding UTF-8 -detokenizer trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml > ../out-documents.test

# Now the actually tests. Execute it for the previous release and for the current RC. Compare the output using diff:

time sh bin/opennlp SentenceDetector ../models/en-sent.bin < ../out-documents.test > ../out-sentences_1.5.2.test

time sh bin/opennlp TokenizerME ../models/en-token.bin < ../out-sentences_1.5.2.test > ../out-toks_1.5.2.test

time sh bin/opennlp TokenNameFinder ../models/en-ner-person.bin < ../out-toks_1.5.2.test > ../out-ner_1.5.2.test

time sh bin/opennlp POSTagger ../models/en-pos-maxent.bin < ../out-toks_1.5.2.test > ../out-pos_maxent_1.5.2.test

time sh bin/opennlp POSTagger ../models/en-pos-perceptron.bin < ../out-toks_1.5.2.test > ../out-pos_pers_1.5.2.test

time sh bin/opennlp ChunkerME ../models/en-chunker.bin < ../out-pos_pers_1.5.2.test > ../out-chk_1.5.2.test

time sh bin/opennlp Parser ../models/en-parser-chunking.bin < ../out-toks_1.5.2.test > ../out-parse_1.5.2.test

Child pages

Versions Compared

Old Version 12

New Version 13

Key

Notes about testing

Compatibility tests