Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Finished compatibility tests

...

The 1.5.0 SourceForge models must be fully compatible with the 1.5.3
release. In this test all the English models are tested for compatibility
on the English 300k sentences Leipzig Corpus (Which file to download??). It is tested that
the output produced with the same model by both versions has the same md5 hash.

Component

Model

Perf 1.5.2

Perf 1.5.3

Tester

Passed

Comment

Sentence Detector

en-sent.bin

44870.8 sent/s

42733.8 sent/s

William

yes

 

Tokenizer

en-token.bin

2824.2 sent/s

2833.3 sent/s

William

yes

 

Name Finder

en-ner-person.bin

781.3 sent/s

761.6 sent/s

William

yes

 

POS Tagger

en-pos-maxent.bin

773.3 sent/s

816.2 sent/s

William

yes

 

POS Tagger

en-pos-perceptron.bin

1138.6 sent/s

1117.1 sent/s

William

yes

 

Chunker

en-chunker.bin

183.7 sent/s

181.1 sent/s

William

yes

 

Parser

en-parser-chunking.bin

 

16.0 sent/s

16.3 sent/s  

William

  yes

 

Note: Test was done on MacBook Pro 15", 2 GHz Core i7, 16GB Ram, 500GB HD running OS X 10.7.5 running
and Java 1.5.0_30. The performance varies because light weight tasks have been performed in the background while testing.

...

Package

File or Test

Tester

Passed

Comment

Binary

LICENSE

 

 

AL 2.0 and BSD for JWNL

Binary

NOTICE

 

 

standard notice, dates are correct. JWNL is mentioned

Binary

README

 

 

File was reviewed on the dev list.

Binary

RELEASE_NOTES.html

 

 

issue list is generated correctly

Binary

Test signatures: .md5, .sha1, .asc

 

 

rc4

Binary

JIRA issue list created

 

 

 

Binary

Contains maxent, tools, uima and jwnl jars

 

 

 

Source

LICENSE

 

 

standard AL 2.0 file

Source

NOTICE

 

 

standard notice, dates are correct

Source

Test signatures: .md5, .sha1, .asc

 

 

rc4

Source

Can build from source?

 

 

Test should be done without jwnl and opennlp in local m2 repo.
Test was done on Ubuntu 10.10.

Notes about testing

Compatibility tests

The following commands can be used to reproduce the compatibility tests with Leipzig corpus.

Code Block
 
# Corpus preparation: the following command will create documents from the corpus. Sed is used to remove the language prefix

sh bin/opennlp DoccatConverter leipzig -data ../eng_news_2010_300K-text/eng_news_2010_300K-sentences.txt -encoding UTF-8 -lang en | sed -E 's/^en[[:space:]]//g' > ../out-tokenized-documents.test

# Corpus preparation: this forces the detokenization of the documents

sh bin/opennlp SentenceDetectorConverter namefinder -data ../out-tokenized-documents.test -encoding UTF-8 -detokenizer trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml > ../out-documents.test

# Now the actually tests. Execute it for the previous release and for the current RC. Compare the output using diff:

time sh bin/opennlp SentenceDetector ../models/en-sent.bin < ../out-documents.test > ../out-sentences_1.5.2.test

time sh bin/opennlp TokenizerME ../models/en-token.bin < ../out-sentences_1.5.2.test > ../out-toks_1.5.2.test

time sh bin/opennlp TokenNameFinder ../models/en-ner-person.bin < ../out-toks_1.5.2.test > ../out-ner_1.5.2.test

time sh bin/opennlp POSTagger ../models/en-pos-maxent.bin < ../out-toks_1.5.2.test > ../out-pos_maxent_1.5.2.test

time sh bin/opennlp POSTagger ../models/en-pos-perceptron.bin < ../out-toks_1.5.2.test > ../out-pos_pers_1.5.2.test

time sh bin/opennlp ChunkerME ../models/en-chunker.bin < ../out-pos_pers_1.5.2.test > ../out-chk_1.5.2.test

time sh bin/opennlp Parser ../models/en-parser-chunking.bin < ../out-toks_1.5.2.test > ../out-parse_1.5.2.test