...
The 1.5.0 SourceForge models must be fully compatible with the 1.5.3
release. In this test all the English models are tested for compatibility
on the English 300k sentences Leipzig Corpus (Which file to download??). It is tested that
the output produced with the same model by both versions has the same md5 hash.
Component | Model | Perf 1.5.2 | Perf 1.5.3 | Tester | Passed | Comment | |
---|---|---|---|---|---|---|---|
Sentence Detector | en-sent.bin | 44870.8 sent/s | 42733.8 sent/s | William | yes |
| |
Tokenizer | en-token.bin | 2824.2 sent/s | 2833.3 sent/s | William | yes |
| |
Name Finder | en-ner-person.bin | 781.3 sent/s | 761.6 sent/s | William | yes |
| |
POS Tagger | en-pos-maxent.bin | 773.3 sent/s | 816.2 sent/s | William | yes |
| |
POS Tagger | en-pos-perceptron.bin | 1138.6 sent/s | 1117.1 sent/s | William | yes |
| |
Chunker | en-chunker.bin | 183.7 sent/s | 181.1 sent/s | William | yes |
| |
Parser | en-parser-chunking.bin |
| 16.0 sent/s | 16.3 sent/s | William | yes |
|
Note: Test was done on MacBook Pro 15", 2 GHz Core i7, 16GB Ram, 500GB HD running OS X 10.7.5 running
and Java 1.5.0_30. The performance varies because light weight tasks have been performed in the background while testing.
...
Package | File or Test | Tester | Passed | Comment |
---|---|---|---|---|
Binary | LICENSE |
|
| AL 2.0 and BSD for JWNL |
Binary | NOTICE |
|
| standard notice, dates are correct. JWNL is mentioned |
Binary | README |
|
| File was reviewed on the dev list. |
Binary | RELEASE_NOTES.html |
|
| issue list is generated correctly |
Binary | Test signatures: .md5, .sha1, .asc |
|
| rc4 |
Binary | JIRA issue list created |
|
|
|
Binary | Contains maxent, tools, uima and jwnl jars |
|
|
|
Source | LICENSE |
|
| standard AL 2.0 file |
Source | NOTICE |
|
| standard notice, dates are correct |
Source | Test signatures: .md5, .sha1, .asc |
|
| rc4 |
Source | Can build from source? |
|
| Test should be done without jwnl and opennlp in local m2 repo. |
Notes about testing
Compatibility tests
The following commands can be used to reproduce the compatibility tests with Leipzig corpus.
Code Block |
---|
# Corpus preparation: the following command will create documents from the corpus. Sed is used to remove the language prefix
sh bin/opennlp DoccatConverter leipzig -data ../eng_news_2010_300K-text/eng_news_2010_300K-sentences.txt -encoding UTF-8 -lang en | sed -E 's/^en[[:space:]]//g' > ../out-tokenized-documents.test
# Corpus preparation: this forces the detokenization of the documents
sh bin/opennlp SentenceDetectorConverter namefinder -data ../out-tokenized-documents.test -encoding UTF-8 -detokenizer trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml > ../out-documents.test
# Now the actually tests. Execute it for the previous release and for the current RC. Compare the output using diff:
time sh bin/opennlp SentenceDetector ../models/en-sent.bin < ../out-documents.test > ../out-sentences_1.5.2.test
time sh bin/opennlp TokenizerME ../models/en-token.bin < ../out-sentences_1.5.2.test > ../out-toks_1.5.2.test
time sh bin/opennlp TokenNameFinder ../models/en-ner-person.bin < ../out-toks_1.5.2.test > ../out-ner_1.5.2.test
time sh bin/opennlp POSTagger ../models/en-pos-maxent.bin < ../out-toks_1.5.2.test > ../out-pos_maxent_1.5.2.test
time sh bin/opennlp POSTagger ../models/en-pos-perceptron.bin < ../out-toks_1.5.2.test > ../out-pos_pers_1.5.2.test
time sh bin/opennlp ChunkerME ../models/en-chunker.bin < ../out-pos_pers_1.5.2.test > ../out-chk_1.5.2.test
time sh bin/opennlp Parser ../models/en-parser-chunking.bin < ../out-toks_1.5.2.test > ../out-parse_1.5.2.test
|