...
The 1.5.0 SourceForge models must be fully compatible with the 1.5.3
release. In this test all the English models are tested for compatibility
on the English 300k sentences Leipzig Corpus (Which file to download??). It is tested that
the output produced with the same model by both versions has the same md5 hash.
Component | Model | Perf 1.5.2 | Perf 1.5.3 | Tester | Passed | Comment | |
---|---|---|---|---|---|---|---|
Sentence Detector | en-sent.bin | 44870.8 sent/s | 42733.8 sent/s | William | yes |
| |
Tokenizer | en-token.bin | 2824.2 sent/s | 2833.3 sent/s | William | yes |
| |
Name Finder | en-ner-person.bin | 781.3 sent/s | 761.6 sent/s | William | yes |
| |
POS Tagger | en-pos-maxent.bin | 773.3 sent/s | 816.2 sent/s | William | yes |
| |
POS Tagger | en-pos-perceptron.bin | 1138.6 sent/s | 1117.1 sent/s | William | yes |
| |
Chunker | en-chunker.bin | 183.7 sent/s | 181.1 sent/s | William | yes |
| |
Parser | en-parser-chunking.bin |
| 16.0 sent/s | 16.3 sent/s | William | yes |
|
Note: Test was done on Hardware
running Operational System
and Java Java Version
MacBook Pro 15", 2 GHz Core i7, 16GB Ram, 500GB HD running OS X 10.7.5
and Java 1.5.0_30. The performance varies because light weight tasks have been performed in the background while testing.
...
To pass the test the event hash and the model output must be identical.
Component | Model | Training Time 1.5.2 | Training Time 1.5.3 | Tester | Passed | Comment |
---|---|---|---|---|---|---|
Sentence Detector | en-sent.bin |
|
| Jörn | yes |
|
Tokenizer | en-token.bin |
|
| Jörn | yes |
|
POS Tagger | en-pos-maxent.bin |
|
| Jörn | yes |
|
POS Tagger | en-pos-perceptron.bin |
|
| Jörn | yes |
|
Parser | en-parser-chunking.bin |
| Jörn |
| yes | Tested on 10k sentences |
Note: Time was measured with the time command, the value is the "real" time value.
...
Component | Data | Tester | Tagging Perf 1.5.2 | Tagging Perf 1.5.3 | Comment | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sentence Detector |
|
|
|
|
| |||||||||||||
Tokenizer |
|
|
|
|
| |||||||||||||
Name Finder | CONLL 2002 Dutch | Name Finder | CONLL 2002 Dutch Person ned.testa | jkosin | Precision: 0.7552941176470588 | Precision: 0.7552941176470588 |
| |||||||||||
Name Finder | CONLL 2002 Dutch Person ned.testb | jkosin | Precision: 0.8505025125628141 |
|
| Precision: 0.8505025125628141 |
| |||||||||||
Name Finder | CONLL 2002 Dutch Organization ned.testa | jkosin | Precision: 0.8561872909698997 | Precision: 0.8561872909698997 |
| |||||||||||||
Name Finder | CONLL 2002 Dutch Organization ned.testb | jkosin | Precision: 0.7830374753451677 | Precision: 0.7830374753451677 |
| |||||||||||||
Name Finder | CONLL 2002 Dutch Location ned.testa | jkosin | Precision: 0.8458333333333333 | Precision: 0.8458333333333333 |
| |||||||||||||
Name Finder | CONLL 2002 Dutch Location ned.testb | jkosin | Precision: 0.8816326530612245 |
|
| Name Finder | CONLL 2002 Dutch Misc ned.testa |
| Precision: 0.8354114713216958 8816326530612245 |
| ||||||||
Name Finder | CONLL 2002 Dutch Misc ned.testb testa | jkosin | Precision: 0.8264984227129337 8354114713216958 |
|
| Name Finder | CONLL 2002 Combined ned.testa |
| 5831157528285466 | Precision: 0.8354114713216958 Precision: 0.6509695290858726 |
| 1000 iterations | ||||||
Name Finder | CONLL 2002 Dutch Combined Misc ned.testb | jkosin | Precision: 0.6869929337869668 8264984227129337 |
| 1000 iterations | Name Finder | CONLL 2002 Spanish Person esp.testa |
| 5755079626578803 | Precision: 0.9010695187165776 8264984227129337 |
|
| ||||||
Name Finder | CONLL 2002 Spanish Person esp.testb Combined ned.testa | jkosin | Precision: 0.9195205479452054 6509695290858726 |
|
| Name Finder | CONLL 2002 Spanish Organization esp.testa | 6397044526540929 | Precision: 0.8288942695722357 664424218440839 |
| 1000 iterations | |||||||
Name Finder | CONLL 2002 Spanish Organization espDutch Combined ned.testb | jkosin | Precision: 0.8036277602523659 6869929337869668 |
|
| Name Finder | CONLL 2002 Spanish Location esp.testa | 6763720690543674 | Precision: 0.7743016759776536 7006019366657943 |
| 1000 iterations | |||||||
Name Finder | CONLL 2002 Spanish Location Person esp.testb testa | jkosin | Precision: 0.8301886792452831 9010695187165776 |
|
| Name Finder | CONLL 2002 Spanish Misc esp.testa |
| 684263959390863 | Precision: 0.6492890995260664 9010695187165776 |
| |||||||
Name Finder | CONLL 2002 Spanish Misc Person esp.testb | jkosin | Precision: 0.686046511627907 9195205479452054 |
|
| Name Finder | CONLL 2002 Spanish Combined esp.testa | 8142532221379833 | Precision: 0.7005423249233671 9195205479452054 |
| 1000 iterations | |||||||
Name Finder | CONLL 2002 Spanish Combined Organization esp.testb testa | jkosin | Precision: 0.756635931824532 8288942695722357 |
| 6988771691051379 | Precision: 0.8288942695722357 | 1000 iterations | |||||||||||
Name Finder | CONLL 2003 English Person eng.testa 2002 Spanish Organization esp.testb | jkosin | Precision: 0.9523195876288659 8036277602523659 | Precision: 0.95231958762886598036277602523659 |
| |||||||||||||
Name Finder | CONLL 2003 English Person eng.testb 2002 Spanish Location esp.testa | jkosin | Precision: 0.9391727493917275 7743016759776536 | Precision: 0.93917274939172757743016759776536 |
| |||||||||||||
Name Finder | CONLL 2003 English Organization eng.testa 2002 Spanish Location esp.testb | jkosin | Precision: 0.8768046198267565 8301886792452831 | Precision: 0.87680461982675658301886792452831 |
| |||||||||||||
Name Finder | CONLL 2003 English Organization eng.testb 2002 Spanish Misc esp.testa | jkosin | Precision: 0.8435980551053485 6492890995260664 | Precision: 0.84359805510534856492890995260664 |
| |||||||||||||
Name Finder | CONLL 2003 English Location eng.testa 2002 Spanish Misc esp.testb | jkosin | Precision: 0.9361421988150099 686046511627907 | Precision: 0.9361421988150099686046511627907 |
| |||||||||||||
Name Finder | CONLL 2003 English Location eng.testb 2002 Spanish Combined esp.testa | jkosin | Precision: 0.9206349206349206 7005423249233671 | Precision: 0.92063492063492067047866069323273 | 1000 iterations | |||||||||||||
Name Finder | CONLL 2003 English Misc eng.testa 2002 Spanish Combined esp.testb | jkosin | Precision: 0.9027982326951399 756635931824532 | Precision: 0.90279823269513997588711930706902 | 1000 iterations | |||||||||||||
Name Finder | CONLL 2003 English Misc Person eng.testb testa | jkosin | Precision: 0.8592436974789915 9523195876288659 | Precision: 0.85924369747899159523195876288659 |
| |||||||||||||
Name Finder | CONLL 2003 English Combined Person eng.testa testb | jkosin | Precision: 0.861812521618817 9391727493917275 | Precision: 0.86406087858872369391727493917275 | 1000 iterations | |||||||||||||
Name Finder | CONLL 2003 English Combined Organization eng.testb testa | jkosin | Precision: 0.8041311831853597 8768046198267565 | Precision: 0.80648668236999458768046198267565 | 1000 iterations |
| ||||||||||||
Name Finder | CONLL 2003 German Person deu.testa English Organization eng.testb | jkosin | Precision: 0.9132653061224489 8435980551053485 |
| Name Finder | CONLL 2003 German Person deu.testb |
| 7191709844559586 | Precision: 0.8732106339468303 8435980551053485 |
| ||||||||
Name Finder | CONLL 2003 German Organization deuEnglish Location eng.testa | jkosin | Precision: 0.8407224958949097 9361421988150099 |
|
| Name Finder | CONLL 2003 German Organization deu.testb | 8474374255065554 | Precision: 0.8014705882352942 9361421988150099 |
|
| |||||||
Name Finder | CONLL 2003 German English Location deueng.testa testb | jkosin | Precision: 0.7816326530612245 9206349206349206 |
|
| Name Finder | CONLL 2003 German Location deu.testb |
| 8144433299899699 | Precision: 0.8033826638477801 9206349206349206 |
| |||||||
Name Finder | CONLL 2003 German English Misc deueng.testa | jkosin | Precision: 0.7055555555555556 9027982326951399 |
|
| Name Finder | CONLL 2003 German Misc deu.testb | 7657713928794503 | Precision: 0.6601307189542484 9027982326951399 |
| ||||||||
Name Finder | CONLL 2003 German Combined deu.testa English Misc eng.testb | jkosin | Precision: 0.7718859429714857 8592436974789915 | Precision: 0.8592436974789915 |
| |||||||||||||
Name Finder | CONLL 2003 German English Combined deueng.testb testa | jkosin | Precision: 0.7467566165023353 861812521618817 |
|
| POS Tagger | CONLL 2006 Danish |
| 8500511770726714 | Precision: 0.8640608785887236 |
|
| POS Tagger | CONLL 2006 Dutch |
| Accuracy8407943453382699 |
|
|
POS Tagger | CONLL 2006 Portuguese |
| Accuracy: 0.9659110277825124 |
|
| |||||||||||||
POS Tagger | CONLL 2006 Swedish |
| Accuracy: 0.9275106082036775 |
|
| |||||||||||||
1000 iterations | ||||||||||||||||||
Name Finder | CONLL 2003 English Combined eng.testb | jkosin | Precision: 0.8041311831853597 | Precision: 0.8064866823699945 | Chunker | CONLL 2000 | William | Precision: 0.9257575757575758 |
|
| 1000 iterations | |||||||
Name Finder | CONLL 2003 German Person deu.testa | jkosin | Precision: 0.9132653061224489 | Chunker | Arvores Deitadas | William | Precision: 0.9403445830378374 |
|
|
Test UIMA Integration
The test ensures that the Analysis Engine can run and not not
crash trough simple runtime time code errors. We need to add
more sophisticated testing with the next releases.
Analysis Engine | Tester | Passed | Comment |
---|---|---|---|
Sentence Detector |
|
|
|
Sentence Detector Trainer |
|
|
|
Tokenizer ME |
|
|
|
Tokenizer Trainer |
|
|
|
Name Finder |
|
|
|
Name Finder Trainer |
|
|
|
Chunker |
|
|
|
Chunker Trainer |
|
|
|
POS Tagger |
|
|
|
POS Tagger Trainer |
|
|
|
Parser |
|
|
|
createPear.sh |
|
|
|
Sample PEAR |
|
|
|
Distribution Review
Please ensure that the listed files below are included in the distributions
and are in a good state.
...
Package
...
File or Test
...
Tester
...
Passed
...
Comment
...
Binary
...
LICENSE
...
...
...
AL 2.0 and BSD for JWNL
...
Binary
...
NOTICE
...
...
...
standard notice, dates are correct. JWNL is mentioned
...
Binary
...
README
...
...
...
File was reviewed on the dev list.
...
Binary
...
RELEASE_NOTES.html
...
...
...
issue list is generated correctly
...
Binary
...
Test signatures: .md5, .sha1, .asc
...
...
...
rc4
...
Binary
...
JIRA issue list created
...
...
...
...
Binary
...
Contains maxent, tools, uima and jwnl jars
...
...
...
...
Source
...
LICENSE
...
...
...
standard AL 2.0 file
...
Source
...
NOTICE
...
...
...
standard notice, dates are correct
...
Source
...
Test signatures: .md5, .sha1, .asc
...
...
...
rc4
...
Source
...
Can build from source?
...
...
...
0.3993307306190742 | Precision: 0.9132653061224489 |
| |||
Name Finder | CONLL 2003 German Person deu.testb | jkosin | Precision: 0.8732106339468303 | Precision: 0.8732106339468303 |
|
Name Finder | CONLL 2003 German Organization deu.testa | jkosin | Precision: 0.8407224958949097 | Precision: 0.8407224958949097 |
|
Name Finder | CONLL 2003 German Organization deu.testb | jkosin | Precision: 0.8014705882352942 | Precision: 0.8014705882352942 |
|
Name Finder | CONLL 2003 German Location deu.testa | jkosin | Precision: 0.7816326530612245 | Precision: 0.7816326530612245 |
|
Name Finder | CONLL 2003 German Location deu.testb | jkosin | Precision: 0.8033826638477801 | Precision: 0.8033826638477801 |
|
Name Finder | CONLL 2003 German Misc deu.testa | jkosin | Precision: 0.7055555555555556 | Precision: 0.7055555555555556 |
|
Name Finder | CONLL 2003 German Misc deu.testb | jkosin | Precision: 0.6601307189542484 | Precision: 0.6601307189542484 |
|
Name Finder | CONLL 2003 German Combined deu.testa | jkosin | Precision: 0.7718859429714857 | Precision: 0.7783891945972986 | OPENNLP-417 |
Name Finder | CONLL 2003 German Combined deu.testb | jkosin | Precision: 0.7467566165023353 | Precision: 0.749351323300467 | OPENNLP-417 |
POS Tagger | CONLL 2006 Danish | Jörn / ? | Accuracy: 0.9511278195488722 | Accuracy: 0.9512987012987013 | Jörn: Same result as other tester |
POS Tagger | CONLL 2006 Dutch | Jörn | Accuracy: 0.9324977618621307 | Accuracy: 0.9324977618621307 |
|
POS Tagger | CONLL 2006 Portuguese | Jörn / ? | Accuracy: 0.9659110277825124 | Accuracy: 0.9659110277825124 | Jörn: Same result as other tester |
POS Tagger | CONLL 2006 Swedish | Jörn | Accuracy: 0.9275106082036775 | Accuracy: 0.9275106082036775 |
|
Chunker | CONLL 2000 | William | Precision: 0.9257575757575758 | Precision: 0.9257575757575758 |
|
Sentence Detector | Arvores Deitadas | William |
| Precision: 0.9891491491491492 | PERCEPTRON Cutoff 0 |
Tokenizer | Arvores Deitadas | William |
| Precision: 0.9995231988260895 | PERCEPTRON Cutoff 0 |
Chunker | Arvores Deitadas | William | Precision: 0.9404684925220583 | Precision: 0.9562405864042575 | OPENNLP-541, OPENNLP-423 |
Test UIMA Integration
The test ensures that the Analysis Engine can run and not not
crash trough simple runtime time code errors. We need to add
more sophisticated testing with the next releases.
Analysis Engine | Tester | Passed | Comment |
---|---|---|---|
Sentence Detector |
|
|
|
Sentence Detector Trainer |
|
|
|
Tokenizer ME |
|
|
|
Tokenizer Trainer |
|
|
|
Name Finder |
|
|
|
Name Finder Trainer |
|
|
|
Chunker |
|
|
|
Chunker Trainer |
|
|
|
POS Tagger |
|
|
|
POS Tagger Trainer |
|
|
|
Parser |
|
|
|
createPear.sh | Jörn | yes |
|
Sample PEAR | Jörn | yes |
|
Distribution Review
Please ensure that the listed files below are included in the distributions
and are in a good state.
Package | File or Test | Tester | Passed | Comment |
---|---|---|---|---|
Binary | LICENSE | Jörn | Yes | AL 2.0 and BSD for JWNL |
Binary | NOTICE | Jörn | Yes | standard notice, dates are correct. JWNL is mentioned |
Binary | README | Jörn | Yes |
|
Binary | RELEASE_NOTES.html | Jörn | Yes |
|
Binary | Test signatures: .md5, .sha1, .asc | Jörn | Yes | tested for rc3 |
Binary | JIRA issue list created | William | Yes | Minor issue: the project.version was not filled. |
Binary | Contains maxent, tools, uima and jwnl jars | Jörn | Yes |
|
Source | LICENSE | Jörn | Yes | standard AL 2.0 file |
Source | NOTICE | Jörn | Yes | standard notice, dates are correct |
Source | Test signatures: .md5, .sha1, .asc | Jörn |
| tested for rc3 |
Source | Can build from source? | Jörn | Yes | Test should be done without jwnl and opennlp in local m2 repo. |
Notes about testing
Compatibility tests
The following commands can be used to reproduce the compatibility tests with Leipzig corpus.
Code Block |
---|
# Corpus preparation: the following command will create documents from the corpus. Sed is used to remove the language prefix
sh bin/opennlp DoccatConverter leipzig -data ../eng_news_2010_300K-text/eng_news_2010_300K-sentences.txt -encoding UTF-8 -lang en | sed -E 's/^en[[:space:]]//g' > ../out-tokenized-documents.test
# Corpus preparation: this forces the detokenization of the documents
sh bin/opennlp SentenceDetectorConverter namefinder -data ../out-tokenized-documents.test -encoding UTF-8 -detokenizer trunk/opennlp-tools/lang/en/tokenizer/en-detokenizer.xml > ../out-documents.test
# Now the actually tests. Execute it for the previous release and for the current RC. Compare the output using diff:
time sh bin/opennlp SentenceDetector ../models/en-sent.bin < ../out-documents.test > ../out-sentences_1.5.2.test
time sh bin/opennlp TokenizerME ../models/en-token.bin < ../out-sentences_1.5.2.test > ../out-toks_1.5.2.test
time sh bin/opennlp TokenNameFinder ../models/en-ner-person.bin < ../out-toks_1.5.2.test > ../out-ner_1.5.2.test
time sh bin/opennlp POSTagger ../models/en-pos-maxent.bin < ../out-toks_1.5.2.test > ../out-pos_maxent_1.5.2.test
time sh bin/opennlp POSTagger ../models/en-pos-perceptron.bin < ../out-toks_1.5.2.test > ../out-pos_pers_1.5.2.test
time sh bin/opennlp ChunkerME ../models/en-chunker.bin < ../out-pos_pers_1.5.2.test > ../out-chk_1.5.2.test
time sh bin/opennlp Parser ../models/en-parser-chunking.bin < ../out-toks_1.5.2.test > ../out-parse_1.5.2.test
|