THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
Instructions to train and run a simple parts-of-speech (PoS) tagger program. Instructions are for Unix, but adaptable for Windows.
Unless otherwise specified, save downloads to $HOME/archives
.
- Download and install Java.
- Download and install Maven.
- Download OpenNLP.
- Download a PoS Treebank training set into
$HOME/archives/pos
.- ERG from DELPH-IN - $0
- Maxent from OpenNLP v1.5 - $0
- Perceptron from OpenNLP 1.5 - $0
- NLTK Files from NLTK - $0
- CDT Files from Copenhagen Treebank - $0
- Penn Treebank 3 from LDC - $3000
- Create development, library, and data directories:
mkdir -p $HOME/dev/java/nlp/lib/
mkdir -p $HOME/dev/java/nlp/data/
- Change to development directory:
cd $HOME/dev/java/nlp/
- Extract files:
tar zxf $HOME/archives/apache-opennlp-*-incubating-src.tar.gz
- Rename directory:
mv apache-opennlp-*-incubating-src opennlp
- Build Java Archive (JAR) files (5 to 10 minutes, depending):
cd opennlp/opennlp
mvn install > build.log
- Change to OpenNLP development directory:
cd $HOME/dev/java/nlp/opennlp/
- Move library files to library directory:
mv opennlp-uima/target/dependency/* ../lib/.
- Move training data to data directory:
mv $HOME/archives/pos/en-pos-maxent.bin $HOME/dev/java/nlp/data/.
- Change to development directory:
cd $HOME/dev/java/nlp/
- Copy HelloWorld Source Code to
$HOME/dev/java/nlp/HelloWorld.java
. - Compile
HelloWorld.java
:
javac -cp $(echo lib/*.jar | tr ' ' ':') HelloWorld.java
- Run
HelloWorld.java
:
java -cp .:$(echo lib/*.jar | tr ' ' ':') HelloWorld data/en-pos-maxent.bin "Earlier today, we compiled a program."
Output:
Earlier => JJR @ 0.2182545923597446 today, => NN @ 0.666361706870189 we => PRP @ 0.8324059729613176 compiled => VBN @ 0.028125261823754893 a => DT @ 0.9145975161653905 program. => NN @ 0.8841759649076423