Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Complete, concise instructions to build, train, Instructions to train and run a simple Natural Language Parsing parts-of-speech (PoS) tagger program. Instructions are for Unix, but can easily adapt adaptable for Windows. For the purposes of these instructions, all downloads are presumed to be in

Unless otherwise specified, save downloads to $HOME/archives.

  1. Download and install Java.
  2. Download and install Maven.
  3. Download OpenNLP.
  4. Download a PoS Treebank training set into $HOME/archives/pos.
  5. Create development, library, and data directoriesCreate development area:
    mkdir -p $HOME/dev/java/nlp/lib/
    mkdir -p $HOME/dev/java/nlp/data/
  6. Change to development areadirectory:
    cd $HOME/dev/java/nlp/
  7. Extract files:
    tar zxf $HOME/archives/apache-opennlp-*-incubating-src.tar.gz
  8. Rename directory:
    mv apache-opennlp-*-incubating-src opennlp
  9. Build Java Archive (JAR) files (5 to 10 minutes, depending):
    cd opennlp/opennlp
    mvn install > build.log
  10. Change to OpenNLP development directory:
    cd $HOME/dev/java/nlp/opennlp/
  11. Move library files to library directory:
    mv opennlp-uima/target/dependency/* ../lib/.
  12. Move training data to data directory:
    mv $HOME/archives/pos/en-pos-maxent.bin $HOME/dev/java/nlp/data/.
  13. Change to development directory:
    cd $HOME/dev/java/nlp/
  14. Copy HelloWorld Source Code to $HOME/dev/java/nlp/HelloWorld.java.
  15. Compile HelloWorld.java:
    javac -cp $(echo lib/*.jar | tr ' ' ':') HelloWorld.java
  16. Run HelloWorld.java:
    java -cp .:$(echo lib/*.jar | tr ' ' ':') HelloWorld data/en-pos-maxent.bin "Earlier today, we compiled a program."

Output:

Code Block

Earlier => JJR @ 0.2182545923597446
today, => NN @ 0.666361706870189
we => PRP @ 0.8324059729613176
compiled => VBN @ 0.028125261823754893
a => DT @ 0.9145975161653905
program. => NN @ 0.8841759649076423