Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar}
Panel
titleContents of this Page
Table of Contents
minLevel2

...

java -cp <classpath>; data.chunk.genia.Genia2PTB <genia-trees> <ptb-trees>
<genia-ptb-name-mapping>'
Where
*<genia-trees>* is the directory which holds the GENIA corpus files;
*<ptb-trees>* is the the directory where the converted PTB trees will be written to;
*<genia-ptb-name-mapping>* is a file that will created by Genia2PTB to save file name mappings.

Tip
Tip
title

There are a number of problematic sentences in the second set of 300 treebanked abstracts (in <ptb-trees> after processing by data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We removed them when building our model. The original GENIA file names are listed below for your reference. You need to remove the lines from the output of Genia2PTB. To find out the converted file names, please look at <genia-ptb-name-mapping>.

...

perl chunklink_2-2-2000_for_conll.pl -NHhftc <ptb-trees> /wsj????.mrg> <chunklink-chunks>_
Where
*<chunklink-chunks>* is the redirected standard output from chunklink.

...

java -cp <classpath> data.chunk.Chunklink2OpenNLP <chunklink-chunks> <training-data>
Where
*<chunklink-chunks>* is the output of chunklink from the previous step.
*<training-data>* is the resulting training data file.

...

Preparing Penn Treebank data is similar to preparing GENIA data, as described in the section called Prepare GENIA training data in the cTAKES documentation on SourceForge, except that the first step is not necessary.

  • Run chunklink:


Where
perl chunklink_2-2-2000_for_conll.pl -NHhftc<ptb-corpus>/wsj_????.mrg ><chunklink-chunks>

Where

*<ptb-corpus>* is your Penn Treebank corpus directory.
*<chunklink-chunks>* the redirected standard output.

...

java -cp <classpath> data.chunk.Chunklink2OpenNLP <chunklink-chunks> <training-data>

Where
*<chunklink-chunks>* is the output of chunklink from the previous step.
*<training-data>* is the resulting training data file.


Build a model from your training data
Building a chunker model is much easier than preparing the training
data. After you have obtained training data, run the OpenNLP tool:


java -cp<classpath>opennlp.tools.chunker.ChunkerME<training-data><model-name>name><iterationscutoff><cutoff>
Where
*<training-data>* is an OpenNLP training data file.
*<model-name>* is the file name of the resulting model. The name should end with
either .txt (for a plain text model) or .bin.gz (for a compressed binary
model).
* <iterations*> determines how many training iterations will be performed. The default is 100.
*cutoff* <cutoff> determines the minimum number of times a feature has to be seen to be
considered for inclusion in the model.The default cutoff is 5.
The iterations and cutoff arguments are, taken together, optional, that is, you should provide both or provide neither.

...

The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides a descriptor that defines a pipeline for shallow parsing so that all the necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the CAS. It inherits two parameters from Chunker.xml and three from POSTagger.xml.

  • Start UIMA CPE GUI.

java -cp <classpath> org.apache.uima.tools.cpm.CpmFrame

...