Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

...

Overview of Chunker

In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a component that tags noun phrases, verb phrases, etc.

...

A chunker model is included with this project.

...

...

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

Building a model - Prepare GENIA training data

You need to download a copy of GENIA's Treebank corpus from tokyo.ac.jp/~genia/topics/Corpus/GTB.html. The version we used is called "beta". This version is distributed in a set of two files, one dated Sept. 22, 2004, with 200 "abstracts", and the other July 11, 2005, with 300 "abstracts". Please download both. After extraction, place all the .tree files from the two download into one directory, which we'll refer to <genia-trees>.

...

Next, we'll use data.chunk.genia.Genia2PTB to convert Genia Treebank corpus to Penn Treebank II format, then use chunklink to convert to chunk data, and finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format.

...

...

This Java class a) renames the .tree files to files that look like wsj_0001.mrg and puts them in a directory structure expected by chunklink and creates a mapping of the original new names to the old names; b) reformats the way pos tags are formatted; c) adds an extra set of parentheses to each line of the data.

...

  • Run data.chunk.genia.Genia2PTB:

...

There are a number of problematic sentences in the second set of 300 treebanked abstracts (in <ptb-trees> after processing by data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We removed them when building our model. The original GENIA file names are listed below for your reference. You need to remove the lines from the output of Genia2PTB. To find out the converted file names, please look at <genia-ptb-name-mapping>.

...

The chunklink script doesn't seem to work on Windows. But we did manage to run it in a Cygwin session.

...

  • Prepare Penn Treebank training data

...

Preparing Penn Treebank data is similar to preparing GENIA data, as described in the section called "Prepare GENIA training data in the cTAKES documentation on SourceForge" above, except that the first step is not necessary.

...


java -cp<classpath>opennlp.tools.chunker.ChunkerME<training-data><model-name><iterations><cutoff>
Where
<training-data> is an OpenNLP training data file.
<model-name> is the file name of the resulting model. The name should end with
either .txt (for a plain text model) or .bin.gz (for a compressed binary
model).
<iterations> determines how many training iterations will be performed. The default is 100.
<cutoff> determines the minimum number of times a feature has to be seen to be considered for inclusion in the model.The default cutoff is 5.
The iterations and cutoff arguments are, taken together, optional, that is, you should provide both or provide neither.

Analysis engines (annotators)

Chunker.xml

The file cTAKESdesc/chunkerdesc/analysis_engine/Chunker.xml provides a descriptor for the Chunker analysis engine which is the UIMA component we have written that wraps the OpenNLP chunker. It calls edu.mayo.bmi.uima.chunker.Chunker, whose Javadoc provides information on how to customize this descriptor.

Parameters
ModelFile - the file that contains the chunker tagging model
ChunkerCreatorClass - the full class name of an implementation of the interface edu.mayo.bmi.uima.chunker.ChunkerCreator

ChunkerAggregate.xml

The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides a descriptor that defines a pipeline for shallow parsing so that all the necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the CAS. It inherits two parameters from Chunker.xml and three from POSTagger.xml.

...