Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Panel
titleContents of this Page
Table of Contents
minLevel2

Overview of Chunker

In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a component that tags noun phrases, verb phrases, etc.

...

Info

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

Building a model - Prepare GENIA training data

You need to download a copy of GENIA's Treebank corpus from tokyo.ac.jp/~genia/topics/Corpus/GTB.html. The version we used is called "beta". This version is distributed in a set of two files, one dated Sept. 22, 2004, with 200 "abstracts", and the other July 11, 2005, with 300 "abstracts". Please download both. After extraction, place all the .tree files from the two download into one directory, which we'll refer to <genia-trees>.

...

Where
*<chunklink-chunks>*is the output of chunklink from the previous step.
*<training-data>*is the resulting training data file.
Build a model from your training data
Building a chunker model is much easier than preparing the training
data. After you have obtained training data, run the OpenNLP tool:
java -cp<classpath>opennlp.tools.chunker.ChunkerME<training-data><model-name>iterationscutoff
Where
*<training-data>*is an OpenNLP training data file.
*<model-name>*is the file name of the resulting model. The name should end with
either .txt (for a plain text model) or .bin.gz (for a compressed binary
model).
*iterations*determines how many training iterations will be performed. The default is 100.
*cutoff*determines the minimum number of times a feature has to be seen to be
considered for inclusion in the model.The default cutoff is 5
The iterations and cutoff arguments are, taken together, optional, that is, you should provide both or provide neither.

Analysis engines (annotators)

Chunker.xml

The file cTAKESdesc/chunkerdesc/analysis_engine/Chunker.xml provides a descriptor for the Chunker analysis engine which is the UIMA component we have written that wraps the OpenNLP chunker. It calls edu.mayo.bmi.uima.chunker.Chunker, whose Javadoc provides information on how to customize this descriptor.

Parameters
ModelFile
the file that contains the chunker tagging model
ChunkerCreatorClass
the full class name of an implementation of the interface edu.mayo.bmi.uima.chunker.ChunkerCreator

ChunkerAggregate.xml

The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides a descriptor that defines a pipeline for shallow parsing so that all the necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the CAS. It inherits two parameters from Chunker.xml and three from POSTagger.xml.

...