Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

...

Overview of Chunker

In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a component that tags noun phrases, verb phrases, etc.

...

A chunker model is included with this project.

...

...

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

Building a model - Prepare GENIA training data

...

This Java class a) renames the .tree files to files that look like wsj_0001.mrg and puts them in a directory structure expected by chunklink and creates a mapping of the original new names to the old names; b) reformats the way pos tags are formatted; c) adds an extra set of parentheses to each line of the data.

...

There are a number of problematic sentences in the second set of 300 treebanked abstracts (in <ptb-trees> after processing by data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We removed them when building our model. The original GENIA file names are listed below for your reference. You need to remove the lines from the output of Genia2PTB. To find out the converted file names, please look at <genia-ptb-name-mapping>.

...

perl chunklink_2-2-2000_for_conll.pl -NHhftc <ptb-trees> /wsj????.mrg> <chunklink-chunks>
Where
<chunklink-chunks> is the redirected standard output from chunklink.

...


...

The chunklink script doesn't seem to work on Windows. But we did manage to run it in a Cygwin session.

...

  • Run data.chunk.Chunklink2OpenNLP

...