Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

...

Overview of Chunker

In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a component that tags noun phrases, verb phrases, etc.

...

The model derives from a combination of GENIA, Penn Treebank (Wall Street Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior to model building the clinical data was deidentified for patient names to preserve patient confidentiality. Any person name in the model will originate from non-patient data sources.

...

Next, we'll use data.chunk.genia.Genia2PTB to convert Genia Treebank corpus to Penn Treebank II format, then use chunklink to convert to chunk data, and finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format.

...

...

This Java class a) renames the .tree files to files that look like wsj_0001.mrg and puts them in a directory structure expected by chunklink and creates a mapping of the original new names to the old names; b) reformats the way pos tags are formatted; c) adds an extra set of parentheses to each line of the data.

...

  • Run data.chunk.genia.Genia2PTB:

...

There are a number of problematic sentences in the second set of 300 treebanked abstracts (in <ptb-trees> after processing by data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We removed them when building our model. The original GENIA file names are listed below for your reference. You need to remove the lines from the output of Genia2PTB. To find out the converted file names, please look at <genia-ptb-name-mapping>.

...

The chunklink script doesn't seem to work on Windows. But we did manage to run it in a Cygwin session.

...

  • Prepare Penn Treebank training data

Please refer to the section called Obtaining training data in the cTAKES documentation on SourceForge on how to obtain Penn Treebank corpus.

Preparing Penn Treebank data is similar to preparing GENIA data, as described in the section called "Prepare GENIA training data in the cTAKES documentation on SourceForge" above, except that the first step is not necessary.

...