Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

{scrollbar}

...

Overview of Dependency Parser

Dependency parsers provide syntactic information about sentences. Unlike deep parsers, they do not explicitly find phrases (e.g., NP or VP); rather, they find the dependencies between words. For example, "hormone replacement therapy" would have deep structure:

...

The implementation in cTAKES v1.1 is based on revision 75.

...

Dependency parses often assume lemmas (normalized word forms) and POS tags as input. This cTAKES component infers lemmas and POS tags from upstream LVG and POS tagger components.

Overview of Semantic Role Labeler

The semantic role labeler assigns the predicate-argument structure of the sentence. (Who did what to whom when and where.

Analysis Engines and other Descriptors

analysis_engine/ClearParserAE.xml

This analysis engine wraps the dependency parser's prediction function (i.e., finding dependency trees from text). It takes lemmas and POS tags from the normalizedForm and partOfSpeech attributes of BaseTokens that have been found in cTAKES (i.e., are in the CAS). This is the analysis engine that should be dropped into any new pipelines to get dependency parses.

...

MorphDictionaryDirectory
leave empty to use POS tags from cTAKES. Enter en_dict to use the Clear Morphological Analyzer.

analysis_engine/ClearParserPlaintextAggregate.xml

An aggregate engine appropriate for use with CVD or other tools that act directly on plain text.

analysis_engine/ClearParserTokenizedAggregate.xml and ClearParserTokenizedInfPosAggregate.xml

Aggregate engines appropriate for use in CPEs. The first of these assumes that POS tags have been given in an upstream component or directly from data. The second infers POS tags using the cTAKES POS tagger.

analysis_engine/ClearTrainerAE.xml and ClearTrainerAggregate.xml

These analysis engines train models for use in ClearParserAE. See Section 4.7.4, "Training a model Training data" Section below for further details.

analysis_engine/LemAssigner.xml, LvgBaseTokenAnnotator.xml, and PosAssigner.xml

These analysis engines are upstream components that complement ClearTrainerAE. Refer to section 4.7.4, See "Training a model in the cTAKES documentation on SourceForgeTraining data" Section below for further details.

collection_reader/DependencyFileCollectionReader.xml

Reads in a single file with dependency data in the formats described in section 4.7.3, Data Format in the cTAKES documentation on SourceForge"Data Format" section below. The file is treated as a single document with many sentences that are separated by blank lines.

cas_consumer/DependencyNodeWriter.xml

Writes ConllDependencyNode objects (the internal form used for dependency parses) to the .dep format. Refer to section 4.7.3, Data Format in the cTAKES documentation on SourceForgethe "Data Format" section below

Resources and Models

clinques.mod

The main ClearParser model packaged with cTAKES v1.1. This is trained on a corpus of 1600 clinical questions.

lexicon*/

A directory of additional files for a ClearParser model. For example, "deprel.txt" contains the set of dependency labels; "pos.txt" contains the set of POS tags.

...

...

When doing training within this project, the specified lexicon directory must first be created separately.

en_dict/

A directory used by the Clear Morphological Analyzer to create lemmas. This analyzer is an alternative to using LVG output from cTAKES. Descriptor files in the project are set up to use the Clear Morphological Analyzer if this valid location is passed as the "Morph Dictionary Directory" parameter to Analysis Engines.

feature.xml

Tells the dependency parser what features to base its dependency decisions on.

config_en.xml

This file is not used when cTAKES is running ClearParser. You can follow the manual on the ClearParser website to run tests or training from the command line, which would make use of this file.

en_clinques.headrules

In order to convert from standard phrase structure trees, head rules tell you which child in a tree is the head (loosely, the most important). These are used by clear.engine.PhraseToDep.

Data Format

The format of training data into DependencyFileCollectionReader and of output data from DependencyNodeWriter is the same. Files should have one word per line alongside several other tab-delimited attributes. Sentences are separated by a blank line.

An example snippet from data/sample.dep of dependency data is shown in Example 4.5 in the cTAKES documentation on SourceForge"Data Format" section repeated here.

Example: Dependency parser data: .dep format (the first line is for reference).

...

The popular CONLL format is also supported for input into cTAKES; this requires several extra columns. However, not all of those columns will be used for ClearParser parsing.

Derivative Formats

The parser will use formats derived from this .dep format, as well.

Alternative formats

.mlem ID, FORM, LEMMA, HEAD, DEPREL .mpos ID, FORM, POS, HEAD, DEPREL .min ID, FORM, HEAD, DEPREL .tok ID, FORM

Of these, .tok is typically used for testing the actual parses, and the rest are used for training new models.

Conversion between formats

Data from resources such as WSJ or Genia typically come as trees that were originally used for deep parsers (as in the example above). The dependency parser project comes with a tool to convert between those trees and the .dep format.

...

  • using Data > Import > From Text File,
  • selecting the .dep file,
  • identifying tabs as the delimiter,
  • deleting unwanted columns
  • File > Export in a tab-delimited format.

Training a model Training data

The packaged model is one trained on Clinical Questions, in part because it is small enough to package with cTAKES. If this is not your domain, you may want to train other models. At the time of cTAKES release 1.1, however, there are no clinical document Treebanks to train on.

...

The Penn Treebank project annotates naturally-occurring text for linguistic structure. To obtain a copy of Release 2 is non-trivial, Please read Penn Treebank on the University of Pennsylvania website.

Training a model in Eclipse

There are many types of models that can be trained, based on how much you want to rely on cTAKES components and how much you want to rely on other components.

...