Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
{scrollbar}
{section}
{column:width=65%}{panel:title=Contents of this Page}{toc:minLevel=2}{panel}{column}
{column}{include:Menu cTAKES 3.0 to Include}{column}
{section}

h2. Overview of Dependency Parser

Dependency parsers provide syntactic information about sentences. Unlike deep parsers, they do not explicitly find phrases 
Section
Column
width65%
Panel
titleContents of this Page
Table of Contents
minLevel2
Column
Include Page
CTAKES:Menu cTAKES 3.0 to Include
CTAKES:Menu cTAKES 3.0 to Include

Overview of Dependency Parser

Dependency parsers provide syntactic information about sentences. Unlike deep parsers, they do not explicitly find phrases (e.g.,

...

NP

...

or

...

VP);

...

rather,

...

they

...

find

...

the

...

dependencies

...

between

...

words.

...

For

...

example,

...

"hormone

...

replacement

...

therapy"

...

would

...

have

...

deep

...

structure:

...

(NP

...

(NML

...

(NN

...

hormone)

...

(NN

...

replacement))

...

(NN

...

therapy))

...

but

...

its

...

dependency

...

structure

...

would

...

show

...

that

...

"hormone"

...

depends

...

on

...

"replacement"

...

and

...

"replacement"

...

in

...

turn

...

depends

...

on

...

"therapyl"

...

Below,

...

the

...

first

...

column

...

of

...

numbers

...

indicates

...

the

...

ID

...

of

...

the

...

word,

...

and

...

the

...

second

...

number

...

indicates

...

what

...

it

...

is

...

dependent

...

on.

...

23

...

hormone

...

hormone

...

NN

...

24

...

NMOD

...

24

...

replacement

...

replacement

...

NN

...

25

...

NMOD

...

25

...

therapy

...

therapy

...

NN

...

22

...

PMOD

...

Dependency

...

parses

...

can

...

be

...

labeled

...

as

...

well,

...

e.g.,

...

we

...

could

...

specify

...

that

...

"hormone"

...

is

...

in

...

a

...

noun-modifier

...

(i.e.,

...

NMOD)

...

relationship

...

with

...

"therapy"

...

in

...

the

...

example

...

above

...

(the

...

last

...

column).

...

This

...

project

...

provides

...

a

...

UIMA

...

wrapper

...

and

...

some

...

utilities

...

for

...

ClearParser

...

,

...

a

...

transition-based

...

dependency

...

parser

...

that

...

achieves

...

state-of-the-art

...

accuracy

...

and

...

speed.

Info

The implementation in cTAKES

{info} The implementation in cTAKES

v1.1

is

based

on

revision

75.

{info}

ClearParser

...

is

...

described

...

in:

...

"K-best,

...

Locally

...

Pruned,

...

Transition-based

...

Dependency

...

Parsing

...

Using

...

Robust

...

Risk

...

Minimization."

...

Jinho

...

D.

...

Choi,

...

Nicolas

...

Nicolov,

...

Collections

...

of

...

Recent

...

Advances

...

in

...

Natural

...

Language

...

Processing

...

V

...

,

...

205-216,

...

John

...

Benjamins,

...

Amsterdam

...

&

...

Philadelphia,

...

2009.

...

Dependency

...

parses

...

often

...

assume

...

lemmas

...

(normalized

...

word

...

forms)

...

and

...

POS

...

tags

...

as

...

input.

...

This

...

cTAKES

...

component

...

infers

...

lemmas

...

and

...

POS

...

tags

...

from

...

upstream

...

LVG

...

and

...

POS

...

tagger

...

components.

...

Analysis

...

Engines

...

and

...

other

...

Descriptors

...

analysis_engine/ClearParserAE.xml

...

This

...

analysis

...

engine

...

wraps

...

the

...

dependency

...

parser's

...

prediction

...

function

...

(i.e.,

...

finding

...

dependency

...

trees

...

from

...

text).

...

It

...

takes

...

lemmas

...

and

...

POS

...

tags

...

from

...

the

...

normalizedForm

...

and

...

partOfSpeech

...

attributes

...

of

...

BaseTokens

...

that

...

have

...

been

...

found

...

in

...

cTAKES

...

(i.e.,

...

are

...

in

...

the

...

CAS).

...

This

...

is

...

the

...

analysis

...

engine

...

that

...

should

...

be

...

dropped

...

into

...

any

...

new

...

pipelines

...

to

...

get

...

dependency

...

parses.

...

Parameters

DependencyModelFile
the file that contains the ClearParser transition model
the file that contains the features (leave this as feature.xml unless
you want to modify the internals of the dependency parser)

LexiconDirectory
the directory that contains ClearParser tagsets, etc.

MorphDictionaryDirectory
leave empty to use POS tags from cTAKES. Enter en_dict to use the Clear Morphological Analyzer.

analysis_engine/ClearParserPlaintextAggregate.xml

...

An

...

aggregate

...

engine

...

appropriate

...

for

...

use

...

with

...

CVD

...

or

...

other

...

tools

...

that

...

act

...

directly

...

on

...

plain

...

text.

...

analysis_engine/ClearParserTokenizedAggregate.xml

...

and

...

ClearParserTokenizedInfPosAggregate.xml

...

Aggregate

...

engines

...

appropriate

...

for

...

use

...

in

...

CPEs.

...

The

...

first

...

of

...

these

...

assumes

...

that

...

POS

...

tags

...

have

...

been

...

given

...

in

...

an

...

upstream

...

component

...

or

...

directly

...

from

...

data.

...

The

...

second

...

infers

...

POS

...

tags

...

using

...

the

...

cTAKES

...

POS

...

tagger.

...

analysis_engine/ClearTrainerAE.xml

...

and

...

ClearTrainerAggregate.xml

...

These

...

analysis

...

engines

...

train

...

models

...

for

...

use

...

in

...

ClearParserAE.

...

See "Training a model Training data" Section below for further details.

analysis_engine/LemAssigner.xml,

...

LvgBaseTokenAnnotator.xml,

...

and

...

PosAssigner.xml

...

These

...

analysis

...

engines

...

are

...

upstream

...

components

...

that

...

complement

...

ClearTrainerAE.

...

See "Training a model Training data" Section below for further details.

collection_reader/DependencyFileCollectionReader.xml

...

Reads

...

in

...

a

...

single

...

file

...

with

...

dependency

...

data

...

in

...

the

...

formats

...

described

...

in "Data Format" section below. The file is treated as a single document with many sentences that are separated by blank lines.

cas_consumer/DependencyNodeWriter.xml

...

Writes

...

ConllDependencyNode

...

objects

...

(the

...

internal

...

form

...

used

...

for

...

dependency

...

parses)

...

to

...

the

...

.dep

...

format.

...

Refer

...

to the "Data Format" section below

Resources and Models

clinques.mod

The main ClearParser model packaged with cTAKES v1.1. This is trained on a corpus of 1600 clinical questions.

lexicon*/

A directory of additional files for a ClearParser model. For example, "deprel.txt" contains the set of dependency labels; "pos.txt" contains the set of POS tags.

Warning
titleImportant

When doing training within this project, the specified lexicon directory must first be created separately.

en_dict/

A directory used by the Clear Morphological Analyzer to create lemmas. This analyzer is an alternative to using LVG output from cTAKES. Descriptor files in the project are set up to use the Clear Morphological Analyzer if this valid location is passed as the "Morph Dictionary Directory" parameter to Analysis Engines.

feature.xml

Tells the dependency parser what features to base its dependency decisions on.

config_en.xml

This file is not used when cTAKES is running ClearParser. You can follow the manual on the ClearParser website to run tests or training from the command line, which would make use of this file.

en_clinques.headrules

In order to convert from standard phrase structure trees, head rules tell you which child in a tree is the head (loosely, the most important). These are used by clear.engine.PhraseToDep.

Data Format

The format of training data into DependencyFileCollectionReader and of output data from DependencyNodeWriter is the same. Files should have one word per line alongside several other tab-delimited attributes. Sentences are separated by a blank line.

An example snippet from data/sample.dep of dependency data is shown in the "Data Format" section repeated here.

Example: Dependency parser data: .dep format (the first line is for reference).

ID FORM LEMMA POS HEAD DEPREL 1 The the DT 3 NMOD 2 study study NN 3 NMOD 3 physician physician NN 4 SBJ 4 called call VBD 0 ROOT 5 a a DT 6 NMOD 6 hematologist hematologist NN 4 OBJ

Description of .dep format fields:
ID
The word number within a sentence. There is an implied ROOT word that has and ID of 0.

FORM
The word itself.

LEMMA
A normalized form of the word, stripping suffixes, etc.

POS
The part-of-speech associated with the word.

HEAD
The ID of the word that the word on the current line is dependent on.

DEPREL
The relationship between the current word and its head. This may be syntactic or semantic.

The popular CONLL format is also supported for input into cTAKES; this requires several extra columns. However, not all of those columns will be used for ClearParser parsing.

Derivative Formats

The parser will use formats derived from this .dep format, as well.

Alternative formats

.mlem ID, FORM, LEMMA, HEAD, DEPREL .mpos ID, FORM, POS, HEAD, DEPREL .min ID, FORM, HEAD, DEPREL .tok ID, FORM

Of these, .tok is typically used for testing the actual parses, and the rest are used for training new models.

Conversion between formats

Data from resources such as WSJ or Genia typically come as trees that were originally used for deep parsers (as in the example above). The dependency parser project comes with a tool to convert between those trees and the .dep format.

java -cp lib/args4j-2.0.12.jar:bin

...

clear.engine.PhraseToDep

...

-h

...

resources/en_clinques.headrules

...

'

...

-i

...

<input-file>

...

-m

...

resources/en_dict

...

<output-file>

...


Where
<input-file>

...

-

...

A

...

file

...

in

...

bracketed

...

tree

...

format,

...

e.g,

...

data/sample.tree

...


<output-file>

...

-

...

A

...

file

...

to

...

be

...

written

...

out

...

in

...

.dep

...

format

...

To

...

down-convert

...

between

...

.dep

...

and

...

other

...

formats,

...

use

...

cut

...

in

...

Linux

...

:

...

cut

...

-f

...

1,2,3,5,6

...

data/sample.dep

...

>

...

data/sample.mlem

...

cut

...

-f

...

1,2,4,5,6

...

data/sample.dep

...

>

...

data/sample.mpos

...

cut

...

-f

...

1,2,5,6

...

data/sample.dep

...

>

...

data/sample.min

...

cut

...

-f

...

1,2

...

data/sample.dep

...

>

...

data/sample.tok

...

In

...

Windows

...

,

...

this

...

operation

...

can

...

be

...

done

...

in

...

spreadsheet

...

applications

...

like

...

Excel

...

by:

...

  • using

...

  • Data

...

  • >

...

  • Import

...

  • >

...

  • From

...

  • Text

...

  • File,

...

  • selecting

...

  • the

...

  • .dep

...

  • file,

...

  • identifying

...

  • tabs

...

  • as

...

  • the

...

  • delimiter,

...

  • deleting

...

  • unwanted

...

  • columns

...

  • File

...

  • >

...

  • Export

...

  • in

...

  • a

...

  • tab-delimited

...

  • format.

...

Training

...

a

...

model

...

Training

...

data

...

The

...

packaged

...

model

...

is

...

one

...

trained

...

on

...

Clinical

...

Questions,

...

in

...

part

...

because

...

it

...

is

...

small

...

enough

...

to

...

package

...

with

...

cTAKES.

...

If

...

this

...

is

...

not

...

your

...

domain,

...

you

...

may

...

want

...

to

...

train

...

other

...

models.

...

At

...

the

...

time

...

of

...

cTAKES

...

release

...

1.1,

...

however,

...

there

...

are

...

no

...

clinical

...

document

...

Treebanks

...

to

...

train

...

on.

...

The

...

other

...

options

...

are:

...

  • GENIA

GENIA is a literature mining project in molecular biology from the University of Tokyo. Its corpus, a collection of biomedical literature, has been annotated with phrase-structure trees. You can download a copy atGENIA from the University of Tokyo

  • Penn Treebank

The Penn Treebank project annotates naturally-occurring text for linguistic structure. To obtain a copy of Release 2 is non-trivial, Please read Penn Treebank on the University of Pennsylvania website.

Training a model in Eclipse

There are many types of models that can be trained, based on how much you want to rely on cTAKES components and how much you want to rely on other components.

1. Download and install the C++ version of liblinear from National Taiwan University; this requires much less memory than the default Java version.
2.Train a model

  • To create a model using cTAKES POS tags and lemmas with Eclipse:

1. Create a <your-data>.min file from <your-data>.dep (see the section called "Conversion between formats")
2. Use the UIMA_CPE_GUI---dependency

...

parser

...

launch.

...


3.

...

Load

...

desc/collection_processing_engine/ClearTrainerPosLemTestCPE.xml

...


4.

...

Put

...

your

...

filename

...

under

...

"Dependency

...

File"

...


5.

...

Make

...

sure

...

"Training

...

Mode"

...

is

...

checked

...


6.

...

Rename

...

the

...

"Dependency

...

Model

...

File"

...

and

...

"Lexicon

...

Directory"

...

according

...

to

...

what

...

you

...

want.

...


7.

...

Make

...

sure

...

"Trainer

...

Path"

...

is

...

a

...

valid

...

relative

...

path

...

from

...

>cTAKES_HOME>/dependency

...

parser

...

to

...

a

...

vaid

...

liblinear

...

binary

...

train

...

file.

...

  • To

...

  • create

...

  • a

...

  • model

...

  • using

...

  • gold

...

  • standard

...

  • POS

...

  • tags

...

  • and

...

  • Clear

...

  • Morphological

...

  • Analyzer

...

  • lemmas:

...

java

...

-cp

...

lib/hppc-0.3.1.jar:bin:resources

...

clear.engine.DepTrain

...

'''

...

-t<training_data>

...

-c

...

<configuration_file>

...


Where
*-t<training_data>

...

*an

...

OpenNLP

...

training

...

data

...

file.

...


*

...

-c

...

<configuration_file>

...

*the

...

file

...

name

...

of

...

the

...

resulting

...

model.

...

The

...

name

...

should

...

end

...

with

...

either

...


.txt

...

(for

...

a

...

plain

...

text

...

model)

...

or

...

.bin.gz

...

(for

...

a

...

compressed

...

binary

...

model)