Page History

Warning
This page is under construction.

Dictionaries

The dictionaries and models used during annotation indeed are the cornerstone of quality for your results. The install instructions show you how to get the resource the separately-downloadable ctakes-resources archive (which is not itself released by the Apache Software Foundation) that you need to run cTAKES. Those resources include:

An RxNorm_index database (a Lucene index): Contains drug names from RxNorm.
The OrangeBook: If you are not using the drug NER pipeline, the orange book is used to filter out what it found in RxNorm so that only things in both RxNorm and orange book are annotated. If you use Drug NER, orange book filtering is bypassed.
A UMLS database (using two hsqldb tables): Contains terms for anatomical sites, procedures, signs/symptoms, and disorders/diseases from SNOMED-CT, NCI Thesaurus, MeSH, and ICD-9 (umls_ms_2011ab) which have been tokenized by cTAKES.
The full LVG: From the lexical tools provided by the NLM for word normalization. Used to match similar words, for example the plural and singular forms of a word.
cTAKES models: Statistical models for assigning things like sentence endings, part of speech tags, chunk tags, dependency parses. They are derived from a combination of clinical and non clinical text.

You may not need to use any other dictionaries or models than those provided in these resourcesthe separately-downloadable ctakes-resources archive. However, the models made available by cTAKES within that archive have been trained on a specific set of text data (a corpus) which may not match well with the characteristics of your text. If you want to build or train your own models, please read the cTAKES 3.0 Component Use Guide, particularly:

Training a sentence detector model
Training a Part of Speech (POS) tagger model (: Building a model - Obtaining training data)
Creating a Part of Speech (POS) tag dictionary (: Building a tag dictionary)
Training a chunker model (: Building a model - Prepare GENIA training data)
Training a dependency parser (Dependency Parser)

To use them, you must have a UMLS username and password, and an Internet connection.

...

Step

...

Example

3. UMLS user ID and password.

...

title	Note

...

Note

Wiki Markup

If you plan to use the UMLS Resources, set/export env variables \\
export ctakes.umlsuser=\[username\], ctakes.umlspw=\[password\] \\
or add the system properties to the java args \\
\-Dctakes.umlsuser=<username> \-Dctakes.umlspw=<password>

: Training a model - Training data or Training a model in Eclipse

Building Your Own Dictionaries

The UMLS dictionaries within the ctakes-resources archive might not match your underlying data completely. You might require other local terms

In order to integrate the dictionaries you will need to do two things:
(1) Change the UMLSUser and UMLSPW <nameValuePair> strings in these descriptor files with your UMLS username and password.

Dictionary Lookup: <cTAKES_HOME>/desc/ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml
(optional) Drug NER: <cTAKES_HOME>/desc/ctakes-drug-ner/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml

The following shows where in the files you would make the changes. (Do not change the <configurationParameters> by the same name.)

Code Block

language	none

      <nameValuePair>
        <name>ctakes.umlsuser</name>
        <value>
          <string>YOUR_UMLS_USERNAME_HERE</string>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>ctakes.umlspw</name>
        <value>
          <string>YOUR_UMLS_PASSWORD_HERE</string>
        </value>
      </nameValuePair>

(2) Include the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within your aggregate Analysis Engine or switch to the ones provided by cTAKES. cTAKES has provided duplicates of shipped Analysis Engine descriptors, put UMLS in the name, and placed DictionaryLookupAnnotatorUMLS.xml within them for these components:

Dictionary Lookup
Clinical Documents pipeline
Drug NER
Side Effect

So you simply need to switch to using those descriptors. For example, if you were using AggregateCdaProcessor.xml in the Clinical Documents pipeline you would switch to using AggregateCdaUMLSProcessor.xml instead and you will now hook into the complete dictionaries.

You can, of course, modify your own aggregate Analysis Engine files and place the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within them.
Since this is an in-memory database implementation, please be patient during the initial load as it could take approximately 20-30 seconds for the database to initialize.

Building Your Own Dictionaries

It is not likely that the UMLS dictionaries will match to your underlying data completely. Other local terms may be required, etc. To install customized dictionaries for RxNorm, SNOMED-CT, or other vocabularies that are available through the UMLS, see the following posts on the cTAKES forums:

https://cabig-kc.nci.nih.gov/Vocab/forums/viewtopic.php?f=28&t=423
https://cabig-kc.nci.nih.gov/Vocab/forums/viewtopic.php?f=28&t=80&start=20#p1459
Warning cTAKES developers need to see if those forum posts still apply to cTAKES 3.0

Space shortcuts

Child pages

Versions Compared

Old Version 10

New Version Current

Key

Dictionaries

Building Your Own Dictionaries

Building Your Own Dictionaries