Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Warning

This page is under construction.


Dictionaries

The dictionaries and models used during annotation indeed are the cornerstone of quality for your results. The install instructions show you how to get the resource that you need to run cTAKES. Those resources include:

  • An RxNorm_index database (a Lucene index): Contains drug names from RxNorm.
  • The OrangeBook: If you are not using the drug NER pipeline, the orange book is used to filter out what it found in RxNorm so that only things in both RxNorm and orange book are annotated. If you use Drug NER, orange book filtering is bypassed.
  • A UMLS database (using two hsqldb tables): Contains anatomical sites, procedures, signs/symptoms, and disorders/diseases from SNOMED-CT (umls_ms_2011ab).
  • The full LVG: From the lexical tools provided by the NLM for word normalization. Used to match similar words, for example the plural and singular forms of a word.
  • cTAKES models: Statistical models for assigning things like sentence endings, part of speech tags, chunk tags, dependency parses. They are derived from a combination of clinical and non clinical text.

Use the instructions below to facilitate changing the default dictionaries and models. Why? cTAKES includes a simple, very limited dictionary to make functions work since annotation is dependent upon having at least one dictionary. cTAKES does not distribute the UMLS dictionaries (like SNOMED-CT and RxNorm). The models made available by cTAKES have been trained on a specific set of text data (a corpus) which may not match well with your text.

Dictionaries

In order to make it easy to obtain common dictionaries cTAKES maintains a SourceForge project where you can download a file with the following dictionaries:

...

To use them, you must have a UMLS username and password, and an Internet connection.

1. Download the dictionary resources.
Go to http://sourceforge.net/projects/ctakesresources/files/ and download the latest ZIP file from the ctakesresources project. 2. Put the resources in the proper place.
Unzip the files into a temporary location, for example, C:\stuff or /tmp.
Copy the org directory (contents and subdirectories, do not replace) to <cTAKES_HOME>/resources.

Windows:

Step

Example

No example

Code Block
langnone
copy C:\stuff\ctakes-resources-3.1.0\resources\org C:\apache-ctakes-3.0.0-incubating\resources\org

Linux:

Code Block
langnone
copy /tmp/ctakes-resources-3.1.0/resources/org /usr/local/apache-ctakes-3.0.0-incubating/resources/org

3. UMLS user ID and password.

Note
titleNote

If you do not have a UMLS username and password, you may request one at UMLS Terminology Services





Note

Wiki Markup
If you plan to use the UMLS Resources, set/export env variables \\
 export ctakes.umlsuser=\[username\], ctakes.umlspw=\[password\] \\
 or add the system properties to the java args \\
 \-Dctakes.umlsuser=<username> \-Dctakes.umlspw=<password> 

...