You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

The dictionaries and models used during annotation indeed are the cornerstone of quality for your results. Use the instructions below to facilitate changing the default dictionaries and models. Why? cTAKES includes a simple, very limited dictionary to make functions work since annotation is dependent upon having at least one dictionary. cTAKES does not distribute the UMLS dictionaries (like SNOMED-CT and RxNorm). The models made available by cTAKES have been trained on a specific set of text data (a corpus) which may not match well with your text.

Dictionaries

In order to make it easy to obtain common dictionaries cTAKES maintains a SourceForge project where you can download a file with the following dictionaries:

  • The OrangeBook
  • An rxnorm_index database (a Lucene index) containing drug names from RxNorm
  • A UMLS database (using two hsqldb tables) containing anatomical sites, procedures, signs/symptoms, and disorders/diseases from SNOMED-CT (umls_ms_2011ab)

To use them, you must have a UMLS username and password, and an Internet connection.

Step

Example

1. Download the dictionary resources.

Go to http://sourceforge.net/projects/ctakesresources/files/ and download the latest ZIP file from the ctakesresources project.

No example

2. Put the resources in the proper place.

Unzip the files into a temporary location, for example, C:\stuff or /tmp.
Copy the org directory (contents and subdirectories, do not replace) to <cTAKES_HOME>/resources.

Windows:

copy C:\stuff\ctakes-resources-3.1.0\resources\org C:\apache-ctakes-3.0.0-incubating\resources\org

Linux:

copy /tmp/ctakes-resources-3.1.0/resources/org /usr/local/apache-ctakes-3.0.0-incubating/resources/org

|

3. UMLS user ID and password.

Note

If you do not have a UMLS username and password, you may request one at UMLS Terminology Services

If you plan to use the UMLS Resources, set/export env variables
export ctakes.umlsuser=[username], ctakes.umlspw=[password]
or add the system properties to the java args
-Dctakes.umlsuser=[username] -Dctakes.umlspw=[password]|

In order to integrate the dictionaries you will need to do two things:
(1) Change the UMLSUser and UMLSPW <nameValuePair> strings in these descriptor files with your UMLS username and password.

  • Dictionary Lookup: <cTAKES_HOME>/desc/ctakes-dictionary-lookup/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml
  • (optional) Drug NER: <cTAKES_HOME>/desc/ctakes-drug-ner/desc/analysis_engine/DictionaryLookupAnnotatorUMLS.xml

The following shows where in the files you would make the changes. (Do not change the <configurationParameters> by the same name.)

<nameValuePair>
<name>UMLSUser</name>
<value>
<string>YOUR_UMLS_USERNAME_HERE</string>
</value>
</nameValuePair>
<nameValuePair>
<name>UMLSPW</name>
<value>
<string>YOUR_UMLS_PASSWORD_HERE</string>
</value>
</nameValuePair>

(2) Include the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within your aggregate Analysis Engine or switch to the ones provided by cTAKES. cTAKES has provided duplicates of shipped Analysis Engine descriptors, put UMLS in the name, and placed DictionaryLookupAnnotatorUMLS.xml within them for these components:

  • Dictionary Lookup
  • Clinical Documents pipeline
  • Drug NER
  • Side Effect

So you simply need to switch to using those descriptors. For example, if you were using AggregateCdaProcessor.xml in the Clinical Documents pipeline you would switch to using AggregateCdaUMLSProcessor.xml instead and you will now hook into the complete dictionaries.

You can, of course, modify your own aggregate Analysis Engine files and place the DictionaryLookupAnnotatorUMLS.xml Analysis Engine within them.
Since this is an in-memory database implementation, please be patient during the initial load as it could take approximately 20-30 seconds for the database to initialize.

If you would like to go back to using the small sample dictionaries that do not require a UMLS username, use the DictionaryLookupAnnotator.xml (UMLS is not in the file name) Analyis Engine descriptor in your aggregate. Just removing your password from the DictionaryLookupAnnotatorUMLS.xml files will not switch you back to the small sample dictionaries.

LVG

We have successfully tested the 2008 release of the full LVG data. In order to use this release of the full LVG data you should:

  1. Download either the full version or the lite version from NIH Lexical Tools
  2. Extract the TGZ file that you downloaded with a tool like 7-zip (available online) to a temporary directory. On some operating systems, like Windows, this may need to be done in two steps, 1) to uncompress and 2) to unzip.
  3. Replace the directory <cTAKES_HOME>/resources/org/apache/ctakes/lvg/data/HSqlDb with data/HSqlDb from your extracted download. Replacing the entire directory is appropriate.
  4. In the future, you can upgrade to later versions of LVG by editing the <cTAKES_HOME>/resources/org/apache/ctakes/lvg/data/config/lvg.properties file, replacing "lvg2008" with the name of the new release.

Building Your Own Dictionaries

It is not likely that the UMLS dictionaries will match to your underlying data completely. Other local terms may be required, etc. To install customized dictionaries for RxNorm, SNOMED-CT, or other vocabularies that are available through the UMLS, see the following posts on the cTAKES forums:

Models

Some models included in cTAKES may not represent your data distribution well. If you want to build or train your own models, please read the cTAKES 3.0 Component Use Guide, particularly:

  • Training a sentence detector model
  • Training a Part of Speech (POS) tagger model (Building a model Obtaining training data)
  • Creating a Part of Speech (POS) tag dictionary (Building a tag dictionary)
  • Training a chunker model (Building a model - Prepare GENIA training data)
  • Training a dependency parser (Dependency Parser)
  • No labels