You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

This page is under construction.

Dictionaries

The dictionaries and models used during annotation indeed are the cornerstone of quality for your results. The install instructions show you how to get the resource that you need to run cTAKES. Those resources include:

  • An RxNorm_index database (a Lucene index): Contains drug names from RxNorm.
  • The OrangeBook: If you are not using the drug NER pipeline, the orange book is used to filter out what it found in RxNorm so that only things in both RxNorm and orange book are annotated. If you use Drug NER, orange book filtering is bypassed.
  • A UMLS database (using two hsqldb tables): Contains anatomical sites, procedures, signs/symptoms, and disorders/diseases from SNOMED-CT (umls_ms_2011ab).
  • The full LVG: From the lexical tools provided by the NLM for word normalization. Used to match similar words, for example the plural and singular forms of a word.
  • cTAKES models: Statistical models for assigning things like sentence endings, part of speech tags, chunk tags, dependency parses. They are derived from a combination of clinical and non clinical text.

You may not need to use any other dictionaries than those provided in these resources. However, the models made available by cTAKES have been trained on a specific set of text data (a corpus) which may not match well with your text. If you want to build or train your own models, please read the cTAKES 3.0 Component Use Guide, particularly:

  • Training a sentence detector model
  • Training a Part of Speech (POS) tagger model (Building a model Obtaining training data)
  • Creating a Part of Speech (POS) tag dictionary (Building a tag dictionary)
  • Training a chunker model (Building a model - Prepare GENIA training data)
  • Training a dependency parser (Dependency Parser)

Building Your Own Dictionaries

It is not likely that the UMLS dictionaries will match to your underlying data completely. Other local terms may be required, etc. To install customized dictionaries for RxNorm, SNOMED-CT, or other vocabularies that are available through the UMLS, see the following posts on the cTAKES forums:

  • No labels