Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Overview of Smoking status

The "smoking status" pipeline processes flat files or CDA (Clinical Document Architecture) documents to classify patient records into five pre-determined categories - past smoker (P), current smoker (C), smoker (S), nonsmoker (N), and unknown (U), where a past and current smoker are distinguished based on temporal expressions in the patient's medical records.

Analysis engines (annotator)

SimulatedProdSmokingTAE.xml

The file desc/analysis_engine/SimulatedProdSmokingTAE.xml provides a working example of the smoking status pipeline, utilizing the aggregate TAEs. This Aggregate includes Token, Sentence, SentenceAdjuster, ClassifiableEntries (which in turn invokes the ProductionPostSentenceAggregate annotators internally).

...

  • ExternalBaseAggregateTAE
  • SentenceAdjuster
  • ClassifiableEntriesAnnotator

...

...

SimulatedProdSmokingTAE_CDA.xml is also provided to process CDA documents. The aggregate flow will contain the annotator version ExternalBaseAggregateTAE_CDA.xml which will process the document as a Clinical Document Architecture (CDA) file.

...

ProductionPostSentenceAggregate_step1.xml

The file desc/analysis_engine/ProductionPostSentenceAggregate_step1.xml Aggregate TAE is used to run the first step classification stage via the KuRuleBasedClassifierAnnotator.

  • TokenizerAnnotator (core project)
  • KuRuleBasedClassifierAnnotator

...

...

This annotator is not contained in the aggregate flow, but introduced via the resource settings of the ClassifiableEntriesAnnotator (see the method initialize() in this class). UIMAFramework.produceAnalysisEngine(taeSpecifierStep1, ResMgr, null) instantiates the AE and CasCreationUtils.createCas(taeStep1.getAnalysisEngineMetaData()).getJCas() retrieves the CAS.

...

ProductionPostSentenceAggregate_step2_libsvm.xml

The file desc/analysis_engine/ProductionPostSentenceAggregate_step2_libsvm.xml is the Aggregate TAE used to run the second classification stage via the libSVM training module. Shipped with this annotator:

  • PcsClassifierAnnotator_libsvm,
  • ArtificialSentenceAnnotator,
  • SentenceAdjuster,
  • SmokingStatusDictionaryLookupAnnotator,
  • NegationAnnotator.

...

...

This annotator is not contained in the aggregate flow, but introduced via the resource settings of the ClassifiableEntriesAnnotator (see the method initialize() in this class). UIMAFramework.produceAnalysisEngine(taeSpecifierStep2, ResMgr, null) instantiates the AE and the ClassifiableEntriesAnnotator process method will process if the smoking status is known.

...

ExternalBaseAggregateTAE.xml

The file desc/analysis_engine/ExternalBaseAggregateTAE.xml provides an aggregate flow for the external annotations, SimpleSegmentAnnotator, TokenizerAnnotator, SentenceDetectorAnnotator, and LvgAnnotator. Shipped with this annotator:

  • SimpleSegmentAnnotator,
  • TokenizerAnnotator (core project),
  • SentDetectorAnnotator (core project),
  • LvgAnnotation (LVG project).

...

...

ExternalBaseAggregateTAE_CDA.xml is also provided to process CDA documents. The aggregate flow will contain the specialized class CdaCasInitializer (replacing the SimpleSegmentAnnotator used by flat file/non-CDA version) which will process the document as a Clinical Document Architecture (CDA) file. This annotator is contained in the SimulatedProdSmokingTAE_CDA aggregate. Red text indicates shipped with this annotator.

...

SentenceAdjuster.xml

The file desc/analysis_engine/SentenceAdjuster.xml drives the java class edu.mayo.bmi.smoking.ae.SentenceAdjuster annotator that uses some patterns and some rules about those patterns to adjust certain annotations. This annotator was extended to handle sentence boundaries for the Smoking status classification.

...

WordsInPattern <String/Multi-valued/Required>
(Default Value = 'no none never quit smoked ;') The list of words ("none", "no", etc) used in the pattern.

ClassifiableEntriesAnnotator.xml

The file desc/analysis_engine/ClassifiableEntriesAnnotator.xml drives the java class edu.mayo.bmi.smoking.ae.ClassifiableEntries. Converts Sentences to ClassifiableEntries (required by SmokingStatus pipeline) and ultimately to RecordSentence.

...

UimaDescriptorStep2
(Default Value = '$main_root/desc/analysis_engine/ProductionPostSentenceAggregate_step2_libsvm.xml')
Annotator responsible for second classification step.

...

...

The UimaDescriptorStep1/UimaDescriptorStep2 are introduced as resources via the ClassifiableEntriesAnnotator annotator during the initialization step. This allows the aggregates specified to be instantiated and analysis processing to be handled on a separate asynchronized thread. This enhances performance overall by ensuring the resources required by the process method will have output of the ProductionPostSentenceAggregates prepared without requiring a synchronized data flow (i.e. explicit aggregate flow via component descriptor aggregate flow).

...

KuRuleBasedClassifierAnnotator.xml

The file desc/analysis_engine/KuRuleBasedClassifierAnnotator.xml drives the java class edu.mayo.bmi.smoking.ae.KuRuleBasedClassifierAnnotator. Known vs Unknown classifier using smoking related keywords.

...

UnknownWordsFile <String/Single-valued/Required>
(Default Value = 'ss/data/KU/unknown_words.txt') If this word/phrase appears, treat the sentence as UNKNOWN.

PcsClassifierAnnotator_libsvm.xml

The file desc/analysis_engine/PcsClassifierAnnotator.xml smoking status classifier using libsvm. This annotator plays the same role as PcsBOWFeatureAnnotator.xml, PcsClassifierAnnotator.xml, and BOWFeatureRemovalAnnotator.xml, which use libsvm.

...

PathOfModel
(Default Value = 'file:ss/data/PCS/pcs_libsvm-2.91.model')
Resource file that provides trained model for smoking status classification.

ArtificialSentenceAnnotator.xml

The file desc/analysis_engine/ArtificialSentenceAnnotator.xml drives the java class edu.mayo.bmi.uima.core.ae.CopyAnnotator. Artificially creates a new SentenceAnnotation object by treating the entire document as a sentence. The offset values from the DocumentAnnotation object are transferred over to the new SentenceAnnotation object.

...

dataBindMap <String/Multi-valued/Required>
(Default Value = 'false')
Binds data from source to destination.
Format for each entry is the getter method name of the source to the
setter method name of the destination. e.g. getMyValue|setMyValue

SmokingStatusDictionaryLookupAnnotator.xml

The file desc/analysis_engine/SmokingStatusDictionaryLookupAnnotator.xml drives the java class edu.mayo.bmi.uima.lookup.ae.DictionaryLookupAnnotator. Performs dictionary lookup and stores the hits as NamedEntityAnnotation objects.

...

NonSmokerDictionary
(Default Value = 'file:ss/data/nonsmoker.dictionary')
Resource file that provides terms used as non-smoking words, e.g. '"non-smoker"'.

NegationAnnotator.xml

The file desc/analysis_engine/NegationAnnotator.xml drives the java class edu.mayo.bmi.uima.context.ContextAnnotator. Boundary tokens moved to external resource - ss/data/context/boundaryData.txt.

Resources
BoundaryData
(Default Value = 'file:ss/data/context/boundaryData.txt')
Resource file that provides terms used as sentence boundaries, e.g. '"nevertheless" "how" ";" "."'.

...

...

The parameters provided act the same way that the core's version of the 'NegationAnnotator', but since the boundary stop words are different for the smoking status pipeline, a separate implementation was necessary. However, current release of 'NegationAnnotator' does not use this resource.

...

CAS consumers - RecordResolutionCasConsumer.xml

The CAS consumer provided in /desc/cas_consumper/RecordResolutionCasConsumer.xml drives the java class edu.mayo.bmi.smoking.cc.RecordResolutionCasConsumer iterates over all sentences (each CAS equals one sentence) for a record and resolves the final classification value for the record. Output is saved to an delimited file. Additionally, optionally provides the overall patient level classification based on record level classification.

...

Resources
libsvm-2.91.jar
The support vector machine (SVM) classificiation tool provided at /lib/libsvm-2.91.jar used to train the smoking status model.

How to Create your own smoking status classifier model

  • Create sentence-level smoking status data with the format of: sentence|class_label (class_label: P, C, S).

...