Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Introduction

YTEX provides a generalizable framework for the computation of path finding, corpus & intrinsic information content based semantic similarity measures from any domain ontology. This page describes the usage and configuration of the YTEX Semantic Similarity Tools. For a high-level overview, refer to our paper: Semantic similarity in the biomedical domain: an evaluation across knowledge sources.

Semantic similarity measures include path finding measures based purely on path distances, and information-content based measures based on taxonomic relationships and information content (IC) of concepts, a measure of concept frequency. Semantic similarity measures utilize a concept graph where vertices represent concepts and edges represent taxonomical relationships. The similarity between concepts is computed from the length of the path between concepts and their nearest common ‘parent’. Previous studies that took advantage of a large, annotated medical corpus to estimate concept frequencies showed that IC based measures of semantic similarity outperform path finding measures. Unfortunately, large annotated corpora are not typically available for many applications. To overcome this limitation, methods that estimate IC from the structure of the concept graph have been developed and their accuracy shown to rival that of corpus-based measures.

Usage

YTEX provides a web application client, web services interface, RESTful interface, and command-line interface to compute similarity measures. The demo similarity web app is available under http://informatics.med.yale.edu/ytex.web; if you plan to use this application extensively, please install ytex locally. Please refer to Sanchez & Batet for an excellent overview of similarity measures in general, and intrinsic information content (IC) based measures in particular. We scale all measures to the unit interval; see YTEX Semantic Similarity Measures for details.

YTEX allows the declarative definition of concept graphs in which nodes represent concepts and edges taxonomical relationships, and can compute the similarity between nodes in these graphs. YTEX comes with two concept graphs derived from the UMLS (version 2011AB2013AA)

  • sct-rxnorm: concepts from SNOMED-CT and RXNORM.  This is the default.
  • sct-msh-csp-aod: concepts from the SNOMED-CT, MeSH, CRISP, and Alcohol and Drug thesaurus
  • umls: concepts from all restriction free (level 0) UMLS source vocabularies and SNOMED-CT

You can configure additional concept graphs (see below).

Note that you must perform the additional YTEX installation tasks to use this component.  You must install the UMLS if you want to create your own concept graphs.

Similarity Web App

The similarity web app allows you to select

...

The similarity web application has two pages:

Similarity Single

Compute similarities for a single concept pair. In addition to the similarity values, this page outputs the path between concepts. You can enter the text of the concept, and the application will attempt to find the corresponding concept id (CUI). Alternatively, you can simply enter the concept id.

Similarity Multiple

Similarity Multiple: Compute the similarity between multiple pairs of concepts. Enter each concept pair on a different line, and separate concepts by a comma or whitespace. The output can be exported to a CSV file or Excel spreadsheet.

Similarity Web/RESTful Services

As with the web application, you can specify the concept graph, concept pairs, and measures for which similarities should be computed. Both methods accept a list of measures; these are:

  • Path-Finding Measures
    • WUPALMER: Wu & Palmer
    • LCH: Leacock & Chodorow
    • PATH: Path
    • RADA: Rada
  • Corpus IC Based Measures:
    • LIN: Lin
  • Intrinsic IC Based Measures:
    • INTRINSIC_LIN: Intrinsic IC based Lin
    • INTRINSIC_LCH: Intrinsic IC based Leacock & Chodorow
    • INTRINSIC_PATH: Intrinsic IC based Path, identical to Jiang & Conrath
    • INTRINSIC_RADA: Intrinsic IC based Rada
    • JACCARD: Intrinsic IC based Jaccard
    • SOKAL: Intrinsic IC based Sokal & Sneath

RESTful interface

To get the similarity between a pair of concepts using the concept graph sct-umls, and the LCH and Intrinsic LCH measures:http://informatics.med.yale.edu/ytex.web/services/rest/similarity?conceptGraph=umls&concept1=C0018787&concept2=C0024109&metrics=LCH,INTRINSIC_LCH&lcs=true

...

To get the 'default' concept graph: http://informatics.med.yale.edu/ytex.web/services/rest/getDefaultConceptGraph

Web Services interface

The Web Services interface is analogous to the restful interface, but allows the computation of similarities fro multiple concept pairs. Seehttp://informatics.med.yale.edu/ytex.web/services/conceptSimilarityWebService?wsdl

Command-Line Interface

The ConceptSimilarityServiceImpl java program accepts a list of concept pairs, and outputs their similarities in a tab-delimited format. It accepts the following arguments:

Java Command Line Arguments:

  • -Dytex.conceptGraphName: to override the default concept graph name (defined in ytex.properties, by default sct-rxnorm). e.g. -Dytex.conceptGraphName=umls to use the umls concept graph.
  • -Xmx<memory>: set the java heap size.  The amount of memory needed depends on the concept graph.  For the sct-rxnorm graph, 256 mb is sufficient (-Xmx256m), the larger UMLS concept graph requires 1200 mb (-Xmx1200m)

ConceptSimilarityServiceImpl Command Line Arguments:

  • -metrics: required, comma separated list of metrics (see above in for valid values)
  • -out: optional file to send output to. if not specified will send output to standard out.
  • -lcs: should the least common subsumer and paths be output for each concept pair?
  • -concepts: a list of concept pairs, or a file with concept pairs. For a file place each concept pair on a separate line, separate concepts by whitespace or commas. For a list of concept pairs, separate each concept by a comma, each pair by a semicolon:
 
Code Block
languagebash
cd CTAKES_HOME
bin\setenv.bat
java -cp %CLASSPATH% -Dlog4j.configuration=file:/%CTAKES_HOME%/config/log4j.xml -Xmx1gXmx256m org.apache.ctakes.ytex.kernel.metric.ConceptSimilarityServiceImpl -concepts C0018787,C0024109;C0034069,C0242379 -metrics LCH,INTRINSIC_LCH

The concept graph that will be used is defined in <YTEX_HOME>/resources/org/apache/ctakes/ytex/ytex.properties with the ytex.conceptGraphName key (default is sct-msh-csp-aod); alternatively, you can specify the concept graph with the java -Dytex.conceptGraphName=<concept graph> option. The amount of memory needed depends on the concept graph; SNOMED-CT fits comfortably in a 500 MB heap. The large umls concept graphs need 1 GB (specify the following java option: -Xmx1g).

Web Interface

To start the Semantic Similarity Web Application, run CTAKES_HOME\bin\ytexweb.bat (windows) or CTAKES_HOME\bin\ytexweb.sh (linux) and open http://localhost:8080/semanticSim.jsf.

Configuration

Creating a Concept Graph

To create a concept graph, you create a properties file that contains a query that retrieves all the edges from a taxonomy. The ConceptDaoImpldoes the following:

...

2) Run the ConceptDaoImpl: (modify the memory option -Xmx1g to as much memory as you can spare)

Code Block
languagebash
cd CTAKES_HOME
bin\setenv.bat
java -cp %CLASSPATH% -Dlog4j.configuration=file:/%CTAKES_HOME%/config/log4j.xml -Xmx1g org.apache.ctakes.ytex.kernel.dao.ConceptDaoImpl -prop sct-umls.xmlname <concept graph name>

You will get warnings about removing cycles. The concept graph will be stored in the CTAKES_HOME/resources/org/apache/ctakes/ytex/conceptGraph directory.

Corpus Information Content

We compute the intrinsic information content (intrinsic IC) when creating the concept graph. The InfoContentEvaluatorImpl class computes the corpus information content (corpus IC) for a given concept graph and corpus. This class takes as input a properties file that contains a query used to retrieve concept frequencies from the database; it then computes the information content of each node in the concept graph; finally it stores this in the feature_eval and feature_rank ytex database tables.

...