Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Introduction

To use the TikaCmdLineMetExtractor is easy. Just follow the below steps.

What is it though? It is an extractor that leverages Apache Tika to extract mime-type specifc metadata automatically for you. In other words, it will use Tika to extract metadata from over 1000+ supported file types, automatically!

Usage

There are several ways to use this extractor, depending on your situation.

Choosing criteria:

  • Use crawler_launcher if you'd like to just invoke the TikaCmdLineMetExtractor using a single command whenever the crawler runs.
  • Use AutoDetectProductCrawler if you have situations where you both want to use the Tika extractor, and you want to use other extractors depending on the name pattern of the file
  • Use Stand Alone if you want to use the Tika extractor for testing or for embedding in scripts.

crawler_launcher

Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:

Code Block
./crawler_launcher \
    --filemgrUrl $FILEMGR_URL \
    --operation --launchMetCrawler \
    --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory \
    --productPath $OODT_HOME/data/staging \
    --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor \
    --metExtractorConfig $OODT_HOME/data/met/tika.conf

Contents of tika.conf:

Code Block
ProductType=GenericFile
Note

To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers.

AutoDetectProductCrawler

Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:

Code Block
./crawler_launcher --operation --launchAutoCrawler \
--filemgrUrl $FILEMGR_URL \
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \
--productPath $OODT_HOME/data/staging \
--mimeExtractorRepo ../policy/mime-extractor-map.xml

Next, make sure to add the Tika extractor based on your file's signature:

Code Block
<mime type="product/customprod">
    <glob pattern="IMG_SYNC*.jpg"/>
    <extractor class="org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor">
        <config file="../etc/img_sync_config.properties"/>
        <preCondComparators>
            <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
        </preCondComparators>
    </extractor>
</mime>

An example for img_sync_config.properties:

Code Block
ProductType=GenericFile
Note

To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers.

Stand-Alone

You can also use the TikaCmdLineMetExtractor stand-alone, via:

Code Block
java -cp cas-metadata-<OODT_VERSION>.jar org/apache/oodt/cas/metadata/extractors/TikaCmdLineMetExtractor
# Usage: org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor <file> <configfile>
Note

To use the full power of Tika's extractors (including external image, video, media etc extractors), include tika-app.jar in the class path above!