Introduction

To use the TikaCmdLineMetExtractor is easy. Just follow the below steps.

What is it though? It is an extractor that leverages Apache Tika to extract mime-type specifc metadata automatically for you. In other words, it will use Tika to extract metadata from over 1000+ supported file types, automatically!

Usage

There are several ways to use this extractor, depending on your situation.

Choosing criteria:

  • Use crawler_launcher if you'd like to just invoke the TikaCmdLineMetExtractor using a single command whenever the crawler runs.
  • Use AutoDetectProductCrawler if you have situations where you both want to use the Tika extractor, and you want to use other extractors depending on the name pattern of the file
  • Use Stand Alone if you want to use the Tika extractor for testing or for embedding in scripts.

crawler_launcher

Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:

./crawler_launcher \
    --filemgrUrl $FILEMGR_URL \
    --operation --launchMetCrawler \
    --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory \
    --productPath $OODT_HOME/data/staging \
    --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor \
    --metExtractorConfig $OODT_HOME/data/met/tika.conf

Contents of tika.conf:

ProductType=GenericFile

To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers.

AutoDetectProductCrawler

Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:

./crawler_launcher --operation --launchAutoCrawler \
--filemgrUrl $FILEMGR_URL \
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \
--productPath $OODT_HOME/data/staging \
--mimeExtractorRepo ../policy/mime-extractor-map.xml

Next, make sure to add the Tika extractor based on your file's signature:

<mime type="product/customprod">
    <glob pattern="IMG_SYNC*.jpg"/>
    <extractor class="org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor">
        <config file="../etc/img_sync_config.properties"/>
        <preCondComparators>
            <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
        </preCondComparators>
    </extractor>
</mime>

An example for img_sync_config.properties:

ProductType=GenericFile

To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers.

Stand-Alone

You can also use the TikaCmdLineMetExtractor stand-alone, via:

java -cp cas-metadata-<OODT_VERSION>.jar org/apache/oodt/cas/metadata/extractors/TikaCmdLineMetExtractor
# Usage: org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor <file> <configfile>

To use the full power of Tika's extractors (including external image, video, media etc extractors), include tika-app.jar in the class path above!

  • No labels