Introduction

To use the TikaCmdLineMetExtractor is easy. Just follow the below steps.

What is it though? It is an extractor that leverages Apache Tika to extract mime-type specifc metadata automatically for you. In other words, it will use Tika to extract metadata from over 1000+ supported file types, automatically!

Usage

There are several ways to use this extractor, depending on your situation.

Choosing criteria:

Use crawler_launcher if you'd like to just invoke the TikaCmdLineMetExtractor using a single command whenever the crawler runs.
Use AutoDetectProductCrawler if you have situations where you both want to use the Tika extractor, and you want to use other extractors depending on the name pattern of the file
Use Stand Alone if you want to use the Tika extractor for testing or for embedding in scripts.

crawler_launcher

Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:

./crawler_launcher \
    --filemgrUrl $FILEMGR_URL \
    --operation --launchMetCrawler \
    --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory \
    --productPath $OODT_HOME/data/staging \
    --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor \
    --metExtractorConfig $OODT_HOME/data/met/tika.conf

Contents of tika.conf:

ProductType=GenericFile

To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers.

AutoDetectProductCrawler

Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:

./crawler_launcher --operation --launchAutoCrawler \
--filemgrUrl $FILEMGR_URL \
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \
--productPath $OODT_HOME/data/staging \
--mimeExtractorRepo ../policy/mime-extractor-map.xml

Next, make sure to add the Tika extractor based on your file's signature:

<mime type="product/customprod">
    <glob pattern="IMG_SYNC*.jpg"/>
    <extractor class="org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor">
        <config file="../etc/img_sync_config.properties"/>
        <preCondComparators>
            <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
        </preCondComparators>
    </extractor>
</mime>

An example for img_sync_config.properties:

ProductType=GenericFile

To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers.

Stand-Alone

You can also use the TikaCmdLineMetExtractor stand-alone, via:

java -cp cas-metadata-<OODT_VERSION>.jar org/apache/oodt/cas/metadata/extractors/TikaCmdLineMetExtractor
# Usage: org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor <file> <configfile>

To use the full power of Tika's extractors (including external image, video, media etc extractors), include tika-app.jar in the class path above!

Space shortcuts

Page tree

Introduction

Usage

crawler_launcher

AutoDetectProductCrawler

Stand-Alone

Space shortcuts

Page tree

Using TikaCmdLineMetExtractor

Introduction

Usage

crawler_launcher

AutoDetectProductCrawler

Stand-Alone