Introduction
To use the TikaCmdLineMetExtractor is easy. Just follow the below steps.
What is it though? It is an extractor that leverages Apache Tika to extract mime-type specifc metadata automatically for you. In other words, it will use Tika to extract metadata from over 1000+ supported file types, automatically!
Usage
There are several ways to use this extractor, depending on your situation.
Choosing criteria:
- Use crawler_launcher if you'd like to just invoke the TikaCmdLineMetExtractor using a single command whenever the crawler runs.
- Use AutoDetectProductCrawler if you have situations where you both want to use the Tika extractor, and you want to use other extractors depending on the name pattern of the file
- Use Stand Alone if you want to use the Tika extractor for testing or for embedding in scripts.
crawler_launcher
Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:
Code Block |
---|
./crawler_launcher \ --filemgrUrl $FILEMGR_URL \ --operation --launchMetCrawler \ --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory \ --productPath $OODT_HOME/data/staging \ --metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor \ --metExtractorConfig $OODT_HOME/data/met/tika.conf |
Contents of tika.conf:
Code Block |
---|
ProductType=GenericFile |
Note |
---|
To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers. |
AutoDetectProductCrawler
Specify the --metExtractor and the --metExtractorConfig arguments when invoking the crawler_launcher script. Example below:
Code Block |
---|
./crawler_launcher --operation --launchAutoCrawler \ --filemgrUrl $FILEMGR_URL \ --clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \ --productPath $OODT_HOME/data/staging \ --mimeExtractorRepo ../policy/mime-extractor-map.xml |
Next, make sure to add the Tika extractor based on your file's signature:
Code Block |
---|
<mime type="product/customprod"> <glob pattern="IMG_SYNC*.jpg"/> <extractor class="org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor"> <config file="../etc/img_sync_config.properties"/> <preCondComparators> <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/> </preCondComparators> </extractor> </mime> |
An example for img_sync_config.properties:
Code Block |
---|
ProductType=GenericFile |
Note |
---|
To use the full power of Tika's extractors, make sure to include tika-app.jar in your crawler's pom.xml. Doing so will include about 30MB of external parsers supporting 1000+ mime-types. Not doing so will leverage only Tika's basic parsers. |
Stand-Alone
You can also use the TikaCmdLineMetExtractor stand-alone, via:
Code Block |
---|
java -cp cas-metadata-<OODT_VERSION>.jar org/apache/oodt/cas/metadata/extractors/TikaCmdLineMetExtractor # Usage: org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor <file> <configfile> |
Note |
---|
To use the full power of Tika's extractors (including external image, video, media etc extractors), include tika-app.jar in the class path above! |