OODT Crawler Help

If like me, it's the first time you're trying to get a working crawler, here are a few simple guidelines that will hopefully be of use to you (this page is by no means a complete tutorial or official user guide).

Please note, this page was updated in response to this email.

Here's how to get some useful feedback about crawler configurations:

./crawler_launcher --printSupportedActions
./crawler_launcher --printSupportedCrawlerActions
./crawler_launcher --printSupportedPreconditions

There where two crawlers that I was particularly interested in using - the MetExtractorProductCrawler and the AutoDetectProductCrawler (the StdProductCrawler does not support meta data extraction).

So, now you want to know more about how to get these crawlers up and running? Ask the crawler!

./crawler_launcher --operation --launchStdCrawler
./crawler_launcher --operation --launchMetCrawler
./crawler_launcher --operation --launchAutoCrawler

As you can see the command line options that need to specified are listed after running the command. My approach was to iteratively add the command line options. The simplest command that you can get some useful feedback from, is to specify the crawler.

MetExtractorProductCrawler

To get the meta data extractor product crawler working I ran:

./crawler_launcher --operation --launchMetCrawler

The crawler then failed, since there was a command line option that needed to be specified. So I added that option and ran the command again to see where it failed next.

This the complete met extractor command that I eventually ran:

./crawler_launcher
--filemgrUrl http://localhost:9000
--operation --launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--metExtractor org.apache.oodt.cas.metadata.extractors.ExternMetExtractor
--metExtractorConfig /usr/local/meerkat/extractors/katextractor/katextractor.config

Here is the katextractor.config configuration file that I used for my meta data extractor:

katextractor.config

<?xml version="1.0" encoding="UTF-8"?>
<cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
  <exec workingDir="">
    <extractorBinPath envReplace="true">/usr/local/meerkat/extractors/katextractor/kat_met_extractor.py</extractorBinPath>
      <args>
         <arg>-o</arg>
         <arg>-f</arg>
         <arg isDataFile="true"></arg>
      </args>
   </exec>
</cas:externextractor>

There was some optional functionality that I wanted to use. I added these options to the end of the command line:

--actionIds DeleteDataFile DeleteMetadataFile MoveDataFileToFailureDir Unique
--failureDir /tmp
--metFileExtension met

Notes:

I had a file manager listening on http://localhost:9000.
I've used an external meta data extractor (written in python) to extract data from HDF5 files.
MetExtractorProductCrawler example configuration can be found in the source (allows you to specify how the crawler will run your extractor): https://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/resources/examples/extern-config.xml

MetExtractorProductCrawler, using the TikaCmdLineMetExtractor (an easier approach)

NOTE: This extractor is only available in 07-SNAPSHOT+

Without having to create your own custom MetExtractor, you can leverage OODT's Tika extractor do automatically extract as much metadata as it can gather for you. The only thing you need to do is to specify a configuration file, and specify which ProductType you want your products ingested to. Below are examples of the steps you could perform:

Invocation command:
./crawler_launcher
--filemgrUrl http://localhost:9000
--operation --launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
--metExtractorConfig /usr/local/meerkat/extractors/tikaextractor/tikaextractor.config

Associated configuration file:

tikaextractor.config

ProductType=MyCustomProductType

AutoDetectProductCrawler

To get the auto detect product crawler working I ran:

./crawler_launcher --operation --launchAutoCrawler

I followed a similar approach for getting the MetExtractorProductCrawler working. For completeness, here is my complete command line:

./crawler_launcher
--operation --launchAutoCrawler
--filemgrUrl http://localhost:9000
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--mimeExtractorRepo ../policy/mime-extractor-map.xml

There is a bit of excitement when you hit the --mimeExtractorRepo command line option, this is where you configure the library of extractors that you will use. This command line option is looking for a mime-extractor-map.xml. Here is how my file looks:

mime-extractor-map.xml

<?xml version="1.0" encoding="UTF-8"?>
<cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or false" mimeRepo="/usr/local/meerkat/cas-crawler/policy/mimetypes.xml">
	<mime type="product/hdf5">
		<extractor class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
			<config file="/usr/local/meerkat/extractors/katextractor/katextractor.config"/>
      <preCondComparators>
        <preCondComparator id="CheckThatDataFileSizeIsGreaterThanZero"/>
      </preCondComparators>
		</extractor>
	</mime>
</cas:mimetypemap>

In the mime-extractor-map.xml file I needed to configure:

magic = "true or false". Not sure what that is yet.
a mime type. I used product/hdf5
a mimeRepo file. I called mine /usr/local/meerkat/cas-crawler/policy/mimetypes.xml.
a preCondComparator. I used CheckThatDataFileSizeIsGreaterThanZero

My mimetypes.xml file looks like this:

mimetypes.xml

<?xml version="1.0" encoding="UTF-8"?>
<mime-info>
  <mime-type type="product/hdf5">
    <glob pattern="*.h5"/>
  </mime-type>
</mime-info>

Notes:

I had a file manager listening on http://localhost:9000.
I've used an external meta data extractor (written in python) to extract data from HDF5 files.
AutoDetectProductCrawler example configuration can be found in the source:
- Uses the same metadata extractor specification file (you will have one of these for each mime-type).
- Allows you to define your mime-types – that is, give a mime-type for a given filename regular expression.
- maps your mime-types to extractors.

Space shortcuts

Page tree

MetExtractorProductCrawler

MetExtractorProductCrawler, using the TikaCmdLineMetExtractor (an easier approach)

AutoDetectProductCrawler

1 Comment

Thomas Bennett