...
./crawler_launcher
--filemgrUrl http://localhost:9000
--operation --launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--metExtractor org.apache.oodt.cas.metadata.extractors.ExternMetExtractor
--metExtractorConfig /usr/local/meerkat/extractors/katextractor/katextractor.config
...
- I had a file manager listening on http://localhost:9000.
- I've used an external meta data extractor (written in python) to extract data from HDF5 files.
- MetExtractorProductCrawler example configuration can be found in the source (allows you to specify how the crawler will run your extractor): https://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/resources/examples/extern-config.xml
...
MetExtractorProductCrawler, using the TikaCmdLineMetExtractor (an easier approach)
...
NOTE: This extractor is only available in 07-SNAPSHOT+
Without having to create your own custom MetExtractor, you can leverage OODT's Tika extractor do automatically extract as much metadata as it can gather for you. The only thing you need to do is to specify a configuration file, and specify which ProductType you want your products ingested to. Below are examples of the steps you could perform:
Invocation command:
./crawler_launcher
--filemgrUrl http://localhost:9000
--operation --launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
--metExtractorConfig /usr/local/meerkat/extractors/tikaextractor/tikaextractor.config
Associated configuration file:
Code Block | ||
---|---|---|
| ||
ProductType=MyCustomProductType
|
AutoDetectProductCrawler
To get the auto detect product crawler working I ran:
./crawler_launcher --operation --AutoDetectProductCrawlerlaunchAutoCrawler
I followed a similar approach for getting the MetExtractorProductCrawler working. For completeness, here is my complete command line:
./crawler_launcher
--operation --AutoDetectProductCrawlerlaunchAutoCrawler
--filemgrUrl http://localhost:9000
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--mimeExtractorRepo ../policy/mime-extractor-map.xml
...
- I had a file manager listening on http://localhost:9000.
- I've used an external meta data extractor (written in python) to extract data from HDF5 files.
- AutoDetectProductCrawler example configuration can be found in the source:
- Uses the same metadata extractor specification file (you will have one of these for each mime-type).
- Allows you to define your mime-types – that is, give a mime-type for a given filename regular expression.
- maps your mime-types to extractors.