...
./crawler_launcher
--filemgrUrl http://localhost:9000
--operation --launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--metExtractor org.apache.oodt.cas.metadata.extractors.ExternMetExtractor
--metExtractorConfig /usr/local/meerkat/extractors/katextractor/katextractor.config
...
- I had a file manager listening on http://localhost:9000.
- I've used an external meta data extractor (written in python) to extract data from HDF5 files.
- MetExtractorProductCrawler example configuration can be found in the source (allows you to specify how the crawler will run your extractor): https://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/resources/examples/extern-config.xml
MetExtractorProductCrawler, using the TikaCmdLineMetExtractor (an easier approach)
...
Invocation command:
./crawler_launcher
--filemgrUrl http://localhost:9000
--operation --launchMetCrawler
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--metExtractor org.apache.oodt.cas.metadata.extractors.TikaCmdLineMetExtractor
--metExtractorConfig /usr/local/meerkat/extractors/tikaextractor/tikaextractor.config
...
./crawler_launcher --operation --AutoDetectProductCrawlerlaunchAutoCrawler
I followed a similar approach for getting the MetExtractorProductCrawler working. For completeness, here is my complete command line:
./crawler_launcher
--operation --AutoDetectProductCrawlerlaunchAutoCrawler
--filemgrUrl http://localhost:9000
--clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
--productPath /usr/local/meerkat/data/staging/products/hdf5
--mimeExtractorRepo ../policy/mime-extractor-map.xml
...
- I had a file manager listening on http://localhost:9000.
- I've used an external meta data extractor (written in python) to extract data from HDF5 files.
- AutoDetectProductCrawler example configuration can be found in the source:
- Uses the same metadata extractor specification file (you will have one of these for each mime-type).
- Allows you to define your mime-types – that is, give a mime-type for a given filename regular expression.
- maps your mime-types to extractors.