Table of Contents |
---|
Introduction
The role of a metadata extractor is extract metadata from one or more product types. In order to extract metadata, the extractor must understand the product type format, parse the product, and return metadata to be associated with the product. CAS-Curator, for example, uses metadata extractors to generate metadata for products in its staging area, both as a preview to the curator, and also during the course of data ingestion.
Java API
The CAS-Metadata project contains an interface class, org.apache.oodt.cas.metadata.MetExtractor. This API consists of two primary methods (with multiple method signatures each). This API can be seen below:
Code Block | ||||
---|---|---|---|---|
| ||||
/** * @author mattmann * @version $Revision$ * * <p> * An interface for {@link Metadata} extraction. This interface expects the * definition of the following two parameters: * * <ul> * <li><b>file</b> - the file to extract {@link Metadata} from.</li> * <li><b>config file</b> - a pointer to the config file for this MetExtractor</li> * </ul> * </p> * */ public interface MetExtractor { /** * Extracts {@link Metadata} from a given {@link File}. * * @param f * File object to extract Metadata from. * @return Extracted {@link Metadata} from the given {@link File}. * @throws MetExtractionException * If any error occurs. */ Metadata extractMetadata(File f) throws MetExtractionException; /** * Extracts {@link Metadata} from a given <code>/path/to/some/file</code>. * * @param filePath * Path to a given file to extract Metadata from. * @return Extracted {@link Metadata} from the given <code>filePath</code>. * @throws MetExtractionException * If any error occurs. */ Metadata extractMetadata(String filePath) throws MetExtractionException; /** * Extracts {@link Metadata} from a given {@link URL} pointer to a * {@link File}. * * @param fileUrl * The URL pointer to a File. * @return Extracted {@link Metadata} from the given File {@link URL}. * @throws MetExtractionException * If any error occurs. */ Metadata extractMetadata(URL fileUrl) throws MetExtractionException; /** * Sets the config file for this MetExtractor to the specified {@link File} * <code>f</code>. * * @param f * The config file for this MetExtractor. * @throws MetExtractionException */ void setConfigFile(File f) throws MetExtractionException; /** * Sets the config file for this MetExtractor to the specified {@link File} * identified by <code>filePath</code>. * * @param filePath * The config file path for this MetExtractor. * @throws MetExtractionException */ void setConfigFile(String filePath) throws MetExtractionException; /** * Sets the MetExtractorConfig for the MetExtractor * * @param config * The MetExtractorConfig */ void setConfigFile(MetExtractorConfig config); /** * Extracts {@link Metadata} from the given {@link File} using the specified * config file. * * @param f * The File to extract Metadata from. * @param configFile * The config file for this MetExtractor. * @return Extracted {@link Metadata} from the given {@link File} using the * specified config file. * @throws MetExtractionException * If any error occurs. */ Metadata extractMetadata(File f, File configFile) throws MetExtractionException; /** * Extracts {@link Metadata} from the given {@link File} using the specified * config file path. * * @param f * The File to extract Metadata from. * @param configFilePath * The path to the config file for this MetExtractor. * @return Extracted {@link Metadata} from the given {@link File} using the * specified config file path. * @throws MetExtractionException * If any error occurs. */ Metadata extractMetadata(File f, String configFilePath) throws MetExtractionException; /** * Extracts {@link Metadata} from the given {@link File} using the specified * {@link MetExtractorConfig}. * * @param f * The {@link File} from which {@link Metadata} will be extracted * from * @param config * The config file for the extractor * @return {@link Metadata} extracted from the {@link File} * @throws MetExtractionException * If any error occurs */ Meaadata extractMetadata(File f, MetExtractorConfig config) throws MetExtractionException; /** * Extracts {@link Metadata} from the given {@link URL} using the specified * {@link MetExtractorConfig}. * * @param fileUrl * The {@link URL} from which {@link Metadata} will be extracted * from * @param config * The config file for the extractor * @return {@link Metadata} extracted from the {@link URL} * @throws MetExtractionException * If any error occurs */ Matadata extractMetadata(URL fileUrl, MetExtractorConfig config) throws MetExtractionException; } |
Existing Implementations
A few have ben documented below, however additional implementations also exist at the link above so check them out.
External Metadata Extractor
<?xml version="1.0" encoding="UTF-8"?> <cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"> <exec workingDir=""> <extractorBinPath envReplace="true">[PWD]/extractor</extractorBinPath> <args> <arg isDataFile="true"/> <arg isPath="true">/usr/local/etc/testExtractor.config</arg> </args> </exec> </cas:externextractor>
<args> <arg>arg1</arg> <arg>arg2</arg> <arg>arg3</arg> </args>
- isDataFile="true" - This attribute passes the full path to the product from which metadata is to be extracted as the argument.
- isPath="true" - This attribute passes the argument encoded as a properly formed path (no char-set replacement, etc).
- envReplace="true" - This attribute replaces any part of the value of the argument that is inside brackets ([and ]) with the environment variable matching the text inside the brackets, if such an enviroment variable exists.
The Filename Token Metadata Extractor
In many cases, products that are to be ingested are named with metadata that should be extracted from the product name and cataloged upon ingest. For this type of situation, we have developed theFilenameTokenMetExtractor. This extractor uses a configuration file that specifies, for each metadata element, the index of the start position in the name for this metadata and its character length.
Below is an example configuration file used by the FilenameTokenMetExtractor. It assumes a product name formatted as follows:
MissionName_Date_StartOrbitNumber_StopOrbitNumber.txt
<input> <group name="SubstringOffsetGroup"> <vector name="MissionName"> <element>1</element> <element>11</element> </vector> <vector name="Date"> <element>13</element> <element>4</element> </vector> <vector name="StartOrbitNumber"> <element>18</element> <element>16</element> </vector> <vector name="StopOrbitNumber"> <element>35</element> <element>15</element> </vector> </group> <group name="CommonMetadata"> <scalar name="DataVersion">1.0</scalar> <scalar name="CollectionName">Test</scalar> <scalar name="DataProvider">OODT</scalar> </group> </input>
In this configuration, the FilenameTokenMetExtractor will produce four metadata elements from the product name:MissionName, Date, StartOrbitNumber, and StopOrbitNumber. The first element of each of these groups is the start index (this assumes 1-indexed strings). The second element is the substring length.
Additionally, this configuration specifies that metadata for all products additionally contain three comment metadata elements that are static: DataVersion, CollectionName, and DataProvider.
Metadata Reader Extractor
The MetReaderExtractor, part of the OODT CAS-Metadata project, assumes that a metadata file with then nameing convention "<Product Name>.met" is present in the same directory as the product. This extractor further assumes that the metadata is in the format specified in this document.
Copy And Rewrite Extractor
The CopyAndRewriteExtractor is a metadata extractor, that, like the MetReaderExtractor, assumes that a metadata file exists for the product from which metadata is to be extracted. This extractor reads in the original metadata file and replaces particular metadata values in that metadata file.
The CopyAndRewriteExtractor takes in a configuration file that is a java properties object with the following properties defined:
- numRewriteFields - The number of fields to rewrite within the original metadata file.
- rewriteFieldN - The name(s) of the fields to rewrite in the original metadata file.
- orig.met.file.path - The original path to the metadata file from which to draw the original metadata fields.
- fieldN.pattern - The string specification that details which fields to replace and to use in building the new field value.
An example of the configuration file is given below:
numRewriteFields=2 rewriteField1=ProductType rewriteField2=FileLocation orig.met.file.path=./src/resources/examples/samplemet.xml ProductType.pattern=NewProductType[ProductType] FileLocation.pattern=/new/loc/[FileLocation]
In ths example configuration, two metadata elements will be rewritten, ProductType and FileLocation. The original metadata file is located on at ./src/resources/examples/samplemet.xml. The Product Type will be rewritten as NewProductType<original ProductType value>. The File location will now be set to/new/location./src/resources/examples/samplemet.xml.