Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

External Metadata Extractor

There are many situations in which developers are interested in using a metadata extractor that is not written in Java. Perhaps there is an existing extractor written in a different programming language the
source of which you do not have access, or perhaps there are functional or non-functional requirements that make a different language more appropriate.We have developed the ExternMetExtractor as part of the CAS Metadata project to address this issue. TheExternMetExtractor uses a configuration file to specify the extractor working directory, the path to the executable, and any commandline arguments. This configuration file is specified below:
<?xml version="1.0" encoding="UTF-8"?>
 <cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
  <exec workingDir=""
   > <extractorBinPath envReplace="true">[PWD]/extractor</extractorBinPath> <args>extractorBinPath> 
    <args> 
      <arg isDataFile="true"/
     > <arg isPath="true">/usr/local/etc/testExtractor.config</arg>arg> 
    </args>
  </exec>
 </cas:externextractor>
There are a number of important elements to the external metadata extractor configuration file, including working directory (the workingDir attribute on the exec tag), the path the the executable extractor (the value of theextractorBinPath tag), and any arguments required by the extractor (values of the args tags).The working directory (the directory in which the metadata file is to be generated), is assumed to be the directory in which the extractor is run. This is signaled by a null value.Command-line arguments are delivered to the external extractor in the order they are listed in the configuration file. In order words,
 <args> <arg>arg1</arg> <arg>arg2</arg> <arg>arg3</arg> </args> 
would be passed to the extractor as arg1 arg2 arg3.Additionally, there are a number of specializations of the arg tag that can be set with tag attributes. Specifically:
  • isDataFile="true" - This attribute passes the full path to the product from which metadata is to be extracted as the argument.
  • isPath="true" - This attribute passes the argument encoded as a properly formed path (no char-set replacement, etc).
  • envReplace="true" - This attribute replaces any part of the value of the argument that is inside brackets ([and ]) with the environment variable matching the text inside the brackets, if such an enviroment variable exists.
For an example of the use of this type of metadata extractor, we our CAS-Curator Basic User Guide.

The Filename Token Metadata Extractor

In many cases, products that are to be ingested are named with metadata that should be extracted from the product name and cataloged upon ingest. For this type of situation, we have developed theFilenameTokenMetExtractor. This extractor uses a configuration file that specifies, for each metadata element, the index of the start position in the name for this metadata and its character length.

Below is an example configuration file used by the FilenameTokenMetExtractor. It assumes a product name formatted as follows:

MissionName_Date_StartOrbitNumber_StopOrbitNumber.txt

 <input> <group name="SubstringOffsetGroup"> <vector name="MissionName"> <element>1</element> <element>11</element> </vector> <vector name="Date"> <element>13</element> <element>4</element> </vector> <vector name="StartOrbitNumber"> <element>18</element> <element>16</element> </vector> <vector name="StopOrbitNumber"> <element>35</element> <element>15</element> </vector> </group> <group name="CommonMetadata"> <scalar name="DataVersion">1.0</scalar> <scalar name="CollectionName">Test</scalar> <scalar name="DataProvider">OODT</scalar> </group> </input> 

In this configuration, the FilenameTokenMetExtractor will produce four metadata elements from the product name:MissionNameDateStartOrbitNumber, and StopOrbitNumber. The first element of each of these groups is the start index (this assumes 1-indexed strings). The second element is the substring length.

Additionally, this configuration specifies that metadata for all products additionally contain three comment metadata elements that are static: DataVersionCollectionName, and DataProvider.

Metadata Reader Extractor

The MetReaderExtractor, part of the OODT CAS-Metadata project, assumes that a metadata file with then nameing convention "<Product Name>.met" is present in the same directory as the product. This extractor further assumes that the metadata is in the format specified in this document.

Copy And Rewrite Extractor

The CopyAndRewriteExtractor is a metadata extractor, that, like the MetReaderExtractor, assumes that a metadata file exists for the product from which metadata is to be extracted. This extractor reads in the original metadata file and replaces particular metadata values in that metadata file.

The CopyAndRewriteExtractor takes in a configuration file that is a java properties object with the following properties defined:

  • numRewriteFields - The number of fields to rewrite within the original metadata file.
  • rewriteFieldN - The name(s) of the fields to rewrite in the original metadata file.
  • orig.met.file.path - The original path to the metadata file from which to draw the original metadata fields.
  • fieldN.pattern - The string specification that details which fields to replace and to use in building the new field value.

An example of the configuration file is given below:

 numRewriteFields=2 rewriteField1=ProductType rewriteField2=FileLocation orig.met.file.path=./src/resources/examples/samplemet.xml ProductType.pattern=NewProductType[ProductType] FileLocation.pattern=/new/loc/[FileLocation] 

In ths example configuration, two metadata elements will be rewritten, ProductType and FileLocation. The original metadata file is located on at ./src/resources/examples/samplemet.xml. The Product Type will be rewritten as NewProductType<original ProductType value>. The File location will now be set to/new/location./src/resources/examples/samplemet.xml.