Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Double-clicking on "mp3", we can see that the staging area path in the top left is now /mp3 and Bach-SuiteNo2.mp3 can be seen the main left staging pane. For the time-being, there is no metadata detected (as reported in the main right staging pane), but in the next section, we will be setting up a basic, command-line metadata extractor in order to show how extractors are integrated into CAS-Curator.

Extractor Setup

The CAS-Curator uses ancillary programs called metadata extractors to produce the metadata that it associates with products. More information about metadata extractors can be found in the Extractor Basics User's Guide.
Like the staging area, we first need to set up an area in the file system for metadata extractors. We will call this directory extractors:

Code Block
mkdir /usr/local/extractors

In order to register the metadata extractor path with the CAS-Curator, we will need to add another parameter to the web application's context file. Add the following parameter:

Code Block

<Parameter name="org.apache.oodt.cas.curator.metExtractorConf.uploadPath"
                value="/usr/local/extractors" />    

We are going to make a metadata extractor that will extractor ID3 tag metadata, such as author, title, resource type, etc from mp3s. As a first step, we will create a directory for the new extractor. The name of this directory is important, because CAS-Curator will use the directory name to register the extractor. We will name this directory mp3extractor

Code Block

mkdir /usr/local/extractors/mp3extractor

While we could write a custom extractor in Java for the Cas-Curator, there are multiple existing software packages that read mp3 ID3 tags. For these situations, where an external, command-line extractor exists, we have developed the ExternMetExtractor class in the CAS-Metadata project.
For this example, we are going to leaverage an existing, open source mime-type detector with text and metadata parsing capabilities called Apache Tika. Tika parses a number of different common data formats, including a number of audio formats like mp3. I'll leave it to the reader of this guide to download and install Tika. We will assume that the latest release of the tika-app jar is in the mp3extractor directory.
We have a little work to do to convert the output of Tika into a metadata file compatible with CAS-Curator. By default, Tika produces metadata in a "key: value" format as shown in the command-line session below:

Code Block

java -jar tika-app-0.5-SNAPSHOT.jar -m \
    /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3
Author: Johann Sebastian Bach
Content-Type: audio/mpeg
resourceName: Bach-SuiteNo2.mp3
title: Bach Cello Suite No 2  

With a little AWK magic, we can convert this output to the Cas-Metadata xml format:

Code Block

java -jar tika-app-0.5-SNAPSHOT.jar -m \
  /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3 | awk -F:\
  'BEGIN \
  {print "<cas:metadata xmlns:cas=\"http://oodt.jpl.nasa.gov/1.0/cas\">"}\
  {print "<keyval><key>"$1"</key><val>"substr($2,2)"</val></keyval>"}\
  END {print "</cas:metadata>"}'
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
<keyval><key>Author</key><val>Johann Sebastian Bach</val></keyval>
<keyval><key>Content-Type</key><val>audio/mpeg</val></keyval>
<keyval><key>resourceName</key><val>Bach-SuiteNo2.mp3</val></keyval>
<keyval><key>title</key><val>Bach Cello Suite No 2</val></keyval>
</cas:metadata>            

Cool as a one line format translater is, we are actually going to have to do a little more work to create an extractor capable of producing metadata for CAS-Curator. A requirement for metadata extractors that are to be integrated with CAS-Curator is that they product three pieces of metadata:
ProductType
FileLocation
Filename
We should note that this is NOT a general requirement of all metadata extractors, but a ramification of the current implementation of CAS-Curator. In order to product this extra metadata, we will develop a small Python script:

Code Block

#!/usr/bin/python

import os
import sys

fullPath = sys.argv[1]
pathElements = fullPath.split("/");
fileName = pathElements[len(pathElements)-1]
fileLocation = fullPath[:(len(fullPath)-len(fileName))]
productType = "MP3"

cmd = "java -jar /Users/woollard/Desktop/extractors/mp3extractor/"
cmd += "tika-app-0.5-SNAPSHOT.jar -m "+fullPath+" | awk -F:"
cmd += " 'BEGIN {print \"<cas:metadata xmlns:cas="
cmd += "\\\"http://oodt.jpl.nasa.gov/1.0/cas\\\">\"}"
cmd += " {print \"<keyval><key>\"$1\"</key><val>\"substr($2,2)\""
cmd += "</val></keyval>\"}' > "+fileName+".met"

os.system(cmd)

f = open(fileName+".met", 'a')
f.write('<keyval><key>ProductType</key><val>'+productType)
f.write('</val></keyval>\n<keyval><key>Filename</key><val>')
f.write(fileName+'</val></keyval>\n<keyval><key>FileLocation')
f.write('</key><val>'+fileLocation+'</val></keyval>\n')
f.write('</cas:metadata>')
f.close()

We'll assume that you have Python installed at /usr/bin/python and you have named this script mp3PythonExtractor.py and placed it in /usr/local/extractors/mp3extractor. We'll need to make sure it is executable from the command-line:

Code Block

cd /usr/local/extractors/mp3extractor
chmod +x mp3PythonExtractor.py
./mp3PythonExtractor.py \
 /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
<keyval><key>Author</key><val>Johann Sebastian Bach</val></keyval>
<keyval><key>Content-Type</key><val>audio/mpeg</val></keyval>
<keyval><key>resourceName</key><val>Bach-SuiteNo2.mp3</val></keyval>
<keyval><key>title</key><val>Bach Cello Suite No 2</val></keyval>
<keyval><key>ProductType</key><val>MP3</val></keyval>
<keyval><key>Filename</key><val>Bach-SuiteNo2.mp3</val></keyval>
<keyval><key>FileLocation</key><val>/usr/local/staging/products/mp3
</val></keyval>
</cas:metadata>

Now that we have a metadata extractor that meets our requirements (it's callable from the command-line, it produces CAS-Metadata compatible XML, and it extracts ProductType, Filename, and FileLocation), the next step is to create an ExternMetExtractor configuration file. This file will configure CAS-Metadata's ExternMetExtractor to call the mp3PythonExtractor.py script correctly.
There is more information about ExternMetExtractor configuration available in CAS-Metadata's Extractor Basics User's Guide. For the purposes of this guide, we will assume that the reader is familiar with configuration of this extractor, so we will just present the configuration below (we assume that you name this file mp3PythonExtractor.config):

Code Block
    
<?xml version="1.0" encoding="UTF-8"?>
<cas:externextractor xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
   <exec workingDir="">
      <extractorBinPath>
/usr/local/extractors/mp3extractor/mp3PythonExtractor.py
      </extractorBinPath>
      <args>
         <arg isDataFile="true"/>
      </args>
   </exec>
</cas:externextractor>

The last step in configuring our mp3 metadata extractor is to provide a properties file for CAS-Curator so that it knows how to call the ExternMetExtractor. Each extractor used by CAS-Curator needs a config.properties file. This file sets two properties:

*extractor.classname
*extractor.config.files

Create a config.properties file (this name is important for CAS-Curator to pick up the cofiguration) in the /usr/local/extractors/mp3extractor directory. This file should consist of the following parameters:

Code Block

extractor.classname=org.apache.oodt.cas.metadata.extractors.ExternMetExtractor
extractor.config.files=/usr/local/extractors/mp3extractor/mp3PythonExtractor.config

To recap, we first created a Python script that calls Apache Tika to extract metadata from mp3 files. Then we created a configuration file that configures CAS-Metadata's ExternMetExtractor to call this python script. Finally, we created a properties file for the CAS-Curator to call the ExternMetExtractor. To confirm the configuration of this extractor, we can long list the extractor directory:

Code Block

cd /usr/local/extractors/mp3extractor
ls -l
total 51448
-rw-r--r--  1 -  -       167 Nov 27 13:50 config.properties
-rw-r--r--  1 -  -       328 Nov 27 13:49 mp3PythonExtractor.config
-rwxr-xr-x  1 -  -       702 Nov 27 13:49 mp3PythonExtractor.py
-rw-r--r--  1 -  -  26325155 Nov 27 13:46 tika-app-0.5-SNAPSHOT.jar

Once you restart Tomcat, the change you have made to the context file will be used. The extractor area will now be set to /usr/local/extractors.

In the above screenshot, we see that, upon clicking on the mp3 file, metadata produced by the mp3extractor is shown in the main right staging pane. Now staging and extraction are set up. In the next section, we will set up a CAS-Filemgr instance and show how CAS-Curator can be used to ingest products.