Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This document serves as a basic user's guide for the CAS-Curator project. The goal of the document is to allow users to check out, build, and install the base version of the CAS-Curator, as well as perform basic configuration tasks. For advanced topics, such as customizing the look and feel of the CAS-Curator for your project, please see our Advanced Guide.
The remainder of this guide is separated into the following sections:
Download and Build
Tomcat Deployment
Staging Area Setup
Extractor Setup
File Manager Configuration

...

While we could write a custom extractor in Java for the Cas-Curator, there are multiple existing software packages that read mp3 ID3 tags. For these situations, where an external, command-line extractor exists, we have developed the ExternMetExtractor class in the CAS-Metadata project.
For this example, we are going to leaverage leverage an existing, open source mime-type detector with text and metadata parsing capabilities called Apache Tika. Tika parses a number of different common data formats, including a number of audio formats like mp3. I'll leave it to the reader of this guide to download and install Tika. We will assume that the latest release of the tika-app jar is in the mp3extractor directory.
We have a little work to do to convert the output of Tika into a metadata file compatible with CAS-Curator. By default, Tika produces metadata in a "key: value" format as shown in the command-line session below:

Code Block
java -jar tika-app-01.5-SNAPSHOT4.jar -m \
    /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3
Author: Johann Sebastian Bach
Content-Type: audio/mpeg
resourceName: Bach-SuiteNo2.mp3
title: Bach Cello Suite No 2  

...

Code Block
java -jar tika-app-01.5-SNAPSHOT4.jar -m \
  /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3 | awk -F:\
  'BEGIN \
  {print "<cas:metadata xmlns:cas=\"http://oodt.jpl.nasa.gov/1.0/cas\">"}\
  {print "<keyval><key>"$1"</key><val>"substr($2,21)"</val></keyval>"}\
  END {print "</cas:metadata>"}'
<cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
<keyval><key>Author</key><val>Johann Sebastian Bach</val></keyval>
<keyval><key>Content-Type</key><val>audio/mpeg</val></keyval>
<keyval><key>resourceName</key><val>Bach-SuiteNo2.mp3</val></keyval>
<keyval><key>title</key><val>Bach Cello Suite No 2</val></keyval>
</cas:metadata>            

Cool as a one line format translater translator is, we are actually going to have to do a little more work to create an extractor capable of producing metadata for CAS-Curator. A requirement for metadata extractors that are to be integrated with CAS-Curator is that they product three pieces of metadata:
ProductType
FileLocation
Filename
We should note that this is NOT a general requirement of all metadata extractors, but a ramification of the current implementation of CAS-Curator. In order to product this extra metadata, we will develop a small Python script:

Code Block
#!/usr/bin/python

import os
import sys

fullPath = sys.argv[1]
pathElements = fullPath.split("/");
fileName = pathElements[len(pathElements)-1]
fileLocation = fullPath[:(len(fullPath)-len(fileName))]
productType = "MP3"

cmd = "java -jar /Users/woollard/Desktop/extractors/mp3extractor/"
cmd += "tika-app-01.5-SNAPSHOT4.jar -m "+fullPath+" | awk -F:"
cmd += " 'BEGIN {print \"<cas:metadata xmlns:cas="
cmd += "\\\"http://oodt.jpl.nasa.gov/1.0/cas\\\">\"}"
cmd += " {print \"<keyval><key>\"$1\"</key><val>\"substr($2,21)\""
cmd += "</val></keyval>\"}' > "+fileName+".met"

os.system(cmd)

f = open(fileName+".met", 'a')
f.write('<keyval><key>ProductType</key><val>'+productType)
f.write('</val></keyval>\n<keyval><key>Filename</key><val>')
f.write(fileName+'</val></keyval>\n<keyval><key>FileLocation')
f.write('</key><val>'+fileLocation+'</val></keyval>\n')
f.write('</cas:metadata>')
f.close()

...