...
This document serves as a basic user's guide for the CAS-Curator project. The goal of the document is to allow users to check out, build, and install the base version of the CAS-Curator, as well as perform basic configuration tasks. For advanced topics, such as customizing the look and feel of the CAS-Curator for your project, please see our Advanced Guide.
The remainder of this guide is separated into the following sections:
Download and Build
Tomcat Deployment
Staging Area Setup
Extractor Setup
File Manager Configuration
...
While we could write a custom extractor in Java for the Cas-Curator, there are multiple existing software packages that read mp3 ID3 tags. For these situations, where an external, command-line extractor exists, we have developed the ExternMetExtractor
class in the CAS-Metadata project.
For this example, we are going to leaverage leverage an existing, open source mime-type detector with text and metadata parsing capabilities called Apache Tika. Tika parses a number of different common data formats, including a number of audio formats like mp3. I'll leave it to the reader of this guide to download and install Tika. We will assume that the latest release of the tika-app jar is in the mp3extractor
directory.
We have a little work to do to convert the output of Tika into a metadata file compatible with CAS-Curator. By default, Tika produces metadata in a "key: value" format as shown in the command-line session below:
Code Block |
---|
java -jar tika-app-01.5-SNAPSHOT4.jar -m \ /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3 Author: Johann Sebastian Bach Content-Type: audio/mpeg resourceName: Bach-SuiteNo2.mp3 title: Bach Cello Suite No 2 |
...
Code Block |
---|
java -jar tika-app-01.5-SNAPSHOT4.jar -m \ /usr/local/staging/products/mp3/Bach-SuiteNo2.mp3 | awk -F:\ 'BEGIN \ {print "<cas:metadata xmlns:cas=\"http://oodt.jpl.nasa.gov/1.0/cas\">"}\ {print "<keyval><key>"$1"</key><val>"substr($2,21)"</val></keyval>"}\ END {print "</cas:metadata>"}' <cas:metadata xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"> <keyval><key>Author</key><val>Johann Sebastian Bach</val></keyval> <keyval><key>Content-Type</key><val>audio/mpeg</val></keyval> <keyval><key>resourceName</key><val>Bach-SuiteNo2.mp3</val></keyval> <keyval><key>title</key><val>Bach Cello Suite No 2</val></keyval> </cas:metadata> |
Cool as a one line format translater translator is, we are actually going to have to do a little more work to create an extractor capable of producing metadata for CAS-Curator. A requirement for metadata extractors that are to be integrated with CAS-Curator is that they product three pieces of metadata:
ProductType
FileLocation
Filename
We should note that this is NOT a general requirement of all metadata extractors, but a ramification of the current implementation of CAS-Curator. In order to product this extra metadata, we will develop a small Python script:
Code Block |
---|
#!/usr/bin/python import os import sys fullPath = sys.argv[1] pathElements = fullPath.split("/"); fileName = pathElements[len(pathElements)-1] fileLocation = fullPath[:(len(fullPath)-len(fileName))] productType = "MP3" cmd = "java -jar /Users/woollard/Desktop/extractors/mp3extractor/" cmd += "tika-app-01.5-SNAPSHOT4.jar -m "+fullPath+" | awk -F:" cmd += " 'BEGIN {print \"<cas:metadata xmlns:cas=" cmd += "\\\"http://oodt.jpl.nasa.gov/1.0/cas\\\">\"}" cmd += " {print \"<keyval><key>\"$1\"</key><val>\"substr($2,21)\"" cmd += "</val></keyval>\"}' > "+fileName+".met" os.system(cmd) f = open(fileName+".met", 'a') f.write('<keyval><key>ProductType</key><val>'+productType) f.write('</val></keyval>\n<keyval><key>Filename</key><val>') f.write(fileName+'</val></keyval>\n<keyval><key>FileLocation') f.write('</key><val>'+fileLocation+'</val></keyval>\n') f.write('</cas:metadata>') f.close() |
...