The content of this page is reproduced from the tutorial written by Rishi Verma for OODT-217. This wiki is meant as a temporary holding place until the documentation makes its way into the official OODT source.
The following guide serves as a hands-on learning exercise for explaining the basics of how a CAS-PGE project can be set up and used. For this exercise, it will be necessary to download a sample project. Please obtain this sample project from the following link: fileconcatenator-pge.tar.
Example Overview
The example project detailed in this exercise is called FileConcatenatorPGE. This CAS-PGE project performs two functions. First, it collects two input files and concatenates them together into a second file. Second, it generates metadata from the generated product and ingests this metadata into a cas-filemanager instance.
Requirements
- A deployed CAS-Workflow instance. See Workflow Basic User Guide for instructions on how to set this component up
- A deployed CAS-Filemanager instance. See File Manager Basic User Guide for instructions on how to set this component up. Also see OODT Filemgr User Guide
- A deployed CAS-Crawler. See Crawler User Guide for instructions on how to set this component up. Also see OODT Crawler Help
- Maven 2
- Environment variables
- CRAWLER_HOME
- WORKFLOW_HOME
- WORKFLOW_URL
- FILEMGR_URL
- PGE_ROOT = /usr/local/pge
- The directory in which the PGE scripts and configuration files will reside.
- Ensure $WORKFLOW_HOME/lib contains at least the following:
- cas-crawler-<VERSION>.jar
- cas-filemgr-<VERSION>.jar
- cas-pge-<VERSION>.jar
1. Setting up CAS-PGE relevant directories.
There are a number of components associated with typical CAS-PGE deployments. These can include CAS-PGE configuration files, external scripts, input files, output files etc. The below steps will help guide you in setting up a configuration directory for the FileConcatenator PGE project as well as setting up a deployment directory for running your PGE. Note, the deployment directory could be located anywhere and it is assumed that for a production project, this directory could be shared among multiple-PGE services.
- Create CAS-PGE configuration directory
cd /usr/local mkdir –p $PGE_ROOT/file_concatenator/pge-configs
- Create CAS-PGE deployment directory
mkdir –p $PGE_ROOT/file_concatenator/output/jobs
- Create CAS-PGE input files directory
mkdir –p $PGE_ROOT/file_concatenator/files
- Create CAS-PGE extractors directory
mkdir –p $PGE_ROOT/file_concatenator/extractors/metlistwriter
2. Download the FileConcatenatorPGE project
The FileConcatenatorPGE project is a Java project that uses the Maven build system for producing a run-time CAS-PGE library. Please follow the below instructions to download and extract the project.
- Download FileConcatenatorPGE
- Extract project
tar xf fileconcatenator-pge.tar –C /usr/local/src
3. Customize and deploy the CAS-PGE configuration file
The CAS-PGE configuration file for identifying the steps involved in executing the PGE are located in fileconcatenator-pge/src/main/resources/config/PGEConfig.xml.
The PGEConfig.xml file performs the following functions:
- Describes how to run the PGE (ie. what external programs to call and in which order)
- Defines custom metadata used within the execution of the CAS-PGE
- Describes how to build metadata files generated as a result of the execution of the CAS-PGE and what to do with these files
Below is the sample PGEConfig.xml file used within the fileconcatenator-pge project:
<?xml version="1.0" encoding="UTF-8"?> <pgeConfig> <!-- How to run the PGE --> <exe dir="[JobDir]" shell="/bin/bash"> <!-- cd to PGE root --> <cmd>cd [PGE_ROOT]/file_concatenator</cmd> <cmd>cp [InputFile1] [OutputFile]</cmd> <cmd>cat [InputFile2] >> [OutputFile]</cmd> </exe> <!-- Files to ingest --> <output> <!-- one or more of these --> <dir path="[JobDir]" createBeforeExe="false"> <!-- one or more of these ** regExp or name can be used--> <files regExp=".*\.txt" metFileWriterClass="org.apache.oodt.pge.examples.fileconcatenator.writers.ConcactenatingFilenameExtractorWriter" args="[PGE_ROOT]/file_concatenator/extractors/concatenatingfilename.extractor.config.xml"/> <files regExp=".*\.txt" metFileWriterClass="org.apache.oodt.cas.pge.writers.metlist.MetadataListPcsMetFileWriter" args="[PGE_ROOT]/file_concatenator/extractors/metlistwriter/metout.xml"/> </dir> </output> <!-- Custom metadata to add to output files --> <customMetadata> <!-- helpful keys --> <metadata key="LessThan" val="<"/> <metadata key="LessThanOrEqualTo" val="[LessThan]="/> <metadata key="GreaterThan" val=">"/> <metadata key="GreaterThanOrEqualTo" val="[GreaterThan]="/> <metadata key="Exclamation" val="!"/> <metadata key="Ampersand" val="&"/> <metadata key="NotEqualTo" val="[Ampersand]="/> <metadata key="LogicalAnd" val="[Ampersand][Ampersand]"/> <metadata key="CshPipeToStdOutAndError" val="[GreaterThan][Ampersand][Exclamation]"/> <metadata key="ProductionDateTime" val="[DATE.UTC]"/> <metadata key="JobDir" val="[PGE_ROOT]/file_concatenator/output/jobs/job-[ProductionDateTime]"/> <metadata key="InputFile1" val="[PGE_ROOT]/file_concatenator/files/concatenatingInputFile1.txt"/> <metadata key="InputFile2" val="[PGE_ROOT]/file_concatenator/files/concatenatingInputFile2.txt"/> <metadata key="OutputFile" val="[JobDir]/concatenatedOutputFile-[ProductionDateTime].txt"/> </customMetadata> </pgeConfig>
4. Build and deploy FileConcatenatorPGE
Deploy the fileconcatenator-pge JAR pacakge
cd /usr/local/src/fileconcatenator-pge mvn package mv target/fileconcatenator-pge-*.jar $WORKFLOW_HOME/lib
Deploy fileconcatenator-pge resources
- PGEConfig.xml
cp /usr/local/src/fileconcatenator-pge/src/main/resources/config/PGEConfig.xml $PGE_ROOT/file_concatenator/pge-configs
- Sample files
cp /usr/local/src/fileconcatenator-pge/src/main/resources/files/concatenatingInputFile*.txt $PGE_ROOT/file_concatenator/files
- Extractor configuration file
cp /usr/local/src/fileconcatenator-pge/src/main/resources/extractors/concatenatingfilename.extractor.config.xml $PGE_ROOT/file_concatenator/extractors
- Met-list writer configuration file
cp /usr/local/src/fileconcatenator-pge/src/main/resources/extractors/metlistwriter/metout.xml $PGE_ROOT/file_concatenator/extractors/metlistwriter
5. Configure deployed CAS-Workflow for running FileConcatenatorPGE
- Navigate to your deployed CAS-Workflow’s policy directory
cd $WORKFLOW_HOME/policy
- Modify events.xml
Add the following entry to this file:events.xml<event name="fileconcatenator-pge"> <workflow id="urn:oodt:FileConcatenatorWorkflow"/> </event>
- Create a new policy file titled: fileconcatenator-pge.workflow.xml.
Add the following entries to this file:fileconcatenator-pge.workflow.xml<cas:workflow xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas" name="FileConcatenatorWorkflow" id="urn:oodt:FileConcatenatorWorkflow"> <tasks> <task id="urn:oodt:FileConcatenator"/> </tasks> </cas:workflow>
- Modify tasks.xml
Add the following entries to this file:tasks.xml<task id="urn:oodt:FileConcatenator" name="FileConcatenator" class="org.apache.oodt.pge.examples.fileconcatenator.FileConcatenatorPGETask"> <conditions/> <configuration> <property name="PGETask_Name" value="FileConcatenator"/> <property name="PGETask_ConfigFilePath" value="[PGE_ROOT]/file_concatenator/pge-configs/PGEConfig.xml" envReplace="true"/> <property name="PGETask_DumpMetadata" value="true"/> <property name="PCS_WorkflowManagerUrl" value="[WORKFLOW_URL]" envReplace="true" /> <property name="PCS_FileManagerUrl" value="[FILEMGR_URL]" envReplace="true"/> <property name="PCS_MetFileExtension" value="met"/> <property name="PCS_ClientTransferServiceFactory" value="org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory"/> <property name="PCS_ActionRepoFile" value="file:[CRAWLER_HOME]/policy/crawler-config.xml" envReplace="true"/> </configuration> <requiredMetFields> <metfield name="RunID"/> </requiredMetFields> </task>
- Modify workflow-lifecycles.xml
Add the following entries to this file (if not already present):workflow-lifecycles.xml<stage name="pge_setup_build_config_file"> <status>BUILDING CONFIG FILE</status> </stage> <stage name="pge_staging_input"> <status>STAGING INPUT</status> </stage> <stage name="pge_exec"> <status>PGE EXEC</status> </stage> <stage name="pcs_crawl"> <status>CRAWLING</status> </stage>
- Modify workflow-instance-met.xml
Add the following entry to this file:workflow-instance-met.xml<workflow id="urn:oodt:FileConcatenatorWorkflow"> <field name="RunID"/> </workflow>
- Restart CAS-Workflow
cd $WORKFLOW_HOME/bin ./wmgr restart
6. Run File Concatenator PGE
- Navigate to CAS-Workflow home binary directory
cd $WORKFLOW_HOME/bin
- Invoke the File Concatenator PGE by running the wmgr-client command-line
./wmgr-client --url http://localhost:9001 --operation --sendEvent --eventName fileconcatenator-pge --metaData --key RunID testNumber1
7. Verify output of PGE execution
After invoking the wmgr-client script as directed above, you should see an entry like the following:
INFO: Successfully ingested product: [/usr/local/pge/file_concatenator/output/jobs/job-2011-08-05T23:42:51.178Z/concatenatedOutputFile-2011-08-05T23:42:51.178Z.txt]: product id: a2d6d5ff-bfbc-11e0-8531-dff90856f73a
Additionally, you should see a the below two files in the generated job directory:
- Generated product file: $PGE_ROOT/file_concatenator/output/jobs/job-2011-08-05T23\:42\:51.178Z/concatenatedOutputFile-2011-08-05.txt
- Generated met file: $PGE_ROOT/file_concatenator/output/jobs/job-2011-08-05T23\:42\:51.178Z/concatenatedOutputFile-2011-08-05.txt.met