The Big Picture

It's exceedingly frustrating in Apache OODT at the moment to understand the flow of metadata in pipeline processing. Take it from someone who was one of the core creators of the current CAS and nearly every piece of OODT code that is in pipeline processing production nowadays. I was recently trying to get OODT working on the DARPA XDATA project and was pulling my frickin' hair out when metadata wasn't correctly being passed from one CAS-PGE task to another, from the Crawler to the File Manager, from the Crawler events to the Workflow Manager, etc.

The answer was simple. OODT was doing exactly what I told it to do. I was just telling it the wrong things (besides a known bug however that I did uncover that has been the bane of my existence for the last 3 years). What it's doing what I told it to do? It's doing the right thing? Huh? How can this be. I'll explain.

The Flow of Metadata: the Basics

OODT uses CAS metadata objects to represent just about everything in its system. CAS-Metadata is one of the core constructs I developed to represent arbitrary information flow in the system. The CAS-Metadata object is a key->multi-valued container of information. CAS and OODT used to have a Metadata object, but it wasn't key->multi-valued. Yes, that really matters. The multiple values allow us to do metadata value transformations not otherwise possible, to represent more information and to handle e.g., NoSQL use cases even before the term NoSQL existed.

The File Manager uses Metadata to represent information a file, like FileSize, Filename, FileLocation and ProductType, for example. The Workflow Manager uses Metadata to represent the dynamic WorkflowContext information shared between WorkflowTasks, e.g., WorkflowManagerUrl, WorkflowInstanceId, JobId, etc. The Resource Manager even uses Metadata, to represent information about Jobs, such as ResourceNodes, StartTime, etc. Since we're focused on pipeline processing, we'll focus in on the Workflow Metadata to begin.

Workflows have Tasks, and Tasks are the basic unit of computation in the Workflow Manager. Tasks have a static and dynamic set of information passed to them when they execute. The static information can be thought of as e.g., environment variables in a linux program. This is information that is set for configuration purposes, and not changed frequently during large batch runs of Tasks. The dynamic information can be thought of as e.g., command line arguments to a linux program. This is information that is constantly changing per run, being updated, tweaked, etc., during large batch runs of Tasks.

Space shortcuts

Page tree

The Big Picture

The Flow of Metadata: the Basics

Space shortcuts

Page tree

Understanding the flow of Metadata during PGE based Processing

The Big Picture

The Flow of Metadata: the Basics