The Big Picture

It's exceedingly frustrating in Apache OODT at the moment to understand the flow of metadata in pipeline processing. Take it from someone who was one of the core creators of the current CAS and nearly every piece of OODT code that is in pipeline processing production nowadays. I was recently trying to get OODT working on the DARPA XDATA project and was pulling my frickin' hair out when metadata wasn't correctly being passed from one CAS-PGE task to another, from the Crawler to the File Manager, from the Crawler events to the Workflow Manager, etc.

The answer was simple. OODT was doing exactly what I told it to do. I was just telling it the wrong things (besides a known bug however that I did uncover that has been the bane of my existence for the last 3 years). What it's doing what I told it to do? It's doing the right thing? Huh? How can this be. I'll explain.

The Flow of Metadata: the Basics

OODT uses CAS metadata objects to represent just about everything in its system. CAS-Metadata is one of the core constructs I developed to represent arbitrary information flow in the system. The CAS-Metadata object is a key->multi-valued container of information. CAS and OODT used to have a Metadata object, but it wasn't key->multi-valued. Yes, that really matters. The multiple values allow us to do metadata value transformations not otherwise possible, to represent more information and to handle e.g., NoSQL use cases even before the term NoSQL existed. 

The File Manager uses Metadata to represent information a file, like FileSizeFilenameFileLocation and ProductType, for example. The Workflow Manager uses Metadata to represent the dynamic WorkflowContext information shared between WorkflowTasks, e.g., WorkflowManagerUrlWorkflowInstanceIdJobId, etc. The Resource Manager even uses Metadata, to represent information about Jobs, such as ResourceNodesStartTime, etc. Since we're focused on pipeline processing, we'll focus in on the Workflow Metadata to begin.

Workflows have Tasks, and Tasks are the basic unit of computation in the Workflow Manager. Tasks have a static and dynamic set of information passed to them when they execute. The static information can be thought of as e.g., environment variables in a linux program. This is information that is set for configuration purposes, and not changed frequently during large batch runs of Tasks. The dynamic information can be thought of as e.g., command line arguments to a linux program. This is information that is constantly changing per run, being updated, tweaked, etc., during large batch runs of Tasks. 

CAS-PGE is a specialized WorkflowTask that makes the effort of setting up, executing an algorithm, and then capturing its output provenance (files, metadata, whatnot) very easy. CAS-PGE takes Workflow Metadata including static configuration and dynamic context metadata and adds in a third wrinkle, CAS-PGE metadata or custom metadata. This is information that is local only to the CAS-PGE execution, and that doesn't write back to the Workflow context or static task config (unless otherwise asked). You can use it to define some keys that you would eventually want to flow into file metadata; to pass along a signal to a downstream task; to use in a computation or calculation of a new derived metadata field, to use in a SQL query for files/products, etc.

Pipeline Metadata: Why So Hard?

It's so hard to understand where and why Metadata goes through OODT b/c there are a ton of places it flows through. Take the Diagram at the top of this page, and read it left to right. Let's imagine you are doing some pipeline processing, and you start out by doing a metadata extraction (e.g., via CAS-Crawler and its MetExtractorProductCrawler and your own custom MetadataExtractor). That's what you see in the first bubble. In the next bubble (upper left), you may have CAS-Crawler notify the Workflow Manager that you ingested a file and its Metadata in order to kick off an Event, like a Workflow. You can use the trusty TriggerPostIngestWorkflow action (formerly called UpdateWorkflowStatusToIngest and when this issue is committed will be called again) to automatically trigger on postIngestSuccess a notification message to the Workflow Manager that includes your product's extracted Metadata as the initial dynamic Workflow Metadata. This is really powerful. All your FilenameFileLocationProductType, etc., can automatically be seeded into the Workflow context for use to fire off some tasks, look up files, etc. That's what you see happening in stage 2. 

One thing that's tricky about this though: what if your entire workflow is driven by CAS-Crawler postIngestSuccess actions. This can be a nice way to trigger ad-hoc Workflows that are branch/bounds, etc., and dynamic graphs and non sequential only at least until OODT-491 is finished. So, in doing so, the problem you will run into quickly, especially if you are using e.g., MetadataWriters in CAS-PGE (which the only way you are doing that at this point is if you are using 0.3 CAS-PGE, but that's an entirely different blog post/wiki page, and I am working on fixing that in OODT-667) is that you will find certain commonly used keys like ProductionDateTimeJobIdInputFiles, etc., start to appear in duplication. That is, you have 2x, 3x the same values in there for them. Why? The short answers are in OODT-725 and in OODT-728.

The Longer Answer: Understanding Pipeline Met Extraction and CAS-PGE Crawler config

The issue lies in what most people using OODT pipelines typically do. If you are using CAS-PGE and you would like to pull out metadata from the workflow context, and then save it to your met file so that it can get cataloged and archived by the File Manager (as shown in the bottom right bubble in the above Figure). So, you set up a PcsMetListFileWriter, add JobIdProductionDateTime and InputFiles to it, and then have those fields sucked out of the Workflow Metadata dynamic context (which also happily scours the CAS-PGE CustomMetadata as well for it). Those fields are then captured for your upstream product. They look right the first time. Consider though that you have another CAS-PGE task that you need to run after this one. So, you want to figure it off using the TriggerPostIngestWorkflow action. You set that up and have it fired off after your CAS-PGE ingests are successful. That task completes. You have the same PcsMetListFileWriter config, and try and suck out and catalog JobId, etc., and then you record that information as metadata in the File Manager. Except this time, there are multiple values for JobId and the rest. Exactly two to be exact, and the same value, twice. Why?

It has to do with the way that CAS-PGE sets up its Crawler. It creates a StdProductCrawler (as of 0.3, and will again in 0.7, despite currently generating a AutoDetectCrawler). It then configures it and there is one magic config that it does at the end. It sets the Crawler Global Metadata field to the value from your Workflow Dynamic Metadata Context. What does that mean? It means that all the metadata from your current workflow's context is going to be passed along as "global" or "default" metadata that will go along with your Products when you ingest them. The best part? Metadata that you extract (say from your PcsMetListWriter) doesn't take precedence over this Global metadata. It actually is appended after it. Yikes.

Some work arounds

I'm working on a fix for the above in OODT-728 that will change the Crawler's behavior to take the extracted and/or provided Metadata's precedence over the Global metadata but until that happens, there are some workarounds. 

  1. I created a reusable Workflow Task called FilterTask which you can use to
    1. Rename metadata keys from upstream contexts/Tasks. You can use this to make sure that FileLocation or Filename etc aren't duplicated and clobber your existing tasks. 
    2. Remove metadata keys from the current contexts/Tasks. You can thus remove offending met keys before they get into your Metadata in the File Manager, or CAS-PGE, in duplication.
  2. You can take out recording the metadata explicitly in your CAS-PGE writers and extractors. The metadata will get passed along anyways via the Crawler global Metadata > Dynamic Workflow Metadata bridge in CAS-PGE and this will ensure that they are only passed along 1x instead of 1x, 2x, 3x, with each successive call.

Conclusion

Study the above text and diagram. It may help save you the 2 days of wasted time on my own that I've spent figuring all of this stuff out again, and some of it, for the first time. Updates and patches welcome to this documentation!

  • No labels