Image RemovedImage Added

The Big Picture

...

Workflows have Tasks, and Tasks are the basic unit of computation in the Workflow Manager. Tasks have a static and dynamic set of information passed to them when they execute. The static information can be thought of as e.g., environment variables in a linux program. This is information that is set for configuration purposes, and not changed frequently during large batch runs of Tasks. The dynamic information can be thought of as e.g., command line arguments to a linux program. This is information that is constantly changing per run, being updated, tweaked, etc., during large batch runs of Tasks.

CAS-PGE is a specialized WorkflowTask that makes the effort of setting up, executing an algorithm, and then capturing its output provenance (files, metadata, whatnot) very easy. CAS-PGE takes Workflow Metadata including static configuration and dynamic context metadata and adds in a third wrinkle, CAS-PGE metadata or custom metadata. This is information that is local only to the CAS-PGE execution, and that doesn't write back to the Workflow context or static task config (unless otherwise asked). You can use it to define some keys that you would eventually want to flow into file metadata; to pass along a signal to a downstream task; to use in a computation or calculation of a new derived metadata field, to use in a SQL query for files/products, etc.

Pipeline Metadata: Why So Hard?

It's so hard to understand where and why Metadata goes through OODT b/c there are a ton of places it flows through. Take the Diagram at the top of this page, and read it left to right. Let's imagine you are doing some pipeline processing, and you start out by doing a metadata extraction (e.g., via CAS-Crawler and its MetExtractorProductCrawler and your own custom MetadataExtractor). That's what you see in the first bubble. In the next bubble (upper left), you may have CAS-Crawler notify the Workflow Manager that you ingested a file and its Metadata in order to kick off an Event, like a Workflow. You can use the trusty TriggerPostIngestWorkflow action (formerly called UpdateWorkflowStatusToIngest and when this issue is committed will be called again) to automatically trigger on postIngestSuccess a notification message to the Workflow Manager that includes your product's extracted Metadata as the initial dynamic Workflow Metadata. This is really powerful. All your Filename, FileLocation, ProductType, etc., can automatically be seeded into the Workflow context for use to fire off some tasks, look up files, etc. That's what you see happening in stage 2.

One thing that's tricky about this though: what if your entire workflow is driven by CAS-Crawler postIngestSuccess actions. This can be a nice way to trigger ad-hoc Workflows that are branch/bounds, etc., and dynamic graphs and non sequential only at least until OODT-491 is finished. So, in doing so, the problem you will run into quickly, especially if you are using e.g., MetadataWriters in CAS-PGE (which the only way you are doing that at this point is if you are using 0.3 CAS-PGE, but that's an entirely different blog post/wiki page, and I am working on fixing that in OODT-667) is that you will find certain commonly used keys like ProductionDateTime, JobId, InputFiles, etc., start to appear in duplication. That is, you have 2x, 3x the same values in there for them. Why? The short answers are in OODT-725 and in OODT-728.

The Longer Answer: Understanding Pipeline Met Extraction and CAS-PGE Crawler config

The issue lies in what most people using OODT pipelines typically do. If you are using CAS-PGE and you would like to pull out metadata from the workflow context, and then save it to your met file so that it can get cataloged and archived by the File Manager (as shown in the bottom right bubble in the above Figure). So, you set up a PcsMetListFileWriter, add JobId, ProductionDateTime and InputFiles to it, and then have those fields sucked out of the Workflow Metadata dynamic context (which also happily scours the CAS-PGE CustomMetadata as well for it). Those fields are then captured for your upstream product. They look right the first time. Consider though that you have another CAS-PGE task that you need to run after this one. So, you want to figure it off using the TriggerPostIngestWorkflow action. You set that up and have it fired off after your CAS-PGE ingests are successful. That task completes. You have the same PcsMetListFileWriter config, and try and suck out and catalog JobId, etc., and then you record that information as metadata in the File Manager. Except this time, there are multiple values for JobId and the rest. Exactly two to be exact, and the same value, twice. Why?

It has to do with the way that CAS-PGE sets up its Crawler. It creates a StdProductCrawler (as of 0.3, and will again in 0.7, despite currently generating a AutoDetectCrawler). It then configures it and there is one magic config that it does at the end. It sets the Crawler Global Metadata field to the value from your Workflow Dynamic Metadata Context. What does that mean? It means that all the metadata from your current workflow's context is going to be passed along as "global" or "default" metadata that will go along with your Products when you ingest them. The best part? Metadata that you extract (say from your PcsMetListWriter) doesn't take precedence over this Global metadata. It actually is appended after it. Yikes.

Some work arounds

I'm working on a fix for the above in OODT-728 that will change the Crawler's behavior to take the extracted and/or provided Metadata's precedence over the Global metadata but until that happens, there are some workarounds.

I created a reusable Workflow Task called FilterTask which you can use to
1. Rename metadata keys from upstream contexts/Tasks. You can use this to make sure that FileLocation or Filename etc aren't duplicated and clobber your existing tasks.
2. Remove metadata keys from the current contexts/Tasks. You can thus remove offending met keys before they get into your Metadata in the File Manager, or CAS-PGE, in duplication.
You can take out recording the metadata explicitly in your CAS-PGE writers and extractors. The metadata will get passed along anyways via the Crawler global Metadata > Dynamic Workflow Metadata bridge in CAS-PGE and this will ensure that they are only passed along 1x instead of 1x, 2x, 3x, with each successive call.

Conclusion

Study the above text and diagram. It may help save you the 2 days of wasted time on my own that I've spent figuring all of this stuff out again, and some of it, for the first time. Updates and patches welcome to this documentation!

Space shortcuts

Page tree

Versions Compared

Old Version 1

New Version Current

Key

The Big Picture

Pipeline Metadata: Why So Hard?

The Longer Answer: Understanding Pipeline Met Extraction and CAS-PGE Crawler config

Some work arounds

Conclusion

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 1

New Version Current

Key

The Big Picture

Pipeline Metadata: Why So Hard?

The Longer Answer: Understanding Pipeline Met Extraction and CAS-PGE Crawler config

Some work arounds

Conclusion