CAS-PGE Metadata: The Basics

I'm writing this guide to better inform you all of what I've been learning and remembering about CAS-PGE metadata. CAS-PGE metadata or "pgeMetadata" is a combined metadata object, that brings together static task configuration, dynamic metadata context information and local, custom PGE metadata in a unified object. It was invented by Brian Foster and is really an advanced object. You can query pgeMetadata only for the dynamic or static information, combine metadata recursively using PathUtils replacement (the subject of likely another guide!), and replace and filter metadata depending on what its scope is. CAS-PGE enforces a particular scope of metadata that can be somewhat confusing, so I'll try to explain.

Order of Precedence

CAS-PGE has an order of precedence based on the two methods of access that it provides to its Metadata: either the whole metadata object, combining all three constituents (static, dynamic and custom), or particular values from a key that is present in the object, e.g., give me the dynamic metadata's values for FileLocation or whatnot. This precedence order is documented below:

Combination Order refers to the way that CAS-PGE combines the metadata when a user or API caller requests the entire schmoz, and what CAS-PGE returns back is the combined Metadata object package representing that schmoz. Since CAS-PGE has several Metadata objects that it stores, it must combine them in an intelligent way enforcing a particular scope and order of precedence. For example, if the dynamic metadata defines a metadata field Filename that is set to foo.tsv, and the local pgeMetadata contains a Filename set to foo.json and the static metadata (aka Workflow task config) contains a Filename set to foo2.json, what should it provide back to the user asking for a combined view?

You might think that CAS-PGE should provide a single Metadata object with key Filename, with values foo.tsvfoo.json, and foo2.json. The problem with this is that multiple values for a single Metadata key can confuse OODT downstream. For example, in File versioning, which value should we use for a filePathSpec of "/[FileLocation]/[Filename]" if there are 3 values for Filename present in the Metadata? What value should CAS-PGE use if someone does a SQL('FORMAT='$FileLocation/$Filename',SORT=ASC){SELECT Filename FROM GenericFile} query? You can see where I'm going. So, CAS-PGE does its best to use the order of priority and importance of the metadata to come up with a single value per Metadata key to give you back when you request a combined version. Reading the above layer diagram from bottom to top, the least priority is given to custom PGE metadata. Its the initial values that come in, and what would be replaced immediately by the values from dynamic metadata (if there are overlapping keys), and then would be finally replaced by the static task configuration values for any overlapping keys. So, the result would be a combined Metadata object with key Filename and value foo2.json

And this sort of makes sense right? If you are putting a key into static task config, you don't expect it to change often and would like it to take precedence. Dynamic metadata is runtime dynamic info, that shouldn't necessarily overlap with Static information, but should definitely take precedence over CAS-PGE local only met information (e.g., to compute queries, derived met fields for CAS-PGE, and to store temporary computation results). 

Query Order defines the order in which metadata values are selected from the aggregate pgeMetadata object and the order corresponds to the precedence given in the combine sequence. Values are first queried for and selected from Static Task metadata; if not there, CAS-PGE heads to dynamic workflow met, and if not there, it heads to CAS-PGE local metadata. 

Implications on PcsMetListFileWriter and CAS-Crawler During Pipeline Processing

Some implications of the above include some helpful hints that I ran into the past few days that may save you some time.

If you are using PcsMetListFileWriter, CAS-Crawler and CAS-PGE

  1. If you have conflicting keys in your dynamic metadata (e.g., your workflow was kicked off by a TriggerPostIngestWorkflow)
    1. defining key overrides will only be valid by using static task metadata. So, if you have ProductType in your dynamic metadata and you want to override it, define it in the Task configuration for that particular CAS-PGE task.
    2. Note if you have more of a dynamic property like Filename that doesn't make sense to define in static task config, then realize that the PcsMetListFileWriter gives you a few auto defined properties that are valid for the created output files that it captures with its Regex.
      1. FIlename
      2. FileLocation
      3. FileSize
    3. You can use the above keys and reference them in your CAS-PGE local metadata, and expect them to flow through to your product cataloging since OODT-728 that forces extracted metadata per product to take precedence over the dynamic workflow metadata passed in by CAS-PGE. This means that if you spit out a Filename and FileLocation and FileSize in your metout.xml file with PcsMetListFileWriter, the PcsMetListFileWriter will define these keys for you, and make them available based on the current file it is extracting. Then it will flow these through into your met file, allowing the cataloged product met values to take precedence over the dynamic workflow metadata which would normally override the fact that these values were defined in CAS-PGE metadata.
    4. Note if your prior workflow task kicked off this task and you used TriggerPostIngestWorkflow, remember you will want to override ProductName (which will have the previous product's value here and cause funkiness in OPSUI and downstream). The best bet to avoid this here is to define a ProductName met field in your metout.xml and set its value to [Filename] which will use the correct dynamically generated filename for the current file that PcsMetListFileWriter is extracting from.

 

  • No labels