Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Contributors (Alphabetical): Vandana Ayyalasomayajula, Francis Liu, Olga Natkovich, Andreas Neumann

Problem:

A business models its data as a dimension store. In this data model, there are two types of tables: fact and dimension. The former containing atomic information about an event while the latter further describes information about an event for a given context. In the context of the business, the fact table is a click stream and one of the dimensions is Ad Campaign information.

...

Code Block
A = LOAD 'click_data' AS (clickId: chararray, campaignId: chararray, timestamp: long);
#Filter
B = FILTER A BY (timestamp > LB) AND (timestamp < UB);
#Skewed join
C = FOREACH B GENERATE clickId, timestamp,
        org.apache.hcatalog.hbase.boundedCeilLookup(CONCAT(campaignId,':'),CONCAT(campaignId,CONCAT(':',(chararray)timestamp),
            tableName:charrary, campaignId:charray, pricePerClick:double, effectiveTime:long);
D = FILTER C BY NOT isEmpty(campaignId);

Snapshots

...

Code Block
A = LOAD 'click_data1' AS (clickId: chararray, campaignId: chararray, timestamp: long);
B = LOAD 'click_data1' AS (clickId: chararray, campaignId: chararray, timestamp: long);
#Skewed join, snapshot will be take and stored as 'my_snapshot'
C = FOREACH A GENERATE clickId, timestamp,
        org.apache.hcatalog.hbase.boundedCeilLookup('my_snapshot',CONCAT(campaignId,':'),CONCAT(campaignId,CONCAT(':',(chararray)timestamp),
            tableName:charrary, campaignId:charray, pricePerClick:double, effectiveTime:long);
D = FILTER C BY NOT isEmpty(campaignId);
#Skewed join, previous snapshot 'my_snapshot' will be reused
E = FOREACH B GENERATE clickId, timestamp,
        org.apache.hcatalog.hbase.boundedCeilLookup('my_snapshot',CONCAT(campaignId,':'),CONCAT(campaignId,CONCAT(':',(chararray)timestamp),
            tableName:charrary, campaignId:charray, pricePerClick:double, effectiveTime:long);
F = FILTER E BY NOT isEmpty(campaignId);