Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Unlike performing joins against other files on HDFS, HBase tables are mutable. Hence random reads using precisely the same row key may return different results (ie one job is updating a table while another is reading from it). HCatalog's integration with HBase introduces the notion of snapshots, which guarantees consistent reads over a HBase Table during the lifetime of a MR job. Snapshots can also be shared, guaranteeing consistency over a DAG of MR jobs. In the context of the problem snapshots guarantee that retroactive updates do not affect jobs that are running concurrently.

More complicated Pig jobs may need the same snapshot to be reused a number of times. The HBase StorageHandler exposes Java apis for creating and reusing snapshots. We can mimic similar functionality in pig.

We can extend the previous lookup udf to support specifying a snapshotName:

Code Block

org.apache.hcatalog.hbase.pig.BoundedCeilLookup(snapshotName:chararray, lbKey:chararray , ubKey:charray, tableName:charrary, selected_columns....)

Once invoked the UDFs will search for the named snapshot. If none is found a snapshot is created and stored in the job's temporary work directory which can then be reused by other UDFs using the specified snapshotName.

Sample usage:
Code Block

A = LOAD 'click_data1' AS (clickId: chararray, campaignId: chararray, timestamp: long);
B = LOAD 'click_data1' AS (clickId: chararray, campaignId: chararray, timestamp: long);
#Skewed join, snapshot will be take and stored as 'my_snapshot'
C = FOREACH A GENERATE clickId, timestamp,
        org.apache.hcatalog.hbase.boundedCeilLookup('my_snapshot',CONCAT(campaignId,':'),CONCAT(campaignId,CONCAT(':',(chararray)timestamp),
            tableName:charrary, campaignId:charray, pricePerClick:double, effectiveTime:long);
D = FILTER C BY NOT isEmpty(campaignId);
#Skewed join, previous snapshot 'my_snapshot' will be reused
E = FOREACH B GENERATE clickId, timestamp,
        org.apache.hcatalog.hbase.boundedCeilLookup('my_snapshot',CONCAT(campaignId,':'),CONCAT(campaignId,CONCAT(':',(chararray)timestamp),
            tableName:charrary, campaignId:charray, pricePerClick:double, effectiveTime:long);
F = FILTER E BY NOT isEmpty(campaignId);