Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Contributors (Alphabeticalalphabetical): Vandana Ayyalasomayajula, Francis Liu, Olga Natkovich, Andreas Neumann

...

Unlike performing joins against other files on HDFS, HBase tables are mutable. Hence random reads using precisely the same row key may return different results (ie one job is updating a table while another is reading from it). HCatalog's integration with HBase introduces the notion of snapshots, which guarantees consistent reads over a HBase Table during the lifetime of a MR job. Snapshots can also be shared, guaranteeing consistency over a DAG of MR jobs. In the context of the problem snapshots guarantee that retroactive updates do not affect jobs that are running concurrently.

More complicated Pig jobs may need the same snapshot to be reused a number of times. The HBase StorageHandler exposes Java apis for creating and reusing snapshots. We can mimic similar functionality in pig.

We can extend the previous lookup udf to support specifying a snapshotName:

Code Block

org.apache.hcatalog.hbase.pig.BoundedCeilLookup(snapshotName:chararray, lbKey:chararray , ubKey:charray, tableName:charrary, selected_columns....)

Once invoked the UDFs will search for the named snapshot. If none is found a snapshot is created and stored in the job's temporary work directory which can then be reused by other UDFs using the specified snapshotName.

Sample usage:

...