Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Note: This document is work in progress.

Contributors: Andreas Neumann, Francis Liu, Vandana Ayyalasomayajula

Objective

The objective of providing revision management capability to HBase tables is to preserve the functional programming paradigm for the grid. The map/reduce paradigm had proven efficient for the big data and it will be really useful if map-reduce programs could have data from HBase tables as their input. Presently, that is not possible as the data present in HBase is not suitable for repeatable reads. This is due to the fact that HBase assigns its own timestamps to store various revisions of user data. So, it is a good idea to have a centralized component ( revision manager) which could replace the timestamps with more meaningful IDs that end-users could tag on for achieving repeatable reads.

...

  • It assigns a unique monotonically increasing revision number (scope: table) for every write transaction.
  • It maintains the currently running, aborted transactions of HBase tables.
  • It provides APIs for users to take a latest snapshot of a HBase Table or for a valid revision number.

...

Integration with Map-Reduce

The users Users can either explicitly specify the a mode to read a given HBase table in their map-reduce job specification. If no mode is specified to read the table, the HBase input storage driver would take a latest snapshot of the table and use it a snapshot or one will be created automatically (latest snapshot of a table) as input for the job. The output data of a map-reduce program will be assigned a revision by the revision manager. If the program completes successfully, the revision information will be available with the "OutputJobInfo". To capture the state of the table after MR job completion, the users can ask the revision manager to take a snapshot. They could also use the revision information in the "OutputJobInfo", to create a snapshot and use it later.

Integration with Pig

The integration of this feature with Pig Users can use the default behavior (latest snapshot created every time an MR job is launched). Features such as explicitly specifying snapshots is future work.

Revision Assignment

...