Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update for implementation as service.

...

Contributors (alphabetical): Vandana Ayyalasomayajula, Francis Liu, Andreas Neumann, Thomas Weise

Objective

The objective of providing revision management capability to HBase tables is to preserve the functional programming paradigm for the grid. The map/reduce paradigm had proven efficient for the big data and it will be really useful if map-reduce programs could have data from HBase tables as their input. Presently, that is not possible as the data present in HBase is not suitable for repeatable reads. This is due to the fact that HBase assigns its own timestamps to store various revisions of user data. So, it is a good idea to have a centralized component ( revision manager) which could replace the timestamps with more meaningful IDs that end-users could tag on for achieving repeatable reads.

...

When a job (or user) requests the revision manager to take a latest snapshot, the list of currently running transactions is consulted. The lowest revision number minus 1 , among the list is used in the snapshot.

...

Revision Manager as Service

Design Goals

  • Keep revision manager independent of HCatalog (e.g. Pig or other component could use RM outside HCatalog to access HBase data)
  • HCatalog meta store server (through Hive meta hook) will eventually interact with revision manager for authorization checks and to synchronize related state (on drop table etc., currently this is handled client side)
  • Revision Manager as active component, it needs to manage transaction expiration (“active” does not imply any particular implementation choice, e.g. thread usage)
  • Security: Access control in revision manager will be delegated to HBase (use ACLs of corresponding tables and standard Hadoop Kerberos or delegation token authentication).

With ZooKeeper based revision manager

  • HBase source for authorization (logically revision data ACL should be same as table ACL)
  • ZooKeeper ACLs will restrict access to revision data to the RM service principal
  • RM service runs everything as service principal, not as authorized client

Transaction Expiration

  • Potentially many current/expired transactions
  • Integrate expiration inline/lazily as part of begin/abort/snapshot
  • We decided not to use a timer based thread option - at this time benefits don’t justify added complexity
  • beginTransaction performs actual expiration with revision data storage update (interval ~30s) - need to read modify and write open revisions at that point anyways
  • Skip expired transactions from active list in createSnapshot (cleanup only happens as side effect of subsequent beginTransaction calls)
  • Limitation: Will initially only handle revision data and don’t delete corresponding data from HBase (no “rollback”)

Service implementation

  • Implement service as HBase coprocessor endpoint
  • Provides access to HBase ACL and leverages HBase container for security and high availability
  • Authorization ACL easy within coprocessor. Has access to HBase ACL info and can use it to authorize client access w/o relying on ACLs in ZooKeeper (difficult to maintain).
  • Consider using HBase instead of ZK for revision meta data storage

UML diagram for ZooKeeper based implementation

The following picture shows the class diagram for the zookeeper based revision manager.

...