Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

HBaseDirectOutputStorageDriver makes use of the HBase client (HTable) to make random writes for each record that needs to be written. This driver will validate the fundamental changes and enhancements required to support HBase via HCatalog before moving on to the more complex implementation. Also it would be nice determine if this driver would be sufficient for some use cases.

HBaseBulkOutputStorageDriver makes use of HBase's bulk import facility wherein generated HFile's (HBase's internal storage format) are loaded onto their respective region servers. This is the solution we wish to achieve and will be mainly using in production environments.

...

As one of the requirements for batch loading data onto HBase all revision revisions must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to
OutputJobInfo which allows enables us to pass implementation specific parameters to the underlying storage driver.

...

HBaseDirectStorageDriver itself is a pretty straightforward implementation. HBaseDirectOutputFormat decorates HBase's TableOutputFormat or we can implement one ourselves controlling the client directly enabling use us better flexibility with tuning ie disabling WAL for higher write rates. This OutputFormat's key is not used and the Value can be either a HBase Put or Delete.

...

One of the main HCat changes that's needed to be made at this stage is it's "assumption" that table are always stored as files HCatOutputFormat and HCatOutputCommitter makes such assumptions such as checking the existence of a path to verify the existence of a partition.

...

ImportSequenceFile is the MR job which does the actual bulk import. It's tasks involves sorting and partitioning the data correctly and finally loading the partition onto their respective region servers. A good reference implementation for this Class is HBase's ImportTSV which does bulk imports on TSV files. An instance of this job is triggered by the MetaStore.

HBaseBulkOutputCommitter is used to inform update the MetaStore the final status of a HBaseBulkOutputFormat write via the thrift client. Essentially ending in either a run of ImportSequenceFile on success or invalidation of the revision on failure. This commit task is done only after the baseCommitTask has completed.

...