Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Wiki Markup
h1. %SPACEOUT{%TOPIC%}%HBase Output Storage Driver - Design

h2. Description

This twiki outlines the design of the HCatOutputStorageDriver for HBase. As an iterative approach the two drivers will be created: HBaseDirectOutputStorageDriver and HBaseBulkOutputStorageDriver. 

HBaseDirectOutputStorageDriver makes use of the HBase client (HTable) to make random writes for each record that needs to be written. This driver will validate the fundamental changes and enhancements required to support HBase via HCatalog before moving on to the more complex implementation. For smaller loads this driver should be sufficient. 

HBaseBulkOutputStorageDriver makes use of HBase's bulk import facility wherein generated HFile's (HBase's internal storage format) are loaded onto their respective region servers. This is the solution we wish to achieve and will be mainly using in production environments.

h2. HBaseDirectOutputStorageDriver

h3. Diagram


<img src="%ATTACHURLPATH%/HCaOutputStorageDriver.jpg" alt="HCaOutputStorageDriver.jpg" width='791' height='961' />

h3. Discussion

As one of the requirements for batch loading data onto HBase all revision must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to HCatTableInfo:

<PRE>
  /** version number to be used for systems which support versions/timestamp ie HBase */
  private long outputVersion;

   /**
     * @return outputVersion to be used if any, -1 means none
     */
    public long getOutputVersion() {
        return outputVersion;
    }

    /**
     * @param outputVersion set the outputVersion to be used
     */
    public void setOutputVersion(long outputVersion) {
        this.outputVersion = outputVersion;
    }
</PRE>

HBaseDirectStorageDriver itself is a pretty straightforward implementation. HBaseDirectOutputFormat decorates HBase's TableOutputFormat or we can implement one ourselves controlling the client directly enabling use better flexibility with tuning ie disabling WAL for higher write rates. This OutputFormat's key is not used and the Value can be either a HBase Put or Delete.

<PRE>
public class HBaseDirectOutputFormat<KEY> extends OutputFormat<KEY,Writable> {
....
}
</PRE> 

One of the main HCat changes that's needed to be made at this stage is it's assumption that table are always stored as files HCatOutputFormat and HCatOutputCommitter makes such assumptions such as checking the existence of a path to verify the existence of a partition.

ResultConverter is a utility class which can be shared by both the input and output storage driver converting data from HCat to native HBase forms. We can initially make use of Hive's HBaseSerDe to power this class and later on move to our own implementation. Though it is not clear whether HCat developers are expected to use Hive SerDes to maintain interoperability with Hive.  

h2. HBaseBulkOutputStorageDriver

h3. Diagram

        <img src="%ATTACHURLPATH%/hbaseBulkOutputStorageDriver.jpg" alt="hbaseBulkOutputStorageDriver.jpg" width='757' height='762' />

h3. Discussion

With this driver data is first stored as a sequencefile using HBaseBulkOutputFormat and a second job is run to sort and write data as HBase region server files (HFileOutputFormat) and finally instruct region servers to "bulkImport" their respective regions.

HBaseBulkOutputFormat also does not use the key field and the value can be either a Put or a Delete. These classes implements Writable so no extra work is needed to serialize them.

<PRE>
public class HBaseBulkOutputFormat<KEY> extends OutputFormat<KEY,Writable> {
...
}
</PRE>

ImportSequenceFile is the MR job which does the actual bulk import. It's tasks involves sorting and partitioning the data correctly and finally loading the partition onto their respective region servers. A good reference implementation for this Class is HBase's ImportTSV which does bulk imports on TSV files. An instance of this job is triggered by the MetaStore.

HBaseBulkOutputCommitter is used to inform the MetaStore the final status of a HBaseBulkOutputFormat write via the thrift client. Essentially ending in either a run of ImportSequenceFile on success or invalidation of the revision on failure. This commit task is done only after the baseCommitTask has completed.

<PRE>
public class HBaseBulkOutputCommiter extends OutputCommitter {
    private OutputCommitter baseOutputCommiter;

    public HBaseBulkOutputCommiter (OutputCommitter outputCommitter) {
        this.baseOutputCommiter = outputCommitter;
    }
...
}
</PRE>