Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

HBase Output Storage Driver - Design

Description

This twiki outlines the design of the HCatOutputStorageDriver for HBase. As an iterative approach the two drivers will be created: HBaseDirectOutputStorageDriver and HBaseBulkOutputStorageDriver.

HBaseDirectOutputStorageDriver makes use of the HBase client (HTable) to make random writes for each record that needs to be written. This driver will validate the fundamental changes and enhancements required to support HBase via HCatalog before moving on to the more complex implementation. Also it would be nice determine if this driver would be sufficient for some use cases.

HBaseBulkOutputStorageDriver makes use of HBase's bulk import facility wherein generated HFile's (HBase's internal storage format) are loaded onto their respective region servers. This is the solution we wish to achieve and will be mainly using in production environments.

HBaseDirectOutputStorageDriver

Diagram Image Added

Discussion

As one of the requirements for batch loading data onto HBase all revisions must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to
OutputJobInfo which enables us to pass implementation specific parameters to the underlying storage driver. This method of passing application specific information is a non-invasive step which we will reevaluate once we have some deliverables running.

Code Block
titleOutputJobInfo.java

  private Map<String,String> properties;

  /**
   * Set/Get Property information to be passed down to *StorageDriver implementation
   * put implementation specific storage driver configurations here
   * @return
   */
  public Map<String,String> getProperties() {
    return properties;
  }

Not Depicted in the diagram is a Constants class for storing the property keys relevant to this storage driver:

Code Block
titleHBaseConstants.java

public class HBaseConstants {

  public static final String CONF_OUTPUT_VERSION_KEY = HCatConstants.HCAT_DEFAULT_TOPIC_PREFIX+".hbase.outputVersion";

   ....

}

HBaseDirectStorageDriver itself is a pretty straightforward implementation. HBaseDirectOutputFormat decorates HBase's TableOutputFormat or we can implement one ourselves controlling the client directly enabling us better flexibility with tuning ie disabling WAL for higher write rates. This OutputFormat's key is not used and the Value can only be an HBase Put.

Code Block
titleHCatDirectOutputFormat.java

public class HBaseDirectOutputFormat extends OutputFormat<WritableComparable<?>,Writable> implements Configurable
Wiki Markup
---+ %SPACEOUT{%TOPIC%}%

---++ Description

This twiki outlines the design of the HCatOutputStorageDriver for HBase. As an iterative approach the two drivers will be created: HBaseDirectOutputStorageDriver and HBaseBulkOutputStorageDriver. 

HBaseDirectOutputStorageDriver makes use of the HBase client (HTable) to make random writes for each record that needs to be written. This driver will validate the fundamental changes and enhancements required to support HBase via HCatalog before moving on to the more complex implementation. For smaller loads this driver should be sufficient. 

HBaseBulkOutputStorageDriver makes use of HBase's bulk import facility wherein generated HFile's (HBase's internal storage format) are loaded onto their respective region servers. This is the solution we wish to achieve and will be mainly using in production environments.

---++ HBaseDirectOutputStorageDriver

---+++ Diagram


<img src="%ATTACHURLPATH%/HCaOutputStorageDriver.jpg" alt="HCaOutputStorageDriver.jpg" width='791' height='961' />

---+++ Discussion

As one of the requirements for batch loading data onto HBase all revision must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to HCatTableInfo:

<PRE>
  /** version number to be used for systems which support versions/timestamp ie HBase */
  private long outputVersion;

   /**
     * @return outputVersion to be used if any, -1 means none
     */
    public long getOutputVersion() {
        return outputVersion;
    }

    /**
     * @param outputVersion set the outputVersion to be used
     */
    public void setOutputVersion(long outputVersion) {
        this.outputVersion = outputVersion;
    }
</PRE>

HBaseDirectStorageDriver itself is a pretty straightforward implementation. HBaseDirectOutputFormat decorates HBase's TableOutputFormat or we can implement one ourselves controlling the client directly enabling use better flexibility with tuning ie disabling WAL for higher write rates. This OutputFormat's key is not used and the Value can be either a HBase Put or Delete.

<PRE>
public class HBaseDirectOutputFormat<KEY> extends OutputFormat<KEY,Writable> {
....
}
</PRE> 

One of the main HCat changes 

One of the main HCat changes that's

...

needed

...

to

...

be

...

made

...

at

...

this

...

stage

...

is

...

it's

...

"assumption

...

" that

...

table

...

are

...

always

...

stored

...

as

...

files

...

HCatOutputFormat

...

and

...

HCatOutputCommitter

...

makes

...

such

...

assumptions

...

such

...

as

...

checking

...

the

...

existence

...

of

...

a

...

path

...

to

...

verify

...

the

...

existence

...

of

...

a

...

partition.

...

ResultConverter

...

is

...

a

...

utility

...

class

...

which

...

can

...

be

...

shared

...

by

...

both

...

the

...

input

...

and

...

output

...

storage

...

driver

...

converting

...

data

...

from

...

HCat

...

to

...

native

...

HBase

...

forms.

...

We

...

can

...

initially

...

make

...

use

...

of

...

Hive's

...

HBaseSerDe

...

to

...

power

...

this

...

class

...

and

...

later

...

on

...

move

...

to

...

our

...

own

...

implementation.

...

Though

...

it

...

is

...

not

...

clear

...

whether

...

HCat

...

driver developers

...

are

...

expected

...

to

...

use

...

Hive

...

SerDes

...

to

...

maintain

...

interoperability

...

with

...

Hive.

HBaseBulkOutputStorageDriver

Diagram

Image Added

Discussion

With this driver data is first stored as a sequencefile using HBaseBulkOutputFormat and a second job is run to sort and write data as HBase region server files (HFileOutputFormat) and finally instruct region servers to "bulkImport" their respective regions.

HBaseBulkOutputFormat also does not use the key field and the value can only be a Put. These classes implements Writable so no extra work is needed to serialize them.

Code Block
titleHBaseBulkOutputFormat.java

public class HBaseBulkOutputFormat extends OutputFormat<WritableComparable<?>,Writable> implements Configurable  

---++ HBaseBulkOutputStorageDriver

---+++ Diagram

        <img src="%ATTACHURLPATH%/hbaseBulkOutputStorageDriver.jpg" alt="hbaseBulkOutputStorageDriver.jpg" width='757' height='762' />

---+++ Discussion

With this driver data is first stored as a sequencefile using HBaseBulkOutputFormat and a second job is run to sort and write data as HBase region server files (HFileOutputFormat) and finally instruct region servers to "bulkImport" their respective regions.

HBaseBulkOutputFormat also does not use the key field and the value can be either a Put or a Delete. These classes implements Writable so no extra work is needed to serialize them.

<PRE>
public class HBaseBulkOutputFormat<KEY> extends OutputFormat<KEY,Writable> {
...
}
</PRE>

ImportSequenceFile

...

is

...

the

...

MR

...

job

...

which

...

does

...

the

...

actual

...

bulk

...

import.

...

It's

...

tasks

...

involves

...

sorting

...

and

...

partitioning

...

the

...

data

...

correctly

...

and

...

finally

...

loading

...

the

...

partition

...

onto

...

their

...

respective

...

region

...

servers.

...

A

...

good

...

reference

...

implementation

...

for

...

this

...

Class

...

is

...

HBase's

...

ImportTSV

...

which

...

does

...

bulk

...

imports

...

on

...

TSV

...

files.

HBaseBulkOutputCommitter is used to update the MetaStore the final status of a HBaseBulkOutputFormat write via the thrift client. Essentially ending in either a run of ImportSequenceFile on success or invalidation of the revision on failure. This commit task is done only after the baseCommitTask has completed.

Code Block
titleHBaseBulkOutputCommiter
 An instance of this job is triggered by the MetaStore.

HBaseBulkOutputCommitter is used to inform the MetaStore the final status of a HBaseBulkOutputFormat write via the thrift client. Essentially ending in either a run of ImportSequenceFile on success or invalidation of the revision on failure. This commit task is done only after the baseCommitTask has completed.

<PRE>
public class HBaseBulkOutputCommiter extends OutputCommitter {
    private OutputCommitter baseOutputCommiter;

    public HBaseBulkOutputCommiter (OutputCommitter outputCommitter) {
        this.baseOutputCommiter = outputCommitter;
    }
...
}
</PRE>