Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

HBase

...

Output

...

Storage

...

Driver

...

-

...

Design

...

Description

This twiki outlines the design of the HCatOutputStorageDriver for HBase. As an iterative approach the two drivers will be created: HBaseDirectOutputStorageDriver and HBaseBulkOutputStorageDriver.

HBaseDirectOutputStorageDriver makes use of the HBase client (HTable) to make random writes for each record that needs to be written. This driver will validate the fundamental changes and enhancements required to support HBase via HCatalog before moving on to the more complex implementation. For smaller loads this driver should be sufficient.

HBaseBulkOutputStorageDriver makes use of HBase's bulk import facility wherein generated HFile's (HBase's internal storage format) are loaded onto their respective region servers. This is the solution we wish to achieve and will be mainly using in production environments.

HBaseDirectOutputStorageDriver

Diagram

<img src="%ATTACHURLPATH%/HCaOutputStorageDriver.jpg"

...

alt="HCaOutputStorageDriver.jpg"

...

width='791'

...

height='961'

...

/>

Discussion

As one of the requirements for batch loading data onto HBase all revision must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to
OutputJobInfo which allows us to pass implementation specific parameters to the underlying storage driver.

Code Block
titleOutputJobInfo.java
  
  private Map<String,String> properties;



h3. Discussion

As one of the requirements for batch loading data onto HBase all revision must be written with the same revision number to uniquely identify each batch update. Thus we have to add a new field to HCatTableInfo:

<PRE>
  /** version number to be used for systems which support versions/timestamp ie HBase */
  private long outputVersion;

   /**
   * Set/Get *Property @returninformation outputVersion to be usedpassed ifdown any, -1 means none
  to *StorageDriver implementation
   */
 put implementation specific publicstorage longdriver getOutputVersion() {
        return outputVersion;configurations here
    }

    /**
     * @param outputVersion set the outputVersion to be used @return
     */
    public voidMap<String,String> setOutputVersiongetProperties(long outputVersion) {
        this.outputVersion = outputVersionreturn properties;
    }
</PRE>

HBaseDirectStorageDriver

...

itself

...

is

...

a

...

pretty

...

straightforward

...

implementation.

...

HBaseDirectOutputFormat

...

decorates

...

HBase's

...

TableOutputFormat

...

or

...

we

...

can

...

implement

...

one

...

ourselves

...

controlling

...

the

...

client

...

directly

...

enabling

...

use

...

better

...

flexibility

...

with

...

tuning

...

ie

...

disabling

...

WAL

...

for

...

higher

...

write

...

rates.

...

This

...

OutputFormat's

...

key

...

is

...

not

...

used

...

and

...

the

...

Value

...

can

...

be

...

either

...

a

...

HBase

...

Put

...

or

...

Delete.

Code Block
titleHCatDirectOutputFormat.java



<PRE>
public class HBaseDirectOutputFormat<KEY>HBaseDirectOutputFormat extends OutputFormat<KEYOutputFormat<WritableComparable<?>,Writable> implements Configurable {
....
}
</PRE> 

One of the main HCat changes 

One of the main HCat changes that's

...

needed

...

to

...

be

...

made

...

at

...

this

...

stage

...

is

...

it's

...

assumption

...

that

...

table

...

are

...

always

...

stored

...

as

...

files

...

HCatOutputFormat

...

and

...

HCatOutputCommitter

...

makes

...

such

...

assumptions

...

such

...

as

...

checking

...

the

...

existence

...

of

...

a

...

path

...

to

...

verify

...

the

...

existence

...

of

...

a

...

partition.

...

ResultConverter

...

is

...

a

...

utility

...

class

...

which

...

can

...

be

...

shared

...

by

...

both

...

the

...

input

...

and

...

output

...

storage

...

driver

...

converting

...

data

...

from

...

HCat

...

to

...

native

...

HBase

...

forms.

...

We

...

can

...

initially

...

make

...

use

...

of

...

Hive's

...

HBaseSerDe

...

to

...

power

...

this

...

class

...

and

...

later

...

on

...

move

...

to

...

our

...

own

...

implementation.

...

Though

...

it

...

is

...

not

...

clear

...

whether

...

HCat

...

driver developers

...

are

...

expected

...

to

...

use

...

Hive

...

SerDes

...

to

...

maintain

...

interoperability

...

with

...

Hive.

...

HBaseBulkOutputStorageDriver

Diagram

<img src="%ATTACHURLPATH%/hbaseBulkOutputStorageDriver.jpg"

...

alt="hbaseBulkOutputStorageDriver.jpg"

...

width='757'

...

height='762'

...

/>

Discussion

With this driver data is first stored as a sequencefile using HBaseBulkOutputFormat and a second job is run to sort and write data as HBase region server files (HFileOutputFormat) and finally instruct region servers to "bulkImport" their respective regions.

HBaseBulkOutputFormat also does not use the key field and the value can be either a Put or a Delete. These classes implements Writable so no extra work is needed to serialize them.

Code Block
titleHBaseBulkOutputFormat.java

public class HBaseBulkOutputFormat extends OutputFormat<WritableComparable<?>,Writable> implements Configurable 

h3. Discussion

With this driver data is first stored as a sequencefile using HBaseBulkOutputFormat and a second job is run to sort and write data as HBase region server files (HFileOutputFormat) and finally instruct region servers to "bulkImport" their respective regions.

HBaseBulkOutputFormat also does not use the key field and the value can be either a Put or a Delete. These classes implements Writable so no extra work is needed to serialize them.

<PRE>
public class HBaseBulkOutputFormat<KEY> extends OutputFormat<KEY,Writable> {
...
}
</PRE>

ImportSequenceFile

...

is

...

the

...

MR

...

job

...

which

...

does

...

the

...

actual

...

bulk

...

import.

...

It's

...

tasks

...

involves

...

sorting

...

and

...

partitioning

...

the

...

data

...

correctly

...

and

...

finally

...

loading

...

the

...

partition

...

onto

...

their

...

respective

...

region

...

servers.

...

A

...

good

...

reference

...

implementation

...

for

...

this

...

Class

...

is

...

HBase's

...

ImportTSV

...

which

...

does

...

bulk

...

imports

...

on

...

TSV

...

files.

...

An

...

instance

...

of

...

this

...

job

...

is

...

triggered

...

by

...

the

...

MetaStore.

...

HBaseBulkOutputCommitter

...

is

...

used

...

to

...

inform

...

the

...

MetaStore

...

the

...

final

...

status

...

of

...

a

...

HBaseBulkOutputFormat

...

write

...

via

...

the

...

thrift

...

client.

...

Essentially

...

ending

...

in

...

either

...

a

...

run

...

of

...

ImportSequenceFile

...

on

...

success

...

or

...

invalidation

...

of

...

the

...

revision

...

on

...

failure.

...

This

...

commit

...

task

...

is

...

done

...

only

...

after

...

the

...

baseCommitTask

...

has

...

completed.

Code Block
titleHBaseBulkOutputCommiter



<PRE>
public class HBaseBulkOutputCommiter extends OutputCommitter {
    private OutputCommitter baseOutputCommiter;

    public HBaseBulkOutputCommiter (OutputCommitter outputCommitter) {
        this.baseOutputCommiter = outputCommitter;
    }
...
}
</PRE>