Hive HBase Integration

Table of Contents

Info

title	Version information

As of Hive 0.9.0 the HBase integration requires at least HBase 0.92, earlier versions of Hive were working with HBase 0.89/0.90

Introduction

This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.

...

The storage handler is built as an independent module, hive-hbase-handler-x.y.z.jar, which must be available on the Hive client auxpath, along with HBase, Guava and Zookeeper ZooKeeper jars. It also requires the correct configuration property to be set in order to connect to the right HBase master. See the HBase documentation for how to set up an HBase cluster.

Here's an example using CLI from a source build environment, targeting a single-node HBase server. (Note that the jar locations and names have changed in Hive 0.79.0, so for earlier releases, some changes are needed.)

No Format

$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.79.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.8992.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar -hiveconf hbase.master=hbase.yoyodyne.com:60000

...

No Format

$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.79.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.8992.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar -hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com

The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.x, hbase-0.8992.0 and zookeeper-3.3.14. If you are not using hbase-0.8992.0, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.

In order to create a new HBase table which is to be managed by Hive, use the STORED BY clause on CREATE TABLE:

No Format

CREATE TABLE hbase_table_1(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");

...

Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:

...

Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.

Column Mapping

There are two SERDEOPTIONS that control the mapping of HBase columns to Hive:

hbase.columns.mapping
hbase.table.default.storage.type: Can have a value of either string (the default) or binary, this option is only available as of Hive 0.9 and the string behavior is the only one available in earlier versions

The column mapping support currently available is somewhat cumbersome and restrictive:

for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want

Wiki Markup
a mapping entry must be either {{:key}} or of the form {{column-family-name:\[column-name]}}

there must be exactly one :key mapping (we don't support compound keys yet)
(note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of HIVE-1228Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
since Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table

...

Space shortcuts

Child pages

Versions Compared

Old Version 10

New Version 11

Key

Hive HBase Integration

Introduction

Column Mapping

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 10

New Version 11

Key

Hive HBase Integration

Introduction

Column Mapping