Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update to reflect status as of Hive 0.9 (part 1, will continue tomorrow)

Hive HBase Integration

Table of Contents
Info
titleVersion information

As of Hive 0.9.0 the HBase integration requires at least HBase 0.92, earlier versions of Hive were working with HBase 0.89/0.90

Introduction

This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.

...

The storage handler is built as an independent module, hive-hbase-handler-x.y.z.jar, which must be available on the Hive client auxpath, along with HBase, Guava and Zookeeper ZooKeeper jars. It also requires the correct configuration property to be set in order to connect to the right HBase master. See the HBase documentation for how to set up an HBase cluster.

Here's an example using CLI from a source build environment, targeting a single-node HBase server. (Note that the jar locations and names have changed in Hive 0.79.0, so for earlier releases, some changes are needed.)

No Format
$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.79.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.8992.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar -hiveconf hbase.master=hbase.yoyodyne.com:60000

...

No Format
$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.79.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.8992.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar -hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com

The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.x, hbase-0.8992.0 and zookeeper-3.3.14. If you are not using hbase-0.8992.0, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.

In order to create a new HBase table which is to be managed by Hive, use the STORED BY clause on CREATE TABLE:

No Format
CREATE TABLE hbase_table_1(key int, value string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");

...

Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:

...

Again, hbase.columns.mapping is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name is optional.

Column Mapping

There are two SERDEOPTIONS that control the mapping of HBase columns to Hive:

  • hbase.columns.mapping
  • hbase.table.default.storage.type: Can have a value of either string (the default) or binary, this option is only available as of Hive 0.9 and the string behavior is the only one available in earlier versions

The column mapping support currently available is somewhat cumbersome and restrictive:

  • for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want
  • Wiki Markup
    a mapping entry must be either {{:key}} or of the form {{column-family-name:\[column-name]}}
  • there must be exactly one :key mapping (we don't support compound keys yet)
  • (note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of HIVE-1228Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
  • if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
  • there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
  • since Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
  • it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table

...