Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Hive HBase Integration

...

Table of Contents

Introduction

This page documents the Hive/HBase integration support originally
introduced in
HIVE-705. This
feature allows Hive QL statements to access
HBase tables for both read (SELECT)
and write (INSERT). It is even possible to combine access to HBase
tables with native Hive tables via joins and unions.

...

This feature is a work in progress, and suggestions for its
improvement are very welcome.

...

Before proceeding, please read Hive-StorageHandlers for an overview
of the generic storage handler framework on which HBase integration depends.

...

The storage handler is built as an independent module,
hive-hbase-handler-x.y.z.jar, which must be available on the Hive
client auxpath, along with HBase and Zookeeper jars. It also requires
the correct configuration property to be set in order to connect to the
right HBase master. See the HBase documentation for how to set up an HBase cluster.

...

The handler requires Hadoop 0.20 or higher, and has only been tested
with dependency versions hadoop-0.20.x, hbase-0.89.0 and zookeeper-3.3.1. If you are not using hbase-0.89.0, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.

In order to create a new HBase table which is to be managed by Hive,
use the STORED BY clause on CREATE TABLE:

...

The hbase.columns.mapping property is required and will be
explained in the next section. The hbase.table.name property
is optional; it controls the name of the table as known by HBase, and
allows the Hive table to have a different name. In this example, the
table is known as hbase_table_1 within Hive, and as xyz
within HBase. If not specified, then the Hive and HBase table names
will be identical.

After executing the command above, you should be able to see the new (empty) table in the HBase shell:

...

If you want to give Hive access to an existing HBase table,
use CREATE EXTERNAL TABLE:

...

Again, hbase.columns.mapping is required (and will be
validated against the existing HBase table's column families), whereas
hbase.table.name is optional.

...

The column mapping support currently available is somewhat
cumbersome and restrictive:

...

Here's an example with three Hive columns and two HBase column
families, with two of the Hive columns (value1 and value2)
corresponding to one of the column families (a, with HBase
column names b and c), and the other Hive column
corresponding to a single column (e) in its own column family
(d).

Code Block
CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,a:b,a:c,d:e"
);
INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 
FROM pokes WHERE foo=98 OR foo=100;

...

Here's how a Hive MAP datatype can be used to access an entire column
family. Each row can have a different set of columns, where the
column names correspond to the map keys and the column values
correspond to the map values.

...

Note that the key of the MAP must have datatype string, since it is
used for naming the HBase column, so the following table definition will fail:

...

  • more flexible column mapping (HIVE-806, HIVE-1245)
  • default column mapping in cases where no mapping spec is given
  • filter pushdown and indexing (see Hive-FilterPushdownDev and Hive-IndexDev)
  • expose timestamp attribute, possibly also with support for treating it as a partition key
  • allow per-table hbase.master configuration
  • run profiler and minimize any per-row overhead in column mapping
  • user defined routines for lookups and data loads via HBase client API (HIVE-758 and HIVE-791)
  • logging is very noisy, with a lot of spurious exceptions; investigate these and either fix their cause or squelch them

Build

Code for the storage handler is located under
hive/trunk/hbase-handler.

...

Class-level unit tests are provided under
hbase-handler/src/test/org/apache/hadoop/hive/hbase.

Positive QL tests are under hbase-handler/src/test/queries.
These use a HBase+Zookeeper mini-cluster for hosting the fixture
tables in-process, so no free-standing HBase installation is needed in order to run them. To avoid failures due to port conflicts, don't try to run these tests on the same machine where a real HBase master or zookeeper is running.

...

  • For information on how to bulk load data from Hive into HBase, see Hive-HBaseBulkLoad.
  • For another project which adds SQL-like query language support on top of HBase, see HBQL (unrelated to Hive).

Acknowledgements

  • Primary credit for this feature goes to Samuel Guo, who did most of the development work in the early drafts of the patch