Hive HBase Integration
Table of Contents |
---|
Info | ||
---|---|---|
| ||
As of Hive 0.9.0 the HBase integration requires at least HBase 0.92, earlier versions of Hive were working with HBase 0.89/0.90 |
Introduction
This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.
...
The storage handler is built as an independent module, hive-hbase-handler-x.y.z.jar
, which must be available on the Hive client auxpath, along with HBase, Guava and Zookeeper ZooKeeper jars. It also requires the correct configuration property to be set in order to connect to the right HBase master. See the HBase documentation for how to set up an HBase cluster.
Here's an example using CLI from a source build environment, targeting a single-node HBase server. (Note that the jar locations and names have changed in Hive 0.79.0, so for earlier releases, some changes are needed.)
No Format |
---|
$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.79.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.8992.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar -hiveconf hbase.master=hbase.yoyodyne.com:60000 |
...
No Format |
---|
$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.79.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.8992.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.4.jar,$HIVE_SRC/build/dist/lib/guava-r09.jar -hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com |
The handler requires Hadoop 0.20 or higher, and has only been tested with dependency versions hadoop-0.20.x, hbase-0.8992.0 and zookeeper-3.3.14. If you are not using hbase-0.8992.0, you will need to rebuild the handler with the HBase jar matching your version, and change the --auxpath
above accordingly. Failure to use matching versions will lead to misleading connection failures such as MasterNotRunningException since the HBase RPC protocol changes often.
In order to create a new HBase table which is to be managed by Hive, use the STORED BY
clause on CREATE TABLE
:
No Format |
---|
CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz"); |
...
Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383 (as of Hive 0.6), and then issue this command before the INSERT:
...
Again, hbase.columns.mapping
is required (and will be validated against the existing HBase table's column families), whereas hbase.table.name
is optional.
Column Mapping
There are two SERDEOPTIONS
that control the mapping of HBase columns to Hive:
hbase.columns.mapping
hbase.table.default.storage.type
: Can have a value of eitherstring
(the default) orbinary
, this option is only available as of Hive 0.9 and thestring
behavior is the only one available in earlier versions
The column mapping support currently available is somewhat cumbersome and restrictive:
- for each Hive column, the table creator must specify a corresponding entry in the comma-delimited
hbase.columns.mapping
string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want Wiki Markup a mapping entry must be either {{:key}} or of the form {{column-family-name:\[column-name]}}
- there must be exactly one
:key
mapping (we don't support compound keys yet) - (note that before HIVE-1228 in Hive 0.6,
:key
was not supported, and the first Hive column implicitly mapped to the key; as of HIVE-1228Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future) - if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
- there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
- since Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
- it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table
...