...
Here's an example using CLI from a source build environment, targeting a single-node HBase server. (Note that the jar locations and names have changed in Hive 0.7, so for earlier releases, some changes are needed.)
Code Blocknoformat |
---|
$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.7.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.89.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.jar -hiveconf hbase.master=hbase.yoyodyne.com:60000 |
Here's an example which instead targets a distributed HBase cluster where a quorum of 3 zookeepers is used to elect the HBase master:
Code Blocknoformat |
---|
$HIVE_SRC/build/dist/bin/hive --auxpath $HIVE_SRC/build/dist/lib/hive-hbase-handler-0.7.0.jar,$HIVE_SRC/build/dist/lib/hbase-0.89.0-SNAPSHOT.jar,$HIVE_SRC/build/dist/lib/zookeeper-3.3.1.jar -hiveconf hbase.zookeeper.quorum=zk1.yoyodyne.com,zk2.yoyodyne.com,zk3.yoyodyne.com |
...
In order to create a new HBase table which is to be managed by Hive, use the STORED BY clause on CREATE TABLE:
Code Blocknoformat |
---|
CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz"); |
...
After executing the command above, you should be able to see the new (empty) table in the HBase shell:
Code Blocknoformat |
---|
$ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Version: 0.20.3, r902334, Mon Jan 25 13:13:08 PST 2010 hbase(main):001:0> list xyz 1 row(s) in 0.0530 seconds hbase(main):002:0> describe "xyz" DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', COMPRESSION => 'NONE', VE true RSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0220 seconds hbase(main):003:0> scan "xyz" ROW COLUMN+CELL 0 row(s) in 0.0060 seconds |
...
Here's how to move data from Hive into the HBase table (see GettingStarted for how to create the example table pokes
in Hive first):
Code Blocknoformat |
---|
INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98; |
Use HBase shell to verify that the data actually got loaded:
Code Blocknoformat |
---|
hbase(main):009:0> scan "xyz" ROW COLUMN+CELL 98 column=cf1:val, timestamp=1267737987733, value=val_98 1 row(s) in 0.0110 seconds |
And then query it back via Hive:
Code Blocknoformat |
---|
hive> select * from hbase_table_1; Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK 98 val_98 Time taken: 4.582 seconds |
Inserting large amounts of data may be slow due to WAL overhead; if you would like to disable this, make sure you have HIVE-1383, and then issue this command before the INSERT:
Code Blocknoformat |
---|
set hive.hbase.wal.enabled=false; |
...
If you want to give Hive access to an existing HBase table, use CREATE EXTERNAL TABLE:
Code Blocknoformat |
---|
CREATE EXTERNAL TABLE hbase_table_2(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val") TBLPROPERTIES("hbase.table.name" = "some_existing_table"); |
...
Here's an example with three Hive columns and two HBase column families, with two of the Hive columns (value1
and value2
) corresponding to one of the column families (a
, with HBase column names b
and c
), and the other Hive column corresponding to a single column (e
) in its own column family (d
).
Code Blocknoformat |
---|
CREATE TABLE hbase_table_1(key int, value1 string, value2 int, value3 int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,a:b,a:c,d:e" ); INSERT OVERWRITE TABLE hbase_table_1 SELECT foo, bar, foo+1, foo+2 FROM pokes WHERE foo=98 OR foo=100; |
Here's how this looks in HBase:
Code Blocknoformat |
---|
hbase(main):014:0> describe "hbase_table_1" DESCRIPTION ENABLED {NAME => 'hbase_table_1', FAMILIES => [{NAME => 'a', COMPRESSION => 'N true ONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_M EMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'd', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN _MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0170 seconds hbase(main):015:0> scan "hbase_table_1" ROW COLUMN+CELL 100 column=a:b, timestamp=1267740457648, value=val_100 100 column=a:c, timestamp=1267740457648, value=101 100 column=d:e, timestamp=1267740457648, value=102 98 column=a:b, timestamp=1267740457648, value=val_98 98 column=a:c, timestamp=1267740457648, value=99 98 column=d:e, timestamp=1267740457648, value=100 2 row(s) in 0.0240 seconds |
And when queried back into Hive:
Code Blocknoformat |
---|
hive> select * from hbase_table_1; Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK 100 val_100 101 102 98 val_98 99 100 Time taken: 4.054 seconds |
...
Here's how a Hive MAP datatype can be used to access an entire column family. Each row can have a different set of columns, where the column names correspond to the map keys and the column values correspond to the map values.
Code Blocknoformat |
---|
CREATE TABLE hbase_table_1(value map<string,int>, row_key int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = "cf:,:key" ); INSERT OVERWRITE TABLE hbase_table_1 SELECT map(bar, foo), foo FROM pokes WHERE foo=98 OR foo=100; |
...
Here's how this looks in HBase (with different column names in different rows):
Code Blocknoformat |
---|
hbase(main):012:0> scan "hbase_table_1" ROW COLUMN+CELL 100 column=cf:val_100, timestamp=1267739509194, value=100 98 column=cf:val_98, timestamp=1267739509194, value=98 2 row(s) in 0.0080 seconds |
And when queried back into Hive:
Code Blocknoformat |
---|
hive> select * from hbase_table_1; Total MapReduce jobs = 1 Launching Job 1 out of 1 ... OK {"val_100":100} 100 {"val_98":98} 98 Time taken: 3.808 seconds |
Note that the key of the MAP must have datatype string, since it is used for naming the HBase column, so the following table definition will fail:
Code Blocknoformat |
---|
CREATE TABLE hbase_table_1(key int, value map<int,int>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,cf:" ); FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to map<int,int>) |
...
Table definitions such as the following are illegal because a
Hive column mapped to an entire column family must have MAP type:
Code Blocknoformat |
---|
CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,cf:" ); FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to string) |
...
For example, the pokes table contains rows with duplicate keys. If it is copied into another Hive table, the duplicates are preserved:
Code Blocknoformat |
---|
CREATE TABLE pokes2(foo INT, bar STRING); INSERT OVERWRITE TABLE pokes2 SELECT * FROM pokes; -- this will return 3 SELECT COUNT(1) FROM POKES WHERE foo=498; -- this will also return 3 SELECT COUNT(1) FROM pokes2 WHERE foo=498; |
But in HBase, the duplicates are silently eliminated:
Code Blocknoformat |
---|
CREATE TABLE pokes3(foo INT, bar STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key,cf:bar" ); INSERT OVERWRITE TABLE pokes3 SELECT * FROM pokes; -- this will return 1 instead of 3 SELECT COUNT(1) FROM pokes3 WHERE foo=498; |
...
The QL tests can be executed via ant like this:
Code Blocknoformat |
---|
ant test -Dtestcase=TestHBaseCliDriver -Dqfile=hbase_queries.q |
...