Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Documenting HIVE-1634

...

  • for each Hive column, the table creator must specify a corresponding entry in the comma-delimited hbase.columns.mapping string (so for a Hive table with n columns, the string should have n entries); whitespace should not be used in between entries since these will be interperted as part of the column name, which is almost certainly not what you want
  • Wiki Markup
    a mapping entry must be either {{:key}} or of the form {{column-family-name:\[column-name][#(binary|string)}}
  • there must be exactly one :key mapping (we don't support compound keys yet)
  • (note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
  • if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
  • there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
  • Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
  • it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table

The next few sections provide detailed examples of the kinds of column mappings currently possible.

Multiple Columns and Families

  •  (the type specification that delimited by _#_ was added in Hive [0.9.0|https://issues.apache.org/jira/browse/HIVE-1634], earlier versions interpreted everything as strings)
    • If no type specification is given the value from hbase.table.default.storage.type will be used
    • Any prefixes of the valid values are valid too (i.e. #b instead of #binary)
    • If you specify a column as binary the bytes in the corresponding HBase cells are expected to be of the form that HBase's Bytes class yields.
  • there must be exactly one :key mapping (we don't support compound keys yet)
  • (note that before HIVE-1228 in Hive 0.6, :key was not supported, and the first Hive column implicitly mapped to the key; as of Hive 0.6, it is now strongly recommended that you always specify the key explictly; we will drop support for implicit key mapping in the future)
  • if no column-name is given, then the Hive column will map to all columns in the corresponding HBase column family, and the Hive MAP datatype must be used to allow access to these (possibly sparse) columns
  • there is currently no way to access the HBase timestamp attribute, and queries always access data with the latest timestamp.
  • Since HBase does not associate datatype information with columns, the serde converts everything to string representation before storing it in HBase; there is currently no way to plug in a custom serde per column
  • it is not necessary to reference every HBase column family, but those that are not mapped will be inaccessible via the Hive table; it's possible to map multiple Hive tables to the same HBase table

The next few sections provide detailed examples of the kinds of column mappings currently possible.

Multiple Columns and Families

Here's an example with three Hive columns and two HBase column families, with two of the Hive columns (value1 and value2) corresponding to one of the column families (a, with HBase column names b and c), and the other Hive column corresponding to a Here's an example with three Hive columns and two HBase column families, with two of the Hive columns (value1 and value2) corresponding to one of the column families (a, with HBase column names b and c), and the other Hive column corresponding to a single column (e) in its own column family (d).

...

No Format
hbase(main):012:0> scan "hbase_table_1"
ROW                          COLUMN+CELL                                                                      
 100                         column=cf:val_100, timestamp=1267739509194, value=100                            
 98                          column=cf:val_98, timestamp=1267739509194, value=98                              
2 row(s) in 0.0080 seconds

And when queried back into Hive:

No Format

hive> select * from hbase_table_1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
{"val_100":100}	100
{"val_98":98}	98
Time taken: 3.808 
2 row(s) in 0.0080 seconds
seconds

Note that the key of the MAP must have datatype string, since it is used for naming the HBase column, so the following table definition will failAnd when queried back into Hive:

No Format
hive>CREATE select *TABLE from hbase_table_1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
...
OK
{"val_100":100}	100
{"val_98":98}	98
Time taken: 3.808 seconds
(key int, value map<int,int>) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,cf:"
);
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to map<int,int>)

Illegal: Hive Primitive to HBase Column Family

Table definitions such as the following are illegal because a
Hive column mapped to an entire column family must have MAP typeNote that the key of the MAP must have datatype string, since it is used for naming the HBase column, so the following table definition will fail:

No Format
CREATE TABLE hbase_table_1(key int, value map<int,int>string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,cf:"
);
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to map<int,int>string)

Illegal: Hive Primitive to HBase Column Family

Example with binary columns

Relying on the default value of hbase.table.default.storage.Table definitions such as the following are illegal because a
Hive column mapped to an entire column family must have MAP type:

No Format
CREATE TABLE hbase_table_1 (key int, value string, foobar double) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key#b,cf:keyval,cf:foo#b"
);
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException 

Specifying hbase.table.default.storage.type:

No Format

CREATE TABLE hbase_table_1 (key int, value string, foobar double)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseSerDe: hbase column family 'cf:' should be mapped to map<string,?> but is mapped to string)HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,cf:val#s,cf:foo",
"hbase.table.default.storage.type" = "binary"
);

Key Uniqueness

One subtle difference between HBase tables and Hive tables is that HBase tables have a unique key, whereas Hive tables do not. When multiple rows with the same key are inserted into HBase, only one of them is stored (the choice is arbitrary, so do not rely on HBase to pick the right one). This is in contrast to Hive, which is happy to store multiple rows with the same key and different values.

...