Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the above statement, normal Hive column name and type pairs are provided as is the case with normal create table statements. The full AccumuloStorageHandler class name is provided to inform Hive that Accumulo will back this Hive table. A number of properties can be provided to configure the AccumuloStorageHandler via SERDEPROPERTIES or TABLEPROPERTIESTBLPROPERTIES. The most important property is "accumulo.columns.mapping" which controls how the Hive columns map to Accumulo columns.

...

  1. A single column
    1. This places the value for the Hive column into the Accumulo value with the given column family and column qualifier.
  2. A column qualifier map
    1. A column family is provided and a column qualifier prefix of any length is allowed, follow by an asterisk.
    2. The Hive column type is expected to be a Map, the key of the Hive map is appended to the column qualifier prefix
    3. The value of the Hive map is placed in the Accumulo value.
  3. The rowid
    1. Controls which Hive column is used as the Accumulo rowid.
    2. Exactly one ":rowid" element must exist in each column mapping.
    3. ":rowid" is case insensitive (:rowID is equivalent to :rowId)

Additionally, a serialization option can be provided to each element in the column mapping which will control how the value is serialized. Currently, the options are:

...

These are set by including a pound sign ('#') after the column mapping element with either the long or short serialization value. The default serialization is 'string'. For example, for the value 10, "person:age#s" is synonymous with the "person:age" and would serialize the value as the literal string "10". If "person:age#b" was used instead, the value would be serialized as four bytes: \x00\x00\x00\xA0.

Other options

The following options are also valid to be used with SERDEPROPERTIES or TABLEPROPERTIES for further control over the actions of the AccumuloStorageHandler:

 

Option NameDescription
accumulo.iterator.pushdownShould filter predicates be satisfied within Accumulo using Iterators (default: true)
accumulo.default.storageThe default storage serialization method for values (default: string)
accumulo.visibility.label

A static ColumnVisibility string to use when writing any records to Accumulo (default: empty string)

accumulo.authorizations

A comma-separated list of authorizations to use when scanning Accumulo (default: no authorizations).

Note that the Accumulo user provided to connect to Accumulo must have all authorizations provided.

accumulo.composite.rowid.factory

Extension point which allows a custom class to be provided when constructing LazyObjects from the rowid without changing

the ObjectInspector for the rowid column.

accumulo.composite.rowidExtension point which allows for custom parsing of the rowid column into a LazyObject.
accumulo.table.nameControls what Accumulo table name is used (default: the Hive table name)
accumulo.mock.instanceUse a MockAccumulo instance instead of connecting to a real instance (default: false). Useful for testing.

Examples

Create a user table, consisting of some unique key for a user, a user ID, and a username. The Accumulo row ID is from the Hive column, the user ID column is written to the "f" column family and "userid" column qualifier, and the username column to the "f" column family and the "nickname" column qualifier. Instead of using the "users" Accumulo table, it is overridden in the TBLPROPERTIES to use the Accumulo table "hive_users" instead.

No Format
CREATE TABLE users(key int, userid int, username string) 
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,f:userid,f:nickname")
WITH TBLPROPERTIES ("accumulo.table.name" = "hive_users");

 

Using an asterisk in the column mapping string, a Hive map can be expanded from a single Accumulo Key-Value pair to multiple Key-Value pairs. The Hive Map is a parameterized type: in the below case, the key is a string, and the value integer. The default serialization is overriden from 'string' to 'binary' which means that the integers in the value of the Hive map will be stored as a series of bytes instead of the UTF-8 string representation.

No Format
CREATE TABLE hive_map(key int, value map<string,int>) 
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES (
"accumulo.columns.mapping" = ":rowID,cf:*",
"accumulo.default.storage" = "binary"
);

 

Creating the Hive table with the external keyword decouples the lifecycle of the Accumulo table from that of the Hive table. Creating this table assumes that the Accumulo table "countries" already exists. This is a very useful way to use Hive to manage tables that are created and populated by some external tool (e.g. A MapReduce job). When the Hive table countries is deleted, the Accumulo table will not be deleted. Additionally, the external keyword can also be useful when creating multiple Hive tables with different options that operate on the same underlying Accumulo table.

No Format
CREATE EXTERNAL TABLE countries(key string, name string, country string, country_id int)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,info:name,info:country,info:country_id");

 

 

Acknowledgements

I would be remiss to not mention the efforts made by Brian Femiano that were the basis for this storage handler. His initial prototype for Accumulo-Hive integration was the base for this work.