Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Table of Contents

Overview

Apache Accumulo is a sorted, distributed key-value store based on the Google BigTable paper. The API methods that Accumulo provides are in terms of Keys and Values which present the highest level of flexibility in reading and writing data; however, higher-level query abstractions are typically an exercise left to the user. Leveraging Apache Hive as a SQL interface to Accumulo complements its existing high-throughput batch access and low-latency random lookups.

...

Option NameDescription
accumulo.iterator.pushdownShould filter predicates be satisfied within Accumulo using Iterators (default: true)
accumulo.default.storageThe default storage serialization method for values (default: string)
accumulo.visibility.label

A static ColumnVisibility string to use when writing any records to Accumulo (default: empty string)

accumulo.authorizations

A comma-separated list of authorizations to use when scanning Accumulo (default: no authorizations).

Note that the Accumulo user provided to connect to Accumulo must have all authorizations provided.

accumulo.composite.rowid.factory

Extension point which allows a custom class to be provided when constructing LazyObjects from the rowid without changing

the ObjectInspector for the rowid column.

accumulo.composite.rowidExtension point which allows for custom parsing of the rowid column into a LazyObject.
accumulo.table.nameControls what Accumulo table name is used (default: the Hive table name)
accumulo.mock.instanceUse a MockAccumulo instance instead of connecting to a real instance (default: false). Useful for testing.

Examples

Override the Accumulo table name

Create a user table, consisting of some unique key for a user, a user ID, and a username. The Accumulo row ID is from the Hive column, the user ID column is written to the "f" column family and "userid" column qualifier, and the username column to the "f" column family and the "nickname" column qualifier. Instead of using the "users" Accumulo table, it is overridden in the TBLPROPERTIES to use the Accumulo table "hive_users" instead.

No Format
CREATE TABLE users(key int, userid int, username string) 
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,f:userid,f:nickname")
WITH TBLPROPERTIES ("accumulo.table.name" = "hive_users");

...

Store a Hive map with binary serialization

Using an asterisk in the column mapping string, a Hive map can be expanded from a single Accumulo Key-Value pair to multiple Key-Value pairs. The Hive Map is a parameterized type: in the below case, the key is a string, and the value integer. The default serialization is overriden from 'string' to 'binary' which means that the integers in the value of the Hive map will be stored as a series of bytes instead of the UTF-8 string representation.

No Format
CREATE TABLE hive_map(key int, value map<string,int>) 
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES (
"accumulo.columns.mapping" = ":rowID,cf:*",
"accumulo.default.storage" = "binary"
);

...

Register an external table

Creating the Hive table with the external keyword decouples the lifecycle of the Accumulo table from that of the Hive table. Creating this table assumes that the Accumulo table "countries" already exists. This is a very useful way to use Hive to manage tables that are created and populated by some external tool (e.g. A MapReduce job). When the Hive table countries is deleted, the Accumulo table will not be deleted. Additionally, the external keyword can also be useful when creating multiple Hive tables with different options that operate on the same underlying Accumulo table.

...