Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To access Accumulo tables, a Hive table must be created using the CREATE command with the STORED BY clause. If the EXTERNAL keyword is omitted from the CREATE call, the lifecycle of the Accumulo table is tied to the lifetime of the Hive table: if the Hive table is deleted, so is the Accumulo table. This is the default case. Providing the EXTERNAL keyword will create a Hive table that references an Accumulo table but will not remove the underlying Accumulo table if the Hive table is dropped.

Each Hive row maps to a set of Accumulo keys with the same row ID. One column in the Hive row is designated as a "special" column which is used as the Accumulo row ID. All other Hive columns in the row have some mapping to Accumulo column (column family and qualifier) where the Hive column value is placed in the Accumulo value.

No Format
CREATE TABLE accumulo_table(rowrowid STRING, name STRING, age INT, weight DOUBLE, height INT)
STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'
WITH SERDEPROPERTIES('accumulo.columns.mapping' = ':rowid,person:name,person:age,person:weight,person:height');

In the above statement, normal Hive column name and type pairs are provided as is the case with normal create table statements. The full AccumuloStorageHandler class name is provided to inform Hive that Accumulo will back this Hive table. A number of properties can be provided to configure the AccumuloStorageHandler via SERDEPROPERTIES or TBLPROPERTIES. The most important property is "accumulo.columns.mapping" which controls how the Hive columns map to Accumulo columns. In this case, the "row" Hive column is used to populate the Accumulo row ID component of the Accumulo Key, while the other Hive columns (name, age, weight and height) are all columns within the Accumulo row.

For the above schema in the "accumulo_table", we could envision a single row in the table:

No Format
hive> select * from accumulo_table;
row1	Steve	32	200	72

The above record would be serialized into Accumulo Key-Value pairs in the following manner given the declared accumulo.columns.mapping:

No Format
user@accumulo accumulo_table> scan
row1	person:age []	32
row1	person:height []	72
row1	person:name []	Steve
row1	person:weight []	200

The power of the column mapping is that multiple Hive tables with differing column mappings can interact with the same Accumulo table and produce different results. When columns are excluded, the performance of Hive queries can be improved through the use of Accumulo locality groups to filter out unwanted data at the server-side.

Column Mapping

The column mapping string is comma-separated list of encoded values whose offset corresponds to the Hive schema for the table. For those familiar with Accumulo, each element in the column mapping string resembles a column_family:column_qualifier; however, there are a few different variants that allow for different control.

...