Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This page documents the storage handler support being added to Hive as
part of work on HBaseIntegration. The motivation is to make
it possible to allow Hive to access data stored and managed by other
systems in a modular, extensible fashion.

...

Hive storage handler support builds on existing extensibility features in
both Hadoop and Hive:

  • input formats
  • output formats
  • serialization/deserialization libraries

Besides bundling these together, a storage handler can also implement
a new metadata hook interface, allowing Hive DDL to be used for
managing object definitions in both the Hive metastore and the
other system's catalog simultaneously and consistently.

...

Before storage handlers, Hive already had a concept of managed vs
external tables. A managed table is one for which the definition
is primarily managed in Hive's metastore, and for whose data
storage Hive is responsible. An external table is one whose
definition is managed in some external catalog, and whose data Hive
does not own (i.e. it will not be deleted when the table is dropped).

Storage handlers introduce a distinction between native and
non-native tables. A native table is one which Hive knows how to
manage and access without a storage handler; a non-native table is one
which requires a storage handler.

These two distinctions (managed vs. external and _native vs
non-native_) are orthogonal. Hence, there are four possibilities for
base tables:

  • managed native: what you get by default with CREATE TABLE
  • external native: what you get with CREATE EXTERNAL TABLE when no STORED BY clause is specified
  • managed non-native: what you get with CREATE TABLE when a STORED BY clause is specified; Hive stores the definition in its metastore, but does not create any files itself; instead, it calls the storage handler with a request to create a corresponding object structure
  • external non-native: what you get with CREATE EXTERNAL TABLE when a STORED BY clause is specified; Hive registers the definition in its metastore and calls the storage handler to check that it matches the primary definition in the other system

Note that we avoid the term file-based in these definitions, since
the form of storage used by the other system is irrelevant.

...

Storage handlers are associated with a table when it is created via
the new STORED BY clause, an alternative to the existing ROW FORMAT
and STORED AS clause:

Code Block
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
  [(col_name data_type [COMMENT col_comment], ...)]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [col_comment], col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS]
  [
   [ROW FORMAT row_format] [STORED AS file_format]
   | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]
  [AS select_statement]

When STORED BY is specified, then row_format (DELIMITED or SERDE) and
STORED AS cannot be specified. Optional SERDEPROPERTIES can be
specified as part of the STORED BY clause and will be passed to the
serde provided by the storage handler.

...

DROP TABLE works as usual, but ALTER TABLE is not yet supported for
non-native tables.

Storage Handler Interface

The Java interface which must be implemented by a storage handler is
reproduced below; for details, see the Javadoc in the code:

...

The HiveMetaHook is optional, and described in the next section.
If getMetaHook returns non-null, the returned object's methods
will be invoked as part of metastore modification operations.

The configureTableJobProperties method is called as part of
planning a job for execution by Hadoop. It is the responsibility of
the storage handler to examine the table definition and set corresponding
attributes on jobProperties. At execution time, only these jobProperties
will be available to the input format, output format, and serde.

...

The HiveMetaHook interface is reproduced below; for details, see
the Javadoc in the code:

Code Block
package org.apache.hadoop.hive.metastore;

import org.apache.hadoop.hive.metastore.api.MetaException;
import org.apache.hadoop.hive.metastore.api.Partition;
import org.apache.hadoop.hive.metastore.api.Table;

public interface HiveMetaHook {
  public void preCreateTable(Table table)
    throws MetaException;
  public void rollbackCreateTable(Table table)
    throws MetaException;
  public void commitCreateTable(Table table)
    throws MetaException;
  public void preDropTable(Table table)
    throws MetaException;
  public void rollbackDropTable(Table table)
    throws MetaException;
  public void commitDropTable(Table table, boolean deleteData)
    throws MetaException;

Note that regardless of whether or not a remote Thrift metastore
process is used in the Hive configuration, meta hook calls are always
made from the Hive client JVM (never from the Thrift metastore
server). This means that the jar containing the storage handler class
needs to be available on the client, but not the thrift server.

Also note that there is no facilitiy for two-phase commit in metadata
transactions against the Hive metastore and the storage handler. As a
result, there is a small window in which a crash during DDL can lead
to the two systems getting out of sync.

...