Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Motivation

Today, the warehouse infrastructure provided by Hive allows for the setup of a single shared warehouse and the authorization model allows for access control within this warehouse if needed. Growth beyond a single warehouse (or) when datacenter capacity limits are reached) OR separation of capacity usage and allocation requires the creation of multiple physical warehouses, i.e., separate Hive instanceswarehouses with each warehouse mapping to it's own Hive metastore. Let's define the term physical warehouse to map to a single Hive metastore, the Hadoop cluster it maps to and the data in it.

In organizations with a large number of teams needing warehousesa warehouse, there is a need to be able to:

  • Maximize sharing of physical clusters to keep operational costs low
  • Clearly identify and track capacity usage by teams in the data warehouse

One way to do this is to use a single shared warehouse as we do today, but this has the below issues:

  • When the warehouse reaches datacenter capacity limits, it is hard to identify self-contained pieces that can be migrated out.
  • Capacity tracking and management becomes an issue.

An alternative is to create a new physical warehouse per team (1:1 mapping), but this is not optimal since the physical resources are not shared across teams, and the operational cost is high. Further, data may not be cleanly partition-able and end up being replicated in multiple physical warehouses.

To provide context, in Facebook, we expect to have 20+ partitions of the warehouse, so operating each in their own physical warehouse will be impractical from an operational perspective.

Another alternative is to use a single shared warehouse, but this has the below issues:

...

.

Requirements

Introduce the notion of a virtual warehouse (namespace) in Hive with the below key properties:

...

  • Modeling namespaces as databases. No explicit accounting/tracking of tables/partitions/views that belong to a namespace is needed since a database provides that already.
  • Prevent access using two part name syntax (Y.T) if namespaces feature is “on” in a Hive instance. This ensures the database is self-contained.
  • Modeling table/partition imports across namespaces using a new concept called Links in Hive. There will be commands to create Links to tables in other databases, alter and drop them. Links do not make copies of the table/partition and hence avoid data duplication in the same physical warehouse.

...

  • Implement links as a first-class concept in Hive, and use a Facebook hook to disable Y.T access unless there is a link to the table Y.T.
  • Implement links as a first-class concept, and introduce a new syntax T@Y to access linked content. Disable Use a Facebook hook to disable Y.T access only in a Facebook hook.
  • Implement links as a first-class concept, and introduce a new syntax T@Y to access linked content + disable Y.T access all in OpenSource Hive (when 'namespaces' feature is turned on). This allows the Open Source Community to also use databases in Hive for creating self-contained namespaces. Disable cross database access using a new privilege. All these changes will be in Open Source Hive.

Links to JIRAS for these features:

...