Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Motivation

...

Today, the warehouse infrastructure provided by Hive allows for the setup of a single shared warehouse and the authorization model allows for access control within this warehouse if needed. Growth beyond a single warehouse (or) separation of capacity usage and allocation requires the creation of multiple physical warehouses, i.e., separate Hive instances.

...

  • When the warehouse reaches datacenter capacity limits, it is hard to identify self-contained pieces that can be migrated out.
  • Capacity tracking and management becomes an issue. =

Requirements

...

Introduce the notion of a virtual warehouse (namespace) in Hive with the below key properties:

  • Can be housed in the same physical warehouse with other virtual warehouses (multi-tenancy).
  • Portable (so it can be moved from one physical warehouse to another). Being self-contained is a necessary condition for portability (all queries on this namespace operate only on data available in the namespace).
  • Unit of capacity tracking and capacity allocation. This is a nice side effect of creating self-contained namespaces and allows capacity planning based on the virtual warehouse growth.

Mapping many namespaces to 1 physical warehouse keeps the operational cost low. If a physical warehouse reaches capacity limits, portability will allow seamless migration of the namespace to another physical warehouse.

Note that users can operate on multiple namespaces simultaneously although they are likely to most often operate within one namespace. So namespaces are not trying to solve the problem of ensuring that users only have access to a subset of data in the warehouse.

...

  • Provide metadata to identify tables and queries that belong to one namespace.
  • Provide controls to prevent operating on tables outside the namespace.
  • Provide commands to explicitly request that tables/partitions in namespace1 be made available in namespace2 (since some tables/partitions may be needed across multiple namespaces). Avoid making copies of tables/partitions for this.

...

Design

...

The design that is proposed is:

  • Modeling namespaces as databases. No explicit accounting/tracking of tables/partitions/views that belong to a namespace is needed since a database provides that already.
  • Prevent access using two part name syntax (Y.T) if namespaces feature is “on” in a Hive instance. This ensures the database is self-contained.
  • Modeling table/partition imports across namespaces using a new concept called Links in Hive. There will be commands to create Links to tables in other databases, alter and drop them. Links do not make copies of the table/partition and hence avoid data duplication in the same physical warehouse.

Let’s take a concrete example:

  • Namespace A resides in database A, namespace B in database B.
  • Access across these namespace using A.T or B.T syntax is disabled in ‘namespace’ mode.
  • The user is importing table T1 from B into A .
  • The user issues a CREATE LINK command, which creates metadata in the target namespace A for the table + metadata to indicate which object it is linked to.
  • The ALTER LINK ADD PARTITION command is used to add partitions to the link.
  • These partitions are modeled by replicating partition-level metadata in the target database A for the accessible partitions.
  • The Link can be dynamic, which means it is kept updated as the source table gets new partitions or drops partitions.

...

A basic tenet of our design is that a Hive instance does not operate across physical warehouses. We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication.=

Alternate design options

...

...

...

...

...