Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • When the warehouse reaches datacenter capacity limits, it is hard to identify self-contained pieces that can be migrated out.
  • Capacity tracking and management becomes an issue.

    Requirements

    Introduce the notion of a virtual warehouse (namespace) in Hive with the below key properties:
  • Can be housed in the same physical warehouse with other virtual warehouses (multi-tenancy).
  • Portable (so it can be moved from one physical warehouse to another). Being self-contained is a necessary condition for portability (all queries on this namespace operate only on data available in the namespace).
  • Unit of capacity tracking and capacity allocation. This is a nice side effect of creating self-contained namespaces and allows capacity planning based on the virtual warehouse growth.

...

  • Provide metadata to identify tables and queries that belong to one namespace.
  • Provide controls to prevent operating on tables outside the namespace.
  • Provide commands to explicitly request that tables/partitions in namespace1 be made available in namespace2 (since some tables/partitions may be needed across multiple namespaces). Avoid making copies of tables/partitions for this.

    Design

    The design that is proposed is:
  • Modeling namespaces as databases. No explicit accounting/tracking of tables/partitions/views that belong to a namespace is needed since a database provides that already.
  • Prevent access using two part name syntax (Y.T) if namespaces feature is “on” in a Hive instance. This ensures the database is self-contained.
  • Modeling table/partition imports across namespaces using a new concept called Links in Hive. There will be commands to create Links to tables in other databases, alter and drop them. Links do not make copies of the table/partition and hence avoid data duplication in the same physical warehouse.

...

Links to JIRAS for these features:

A basic tenet of our design is that a Hive instance does not operate across physical warehouses. We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication.

...