Use Cases

Inside facebook, we are running out of power inside a data center (physical cluster), and we have a need to have a bigger cluster.
We can divide the cluster into multiple clusters - multiple hive instances, multiple mr and multiple dfs. This will put a burden on
the user - he needs to know which cluster to use. It will be very difficult to support joins across tables in different clusters, and
will lead to a lot of duplication of data in the long run. To get around these problems, we are planning to extend hive to span
multiple data centers, and make the existence of multiple clusters transparent to the end users in the long run. Note that, even
today, different partitions/tables can span multiple dfs's, and hive does not enforce any restrictions. Those dfs's can be in different
data centers also. However, a single table/partition can only have a single location. We need to enhance this. We will not be able to
partition our warehouse cluster into multiple disjoint clusters, and therefore some tables/partitions will be present in multiple clusters.

Requirements

In order to do so, we need to make some changes in hive, and this document primarily talks about those. The changes should be generic
enough, so that they can be used by others (outside Facebook) also, if they have such a requirement. The following restrictions have
been imposed to simplify the problem:

The same idea can be extended for partitioned tables.

The ClusterStorageDescriptor contains the following:
ClusterName
Location

location will be removed from the StorageDescriptor.

In order to support the above, hive metastore needs to be enhanced to have the concept of a cluster. The following parameters will
be added:

The existing thrift API's will continue to work as if the user is trying to access the default cluster.
New APIs will be added which take the cluster as a new parameter. Almost all the existing APIs will be
enhanced to support this. The behavior will be the same as if, the user issued the command 'USE CLUSTER <CLUSTERNAME>

If the user does not intend of use multiple clusters, there should be no change in the behavior of hive commands, and all the
existing thrift APIs should continue to work.

An upgrade script will be provided to upgrade the metastore with the patch.