Use Cases
Inside facebook, we are running out of power inside a data center (physical cluster), and we have a need to have a bigger cluster.
We can divide the cluster into multiple clusters - multiple hive instances, multiple mr and multiple dfs. This will put a burden on
the user - he needs to know which cluster to use. It will be very difficult to support joins across tables in different clusters, and
will lead to a lot of duplication of data in the long run. To get around these problems, we are planning to extend hive to span
multiple data centers, and make the existence of multiple clusters transparent for the end users in the long term. Note that, even
today, different partitions/tables can span multiple dfs's, and hive does not enforce any restrictions. Those dfs's can be in different
data centers also. However, a single table/partition can only have a single location. We need to enhance this. We will not be able to
partition our warehouse cluster into multiple disjoint clusters, and therefore some tables/partitions will be present in multiple clusters.
In order to do so, we need to make some changes in hive, and this document primarily talks about those. The changes should be generic
enough, so that they can be used by others (outside Facebook) also, if they have such a requirement. The following restrictions have
been imposed to simplify the problem:
- There will be a single hive instance. possibly spanning multiple clusters (both dfs and mr)
- There will be a single hive metastore to keep track of the table/partition locations across different clusters.
- A table/partition can exist in more than one cluster. However, a single table will have a primary cluster, and can have multiple
secondary clusters. A table T1's primary cluster is C1 meaning :1) C1 contains
all data that is available in all other clusters. 2) write is only
allowed in this cluster for table C1. but need to allow exceptions
here 3) new partitions are only allowed to be created in C1. 4) all
data changes to T1 happened in the primary cluster should be
replicated to other clusters if there are any secondary clusters. but
there should be a conf to disable it as there are some exception
situations.
We are planning to make hive run across multiple data centers (physical clusters). We prefer to use hive metastore to provide a
unified namespace.
Tables/partitions can exist in more than one cluster. And one cluster
is defined as a primary cluster. A primary cluster is a table level
property. A table T1's primary cluster is C1 meaning :1) C1 contains
all data that is available in all other clusters. 2) write is only
allowed in this cluster for table C1. but need to allow exceptions
here 3) new partitions are only allowed to be created in C1. 4) all
data changes to T1 happened in the primary cluster should be
replicated to other clusters if there are any secondary clusters. but
there should be a conf to disable it as there are some exception
situations.
The first thing that needs to be done is to make hive metastore have a
concept of cluster. And that also means all thrift communication calls
to metastore need to provide a cluster parameter. So we have there
options here:
1) add a cluster parameter to existing thrift interfaces
or
2) add new interfaces which do exactly the same set of functionalities
as old ones but using a different name (use _on_cluster suffifx
maybe?) and have a cluster parameter
or
3) overwrite database name for the purpose of cluster name. And allow
a table co-exist in multiple databases. But that require to promote
table to top level citizen, and degrade database. For example, "show
tables" used to scan all tables in current db, but now need to scan
all tables in all databases.
We would like to get more ideas about which one to choose, and we are
definitely open to other alternatives that we missed here.