Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Inside facebook, we are running out of power inside a data center (physical cluster), and we have a need to have a bigger cluster.
We can divide the cluster into multiple clusters - multiple hive instances, multiple mr and multiple dfs. This will put a burden on
the user - he needs to know which cluster to use. It will be very difficult to support joins across tables in different clusters, and
will lead to a lot of duplication of data in the long run. To get around these problems, we are planning to extend hive to span
multiple data centers, and make the existence of multiple clusters transparent for the end users in the long term. Note that, even
today, different partitions/tables can span multiple dfs's, and hive does not enforce any restrictions. Those dfs's can be in different
data centers also. However, a single table/partition can only have a single location. We need to enhance this. We will not be able to
partition our warehouse cluster into multiple disjoint clusters, and therefore some tables/partitions will be present in multiple clusters.

In order to do so, we need to make some changes in hive, and this document primarily talks about those. The changes should be generic
enough, so that they can be used by others (outside Facebook) also, if they have such a requirement. The following restrictions have
been imposed to simplify the problem:

  • There will be a single hive instance. possibly spanning multiple clusters (both dfs and mr)
  • There will be a single hive metastore to keep track of the table/partition locations across different clusters.
  • A table/partition can exist in more than one cluster. However, a single

We are planning to make hive run across multiple data centers (physical clusters). We prefer to use hive metastore to provide a
unified namespace.

...