Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

= Hive across multiple data centers (physical clusters) =

Table of Contents

Use Cases

Inside facebook, we are running out of power inside a data center (physical cluster), and we have a need to have a bigger cluster. We can divide the cluster into multiple clusters - multiple hive instances, multiple mr and multiple dfs. This will put a burden on the user - he needs to know which cluster to use. It will be very difficult to support joins across tables in different clusters, and will lead to a lot of duplication of data in the long run. To get around these problems,
We are planning to make hive run across multiple data centers
(physical clusters). We prefer to use hive metastore to provide a
unified namespace.

Tables/partitions can exist in more than one cluster. And one cluster
is defined as a primary cluster. A primary cluster is a table level
property. A table T1's primary cluster is C1 meaning :1) C1 contains
all data that is available in all other clusters. 2) write is only
allowed in this cluster for table C1. but need to allow exceptions
here 3) new partitions are only allowed to be created in C1. 4) all
data changes to T1 happened in the primary cluster should be
replicated to other clusters if there are any secondary clusters. but
there should be a conf to disable it as there are some exception
situations.

...