Page History

...

There will be a single hive instance. , possibly spanning multiple clusters (both dfs and mr)
There will be a single hive metastore to keep track of the table/partition locations across different clusters.
A table/partition can exist in more than one cluster. A table will have a single primary cluster, and can have multiple
secondary clusters.
Table/Partition's metadata will be enhanced to support multiple clusters/locations of the table.
All the data for a table is available in the primary cluster, but a subset can be available in the secondary cluster.
However, an object (unpartitioned table/partition) is either fully present or not present at all in the secondary cluster.
It is not possible to have partial data of a partition in the secondary cluster.
The user can only update the table (or its partition) in the primary cluster.
The following mapping will be added. Cluster -> JobTracker
By default, the user will not specify any cluster for the session, and the behavior will be as follows:
- The query will be processed in a single cluster, and use the jobtracker for that cluster.
- If the primary cluster of any output table is different from the query processing cluster, an error is thrown.
  So, a multi-table insert with tables belonging to different primary clusters will always fail.
- If the input's table primary cluster is different from the query processing cluster, the query will only succeed
  if all the partitions for that input table are also present on the query processing cluster.
- If an output is specified, the primary cluster for that output will be used.
- If the output specified is a new table, the output is not used in determining the query processing cluster.
- If no output is specified (or the output is a new table), and there are multiple inputs for the query, all the input tables
  primary clusters are tried one-by-one, till a valid cluster is found.

...

Space shortcuts

Child pages

Versions Compared

Old Version 20

New Version 21

Key