Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • There will be a single hive instance. possibly spanning multiple clusters (both dfs and mr)
  • There will be a single hive metastore to keep track of the table/partition locations across different clusters.
  • A table/partition can exist in more than one cluster. A table will have a single primary cluster, and can have multiple
    secondary clusters.
  • Table/Partition's metadata will be enhanced to support multiple clusters/locations of the table.
  • All the data for a table is available in the primary cluster, but a subset can be available in the secondary cluster.
    However, an object (unpartitioned table/partition) is either fully present or not present at all in the secondary cluster.
    It is not possible to have partial data of a partition in the secondary cluster.
  • The user can only update the table (or its partition) in the primary cluster.
  • The following mapping will be added. Cluster -> JobTracker
  • By default, the user will not specify any cluster for the session, and the behavior will be as follows:
    • The query will be processed in a single cluster, and use the jobtracker for that cluster.
    • If the primary cluster of any output table is different from the query processing cluster, an error is thrown.
      So, a multi-table insert with tables belonging to different primary clusters will always fail.
    • If the primary cluster of the input and output tables is same, the jobtracker corresponding to the primary cluster of the table
      will be used to run the query. The output will be created in the primary cluster of the table.
    • If the primary clusters of the input table and the output table don't match, the
    system will try to read the data from the primary cluster for the
    • primary cluster of the output table will be used
      to process the query.

There will be a default cluster for the session (a configuration parameter). Commands will be added to change the cluster.

    • Use cluster <ClusterName>
  • Eventually, hive will provide some utilities to copy a table/partition from the primary cluster to the secondary clusters.
    In the first cut, the user needs to do this operation outside hive (one simple way to do so, is distcp the partition from the
    primary to the secondary cluster, and then update the metadata directly - via the thrift api).

...