Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Wiki Markup
This page documents Correlation Optimizer. It was originally introduced by [HIVE-2206|https://issues.apache.org/jira/browse/HIVE-2206] and based on the idea of YSmart \[1\].

1. Overview

In Hadoop environments, an SQL query submitted to Hive will be evaluated in distributed systems. Thus, after generating a query operator tree representing the submitted SQL query, Hive needs to determine what operations can be executed in a task which will be evalauted in a single node. Also, since a MapReduce job can shuffle data data once, Hive also needs to cut the tree to multiple MapReduce jobs. It is important to cut an operator tree to multiple MapReduce in a good way, so the generated plan can evaluate the query efficiently.

...

Before we integrating Correlation Optimizer into Hive, Hive has ReduceSink Deduplication Optimizer which can figure out if we need to shuffle data for chained operators. However, to support more complex operator trees, we need a more general-purpose optimizer and a mechanism to correctly execute optimized plan. Thus, we have designed and implemented Correlation Optimizer and two operators for evaluating optimized plans. It is worth noting that it is better to use ReduceSink Deduplication Optimizer to handle simple cases first and then use Correlation Optimizer to handle more complex cases.

2. Examples

At first, let's take a look at three examples. For every query, we show the original operator tree generated by Hive and the optimized operator tree. To be concise, we only show the following operators, which are FileSinkOperator (FS), GroupByOperator (AGG), HashTableSinkOperator (HT), JoinOperator (JOIN), MapJoinOperator (MJ), and ReduceSinkOperator (RS). Also, in every query, we add comments (e.g. /*JOIN1*/) to indicate the node in the operator tree that an operation belongs to.

...


Anchor
figure6
figure6

Figure 6: The optimized operator tree of Example 3

3. Intra-query Correlations

In Hive, a submitted SQL query needs to be evaluated in a distributed system. When evaluating a query, data may need to shuffled sometimes. Based on the nature of different data operations, operators in Hive can be divided to two categories.

...

  1. Input Correlation: A input table is used by multiple MapReduce tasks in the original operator tree.
  2. Job Flow Correlation: Two dependent MapReduce tasks shuffle the data in the same way.

4. Correlation Detection

5. Operator Tree Transformation

6. Executing Optimized Operator Tree in the Reduce Phase

Currently, blocking operators in the reduce phase operator tree share the same keys. Other cases will be supported in future work.

6.1 Executing Operator Tree with Same Keys

7. Related Jiras

The umbrella jira is HIVE-3667.

...

8. References

  1. Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, Xiaodong Zhang. YSmart: Yet another SQL-to-MapReduce Translator, ICDCS, 2011