Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Wiki MarkupThis page documents Correlation Optimizer. It was originally introduced by [HIVE-2206|https://issues.apache.org/jira/browse/HIVE-2206] and based on the idea of YSmart \ [1\].

1. Overview

In Hadoop environments, an SQL query submitted to Hive will be evaluated in distributed systems. Thus, after generating a query operator tree representing the submitted SQL query, Hive needs to determine what operations can be executed in a task which will be evalauted in a single node. Also, since a MapReduce job can shuffle data data once, Hive also needs to cut the tree to multiple MapReduce jobs. It is important to cut an operator tree to multiple MapReduce in a good way, so the generated plan can evaluate the query efficiently.

...

For example, in Figure 1 (we also show it below), the process of correlation detection is described as follows.

  1. Wiki MarkupThe tree walker visits {{RS4}}. We set {{topReduceSinkOperators=\[RS4\]}}.
  2. From RS4, we track sorting columns and partitioning columns of RS4 backward until we reach RS2 (because tmp1.key is from the left table of JOIN1).
    1. Check if RS4 and RS2 are using the same sorting columns, sorting orders, and same partitioning columns. Also, we check if RS4 and RS2 do not have any conflict on the number of reducers. In this example, all of these checks pass.
    2. Because RS4 and RS2 are correlated and the child of RS2 is a JoinOperator, we analyze if we can consider RS3 as a correlated ReduceSinkOperator of RS4. In this example, JOIN1 is an inner join operation. So, RS4 and RS3 are also correlated. Because both parents of the JOIN1 are correlated ReduceSinkOperators, we can continue to search ReduceSinkOperators from both RS2 and RS3.
  3. From RS2, we track sorting columns and partitioning columns of RS2 backward until we reach RS1.
    1. Check if RS2 and RS1 are using the same sorting columns, sorting orders, and same partitioning columns. Also, we check if RS2 and RS1 do not have any conflict on the number of reducers. In this example, all of these checks pass. So, RS2 and RS1 are correlated.
  4. Because there is no ReduceSinkOperator we can track backward from RS1, we add RS1 to bottomReduceSinkOperators.
  5. Because there is no ReduceSinkOperator we can track backward from RS3, we add RS3 to bottomReduceSinkOperators.
  6. Wiki MarkupWe have {{topReduceSinkOperators=\[RS4\]}} and {{bottomReduceSinkOperators=\[RS1, RS3\]}}. Because {{topReduceSinkOperators}} and {{bottomReduceSinkOperators}} are not the same, we have found a sub-tree with correlations. This sub-tree starts from {{RS1}} and {{RS3}}, and all {{ReduceSinkOperators}} in this sub-tree are {{RS1}}, {{RS2}}, {{RS3}}, and {{RS4}}.
  7. There is no ReduceSinkOperator which needs to be visited. The process of correlation detection stops.

...