This page documents Correlation Optimizer. It was originally introduced by [HIVE-2206|https://issues.apache.org/jira/browse/HIVE-2206] and based on the idea of YSmart \ [1\]. Wiki Markup
1. Overview
In Hadoop environments, an SQL query submitted to Hive will be evaluated in distributed systems. Thus, after generating a query operator tree representing the submitted SQL query, Hive needs to determine what operations can be executed in a task which will be evalauted in a single node. Also, since a MapReduce job can shuffle data data once, Hive also needs to cut the tree to multiple MapReduce jobs. It is important to cut an operator tree to multiple MapReduce in a good way, so the generated plan can evaluate the query efficiently.
...
For example, in Figure 1 (we also show it below), the process of correlation detection is described as follows.
The tree walker visits {{Wiki Markup RS4
}}. We set {{topReduceSinkOperators=
\[RS4
\]
}}.- From
RS4
, we track sorting columns and partitioning columns ofRS4
backward until we reachRS2
(becausetmp1.key
is from the left table ofJOIN1
).- Check if
RS4
andRS2
are using the same sorting columns, sorting orders, and same partitioning columns. Also, we check ifRS4
andRS2
do not have any conflict on the number of reducers. In this example, all of these checks pass. - Because
RS4
andRS2
are correlated and the child ofRS2
is a JoinOperator, we analyze if we can considerRS3
as a correlatedReduceSinkOperator
ofRS4
. In this example,JOIN1
is an inner join operation. So,RS4
andRS3
are also correlated. Because both parents of theJOIN1
are correlatedReduceSinkOperators
, we can continue to searchReduceSinkOperators
from bothRS2
andRS3
.
- Check if
- From
RS2
, we track sorting columns and partitioning columns ofRS2
backward until we reachRS1
.- Check if
RS2
andRS1
are using the same sorting columns, sorting orders, and same partitioning columns. Also, we check ifRS2
andRS1
do not have any conflict on the number of reducers. In this example, all of these checks pass. So,RS2
andRS1
are correlated.
- Check if
- Because there is no
ReduceSinkOperator
we can track backward fromRS1
, we addRS1
tobottomReduceSinkOperators
. - Because there is no
ReduceSinkOperator
we can track backward fromRS3
, we addRS3
tobottomReduceSinkOperators
. We have {{Wiki Markup topReduceSinkOperators=
\[RS4
\]
}} and {{bottomReduceSinkOperators=
\[RS1,
RS3
\]
}}. Because {{topReduceSinkOperators
}} and {{bottomReduceSinkOperators
}} are not the same, we have found a sub-tree with correlations. This sub-tree starts from {{RS1
}} and {{RS3
}}, and all {{ReduceSinkOperators
}} in this sub-tree are {{RS1
}}, {{RS2
}}, {{RS3
}}, and {{RS4
}}.- There is no
ReduceSinkOperator
which needs to be visited. The process of correlation detection stops.
...