...
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT count(distinct ws1.ws_order_number) as order_count, sum(ws1.ws_ext_ship_cost) as total_shipping_cost, sum(ws1.ws_net_profit) as total_net_profit FROM web_sales ws1 JOIN /*MJ1*/ customer_address ca ON (ws1.ws_ship_addr_sk = ca.ca_address_sk) JOIN /*MJ2*/ web_site s ON (ws1.ws_web_site_sk = s.web_site_sk) JOIN /*MJ3*/ date_dim d ON (ws1.ws_ship_date_sk = d.d_date_sk) LEFT SEMI JOIN /*JOIN4*/ (SELECT ws2.ws_order_number as ws_order_number FROM web_sales ws2 JOIN /*JOIN1*/ web_sales ws3 ON (ws2.ws_order_number = ws3.ws_order_number) WHERE ws2.ws_warehouse_sk <> ws3.ws_warehouse_sk) ws_wh1 ON (ws1.ws_order_number = ws_wh1.ws_order_number) LEFT SEMI JOIN /*JOIN4*/ (SELECT wr_order_number FROM web_returns wr JOIN /*JOIN3*/ (SELECT ws4.ws_order_number as ws_order_number FROM web_sales ws4 JOIN /*JOIN2*/ web_sales ws5 ON (ws4.ws_order_number = ws5.ws_order_number) WHERE ws4.ws_warehouse_sk <> ws5.ws_warehouse_sk) ws_wh2 ON (wr.wr_order_number = ws_wh2.ws_order_number)) tmp1 ON (ws1.ws_order_number = tmp1.wr_order_number) WHERE d.d_date >= '2001-05-01' and d.d_date <= '2001-06-30' and ca.ca_state = 'NC' and s.web_company_name = 'pri'; |
...
- Input Correlation: A input table is used by multiple MapReduce tasks in the original operator tree.
- Job Flow Correlation: Two dependent MapReduce tasks shuffle the data in the same way.
4. Correlation Detection
At the optimization side, Correlation Optimizer is located in the class of CorrelationOptimizer
and it is a part of the package of org.apache.hadoop.hive.ql.optimizer.correlation
. It works on the operator tree before this tree is cut to multiple MapReduce tasks. This optimizer detects correlations and transforms the operator tree accordingly. In this section, we first go through the part of correlation detection. In the next section, we will introduce how an operator tree is transformed based on detected correlations.
To detect correlations, we start to walk the tree from the FileSinkOperator
(using DefaultGraphWalker
). We stop by at every ReduceSinkOperator
. Then, from this ReduceSinkOperator
and its peer ReduceSinkOperators (in the case of handling a JoinOperator
), we start to find correlated ReduceSinkOperators
along the upstream direction (the direction of parent operators) in a layer by layer way. These ReduceSinkOperator
which the search starts from are called topReduceSinkOperators
. The search from topReduceSinkOperators
will return all ReduceSinkOperators
at the lowest layers we can reach as a list called bottomReduceSinkOperators
. Finally, the optimizer will evaluate if we have found a sub-tree with correlations by comparing topReduceSinkOperators
and bottomReduceSinkOperators
.If topReduceSinkOperators
and bottomReduceSinkOperators
are not the same, we consider that we have found job flow correlations. If we found correlations, we will mark those ReduceSinkOperator
belonging to the sub-tree with correlations, so the tree walker will not visit these ReduceSinkOperators
again. Finally, the optimizer continues to walk the tree. It is worth noting that if hive.auto.convert.join=true
, we will first check if any JoinOperator
will be automatically converted to MapJoinOperator
later by CommonJoinResolver
. Then in the process of correlation detection, we will stop searching a branch if we reach a such kind of JoinOperator
.
For example, in Figure 1 (we also show it below), the process of correlation detection is described as follows.
Wiki Markup The tree walker visits {{RS4}}. We set {{topReduceSinkOperators=\[RS4\]}}.
- From
RS4
, we track sorting columns and partitioning columns ofRS4
backward until we reachRS2
(becausetmp1.key
is from the left table ofJOIN1
).- Check if
RS4
andRS2
are using the same sorting columns, sorting orders, and same partitioning columns. Also, we check ifRS4
andRS2
do not have any conflict on the number of reducers. In this example, all of these checks pass. - Because
RS4
andRS2
are correlated and the child ofRS2
is a JoinOperator, we analyze if we can considerRS3
as a correlatedReduceSinkOperator
ofRS4
. In this example,JOIN1
is an inner join operation. So,RS4
andRS3
are also correlated. Because both parents of theJOIN1
are correlatedReduceSinkOperators
, we can continue to searchReduceSinkOperators
from bothRS2
andRS3
.
- Check if
- From
RS2
, we track sorting columns and partitioning columns ofRS2
backward until we reachRS1
.- Check if
RS2
andRS1
are using the same sorting columns, sorting orders, and same partitioning columns. Also, we check ifRS2
andRS1
do not have any conflict on the number of reducers. In this example, all of these checks pass. So,RS2
andRS1
are correlated.
- Check if
- Because there is no
ReduceSinkOperator
we can track backward fromRS1
, we addRS1
tobottomReduceSinkOperators
. - Because there is no
ReduceSinkOperator
we can track backward fromRS3
, we addRS3
tobottomReduceSinkOperators
. Wiki Markup We have {{topReduceSinkOperators=\[RS4\]}} and {{bottomReduceSinkOperators=\[RS1, RS3\]}}. Because {{topReduceSinkOperators}} and {{bottomReduceSinkOperators}} are not the same, we have found a sub-tree with correlations. This sub-tree starts from {{RS1}} and {{RS3}}, and all {{ReduceSinkOperators}} in this sub-tree are {{RS1}}, {{RS2}}, {{RS3}}, and {{RS4}}.
- There is no
ReduceSinkOperator
which needs to be visited. The process of correlation detection stops.
In the process of searching correlated ReduceSinkOperators
, if the child of a correlated ReduceSinkOperator
is a JoinOperator
, we analyze if other ReduceSinkOperators
of this JoinOperator
can be also considered as correlated ReduceSinkOperators
in the following way. In a JoinOperator
, there are multiple join conditions (joinConds
). For a join condition, it has a left table and a right table. For a correlated ReduceSinkOperator
, if it is the left table of a join condition, we consider that the ReduceSinkOperator
corresponding to the right table is also correlated when the join type is either INNER_JOIN
, LEFT_OUTER_JOIN
, or LEFT_SEMI_JOIN
. If a correlated ReduceSinkOperator
is the right table of a join condition, we consider that the ReduceSinkOperator
corresponding to the left table is also correlated when the join type is either INNER_JOIN
, or RIGHT_OUTER_JOIN
. Because a JoinOperator
can have multiple join conditions, we recursively search all join conditions until we either have searched all join conditions or there is no more correlated ReduceSinkOperators
. After this analysis, if all parent ReduceSinkOperators
of the JoinOperator
are correlated, we will continue to search ReduceSinkOperators
at this branch. Otherwise, we will stop searching this branch and consider none of parent ReduceSinkOperators
of the JoinOperator
is correlated.
Right now, the process of correlation detection has a few limitations. We should improve these in our future work.
- Conditions on checking if two
ReduceSinkOperators
are correlated are very restrict. TwoReduceSinkOperators
are considered correlated if they have the same sorting columns, sorting orders, partitioning columns, and they do not have conflict on the number of reducers. - Input correlations are not explicitly detected. Right now, we only explicitly detect job flow correlations. If a sub-tree has job flow correlations, because we use a single MapReduce job to evaluate this sub-tree, input correlations in this sub-tree can be automatically exploited. However, there are cases which only have input correlations. Right now, these cases are not optimized.
- If the input operator tree has multiple
FileSinkOperators
, we do not optimize this tree. - If the input operator tree already has
MapJoinOperator
, we do not optimize this tree. - In the process of searching
ReduceSinkOperators
, if we find aGroupByOperator
with grouping sets or aPTFOperator
in a branch, we stop searching this branch.
5. Operator Tree Transformation
...