Subquery De-correlation

The rule InlineSubplanInputForNestedTupleSourceRule is to remove SubplanOperators containing DataScan, InnerJoin, LeftOuterJoin, UnionAll or Distinct. Given a qualified Subplan operator called S1, Let's call its input operator O1.

General Cases

We have the following rewritings for general cases:

R1. Replace all NestedTupleSourceOperators in S1 with deep-copies the query plan rooted at O1;

R2. Add a LeftOuterOperatorJoinOperator (let's call it LJ) between O1 and the SubplanOperator's

root operator's input (let's call it SO1), where O1 is the left branch and SO1 is the right

branch;

R3. The deep copy of the primary key variables in O1 should be preserved from an inlined

NestedTupleSourceOperator to SO1. The join condition of LJ is the equality between

the primary key variables in O1 and its deep copied version at SO1;

R4. A variable v indicating non-match tuples is assigned to TRUE between LJ and SO1;

R5. On top of the LJ, add a GroupByOperaptor in which the nested plan consists of the

S1's root operator, i.e., an aggregate operator. Below the aggregate, there is a not-null-filter

on variable v.

This is an abstract example for the rewriting mechanism described above:

Before rewriting:

--Op1

--Subplan{

--AggregateOp

--NestedOp

.....

--Nested-Tuple-Source

}

--InputOp

.....

After rewriting:

--Op1

--GroupBy v_lc_1, ..., v_lc_n Decor v_l1, ....v_ln {

--AggregateOp

--Select v_new!=NULL

-- Nested-Tuple-Source

}

--LeftOuterJoin (v_lc_1=v_rc_1 AND .... AND v_lc_n=v_rc_n)

(left branch)

--InputOp

.....

(right branch)

-- Assign v_new=TRUE

--NestedOp

.....

--Deepcopy_The_Plan_Rooted_At_InputOp_With_New_Variables(InputOp)

In the plan, v_lc_1, ..., v_lc_n are live "covering" variables at InputOp,

while v_rc_1, ..., v_rc_n are their corresponding variables populated from the deepcopy of InputOp.

("Covering" variables form a set of variables that can imply all live variables.)

v_l1, ....v_ln in the decoration part of the added group-by operator are all

live variables at InputOp except the covering variables v_lc_1, ..., v_lc_n.

Here is a concrete example of the general case rewriting.

Before plan:

distribute result [%0->$$13] -- |UNPARTITIONED|

project ([$$13]) -- |UNPARTITIONED|

assign [$$13] <- [function-call: asterix:open-record-constructor, Args:[AString: {cust}, %0->$$0, AString: {orders}, %0->$$12]] -- |UNPARTITIONED|

subplan {

aggregate [$$12] <- [function-call: asterix:listify, Args:[%0->$$1]] -- |UNPARTITIONED|

join (function-call: algebricks:eq, Args:[%0->$$16, %0->$$14]) -- |UNPARTITIONED|

select (function-call: algebricks:eq, Args:[%0->$$18, AInt64: {5}]) -- |UNPARTITIONED|

nested tuple source -- |UNPARTITIONED|

assign [$$16] <- [function-call: asterix:field-access-by-name, Args:[%0->$$19, AString: {o_custkey}]] -- |UNPARTITIONED|

assign [$$19] <- [function-call: asterix:field-access-by-name, Args:[%0->$$1, AString: {o_$o}]] -- |UNPARTITIONED|

data-scan []<-[$$15, $$1] <- tpch:Orders -- |UNPARTITIONED|

empty-tuple-source -- |UNPARTITIONED|

} -- |UNPARTITIONED|

assign [$$18] <- [function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {3}]] -- |UNPARTITIONED|

data-scan []<-[$$14, $$0] <- tpch:Customers -- |UNPARTITIONED|

empty-tuple-source -- |UNPARTITIONED|

After plan

distribute result [%0->$$13] -- |UNPARTITIONED|

project ([$$13]) -- |UNPARTITIONED|

assign [$$13] <- [function-call: asterix:open-record-constructor, Args:[AString: {cust}, %0->$$0, AString: {orders}, %0->$$12]] -- |UNPARTITIONED|

group by ([$$24 := %0->$$14]) decor ([%0->$$0; %0->$$18]) {

aggregate [$$12] <- [function-call: asterix:listify, Args:[%0->$$1]] -- |UNPARTITIONED|

select (function-call: algebricks:not, Args:[function-call: algebricks:is-null, Args:[%0->$$23]]) -- |UNPARTITIONED|

nested tuple source -- |UNPARTITIONED|

} -- |UNPARTITIONED|

left outer join (function-call: algebricks:eq, Args:[%0->$$14, %0->$$22]) -- |UNPARTITIONED|

assign [$$18] <- [function-call: asterix:field-access-by-index, Args:[%0->$$0, AInt32: {3}]] -- |UNPARTITIONED|

data-scan []<-[$$14, $$0] <- tpch:Customers -- |UNPARTITIONED|

empty-tuple-source -- |UNPARTITIONED|

assign [$$23] <- [TRUE] -- |UNPARTITIONED|

join (function-call: algebricks:eq, Args:[%0->$$16, %0->$$22]) -- |UNPARTITIONED|

select (function-call: algebricks:eq, Args:[%0->$$20, AInt64: {5}]) -- |UNPARTITIONED|

assign [$$20] <- [function-call: asterix:field-access-by-index, Args:[%0->$$21, AInt32: {3}]] -- |UNPARTITIONED|

data-scan []<-[$$22, $$21] <- tpch:Customers -- |UNPARTITIONED|

empty-tuple-source -- |UNPARTITIONED|

assign [$$16] <- [function-call: asterix:field-access-by-name, Args:[%0->$$19, AString: {o_custkey}]] -- |UNPARTITIONED|

assign [$$19] <- [function-call: asterix:field-access-by-name, Args:[%0->$$1, AString: {o_$o}]] -- |UNPARTITIONED|

data-scan []<-[$$15, $$1] <- tpch:Orders -- |UNPARTITIONED|

empty-tuple-source -- |UNPARTITIONED|

Special Cases

For special cases where:

a. there is a join (let's call it J1.) in the nested plan,

b. one input pipeline of J1 has a NestedTupleSource descendant (let's call it N1),

c. there is no tuple dropping from the N1 to the J1

Rewriting R2 is not necessary since before J1, all tuples from N1 are preserved. But rewriting R1' to R4' are needed:

R1'. Replace N1 by the O1 (no additional deep copy);

R2'. All inner joins on the path from N1 to J1 (including J1) are rewritten to a left-outer join with the same join condition;

R3'. If N1 resides in the right branch of a join (let's call it J2) in the path from N1 to J1, switch the left and right branches of J2;

R4'. On top of J1, a GroupByOperaptor G1 is added where the group-by key is the primary key of the subplan input operator and the nested query plan for aggregation is the nested pipeline on top of J1 (with a not-null-filter added).

R5'. All other NestedTupleSourceOperators in the subplan is inlined with a deep copy of the query plan rooted at O1.

Page tree

Subquery De-correlation