1. Motivation
To efficiently support the sort-dependent operators based on the Hyracks running framework, a multiple stages-based subdivision is used to dynamically construct the rangemap, cache the frames as well as parallelize the operations. Since the down-streaming operator could be connected with the sort-dependent operators in one stage, which means both those down-streaming operators and the sort-dependent operators can execute in pipeline. We should cache the frames from the down-streaming operators to the sort-dependent operators until the target rangemap is completely figured out. Based on a range partitioner, the caching frames can be forwarded onto their targets correspondingly.
2. RangeMap Generation Based on Aggregation
Combining a replicate operator, the aggregation-based RangeMap is generated based on a streaming algorithm to dynamically construct the histogram.
use dataverse tpch; let $rg := rg( for $d in dataset Lineitem return $d.l_extendedprice ) return $rg |
---|
3. Parallel Sort
In general, the parallel sort is divided into five stages, i.e., replicate, local aggregation, global aggregation, forward, sort and merge, to scale up the sort based on Hyracks.
A parallel sort template can be given as:
use dataverse tpch; for $d in dataset Lineitem order by $d.l_extendedprice return $d |
---|
3.1 Four stages of parallel sort
4. Binary In-Equal Join
use dataverse tpch; for $d in dataset Lineitem for $t in dataset Orders where 4 * $d.l_extendedprice - $t.o_totalprice > -2 and 4 * $d.l_extendedprice - $t.o_totalprice < 2 return {"ok": $d.l_orderkey, "ln": $d.l_linenumber, "ep": $l.l_extendedprice, "tp": $t.o_totalprice} |
---|
4.1 Five stages of parallel sort-based binary join
4.2 Histogram merging based on the densities of the both sides.