Page History

This optimization tries to intersect multiple secondary indexes if the select conditions introduce them. Previously, as in the IntroduceSelectAccessMethodRule, we would pick the first index to contribute to the access path when there were multiple indexes available. Due to the lack of statistical information, the first one may not be the best choice. Moreover, even we chose the index of the lowest selectivity, it still may not be the best solution. Because we can further reduce the selectivity by intersecting it with the other secondary indexes. Having intersection into the plan will avoid the worst path. Furthermore, if we have statistical information later we can keep improving the decision by whether to introduce the intersection or not.

Table of Contents

Optimization Rule

The logical changes are in the IntroduceSelectAccessMethodRule. After we analyzed the interesting functions and indexes, we pair them up as one to one mapping. (E.g. BTreeAccessMethod -> BTreeIndex On Salary.).

...

Physically, each ETS node will introduce a thread. Thus, the intersection operator must synchronize the upstream input threads in order to generate the correct result. In order to have a pipeline operation, the intersection is implemented in a sort-merge manner. Therefore, each input is required to be sorted. The synchronization is handled by the thread of input No.0, which means the thread 0 will call the writer.open/nextFrame/close functions. If we authorize arbitrary threads to push forward, the downstream operator will be confused, especially in synchronizing their locks. The core logical intersection function is as below:

do
1. find the max input: maxinput id of the maximum record
2. for each input i
  1. if record < max keep popping
  2. if record == max keep popping until it matches max. then match++; continue
  3. if > max, break
3. If match == inputArity
  1. output max record
while no input is closed.

...

Code Block

use dataverse twitter; 
let $ts_start := datetime("2015-11-23T17:$min_start:00.000Z") 
let $ts_end := datetime("2015-11-23T17:$min_end:03.000Z") 
let $ms_start := date("2010-$month_start-01") 
let $ms_end := date("2010-$month_end-28") 
let $result := for $t in dataset ds_tweets 
               where $t.user.create_at >= $ms_start and $t.user.create_at < $ms_end 
               and $t.create_at >= $ts_start and $t.create_at < $ts_end 
               and $t.place = "Unite State" return $t return count($result) 
               return $t
return count($result)

...

Each query will run ten times. We record the time by average the last fives. The time unit is Milliseconds.

Table 1. Fix the User.create_at $month_start = 01, $month_end = 02, increasing the Tweets.create_at selectivity

...

Because we only have one disk. First, the Tweet.create_at path has to wait for the User.create_at to finish a frame to operate the intersection. These two index search is battling the disk read. Second, although the intersection itself can be finished as long as one of the input is done, we can not stop the other index scan based on our push model. Hence, the primary search is also competing on the disk resource with the two index searches.

...

Intersect Unclustered Secondary Indexes

As shown in the previous result, the index on Tweet.create_at is a clustered secondary index, which is a special case for the secondary index. To test a more general case, we create an RTree on the Tweet.place.boudingbox which is a rectangle area. We create a circle area around LA county. By increasing the radius, we can increase the selectivity of that RTree. The query is as below

Code Block

use dataverse twitter; 
let $ms_start := date("2010-$month_start-01") 
let $ms_end := date("2010-$month_end-28") 
let $region := create-circle(create-point(-118.125,33.939), $radius)
let $result := for $t in dataset ds_tweets 
               where $t.user.create_at >= $ms_start and $t.user.create_at < $ms_end 
               and spatial-intersect($t.place.bounding_box, $region)
               and $t.place = "Unite State" return $t return count($result)
               return $t
return count($result)

...

			Scan	user time Index	Rtree Index	intersection	speedup
result	month	hourradius	Time (Avg last 5)
1390	01--02	0.01		111087	106159	9293	11.4235446
1551	01--02	0.02		111306	107127	10012	10.69986017
1575	01--02	0.03		112024	108143	10278	10.52179412
6171	01--02	0.04			111264	31850	3.493375196
6193	01--02	0.05			112916	32001	3.528514734
6689	01--02	0.06			111673	33952	3.289143497
6900	01--02	0.07			111012	34946	3.176672581
6900	01--02	0.08			111570	34937	3.193462518

The experiment is slow. Stay tuned.

...

Page tree

Versions Compared

Old Version 6

New Version Current

Key

Optimization Rule

Intersect Unclustered Secondary Indexes