Page History

...

Physically, each ETS node will introduce a thread. Thus, the intersection operator must synchronize the upstream input threads in order to generate the correct result. In order to have a pipeline operateoperation, the intersection is implemented as in a sort-merge manner. Therefore, each input is required to be sorted. The synchronization is handled by the thread of input zeroNo.0, which means the thread 0 will call the writer.open/nextFrame/close functions. If we authorize arbitrary threads to do such taskpush forward, the downstream operator will be confused to synchronize on , especially in synchronizing their locks. The core logical intersection function is as below:

do
1. find the max input: max
2. for each input i
  1. keep popping until it matches max. then match++;
  2. if > max, break
3. If match == inputArity
  1. output max
while all inputs haven't been no input is closed.

If any of the input is fully consumed, the operator is closed.

...

Each test will run three times:

1st. with BTree index on TweetUser.create_at only,
2nd. with BTree index on UserTweet.create_at onlyat only,
3rd. with both indexes presents, consequently, the intersection is introduced.

...

			Scan	UserCreateAtIndex	TweetCreateAt Index	Intersection	ReductionSpeedUp
result	month	minutes	Time (Avg last 5)
431	01--02	00-09	929	142	132	52	602.61%54
928	01--02	00-19	929	143	226	56	602.84%55
1458	01--02	00-29	934	144	328	73	491.31%97
1997	01--02	00-39	932	143	427	80	441.06%79
2504	01--02	00-49	930	142	528	92	351.21%54
2989	01--02	00-59	933	141	631	109	221.70%29

Table 2. Fix the Tweet.create_at $min_start = 00, $min_end = 09, increasing the User.create_at selectivity

			Scan	UserCreateAtIndex	TweetCreateAt Index	Intersection	Reduction
result	month	minutes	Time (Avg last 5)
431	01--02	00-09	929	142	132	52	602.61%54
670	01--03	00-09	932	189	129	48	622.79%69
929	01--04	00-09	928	240	124	61	502.81%03
1140	01--05	00-09	931	291	128	69	461.09%86
1471	01--06	00-09	933	367	126	65	481.41%94
1859	01--07	00-09	932	449	125	84	321.80%49
2166	01--08	00-09	931	525	126	87	301.95%45
2438	01--09	00-09	932	580	127	94	251.98%35
2682	01--10	00-09	939	648	127	104	181.11%22
3011	01--11	00-09	934	710	125	110	121.00%14
3346	01--12	00-09	933	781	127	120	51.51%06

Table 3. Both Tweet.create_at and User.create_at increasing the selectivity

			Scan	UserCreateAtIndex	TweetCreateAt Index	Intersection	Reduction
result	month	minutes	Time (Avg last 5)
431	01--02	00-09	929	142	132	52	602.61%54
1429	01--03	00-19	933	190	228	67	642.74%84
1945	01--04	00-19	933	239	226	67	703.35%37
3686	01--05	00-29	934	294	322	83	713.77%54
4816	01--06	00-29	936	370	324	102	683.52%18
8320	01--07	00-39	931	453	425	123	713.06%46
12202	01--08	00-49	932	522	529	146	723.03%58
13791	01--09	00-49	934	582	527	157	703.21%36
18489	01--10	00-59	937	644	630	191	693.68%30

We can see that intersection is the best one choice under current above settings. The total time reduction is from 5% to 70%speed up to the fast single index path is up to 3.5 times. If the selectivities of two indexes are vary indexes vary a lot in the selectivity, then the benefit of intersection may not that much. If the two indexes are of the similar selectivity the intersection can achieve 60% ~ 70% total time reductiontwo to three times faster.

On disk case

The test dataset is changing to the 8.2G dataset. In order to flush the cache, we load the same dataset to another ds_copy dataset. Every time when we run the selection, we scan this 8.2G ds_copy once to invalidate the cache pages. Due to the slowness of the on disk case, we warm up the query only once and record the average time of the next three times.

...

Though the two access methods have very different execution time, the intersection tends to catch with the fastest one. The overhead of intersection compares to the fastest path is from 15% to 78%. While , its speedup compares to the slowest one is about 5~10 times faster.

Why the Tweet.creat_at access path is so fast?

The answer is that the order of primary key (Tweet.id) is consistent with the order of Tweet.create_at. We speculates speculate that the Tweet.id was generated by the Tweet.create_at. Thus, this secondary index is actually clustered as the primary index. As the consequence, the IO time to fetch each record is clustered. The general performance of the secondary index lookup should be as slow as the User.create_at access path.

...

Because we only have one disk. First, the Tweet.create_at path has to wait for the User.create_at to finish a frame to operate the intersection. These two index search is battling the disk read. Second, although the intersection itself can be finished as long as one of the input is done, we can not stop the other index scan based on our push model. Hence, the primary search is also competing competing on the disk resource with the two index searches.

General secondary index comparison

We know the As shown in the previous result, the index on Tweet.create_at is a primary-like key. Then we replace that access path with an RTree path which is build clustered secondary index, which is a special case for the secondary index. To test a more general case, we create an RTree on the Tweet.place.bounding_box..boudingbox which is a rectangle area. We create a circle area around LA county. By increasing the radius, we can increase the selectivity of that RTree. The query is as below

Code Block

use dataverse twitter; 
let $ms_start := date("2010-$month_start-01") 
let $ms_end := date("2010-$month_end-28") 
let $region := create-circle(create-point(-118.125,33.939), $radius)
let $result := for $t in dataset ds_tweets 
               where $t.user.create_at >= $ms_start and $t.user.create_at < $ms_end 
               and spatial-intersect($t.place.bounding_box, $region)
               and $t.place = "Unite State" return $t return count($result)
               return $t
return count($result)

Table 6. Fix User.create_at condition for one month and increase the $radius range.

			Scan	user time Index	Rtree Index	intersection	speedup
result	month	hour	Time (Avg last 5)
1390	01--02	0.01			106159	9293	11.4235446
1551	01--02	0.02			107127	10012
1575	01--02	0.03			108143	10278
6171	01--02	0.04			111264	31850
6193	01--02	0.05			112916	32001
6689	01--02	0.06			111673	33952
6900	01--02	0.07			111012	34946
6900	01--02	0.08			111570	34937

The experiment is slow. Stay tuned.

...

Page tree

Versions Compared

Old Version 5

New Version 6

Key

On disk case

General secondary index comparison