...
Physically, each ETS node will introduce a thread. Thus, the intersection operator must synchronize the upstream input threads in order to generate the correct result. In order to have a pipeline operation, the intersection is implemented in a sort-merge manner. Therefore, each input is required to be sorted. The synchronization is handled by the thread of input No.0, which means the thread 0 will call the writer.open/nextFrame/close functions. If we authorize arbitrary threads to push forward, the downstream operator will be confused, especially in synchronizing their locks. The core logical intersection function is as below:
- do
- find the max input: maxinput id of the maximum record
- for each input i
- if record < max keep popping
- if record == max keep popping until it matches max. then match++; continue
- if > max, break
- If match == inputArity
- output max record
- while no input is closed.
...
Code Block |
---|
use dataverse twitter; let $ts_start := datetime("2015-11-23T17:$min_start:00.000Z") let $ts_end := datetime("2015-11-23T17:$min_end:03.000Z") let $ms_start := date("2010-$month_start-01") let $ms_end := date("2010-$month_end-28") let $result := for $t in dataset ds_tweets where $t.user.create_at >= $ms_start and $t.user.create_at < $ms_end and $t.create_at >= $ts_start and $t.create_at < $ts_end and $t.place = "Unite State" return $t return count($result) return $t return count($result) |
...
Each query will run ten times. We record the time by average the last fives. The time unit is Milliseconds.
Table 1. Fix the User.create_at $month_start = 01, $month_end = 02, increasing the Tweets.create_at selectivity
...
Code Block |
---|
use dataverse twitter; let $ms_start := date("2010-$month_start-01") let $ms_end := date("2010-$month_end-28") let $region := create-circle(create-point(-118.125,33.939), $radius) let $result := for $t in dataset ds_tweets where $t.user.create_at >= $ms_start and $t.user.create_at < $ms_end and spatial-intersect($t.place.bounding_box, $region) and $t.place = "Unite State" return $t return count($result) return $t return count($result) |
...
Scan | user time Index | Rtree Index | intersection | speedup | |||
result | month | hourradius | Time (Avg last 5) | ||||
1390 | 01--02 | 0.01 | 111087 | 106159 | 9293 | 11.4235446 | |
1551 | 01--02 | 0.02 | 111306 | 107127 | 10012 | 10.69986017 | |
1575 | 01--02 | 0.03 | 112024 | 108143 | 10278 | 10.52179412 | |
6171 | 01--02 | 0.04 | 111264 | 31850 | 3.493375196 | ||
6193 | 01--02 | 0.05 | 112916 | 32001 | 3.528514734 | ||
6689 | 01--02 | 0.06 | 111673 | 33952 | 3.289143497 | ||
6900 | 01--02 | 0.07 | 111012 | 34946 | 3.176672581 | ||
6900 | 01--02 | 0.08 | 111570 | 34937 | 3.193462518 |
The experiment is slow. Stay tuned.
...