ID | IEP-24 |
Author | Vladimir Ozerov Ozerov |
Sponsor | Vladimir Ozerov Ozerov |
Created | 19 Jun 2018 |
Status | DRAFT |
SQL query may scan arbitrary set of cache values. In general case interested values may reside on every cluster node, so broadcast is needed. What is worse, when distributed joins are enabled, another broadcast for every broadcasted message may be needed. One of the most important optimization techniques is so-called "partition pruning", allowing to calculate a set of partitions and nodes which will return empty result without actual query execution. Query parts are not executed on these nodes, increasing overall system throughput and providing nearly linear scaling.
Recent researches in distributed query processing assume that CPU is relatively cheap resource, while network IO and disk IO are equally costly operations. Partition pruning not only saves network requests, but also avoids unnecessary executionы of local query parts, preventing page cache trashing and additional disk reads (mostly random).
Thus, partition pruning is a critical optimization technique for both distributed and local queries and should be treated as high priority task for Ignite SQL engine.
Partition pruning is already implemented in Apache Ignite in very simplified form [1]. Only WHERE condition with equality is considered and only for SQL queries without joins. We should expand it further.
[1] https://issues.apache.org/jira/browse/IGNITE-4509
This technique is used excessively by both distributed databases and classical RDBMS vendors (as a part of their partitioning/sharding solutions). First, query is analyzed and optionally re-written to allow for better partition pruning. Second, information of interesting partitions are extracted from query for every table (cache) and subquery (for complex multi-stage distributed queries).
WHERE conditions and JOIN-s are relatively simple targets for extract partition extraction. The main difficulty is subqueries, as they potentially introduce unlimited nesting what complicates analysis a lot. So the main goal of query rewrite is to minimize number of subqueries. Main techniques are JOIN conversion, correlation and materialization. Interestingly, all these techniques not only simplify partition extraction, but also allow optimizer to choose better execution plan for local execution, as JOINs are much easier to be analyzed and reordered than subqueries. As we shall see below, efforts are usually focused on non-correlated subqueries as they are the most costly for straightforward execution.
The goal is to convert a subquery to join if possible. This is possible in most cases for subqueries in SELECT (aka "derived table") and FROM clauses. Also, it is possible for some IN-clauses.
Example 1: JOIN conversion for derived table
SELECT emp.name, (SELECT dept.name FROM dept WHERE emp.dept_id=dept.id) FROM emp WHERE emp.salary>1000
SELECT emp.name, dept.name FROM emp, dept WHERE emp.salary>1000 AND emp.dept_id=dept.id
Example 2: JOIN conversion for FROM clause
SELECT emp.name, dept_subquery.name FROM emp, (SELECT * FROM dept WHERE state='CA') dept_subquery WHERE emp.salary>1000 AND emp.dept_id=dept_subquery.id
SELECT emp.name, dept.name FROM emp, dept WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'
TODO
TODO
TOOD
TODO