ID | IEP-24 |
Author | Vladimir Ozerov Ozerov |
Sponsor | Vladimir Ozerov Ozerov |
Created | 19 Jun 2018 |
Status | ACTIVE |
The goal of this IEP is to avoid execution of SQL queries on partitions which do no contain relevant data.
SQL query may scan arbitrary set of cache values. In general case interested values may reside on every cluster node, so broadcast is needed. Widely adopted optimization technique is so-called "partition pruning":
When implemented it will provide the following benefits:
Partition pruning is already implemented in Apache Ignite in very simplified form [1]. Only WHERE condition with equality is considered and only for SQL queries without joins. We should expand it further.
[1] https://issues.apache.org/jira/browse/IGNITE-4509
In the following sections we first explain how partitions could be extracted from SQL parts, and how certain query rewrite techniques could help us with it. Then we will describe how extracted partition info is assembled in a form of tree. Then we discuss that partition extraction should be performed two times - before split for the whole query, and after split for query parts. Finally, we explain how partition info will be passed to thin clients, and how users will be able to control and monitor partition pruning.
Suppose that for every table we know it's affinity column. It is either PK or explicitly defined affinity column. Then we can analyze WHERE expressions related to the given tables to extract partition info. For JOINs we can compare affinity functions of two tables. If they are compatible, then we can "pass" partition information from one table to another.
Apache Ignite supports only hash-based sharding, so partition could be extracted only from equality conditions.
In further examples affinity function is denoted as '{...}'. Extracted partition is either concrete number, or query parameter index which will be converted to concrete number later. We will denote first type as "Pn" (e.g. P1, P2), and second as ":INDEX" (e.g. :1, :2). If partition cannot be extracted from condition, we will denote it as "ALL". Empty partition set is denoted as "EMPTY".
For equality we simply apply affinity function to the value.
SELECT * FROM emp WHERE emp.id = 100 => {id=100} => P1
SELECT * FROM emp WHERE emp.id = :1 => {id=:1} => :1
SELECT * FROM emp WHERE emp.id != 100 => {name!=100} => (ALL)
SELECT * FROM emp WHERE emp.name = :1 => {name=:1} => (ALL)
IN condition with list of values results in a merged list of affected partitions. IN condition with nested SELECT statement will not be supported for now.
SELECT * FROM emp WHERE emp.id IN (100, :1) => ({name=100}, {name=:1}) => (P1, :1)
Range conditions could be converted to IN statements if column is of integer type.
SELECT * FROM emp WHERE emp.id BETWEEN 100 AND 102 => {id BETWEEN 100 AND 102} => {100, 101, 102} => (P1, P2, P3)
SELECT * FROM emp WHERE emp.id > 100 AND emp.id <= 102 => {id > 100 AND id <= 102} => {101, 102} => (P1, P2)
Every WHERE expression can be represented as sequence conjunctive expressions separated by disjunctions. For two OR expressions we return disjunctive set. For AND expressions we return conjunctive set. Concrete partitions can be merged together. Partition placeholders can only be merged with ALL or EMPTY on the other side.
For AND condition there is a special rule. If two sides contain parameters, we cannot simplify them, because final result depend on resolved partitions. If left side contain parameters and right side contain only concrete values, then we can remove concrete values from the left side which are not present on the right side. And vice versa.
(P1) AND (P2) => () (P1, P2) AND (P3, P4) => () (P1, P2) AND (P2, P3) => (P2) (P1) AND (ALL) => (P1) (P1, P2) AND (ALL) => (P1, P2) (P1, P2) AND () => () (:1) AND (:2) => (:1) AND (:2) (:1) AND (ALL) => (:1) (:1) AND () => () (P1) AND (:2) => (P1) AND (:2) (P1, :1) AND (P2) => (:1) AND (P2) (P1, :1) AND (P2, :2) => (P1, :1) AND (P2, :2)
(P1) OR (P2) => (P1, P2) (P1) OR (ALL) => (ALL) (P1) OR () => (P1) (P1, P2) OR (P2, P3) => (P1, P2, P3) (:1) OR (:2) => (:1, :2) (P1, :1) OR (P2, :2) => (P1, P2, :1, :2)
Joins are very common, so it is crucial to support partition extraction for them as well. General solution might be extremely complex, so we need to define reasonable bounds where could operate, and improve them iteratively in future. We start with query AST obtained from parser. Proposed flow to extract partitions is explained below. Some of explained steps could be merged to improve performance.
REPLICATED
and do not have node filter, then mark it as "ANY" and remove from the tree, as it doesn't affect JOIN outcome. Otherwise - exit, no need to bother with custom filters.RendezvousAffintiyFunction
is deterministic, while FairAffinityFunction
is not)PARTITIONED
caches, only equi-joins, where every table is assigned a single co-location group.REPLICATED
caches with "ANY" policy should be eliminated as follows:
(P1, :2) AND (ANY) => (P1, :2) (P1, :2) OR (ANY) => (P1, :2)
If partition tree contain rules from different co-location groups, then exit.
It is not easy to extract partitions from subqueries. But we can rewrite certain subqueries to joins with a technique called "join conversion". Important prerequisite is that number of resulting rows is not changed.
Example 1: JOIN conversion for derived table
SELECT emp.name, (SELECT dept.name FROM dept WHERE emp.dept_id=dept.id) FROM emp WHERE emp.salary>1000
SELECT emp.name, dept.name FROM emp, dept WHERE emp.salary>1000 AND emp.dept_id=dept.id
Example 2: JOIN conversion for FROM clause
SELECT emp.name, dept_subquery.name FROM emp, (SELECT * FROM dept WHERE state='CA') dept_subquery WHERE emp.salary>1000 AND emp.dept_id=dept_subquery.id
SELECT emp.name, dept.name FROM emp, dept WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'
Also, it is possible for some IN-clauses. MariaDB calls it "table pullout optimization" [1]
Example 3: Table pullout optimization
SELECT emp.name FROM emp WHERE emp.salary>1000 AND emp.dept_id IN (SELECT id FROM dept WHERE state='CA')
SELECT emp.name FROM emp, dept WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'
[1] https://mariadb.com/kb/en/library/table-pullout-optimization/
[2] https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization-with-exists.html
Partitions could be calculated not only on the server side, but on clients as well. This way thin clients could be able to send requests directly to affected nodes, reducing latency. To achieve this we first define serializable partition tree model:
interface PartitionNode { Collection<Integer> apply(Object[] args); } class PartitionGroup implements PartitionNode { Collection<Object> parts; // Concrete partitions, arguments or both. } class PartitionExpression implements PartitionNode { PartitionNode left; PartitioNode right; }
Partition tree is enriched with {{AffinityTopologyVersion}} it was built on, and affinity function descriptor. Descriptor can only be defined for well-known affinity functions, such as {{RendezvousAffinityFunction}}, and defines the logic on how to convert an object to partition.
class PartitionInfo { PartitionNode tree; AffintiyTopologyVersion affTopVer; AffinityFunctionDecriptor affFunc; }
When client executes request for the first time, the following sequence of actions happen:
Client saves partition tree and re-use it for further requests as follows:
If partition tree is extracted form the query successfully, then two types of optimizations are possible:
It is very important to let user know if partition pruning is applicable to query for performance tuning. For every cached two-step query we may expose the following information:
Also we need to let user disable this optimization. Otherwise a bug in implementation may lead to incorrect results with no workarounds. System on configuration property could be used for that.