ID	IEP-24
Author	Vladimir Ozerov Ozerov
Sponsor	Vladimir Ozerov Ozerov
Created	19 Jun 2018
Status	ACTIVE

Motivation

The goal of this IEP is to avoid execution of SQL queries on partitions which do no contain relevant data.

SQL query may scan arbitrary set of cache values. In general case interested values may reside on every cluster node, so broadcast is needed. Widely adopted optimization technique is so-called "partition pruning":

Try extracting information about target partitions from SQL query
If succeeded - execute query only over these partitions

When implemented it will provide the following benefits:

Improved query latency, as we will be able to skip much more partitions than now (only backup partitions are skipped for now)
Improved thin client latency - it will be possible to send requests to target node, thus saving one network hop.
Decreased page cache pressure - less data to read, less data to evict, less number of page locks
Improved system throughput, as less total CPU and IO operations will be required to execute optimized query
Improved thin client latency - it will be possible to send requests to target node, thus saving one network hop.

Partition pruning is already implemented in Apache Ignite in very simplified form [1]. Only WHERE condition with equality is considered and only for SQL queries without joins. We should expand it further.

[1] https://issues.apache.org/jira/browse/IGNITE-4509

Design

In the following sections we first explain how partitions could be extracted from SQL parts, and how certain query rewrite techniques could help us with it. Then we will describe how extracted partition info is assembled in a form of tree. Then we discuss that partition extraction should be performed two times - before split for the whole query, and after split for query parts. Finally, we explain how partition info will be passed to thin clients, and how users will be able to control and monitor partition pruning.

Extracting Partitions

Suppose that for every table we know it's affinity column. It is either PK or explicitly defined affinity column. Then we can analyze WHERE expressions related to the given tables to extract partition info. For JOINs we can compare affinity functions of two tables. If they are compatible, then we can "pass" partition information from one table to another.

Apache Ignite supports only hash-based sharding, so partition could be extracted only from equality conditions.

In further examples affinity function is denoted as '{...}'. Extracted partition is either concrete number, or query parameter index which will be converted to concrete number later. We will denote first type as "Pn" (e.g. P1, P2), and second as ":INDEX" (e.g. :1, :2). If partition cannot be extracted from condition, we will denote it as "ALL". Empty partition set is denoted as "EMPTY".

Equality

For equality we simply apply affinity function to the value.

Equality with constant on affinity column

SELECT * FROM emp WHERE emp.id = 100
=> {id=100} => P1

Equality with parameter on affinity column

SELECT * FROM emp WHERE emp.id = :1
=> {id=:1} => :1

Non-equality on affinity column

SELECT * FROM emp WHERE emp.id != 100
=> {name!=100} => (ALL)

Equality on non-affinity column

SELECT * FROM emp WHERE emp.name = :1
=> {name=:1} => (ALL)

IN, BETWEEN, Ranges

IN condition with list of values results in a merged list of affected partitions. IN condition with nested SELECT statement will not be supported for now.

Extracting partition from IN

SELECT * FROM emp WHERE emp.id IN (100, :1)
=> ({name=100}, {name=:1}) => (P1, :1)

Range conditions could be converted to IN statements if column is of integer type.

BETWEEN on affinity column

SELECT * FROM emp WHERE emp.id BETWEEN 100 AND 102
=> {id BETWEEN 100 AND 102} => {100, 101, 102} => (P1, P2, P3)

Range on integer affinity column

SELECT * FROM emp WHERE emp.id > 100 AND emp.id <= 102
=> {id > 100 AND id <= 102} => {101, 102} => (P1, P2)

Composite expressions (AND, OR)

Every WHERE expression can be represented as sequence conjunctive expressions separated by disjunctions. For two OR expressions we return disjunctive set. For AND expressions we return conjunctive set. Concrete partitions can be merged together. Partition placeholders can only be merged with ALL or EMPTY on the other side.

For AND condition there is a special rule. If two sides contain parameters, we cannot simplify them, because final result depend on resolved partitions. If left side contain parameters and right side contain only concrete values, then we can remove concrete values from the left side which are not present on the right side. And vice versa.

AND algebra

(P1) AND (P2) => ()
(P1, P2) AND (P3, P4) => ()
(P1, P2) AND (P2, P3) => (P2)
(P1) AND (ALL) => (P1)
(P1, P2) AND (ALL) => (P1, P2)
(P1, P2) AND () => ()

(:1) AND (:2) => (:1) AND (:2)
(:1) AND (ALL) => (:1)
(:1) AND () => ()

(P1) AND (:2) => (P1) AND (:2)
(P1, :1) AND (P2) => (:1) AND (P2)
(P1, :1) AND (P2, :2) => (P1, :1) AND (P2, :2)

OR algebra

(P1) OR (P2) => (P1, P2)
(P1) OR (ALL) => (ALL)
(P1) OR () => (P1)
(P1, P2) OR (P2, P3) => (P1, P2, P3)


(:1) OR (:2) => (:1, :2)
(P1, :1) OR (P2, :2) => (P1, P2, :1, :2)

Joins

Joins are very common, so it is crucial to support partition extraction for them as well. General solution might be extremely complex, so we need to define reasonable bounds where could operate, and improve them iteratively in future. We start with query AST obtained from parser. Proposed flow to extract partitions is explained below. Some of explained steps could be merged to improve performance.

Look for non-equality JOIN conditions. When one is found, exit. This way join type space is reduced to equijoins.
Build co-location tree, which is another tree showing how PARTITIONED tables are joined together
1. Copy current JOIN AST into separate tree
2. If table is REPLICATED and do not have node filter, then mark it as "ANY" and remove from the tree, as it doesn't affect JOIN outcome. Otherwise - exit, no need to bother with custom filters.
3. If CROSS JOIN is found, then exit (might be improved in future)
4. If tables are joined on their affinity columns and has equal affinity functions, then mark them as belonging to the same co-location group. Otherwise - assign them to different co-location groups. Repeat this for all tables and joins in the tree. Functions are defined equal if and only if the following is true:
  1. Affinity function is deterministic (e.g. RendezvousAffintiyFunction is deterministic, while FairAffinityFunction is not)
  2. Both affinity functions are equal
  3. There are no custom node filters
  4. There are no custom affinity key mappers
5. Every subquery is assigned it's own co-location group unconditionally (may be improved in future)
6. At this point we have a co-location tree with only PARTITIONED caches, only equi-joins, where every table is assigned a single co-location group.
Extract partitions from expression tree with two additional rules:
1. Every partition group is assigned respective co-location group from co-location tree
2. REPLICATED caches with "ANY" policy should be eliminated as follows:
  ANY algebra
```
(P1, :2) AND (ANY) => (P1, :2)
(P1, :2) OR (ANY) => (P1, :2)
```
3. If partition tree contain rules from different co-location groups, then exit.
At this point we have partition tree over a single co-location group. All outstanding arguments could be passed through the same affinity function to get target partitions.

Subquery rewrite

It is not easy to extract partitions from subqueries. But we can rewrite certain subqueries to joins with a technique called "join conversion". Important prerequisite is that number of resulting rows is not changed.

Example 1: JOIN conversion for derived table

Before

SELECT emp.name, (SELECT dept.name FROM dept WHERE emp.dept_id=dept.id)
FROM emp
WHERE emp.salary>1000

After

SELECT emp.name, dept.name
FROM emp, dept
WHERE emp.salary>1000 AND emp.dept_id=dept.id

Example 2: JOIN conversion for FROM clause

Before

SELECT emp.name, dept_subquery.name
FROM emp, (SELECT * FROM dept WHERE state='CA') dept_subquery
WHERE emp.salary>1000 AND emp.dept_id=dept_subquery.id

After

SELECT emp.name, dept.name
FROM emp, dept
WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'

Also, it is possible for some IN-clauses. MariaDB calls it "table pullout optimization" [1]

Example 3: Table pullout optimization

Before

SELECT emp.name
FROM emp
WHERE emp.salary>1000 AND emp.dept_id IN (SELECT id FROM dept WHERE state='CA')

After

SELECT emp.name
FROM emp, dept
WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'

[1] https://mariadb.com/kb/en/library/table-pullout-optimization/
[2] https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization-with-exists.html

Partition Pruning on Thin Clients

TODO

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-24: SQL Partition Pruning

Motivation

Design

Extracting Partitions

Equality

IN, BETWEEN, Ranges

Composite expressions (AND, OR)

Joins

Subquery rewrite

Partition Pruning on Thin Clients

Tickets