...

ID

IEP-24

Author

Vladimir Ozerov Ozerov

Sponsor

Vladimir Ozerov Ozerov

Created

19 Jun 2018

Status

colour

Grey

title
Green

DRAFT

ACTIVE

Table of Contents

Motivation

The goal of this IEP is to avoid execution of SQL queries on partitions which do no contain relevant data.

SQL query may scan arbitrary set of cache values. In general case interested values may reside on every cluster node, so broadcast is needed. What is worse, when distributed joins are enabled, another broadcast for every broadcasted message may be needed. One of the most important optimization techniques Widely adopted optimization technique is so-called "partition pruning", allowing to calculate a set of partitions and nodes which will return empty result without actual query execution. Query parts are not executed on these nodes, increasing overall system throughput and providing nearly linear scaling.

Recent researches in distributed query processing assume that CPU is relatively cheap resource, while network IO and disk IO are equally costly operations. Partition pruning not only saves network requests, but also avoids unnecessary executionы of local query parts, preventing page cache trashing and additional disk reads (mostly random).

...

:

Try extracting information about target partitions from SQL query
If succeeded - execute query only over these partitions

When implemented it will provide the following benefits:

Improved query latency, as we will be able to skip much more partitions than now (only backup partitions are skipped for now)
Improved thin client latency - it will be possible to send requests to target node, thus saving one network hop.
Decreased page cache pressure - less data to read, less data to evict, less number of page locks
Improved system throughput, as less total CPU and IO operations will be required to execute optimized query

Partition pruning is already implemented in Apache Ignite in very simplified form [1]. Only WHERE condition with equality is considered and only for SQL queries without joins. We should expand it further.

[1] https://issues.apache.org/jira/browse/IGNITE-4509

Theory

This technique is used excessively by both distributed databases and classical RDBMS vendors (as a part of their partitioning/sharding solutions). First, query is analyzed and optionally re-written to allow for better partition pruning. Second, information of interesting partitions are extracted from query for every table (cache) and subquery (for complex multi-stage distributed queries).

Query Rewrite

WHERE conditions and JOIN-s are relatively simple targets for extract partition extraction. The main difficulty is subqueries, as they potentially introduce unlimited nesting what complicates analysis a lot. So the main goal of query rewrite is to minimize number of subqueries. Main techniques are JOIN conversion, correlation and materialization. Interestingly, all these techniques not only simplify partition extraction, but also allow optimizer to choose better execution plan for local execution, as JOINs are much easier to be analyzed and reordered than subqueries. As we shall see below, efforts are usually focused on non-correlated subqueries as they are the most costly for straightforward execution.

JOIN conversion

Design

In the following sections we first explain how partitions could be extracted from SQL parts, and how certain query rewrite techniques could help us with it. Then we will describe how extracted partition info is assembled in a form of tree. Finally, we explain how partition info will be passed to thin clients, and how users will be able to control and monitor partition pruning.

Extracting Partitions

Suppose that for every table we know it's affinity column. It is either PK or explicitly defined affinity column. Then we can analyze WHERE expressions related to the given tables to extract partition info. For JOINs we can compare affinity functions of two tables. If they are compatible, then we can "pass" partition information from one table to another.

Apache Ignite supports only hash-based sharding, so partition could be extracted only from equality conditions.

In further examples affinity function is denoted as '{...}'. Extracted partition is either concrete number, or query parameter index which will be converted to concrete number later. We will denote first type as "Pn" (e.g. P1, P2), and second as ":INDEX" (e.g. :1, :2). If partition cannot be extracted from condition, we will denote it as "ALL". Empty partition set is denoted as "EMPTY".

Equality

For equality we simply apply affinity function to the value.

Code Block

language	sql
title	Equality with constant on affinity column
linenumbers	true

SELECT * FROM emp WHERE emp.id = 100
=> {id=100} => P1

Code Block

language	sql
title	Equality with parameter on affinity column
linenumbers	true

SELECT * FROM emp WHERE emp.id = :1
=> {id=:1} => :1

Code Block

language	sql
title	Non-equality on affinity column
linenumbers	true

SELECT * FROM emp WHERE emp.id != 100
=> {name!=100} => (ALL)

Code Block

language	sql
title	Equality on non-affinity column
linenumbers	true

SELECT * FROM emp WHERE emp.name = :1
=> {name=:1} => (ALL)

IN, BETWEEN, Ranges

IN condition with list of values results in a merged list of affected partitions. IN condition with nested SELECT statement will not be supported for now.

Code Block

language	sql
title	Extracting partition from IN
linenumbers	true

SELECT * FROM emp WHERE emp.id IN (100, :1)
=> ({name=100}, {name=:1}) => (P1, :1)

Range conditions could be converted to IN statements if column is of integer type.

Code Block

language	sql
title	BETWEEN on affinity column
linenumbers	true

SELECT * FROM emp WHERE emp.id BETWEEN 100 AND 102
=> {id BETWEEN 100 AND 102} => {100, 101, 102} => (P1, P2, P3)

Code Block

language	sql
title	Range on integer affinity column
linenumbers	true

SELECT * FROM emp WHERE emp.id > 100 AND emp.id <= 102
=> {id > 100 AND id <= 102} => {101, 102} => (P1, P2)

Composite expressions (AND, OR)

Every WHERE expression can be represented as sequence conjunctive expressions separated by disjunctions. For two OR expressions we return disjunctive set. For AND expressions we return conjunctive set. Concrete partitions can be merged together. Partition placeholders can only be merged with ALL or EMPTY on the other side.

For AND condition there is a special rule. If two sides contain parameters, we cannot simplify them, because final result depend on resolved partitions. If left side contain parameters and right side contain only concrete values, then we can remove concrete values from the left side which are not present on the right side. And vice versa.

Code Block

language	sql
title	AND algebra
linenumbers	true

(P1) AND (P2) => ()
(P1, P2) AND (P3, P4) => ()
(P1, P2) AND (P2, P3) => (P2)
(P1) AND (ALL) => (P1)
(P1, P2) AND (ALL) => (P1, P2)
(P1, P2) AND () => ()

(:1) AND (:2) => (:1) AND (:2)
(:1) AND (ALL) => (:1)
(:1) AND () => ()

(P1) AND (:2) => (P1) AND (:2)
(P1, :1) AND (P2) => (:1) AND (P2)
(P1, :1) AND (P2, :2) => (P1, :1) AND (P2, :2)

Code Block

language	sql
title	OR algebra
linenumbers	true

(P1) OR (P2) => (P1, P2)
(P1) OR (ALL) => (ALL)
(P1) OR () => (P1)
(P1, P2) OR (P2, P3) => (P1, P2, P3)

(:1) OR (:2) => (:1, :2)
(P1, :1) OR (P2, :2) => (P1, P2, :1, :2)

Joins

Joins are very common, so it is crucial to support partition extraction for them as well. General solution might be extremely complex, so we need to define reasonable bounds where optimization is applicable, and improve them iteratively in future. We start with query AST obtained from parser. Proposed flow to extract partitions is explained below. Some of these steps could be merged to improve performance.

Look for non-equality JOIN conditions. When one is found, exit. This way join type space is reduced to equijoins.
Build co-location tree, which is another tree explaining how PARTITIONED tables are joined together
1. Copy current JOIN AST into separate tree
2. If table is REPLICATED and do not have node filter, then mark it as "ANY" and remove from the tree, as it doesn't affect JOIN outcome. Otherwise - exit, no need to bother with custom filters.
3. If CROSS JOIN is found, then exit (might be improved in future)
4. If tables are joined on their affinity columns and has equal affinity functions, then mark them as belonging to the same co-location group. Otherwise - assign them to different co-location groups. Repeat this for all tables and joins in the tree. Functions are equal if and only if the following is true:
  1. Affinity function is deterministic (e.g. RendezvousAffintiyFunction is deterministic, while FairAffinityFunction is not)
  2. Both affinity functions are equal
  3. There are no custom node filters
  4. There are no custom affinity key mappers
5. Every subquery is assigned it's own co-location group unconditionally (may be improved in future)
6. At this point we have a co-location tree with only PARTITIONED caches, only equi-joins, where every table is assigned to a single co-location group.
Extract partitions from expression tree with two additional rules:
1. Every group of partitions is assigned respective co-location group from co-location tree
2. REPLICATED caches with "ANY" policy should be eliminated as follows:
  Code Block
  language sql
  title ANY algebra
  linenumbers true
  (P1, :2) AND (ANY) => (P1, :2) (P1, :2) OR (ANY) => (P1, :2)
3. If partition tree contain rules from different co-location groups, then exit.
At this point we have partition tree over a single co-location group. All outstanding arguments could be passed through the same affinity function to get target partitions.

Subquery rewrite

It is not easy to extract partitions from subqueries. But we can rewrite certain subqueries to joins with a technique called "join conversion". Important prerequisite is that number of resulting rows is not changed.The goal is to convert a subquery to join if possible. This is possible in most cases for subqueries in SELECT (aka "derived table") and FROM clauses. Also, it is possible for some IN-clauses.

Example 1: JOIN conversion for derived table

...

Code Block

language	sql
title	After
linenumbers	true

SELECT emp.name, dept.name
FROM emp, dept
WHERE emp.salary>1000 AND emp.dept_id=dept.id

Example 2: JOIN conversion for FROM clause

...

Code Block

language	sql
title	After
linenumbers	true

SELECT emp.name, dept.name
FROM emp, dept
WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'

Correlation

TODO

Materialization

TODO

Partition Extraction

TOOD

Proposed Changes

Also, it is possible for some IN-clauses. MariaDB calls it "table pullout optimization" [1]

Example 3: Table pullout optimization

Code Block

language	sql
title	Before
linenumbers	true

SELECT emp.name
FROM emp
WHERE emp.salary>1000 AND emp.dept_id IN (SELECT id FROM dept WHERE state='CA')

Code Block

language	sql
title	After
linenumbers	true

SELECT emp.name
FROM emp, dept
WHERE emp.salary>1000 AND emp.dept_id=dept.id AND dept.state='CA'

[1] https://mariadb.com/kb/en/library/table-pullout-optimization/
[2] https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization-with-exists.html

Thin Clients

Partitions could be calculated not only on the server side, but on clients as well. This way thin clients could be able to send requests directly to affected nodes, reducing latency. To achieve this we first define serializable partition tree model:

Code Block

language	java

interface PartitionNode {
    Collection<Integer> apply(Object[] args);
}

class PartitionGroup implements PartitionNode {
    Collection<Object> parts; // Concrete partitions, arguments or both.
}

class PartitionExpression implements PartitionNode {
    PartitionNode left;
    PartitionNode right;
}

Partition tree is enriched with AffinityTopologyVersion it was built on, and affinity function descriptor. Descriptor can only be defined for well-known affinity functions, such as RendezvousAffinityFunction.

Code Block

language	java

class PartitionInfo {
    PartitionNode tree;
    AffintiyTopologyVersion affTopVer;
    AffinityFunctionDecriptor affFunc;
}

When client executes request for the first time, the following sequence of actions happen:

Query request is received by the server
Partition tree is built using specific affinity topology version.
Query is executed as usual
If partition tree exists and is built with well-known affinity function, then it is attached to query response.

Client saves partition tree and re-use it for further requests as follows:

Query arguments are applied on the client.
Target node is determined from the list of partitions. We assume that partition distribution for the given affinity topology versions has been requested in advance similarly how we do that for C++ thin client.
If only one node is resolved, send request to it. If several nodes are resolved - send request to random node from the list.
Request is executed on the server and current affinity topology version is attached to the response. If it differs from the one received from the client, new partition tree is built and attached.
Client checks if current affinity topology version differs. If yes - old partition tree is invalidated.

Optimizations

If partition tree is extracted form the query successfully, then two types of optimizations are possible:

If tree evaluation returned empty partition set, return empty result set immediately without actual query execution
If tree evaluation returned one partition, then all data reside on a single node. Convert query to "local" and execute it on target node without two-phase flow
If tree evaluation returned several partitions, and all of them appear to be on the same node, then try to execute query speculatively on a single node, provided that partitions are still on that node. Fallback to normal execution mode in case of concurrent eviction.

Management and Monitoring

It is very important to let user know if partition pruning is applicable to query for performance tuning. For every cached two-step query we may expose the following information:

Whether partition pruning is applicable
Formatted partition tree
Affinity topology version of the plan
If not applicable - explain why (e.g. non-equijoin, incompatible affinity functions, etc.)

Also we need to let user disable this optimization. Otherwise a bug in implementation may lead to incorrect results with no workarounds. System on configuration property could be used for that.TODO

Tickets

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues	20
jqlQuery	project = Ignite AND labels IN (iep-24) ORDER BY status
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b

Page tree

Versions Compared

Old Version 7

New Version Current

Key

Motivation

Theory

Query Rewrite

JOIN conversion

Design

Extracting Partitions

Equality

IN, BETWEEN, Ranges

Composite expressions (AND, OR)

Joins

Subquery rewrite

Correlation

Materialization

Partition Extraction

Proposed Changes

Thin Clients

Optimizations

Management and Monitoring

Tickets

Page tree

Page History

Versions Compared

Old Version 7

New Version Current

Key

Motivation

Theory

Query Rewrite

JOIN conversion

Design

Extracting Partitions

Equality

IN, BETWEEN, Ranges

Composite expressions (AND, OR)

Joins

Subquery rewrite

Correlation

Materialization

Partition Extraction

Proposed Changes

Thin Clients

Optimizations

Management and Monitoring

Tickets