Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Improved query latency, as we will be able to skip much more partitions than now (only backup partitions are skipped for now)
  2. Improved thin client latency - it will be possible to send requests to target node, thus saving one network hop.
  3. Decreased page cache pressure - less data to read, less data to evict, less number of page locks
  4. Improved system throughput, as less total CPU and IO operations will be required to execute optimized query
  5. Improved thin client latency - it will be possible to send requests to target node, thus saving one network hop.


Partition pruning is already implemented in Apache Ignite in very simplified form [1]. Only WHERE condition with equality is considered and only for SQL queries without joins. We should expand it further.

[1] https://issues.apache.org/jira/browse/IGNITE-4509

...

In the following sections we first explain how partitions could be extracted from SQL parts, and how certain query rewrite techniques could help us with it. Then we will describe how extracted partition info is assembled in a form of tree. Then we discuss that partition extraction should be performed two times - before split for the whole query, and after split for query parts. Finally, we explain how partition info will be passed to thin clients, and how users will be able to control and monitor partition pruning.

...

Code Block
languagesql
titleOR algebra
linenumberstrue
(P1) OR (P2) => (P1, P2)
(P1) OR (ALL) => (ALL)
(P1) OR () => (P1)
(P1, P2) OR (P2, P3) => (P1, P2, P3)


(:1) OR (:2) => (:1, :2)
(P1, :1) OR (P2, :2) => (P1, P2, :1, :2)

...

Joins are very common, so it is crucial to support partition extraction for them as well. General solution might be extremely complex, so we need to define reasonable bounds where could operateoptimization is applicable, and improve them iteratively in future.  We start with query AST obtained from parser. Proposed flow to extract partitions is explained below. Some of explained these steps could be merged to improve performance.

  1. Look for non-equality JOIN conditions. When one is found, exit. This way join type space is reduced to equijoins.
  2. Build co-location tree, which is another tree showing explaining how PARTITIONED tables are joined together
    1. Copy current JOIN AST into separate tree
    2. If table is REPLICATED and  and do not have node filter, then mark it as "ANY" and remove from the tree, as it doesn't affect JOIN outcome. Otherwise - exit, no need to bother with custom filters.
    3. If CROSS JOIN is found, then exit (might be improved in future)
    4. If tables are joined on their affinity columns and has equal affinity functions, then mark them as belonging to the same co-location group. Otherwise - assign them to different co-location groups. Repeat this for all tables and joins in the tree. Functions are defined equal if and only if the following is true:
      1. Affinity function is deterministic (e.g. RendezvousAffintiyFunction is  is deterministic, while FairAffinityFunction is  is not)
      2. Both affinity functions are equal
      3. There are no custom node filters
      4. There are no custom affinity key mappers
    5. Every subquery is assigned it's own co-location group unconditionally (may be improved in future)
    6. At this point we have a co-location tree with only PARTITIONED caches caches, only equi-joins, where every table is assigned to a single co-location group.
  3. Extract partitions from expression tree with two additional rules:
    1. Every partition group of partitions is assigned respective co-location group from co-location tree
    2. REPLICATED caches  caches with "ANY" policy should be eliminated as follows:

      Code Block
      languagesql
      titleANY algebra
      linenumberstrue
      (P1, :2) AND (ANY) => (P1, :2)
      (P1, :2) OR (ANY) => (P1, :2)


    3. If partition tree contain rules from different co-location groups, then exit.

  4. At this point we have partition tree over a single co-location group. All outstanding arguments could be passed through the same affinity function to get target partitions.

...

Code Block
languagejava
interface PartitionNode {
    Collection<Integer> apply(Object[] args);
}


class PartitionGroup implements PartitionNode {
    Collection<Object> parts; // Concrete partitions, arguments or both.
}


class PartitionExpression implements PartitionNode {
    PartitionNode left;
    PartitioNodePartitionNode right;
}

Partition tree is enriched with {{AffinityTopologyVersion}} it  it was built on, and affinity function descriptor. Descriptor can only be defined for well-known affinity functions, such as {{RendezvousAffinityFunction}}, and defines the logic on how to convert an object to partition

Code Block
languagejava
class PartitionInfo {
    PartitionNode tree;
    AffintiyTopologyVersion affTopVer;
    AffinityFunctionDecriptor affFunc;
}

...

  1. Query arguments are applied on the client. 
  2. Target node is determined from the list of partitions. We assume that partition distribution for the given affinity topology versions has been requested in advance similarly how we do that for C++ thin client. 
  3. If only one node is resolved, send request to it. If several nodes are resolved - send request to random node from the list. 
  4. Request is executed on the server and current affinity topology version is attached to the response. If it differs from the one received from the client, new partition tree is built and attached.
  5. Client checks if current affinity topology version differs. If yes - old partition tree is invalidated.

Optimizations

If partition tree is extracted form the query successfully, then two types of optimizations are possible:

  1. If tree evaluation returned empty partition set, return empty result set immediately without actual query execution
  2. If tree evaluation returned one partition, then all data reside on a single node. Convert query to "local" and execute it on target node without two-phase flow
  3. If tree evaluation returned several partitions, and all of them appear to be on the same node, then try to execute query speculatively on a single node, provided that partitions are still on that node. Fallback to normal execution mode in case of concurrent eviction.

Management and Monitoring

It is very important to let user know if partition pruning is applicable to query for performance tuning. For every cached two-step query we may expose the following information:

  1. Whether partition pruning is applicable
  2. Formatted partition tree
  3. Affinity topology version of the plan
  4. If not applicable - explain why (e.g. non-equijoin, incompatible affinity functions, etc.)

 Also we need to let user disable this optimization. Otherwise a bug in implementation may lead to incorrect results with no workarounds. System on configuration property could be used for that.

Tickets

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
maximumIssues20
jqlQueryproject = Ignite AND labels IN (iep-24) ORDER BY status
serverId5aa69414-a9e9-3523-82ec-879b028fb15b