Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current SQL engine has a number of critical limitations:

  • Query execution is hard-coded to a primitive map-reduce flow (SQL query is split into a 'map query' and 'reduce query'). This flow is fundamentally incomplete as it is impossible to execute the following queries:
    • SELECT t1.a, t2.b FROM t1, t2 WHERE t1.id = t2.id - silently returns wrong results in case of t1 and t2 are not co-located
    • SELECT * FROM person p WHERE p.salary > (SELECT avg(salary) from person) - doesn’t work at all because it cannot be executed in two steps
  • No data co-location control, i.e. arbitrary data can be returned silently
  • Low control on how query executes internally, as a result we have limited possibility to implement improvements/fixes.
  • Limited execution modes: either two-phase execution (default) or "distributed joins" which adds one more phase. This leads to the impossibility "by design" of executing  some queries, see 
    • Jira
      serverASF JIRA
      columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keyIGNITE-11448
  • Query execution is done using H2 query engine which itself has several issues
    • Low control on how query executes internally, as a result, we have limited possibility to implement improvements/fixes (H2 is outside of ASF)
    • H2 is a local database and has no notion of distributed query execution
    • H2 lacks the
    Lack of
    • proper planner which will take in count both data distribution and data statistics
    • H2 optimizer is very primitive. It can do only predicates push down, join order choosing and also some minor optimizations. It lacks
    of
    • many useful optimizations like
    this one 
    • the following 
      Jira
      serverASF JIRA
      columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
      serverId5aa69414-a9e9-3523-82ec-879b028fb15b
      keyIGNITE-6085

Classical database query execution is done using roughly the following steps [1], [2]

  1. SQL text is parsed and validated, which produces an Abstract Syntax Tree (AST) representing the query
  2. Query rewrite (output: rewritten SQL AST) - need to (for example) turn DML queries into selects, returning set of rows to change (queries like 'INSERT INTO table1 SELECT * FROM table 2').
  3. AST is transformed into Query Plan (output: query execution graph, which is identical to the original query but operates in terms of relational algebra)
  4. Query execution graph is optimized (output: optimized query execution graph with is semantically identical to the original query)
  5. Query execution (output: resulting cursor)

The key point in the aforementioned plan is to have a relational algebra execution plan which can undergo arbitrary equivalence transformations by the rules of relational algebra. There is a well-studied optimization approach used in many production systems [3] for optimizing query execution plans. 

Description

The approach to solve the limitations implies more complex execution flow that brings a new abstraction: idea of this IEP is to introduce all missing intermediate steps into the query execution flow and operate on query execution graph.

This graph consists of all execution steps (relational operators) like join, filter, sort, etc. The execution graph may be transformed into another one saving query semantic during query optimisation process using relational algebra and transformation rules. After transformation the graph is split into a set of dependent subgraphs where an output of dependent subgraph is an aggregated input of depending one (in other words we have more than one map-reduce steps).

Optimized query execution graph (or query plan) used to build final execution tasks.

So, the query flow gets a number of additional steps and will consist of next steps:

...

.

...

...

Example:


Initial query:

SELECT t1.name, t2.name as projectName FROM Persons t1, Projects t2 where t1.id == t2.responsiblePersonId

...

http://apache-ignite-developers.2346864.n4.nabble.com/New-SQL-execution-engine-td43724.html

Reference Links

[1] 

[2] https://www.youtube.com/playlist?list=PLSE8ODhjZXja7K1hjZ01UTVDnGQdx5v5U

[3] https://cs.uwaterloo.ca/~david/cs848/volcano.pdf

https://calcite.apache.org/

...