Unable to render Jira issues macro, execution error.


Background

Project Name & Issue to Solve

Apache IoTDB: https://iotdb.apache.org/

JIRA: https://issues.apache.org/jira/browse/COMDEV-409

Synopsis

IOTDB supports many types of queries: raw data queries, function queries (including UDF queries), and so on. However, currently there is no easy way to combine the results of multiple queries. Therefore, we hope that IoTDB can support complex arithmetic operations and nested operations in the SELECT clause, which will greatly improve the analysis capabilities. 

This proposal describes the implementation of arithmetic operations and nested operations based on the IoTDB UDF runtime framework. The content covers the generation of logical operators, physical plans, and result data sets. In addition, this proposal also points out a number of query codes that need to be refactored, and proposes a number of solutions to improve the UDF execution performance.

Feature Description

Applied to: 

raw time series, literal numbers, and function outputs.

Applicable data types:

all types except TIMESTAMP and TEXT.

Applicable operators: 

at least five binary operators ( + , - , * , / , % ) and two unary operators (+ , -).

Usage Examples

  1. raw queries
    SELECT -a FROM root.sg.d;
    SELECT a, b, c, b * b - 4 * a * c FROM root.sg.d WHERE b > 0;
    SELECT a, b, -(bool_value * (a - b)) FROM root.sg.d;
    SELECT -3.14 + a / 15 + 926 FROM root.sg.d;
    SELECT +a % 3.14 FROM root.sg.d WHERE a < 0;
  2. function queries
    SELECT a + abs(a), sin(a) * cos(a) FROM root.sg.d;
    SELECT a, b, sqrt(a) * sqrt(b) / (a * b) FROM FROM root.sg.d WHERE a < 0;
  3. nested queries
    select a, b, a + b + udf(sin(a) * sin(b), cos(a) * cos(b)) FROM root.sg.d;
    select a, a + a, sin(sin(sin(a + a))) FROM root.sg.d WHERE a < 0;


Non-functional Requirements


  • For performance reasons, it's better to perform as few disk read operations as possible.


Example:

SELECT a, sin(a + a) FROM root.sg.d WHERE a < 0;

The series root.sg.d.a should be read-only once during the query.


  • For performance reasons, it's better to reuse intermediate calculation results as much as possible.


Example:

SELECT a + a, sin(a + a) FROM root.sg.d WHERE a < 0;

The intermediate calculation result a + a should only be evaluated once during the query.


  • Need to consider memory-constrained scenarios.



Key Design 

In order to implement expected features, the implementations of the following components need to be considered:

Component

Description

ANTLR grammar

Definition of the SQL syntax.

Logical Operator

The data structure used to store information parsed from SQL. The information includes the selected series, the functions or the arithmetic operations on the selected series, and the priority of each operation.

Physical Plan

It is converted by a logical operator and saves all the information used to execute SQL. The physical plan will eventually be submitted to the plan executor.

Transformer

Transformer is constructed by the plan executor, which is the most basic unit of function execution. It accepts several series as input and outputs a result series. A complete arithmetic operation or nested operation can be decomposed into several Transformer-based operations.

Result Dataset

It saves the context information of the query, so that the client can identify the actual output column through the column name, so that the query results can be obtained in batches as needed.



ANTLR grammar

The original ANTLR grammar is not very elegant, I hope to take this opportunity to refactor it. The following is just a draft, but it will be much clearer than the original one.

selectClause

   : SELECT LAST? complexOperationAsClause (COMMA complexOperationAsClause)*

   ;

complexOperationAsClause

   : complexOperationClause (AS ID)?

   ;

complexOperationClause

   : numberLiteral

   | suffixPath

   | functionClause

   | '(' complexOperationClause ')'

   | ('+' | '-') complexOperationClause

   | complexOperationClause ('*' | '/' | '%') complexOperationClause

   | complexOperationClause ('+' | '-') complexOperationClause

   ;

functionClause

   : functionName=ID LR_BRACKET complexOperationClause functionAttribute* RR_BRACKET

   ;

functionAttribute

   : COMMA functionAttributeKey=stringLiteral OPERATOR_EQ functionAttributeValue=stringLiteral

   ;

Logical Operator

Logical operator is a data structure used to store information parsed from SQL. The information includes the selected series, the functions or the arithmetic operations on the selected series, and the priority of each operator.

In order to support complex arithmetic operations and nested operations, we have to make modifications on the basis of SelectOperator. We need to record the priority of operators in arithmetic operations and nested operations in the SelectOperator.

The priority of operators in arithmetic operations and nested operations can be defined by ANTLR grammar, and the ANTLR visitor will parse each operator and corresponding operands in turn according to the priority. We can construct an execution tree for each output column according to the order in which the ANTLR visitor visits each operator, and this execution tree saves the method of calculating the output. Each node of the execution tree represents a kind of query, which can be a raw query, a function query or an arithmetic query.

Here is an example execution tree constructed by the ANTLR visitor:

input: select a * b - c / sin(sin(d - e * f)) from ...

In addition, for code readability and maintainability, we also need to redesign the fields in SelectOperator. Here are all the fields currently in SelectOperator:

private final ZoneId zoneId;

private List<PartialPath> suffixList;

private List<String> aggregations;

private List<UDFContext> udfList;

private boolean lastQuery;

private boolean udfQuery;

private boolean hasBuiltinAggregation;

The following are the fields in SelectOperator after refactoring:

private boolean isLastQuery;

private List<OutputColumn> outputColumns;

Among them, the outputColumns field records the alias of each output column and the corresponding execution tree.

Refactoring SelectOperator in the above way can reduce the amount of code during logic optimization, and also make the conversion process from logical operators to physical plans more concise.

Physical Plan

A physical plan is converted by a logical operator and saves all the information used to execute SQL. The physical plan will eventually be submitted to the plan executor.

The physical plan converts each node in the execution tree into a series reader, and records the tree-shaped evaluation dependency of the result column as a list. Note that this enables the evaluation to reuse intermediate results and minimizes the number of reads from the disk.

Below, I will illustrate how to store the tree-shaped calculation dependency as a list.

Example:

select sin(sin(a, b) - c, b),  cos(sin(a, b) + b, b),  sin(a,b), b, d from ...

Evaluation node list (deduplicated):

  1. sin(sin(a, b) - c, b) 
  2. sin(a, b) - c 
  3. b
  4. sin(a, b)
  5. c
  6. a
  7. cos(sin(a, b) + b, b)
  8. sin(a, b) + b
  9. d


Input column indexes list:

  1. (2, 3)
  2. (4, 5)
  3. ()
  4. (6, 3)
  5. ()
  6. ()
  7. (8, 3)
  8. (4, 5)
  9. ()

The meaning of the elements in this list: 

Taking the first element sin(sin(a, b)-c, b) in the evaluation node list as an example, sin(sin(a, b)-c, b) will take the 2nd and 3rd nodes in the list as input, so the first element in the input column indexes list is (2, 3).

Dataset output column list:

(1, 7, 4, 3, 9)

because:

sin(sin(a, b) - c, b) -> 1

cos(sin(a, b) + b, b) -> 7

sin(a,b) -> 4

b -> 3

d -> 9

We can extend UDTFPlan to construct a physical plan that supports complex arithmetic operations and nested operations.

Transformer Layer

The transformer is constructed by the plan executor according to the physical plan, which is the most basic unit of a calculation. Its interface looks like a series reader, but in fact it will accept several input series and finally produce an output series, at the same time, it can also provide memory control capabilities (which can be implemented based on the existing data structures of IoTDB).  

There are two types of transformers, one is a raw series reader, and the other is a function transformer. The function transformer can handle arithmetic operations and UDF calculations. The transformer layer will generate the same number of transformers as the evaluation nodes in the physical plan. Still in the above example, the 3rd, 5th, 6th and 9th nodes in the evaluation node list will be converted into raw series readers, and the remaining nodes will be converted into function transformers.

The transformer layer that supports complex arithmetic operations and nested operations can be refactored from the InputLayer of UDF. The transformer layer generated according to the above method can support all the function operation features that are currently supported, and at the same time can reduce the UDF evaluation overhead at runtime. Of course, another advantage is that the design can be easily implemented for the cluster mode.

Result Dataset

The result dataset saves the context information of the query, so that the client can identify the actual output column through the column name, so that the query results can be obtained in batches as needed.

The structure of the result dataset of complex arithmetic queries and nested queries can be implemented directly by modifying the UDFDataSet. We only need to modify some fields related to the above-mentioned Transformer Layer. Since the modification is relatively simple, the description will not be expanded here.

In addition, I would like to propose a refactoring point here: the mapping relationship between the output column of the client and the output column of the result dataset. In the previous implementation, the mapping relationship in raw queries, aggregation queries, and UDF queries are all different. After introducing complex arithmetic queries and nested queries, the mapping relationship will be more complicated. I suggest unifying the mapping relationship as Map<OutputColumnName, DatasetOutputIndex>.

  • No labels