Apache Kylin : Analytical Data Warehouse for Big Data
Page History
...
Optimization of query engine
Cache Calcite physical plan
In Kylin4, SQL will be analyzed, optimized and do code generation in calcite, this takes up about 150ms for some queries. We have supported PreparedStatementCache in Kylin4 to cache calcite plan. With this optimization it saved about 150ms of time cost
Tunning spark configuration
Kylin4 uses spark as query engine. As spark is a distributed engine designed for massive data processing, it's inevitable to loose some performance for small queries. We have tried to do some tuning to catch up with the latency in KYLIN3 for small queries.
Our first optimization is to make more processes finish in memory. The key is to avoid data spill during aggregation, shuffle and sort. Tuning the following configuration is helpful.
- set "spark.sql.objectHashAggregate.sortBased.fallbackThreshold" to a bigger value to avoid HashAggregate fall back to Sort Based Aggregate, which really kills performance when happens.
- set "spark.shuffle.spill.initialMemoryThreshold to an large" to avoid to many spills during shuffle.
Secondly, we route small queries to Query Server which run spark in local mode. Because the overhead of task schedule, shuffle read and variable broadcast is enlarged for small queries on YARN/Standalone mode.
Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM disk as TMPFS and set spark.local.dir to directory using RAM disk
Lastly, we disabled spark's whole stage code generation for small queries, for spark's whole stage code generation will cost about 100ms~200ms, whereas it's no need for small queries which is an simple project.
Parquet optimization
... Please try to complete this part. Shengjun Zheng
Dynamic elimination of partitioning dimensions
...
We have tested that in some situations the response time reduced from 20s to 6s, 10s to 3s.
Partition cropping under complex filtering conditions
对用户来说可能不需要了解分区过滤的细节,这应该是一个必须要有的功能 Shengjun Zheng
Cache Calcite physical plan
Adjust spark configuration
Parquet optimization
... Please try to complete this part. Shengjun Zheng
Optimization of build engine
...