Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Tunning spark configuration

Image RemovedImage Added

Kylin4 uses spark as query engine. As spark is a distributed engine designed for massive data processing, it's inevitable to loose some performance for small queries. We have tried to do some tuning to catch up with the latency in KYLIN3 for small queries.

...

Lastly, we disabled spark's whole stage code generation for small queries, for spark's whole stage code generation will cost about 100ms~200ms, whereas it's no need for small queries which is an simple project.

Parquet optimization

Image Added

Optimizing parquet is also important for queries.

The first principal is that we'd better always include shard by column in our filter condition, for parquet files are shared by shard by columns, filter using shard by column reduces the data files to read.

Then look into parquet files, data within files are sorted by rowkey columns, that is to say, prefix match in query is as important as Kylin on HBase. . Please try to complete this part. Shengjun ZhengWhen a query condition satisfies prefix match, it can filter row groups with column's max/min index. Furthermore, we can reduce row group size to make finer index granularity, but be aware that the compression rate will be lower if we set row group size smaller.

Dynamic elimination of partitioning dimensions

...

Space shortcuts

Page tree

Versions Compared

Old Version 12

New Version 13

Key

Tunning spark configuration

Parquet optimization

Dynamic elimination of partitioning dimensions