THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Background

    Maybe you have known that, pre-calculated cuboid data is stored in parquet files in Kylin 4. Before saving into HDFS/Object Storage, Kylin will do repartition on pre-calculated cuboid data. This article will introduce how this pre-calculated cuboid data is repartitioned, and how this affect query performance.

  • By setting shard by column, we can improve query performance by reducing count of scanned parquet files.
  • We suggest set the column as shard by column which has a high cardinality and is used in where clause. After set shard by column the build engine will repartition the cuboid with shard by column and filter out files not in range.
  • shard pruning supports: Equality/In/IsNull/And/Or. For example, if we set seller_id as shard by column, the effeciency of the following sql will be improved, select count(*) from kylin_sales where seller_id = '1001'

For example, there's a column which have high cardinality called seller_id and our application scenario will filter according to this column. There're some sample SQLs:

Exmaple
select count(*) from kylin_sales left join kylin_order where SELLER_ID = '10000233'
select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235)
select count(*) from kylin_sales left join kylin_order where SELLER_ID is NULL
select count(*) from kylin_sales left join kylin_order where SELLER_ID in (10000233,10000234,10000235) and SELLER_ID = 10000233 
select count(*) from kylin_sales left join kylin_order where SELLER_ID = 10000233 or SELLER_ID = 1 


How to use

Model design

Edit cube and add dimension seller_id. Remember that the type of dimension should be normal not derived.

Cube Design

From Cube Designer → Advanced Setting → Rowkeys, find the column seller_id and set the shard by to true. Remember that now only support one shard by column, so there should only be one shard by column set to true.


Advanced configuration



FAQ

Before configure shard by column, there are some things need to pay attention to:

  • Now only support one shard by column, so we suggest set the column wich have high cardinality
  • The shard by column must not be set to derived column. See more about derived column.


  • No labels