THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

BackGround

Kylin uses bitmap to accelerate precise count distinct queries. To use bitmap, we should encode column values to int values using dictionary.  

In KYLIN4, dictionaries are hashed into several buckets, column data are repartitioned by the same hash algorithm too.  Then, each encoding task can only load a piece of  dictionary bucket to do the encoding step. 

Recently we are troubled by this improvement when data skew happens. In some of our cases, the encoding/repartition step is even impossible to finish . Whereas this works fine in KYLIN3, for each Spark task will load all dictionary of a column and encode column values to int values. There is no need to do repartition step in KYLIN3.

Solutions

  1. Sample from Source Data

  2. Get skewed data by some algorithm. E.g.  A value that accounts for more than 10 percent of the total.

  3. Create small dictionary for skewed data , and broadcast it.

  4. Customize a repartition function: For skewed data, repartition to random partitions.

  5. Encode the repartitioned data with both broadcast dictionary and corresponding dictionary bucket within partition.

We have made simple test in our case by this approach, the build job have succeeded in 30 minutes. Maybe there is better way to solve this case, hoping to get more advices.

  • No labels