Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
BackGround
Kylin uses bitmap to accelerate precise count distinct queries. To use bitmap, we should encode column values to int values using dictionary.
In KYLIN4, dictionaries are hashed into several buckets, column data are repartitioned by the same hash algorithm too. Then, each encoding task can only load a piece of dictionary bucket to do the encoding step.
Recently we are troubled by this improvement when data skew happens. In some of our cases, the encoding/repartition step is even impossible to finish . Whereas this works fine in KYLIN3, for each Spark task will load all dictionary of a column and encode column values to int values. There is no need to do repartition step in KYLIN3.
Solutions
Sample from Source Data
Get skewed data by some algorithm. E.g. A value that accounts for more than 10 percent of the total.
Create small dictionary for skewed data , and broadcast it.
Customize a repartition function: For skewed data, repartition to random partitions.
Encode the repartitioned data with both broadcast dictionary and corresponding dictionary bucket within partition.
We have made simple test in our case by this approach, the build job have succeeded in 30 minutes. Maybe there is better way to solve this case, hoping to get more advices.
1 Comment
Xiaoxiang Yu
See : https://github.com/apache/kylin/pull/1662 .