BackGround

Kylin uses bitmap to accelerate precise count distinct queries. To use bitmap, we should encode column values to int values using dictionary.

In KYLIN4, dictionaries are hashed into several buckets, column data are repartitioned by the same hash algorithm too. Then, each encoding task can only load a piece of dictionary bucket to do the encoding step.

Recently we are troubled by this improvement when data skew happens. In some of our cases, the encoding/repartition step is even impossible to finish . Whereas this works fine in KYLIN3, for each Spark task will load all dictionary of a column and encode column values to int values. There is no need to do repartition step in KYLIN3.

Solutions

Sample from Source Data
Get skewed data by some algorithm. E.g. A value that accounts for more than 10 percent of the total.
Create small dictionary for skewed data , and broadcast it.
Customize a repartition function: For skewed data, repartition to random partitions.
Encode the repartitioned data with both broadcast dictionary and corresponding dictionary bucket within partition.

We have made simple test in our case by this approach, the build job have succeeded in 30 minutes. Maybe there is better way to solve this case, hoping to get more advices.

Space shortcuts

Page tree

BackGround

Solutions

1 Comment

Xiaoxiang Yu

Space shortcuts

Page tree

KIP-8 Improve dict encoding's performance when DataSkew happens

BackGround

Solutions

1 Comment

Xiaoxiang Yu