THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Some properties are not listed because they are not for Kylin user.

Basic configuration

PropertyDefaultDescriptionSince
kylin.snapshot.parallel-build-enabled



kylin.snapshot.parallel-build-timeout-seconds



kylin.snapshot.shard-size-mb



kylin.storage.columnar.shard-size-mb



kylin.storage.columnar.shard-rowcount



kylin.storage.columnar.shard-countdistinct-rowcount



kylin.storage.columnar.repartition-threshold-size-mb








Advanced configuration

PropertyDefaultDescriptionSince
kylin.engine.submit-hadoop-conf-dirnull

kylin.engine.spark.cache-parent-dataset-storage-level
NONE
4.0.0
kylin.engine.spark.cache-parent-dataset-count
1
4.0.0
kylin.engine.build-base-cuboid-enabled
true
4.0.0
kylin.engine.spark.repartition.dataset.after.encode-enabled
false

Global dictionary will be split into several buckets. To encode a column to int value more efficiently, source dataset will be repartitioned by the to-be encoded column to the same amount of partitions as the dictionary's bucket size. It sometimes bring side effect, because repartitioning by a single column is more likely to cause serious data skew, causing one task takes the majority of time in first layer's cuboid building. When faced with this case, you can try repartitioning encoded dataset by all RowKey columns to avoid data skew. The repartition size is default to max bucket size of all dictionaries, but you can also set to other flexible value by this option: 'kylin.engine.spark.dataset.repartition.num.after.encoding'

4.0.0
kylin.engine.spark.repartition.dataset.after.encode.num
0see above4.0.0


Spark resources automatic adjustment strategy

PropertyDefaultDescriptionSince
kylin.spark-conf.auto.prior
trueFor a CubeBuildJob and CubeMergeJob, it is important to allocate enough and proper resources(cpu/memory), including following config entries mainly:
- spark.driver.memory
- spark.executor.memory
- spark.executor.cores
- spark.executor.memoryOverhead
- spark.executor.instances
- spark.sql.shuffle.partitions

When `kylin.spark-conf.auto.prior` is set to true, Kylin will try to adjust above config entries according to:
- Count of cuboids to be built
- (Max)Size of fact table
- Available resources from current resource manager 's queue

But user still can choose to override some config via `kylin.engine.spark-conf.XXX` in Cube level .
Check detail at How to improve cube building and query performance
4.0.0
kylin.engine.spark-conf.
nullUser can choose to set spark conf of Cube/Merge Job at Cube level.4.0.0
kylin.engine.driver-memory-base
1024

Driver memory(spark.driver.memory) is auto adjusted by cuboid count and configuration.

kylin.engine.driver-memory-strategy will decided some level. For example, "2,20,100" will transfer to four cuboid count ranges, from low to high, as following: 

  • Level 1 : (0, 2)
  • Level 2 : (2, 20)
  • Level 3 : (20, 100)
  • Level 4 : (100, +)

So, we can find a proper level for specific cuboid count. 12 will be level 2, and 230 will be level 4.


Driver memory will be calculated by following formula : 

LaTeX Formattingtext-aligncenter

Code Block
languagesql
themeEmacs
min(kylin.engine.driver-memory-base * level, kylin.engine.driver-memory-maximum)


4.0.0
kylin.engine.driver-memory-maximum
4096See above.4.0.0
kylin.engine.driver-memory-strategy
2,20,100See above.4.0.0
kylin.engine.base-executor-instance
5
4.0.0
kylin.engine.spark.required-cores
1
4.0.0
kylin.engine.executor-instance-strategy
100,2,500,3,1000,4
4.0.0
kylin.engine.retry-memory-gradient


4.0.0


Global dictionary

PropertyDefaultDescriptionSince
















kylin.dictionary.detect-data-skew-sample-enabled



kylin.dictionary.detect-data-skew-sample-rate



kylin.dictionary.detect-data-skew-percentage-threshold

Please remove following : 

PropertyDefaultDescriptionVersion
kylin.engine.spark.build-class-name
org.apache.kylin.engine.spark.job.CubeBuildJob
For developer only. The className use in spark-submit

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.spark.cluster-info-fetcher-class-name
org.apache.kylin.cluster.YarnInfoFetcher
For developer only. Fetch yarn information of spark job

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.spark-conf.XXX
  1. Before Kylin submit a cubing job, some major property(cores and memory) will be automatically adjusted adaptively. (if kylin.spark-conf.auto.prior was set to true).
  2. After auto adjust, spark conf will be overwrite by this property. If you want to set spark.driver.extraJavaOptions=-Dhdp.version=current, you can add follow line in kylin.properties:
kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.storage.provider
org.apache.kylin.common.storage.DefaultStorageProvider

The content summary objects returned by different cloud vendors are not the same, so need to provide targeted implementation.

You can refer to this to learn more : org.apache.kylin.common.storage.IStorageProvider

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.spark.merge-class-name
org.apache.kylin.engine.spark.job.CubeMergeJob
For developer only. The className use in spark-submit

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.spark.task-impact-instance-enabled
true StatussubtletruecolourYellowtitleUpdating

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.spark.task-core-factor
3 StatussubtletruecolourYellowtitleUpdating

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.driver-memory-base
1024Auto adujst spark.driver.memory for Build Engine if kylin.engine.spark-conf.spark.driver.memory is not set.
Status
subtletrue
colourBlue
title4.0.0-alpha
kylin.engine.driver-memory-strategy
{"2", "20", "100"}
StatussubtletruecolourYellowtitleUpdating

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.driver-memory-maximum
4096 StatussubtletruecolourYellowtitleUpdating

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.persist-flattable-threshold
1If the number of cuboids which will be build from flat table is bigger than this threshold, the flat table will be persisted into $HDFS_WORKING_DIR/job_tmp/flat_table for saving more memory.

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.snapshot.parallel-build-timeout-seconds
3600To improve the speed of snapshot build.
Status
subtletrue
colourBlue
title4.0.0-alpha
kylin.snapshot.parallel-build-enabled
true StatussubtletruecolourYellowtitleUpdating
kylin.spark-conf.auto.prior
true Enable adjust spark parameters adaptively.

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.engine.submit-hadoop-conf-dir
/etc/hadoop/conf

Set HADOOP_CONF_DIR for spark-submit.

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.storage.columnar.shard-size-mb
128

The max size of pre-calcualted cuboid parquet file.

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.storage.columnar.shard-rowcount

2500000

The max rows of pre-calcualted cuboid parquet file.

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.storage.columnar.shard-countdistinct-rowcount
1000000The max rows of pre-calcualted cuboid parquet file when cuboid has bitmap measure. (When cuboid has BItmap, it is large.)

Status
subtletrue
colourBlue
title4.0.0-alpha

kylin.query.spark-engine.join-memory-fraction
0.3Limit memory used by broadcast join of Sparder. (Broadcast join cause unstable.) StatussubtletruecolourBluetitle4.0.0-alpha