Kylin 4.0 Build Engine Configuration

Basic configuration

Property	Default	Description	Since
kylin.snapshot.parallel-build-enabled
kylin.snapshot.parallel-build-timeout-seconds
kylin.snapshot.shard-size-mb
kylin.storage.columnar.shard-size-mb
kylin.storage.columnar.shard-rowcount
kylin.storage.columnar.shard-countdistinct-rowcount
kylin.storage.columnar.repartition-threshold-size-mb

Advanced configuration

Property	Default	Description	Since
kylin.engine.submit-hadoop-conf-dir	null
kylin.engine.spark.cache-parent-dataset-storage-level	NONE		4.0.0
kylin.engine.spark.cache-parent-dataset-count	1		4.0.0
kylin.engine.build-base-cuboid-enabled	true		4.0.0
kylin.engine.spark.repartition.dataset.after.encode-enabled	false	Global dictionary will be split into several buckets. To encode a column to int value more efficiently, source dataset will be repartitioned by the to-be encoded column to the same amount of partitions as the dictionary's bucket size. It sometimes bring side effect, because repartitioning by a single column is more likely to cause serious data skew, causing one task takes the majority of time in first layer's cuboid building. When faced with this case, you can try repartitioning encoded dataset by all RowKey columns to avoid data skew. The repartition size is default to max bucket size of all dictionaries, but you can also set to other flexible value by this option: 'kylin.engine.spark.dataset.repartition.num.after.encoding'	4.0.0
kylin.engine.spark.repartition.dataset.after.encode.num	0	see above	4.0.0

Spark resources automatic adjustment strategy

Property	Default	Description	Since
kylin.spark-conf.auto.prior	true	For a CubeBuildJob and CubeMergeJob, it is important to allocate enough and proper resources(cpu/memory), including following config entries mainly: - spark.driver.memory - spark.executor.memory - spark.executor.cores - spark.executor.memoryOverhead - spark.executor.instances - spark.sql.shuffle.partitions When `kylin.spark-conf.auto.prior` is set to true, Kylin will try to adjust above config entries according to: - Count of cuboids to be built - (Max)Size of fact table - Available resources from current resource manager 's queue But user still can choose to override some config via `kylin.engine.spark-conf.XXX` in Cube level . Check detail at How to improve cube building and query performance	4.0.0
kylin.engine.spark-conf.	null	User can choose to set spark conf of Cube/Merge Job at Cube level.	4.0.0
kylin.engine.driver-memory-base	1024	Driver memory(spark.driver.memory) is auto adjusted by cuboid count and configuration. kylin.engine.driver-memory-strategy will decided some level. For example, "2,20,100" will transfer to four cuboid count ranges, from low to high, as following: Level 1 : (0, 2) Level 2 : (2, 20) Level 3 : (20, 100) Level 4 : (100, +) So, we can find a proper level for specific cuboid count. Driver memory will be calculated by following formula : min(kylin.engine.driver-memory-base * level, kylin.engine.driver-memory-maximum)	4.0.0
kylin.engine.driver-memory-maximum	4096	See above.	4.0.0
kylin.engine.driver-memory-strategy	2,20,100	See above.	4.0.0
kylin.engine.base-executor-instance	5		4.0.0
kylin.engine.spark.required-cores	1		4.0.0
kylin.engine.executor-instance-strategy	100,2,500,3,1000,4		4.0.0
kylin.engine.retry-memory-gradient			4.0.0

Global dictionary

Property	Default	Description	Since




kylin.dictionary.detect-data-skew-sample-enabled
kylin.dictionary.detect-data-skew-sample-rate
kylin.dictionary.detect-data-skew-percentage-threshold

Please remove following :

Property	Default	Description	Version
kylin.engine.spark.build-class-name	org.apache.kylin.engine.spark.job.CubeBuildJob	For developer only. The className use in spark-submit	4.0.0-ALPHA
kylin.engine.spark.cluster-info-fetcher-class-name	org.apache.kylin.cluster.YarnInfoFetcher	For developer only. Fetch yarn information of spark job	4.0.0-ALPHA
kylin.engine.spark-conf.XXX		Before Kylin submit a cubing job, some major property(cores and memory) will be automatically adjusted adaptively. (if kylin.spark-conf.auto.prior was set to true). After auto adjust, spark conf will be overwrite by this property. If you want to set spark.driver.extraJavaOptions=-Dhdp.version=current, you can add follow line in kylin.properties: kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current	4.0.0-ALPHA
kylin.storage.provider	org.apache.kylin.common.storage.DefaultStorageProvider	The content summary objects returned by different cloud vendors are not the same, so need to provide targeted implementation. You can refer to this to learn more : org.apache.kylin.common.storage.IStorageProvider	4.0.0-ALPHA
kylin.engine.spark.merge-class-name	org.apache.kylin.engine.spark.job.CubeMergeJob	For developer only. The className use in spark-submit	4.0.0-ALPHA
kylin.engine.spark.task-impact-instance-enabled	true	UPDATING	4.0.0-ALPHA
kylin.engine.spark.task-core-factor	3	UPDATING	4.0.0-ALPHA
kylin.engine.driver-memory-base	1024	Auto adujst *spark.driver.memory* for Build Engine if kylin.engine.spark-conf.spark.driver.memory is not set.	4.0.0-ALPHA
kylin.engine.driver-memory-strategy	{"2", "20", "100"}	UPDATING	4.0.0-ALPHA
kylin.engine.driver-memory-maximum	4096	UPDATING	4.0.0-ALPHA
kylin.engine.persist-flattable-threshold	1	If the number of cuboids which will be build from flat table is bigger than this threshold, the flat table will be persisted into $HDFS_WORKING_DIR/job_tmp/flat_table for saving more memory.	4.0.0-ALPHA
kylin.snapshot.parallel-build-timeout-seconds	3600	To improve the speed of snapshot build.	4.0.0-ALPHA
kylin.snapshot.parallel-build-enabled	true	UPDATING

kylin.spark-conf.auto.prior	true	Enable adjust spark parameters adaptively.	4.0.0-ALPHA
kylin.engine.submit-hadoop-conf-dir	/etc/hadoop/conf	Set HADOOP_CONF_DIR for spark-submit.	4.0.0-ALPHA
kylin.storage.columnar.shard-size-mb	128	The max size of pre-calcualted cuboid parquet file.	4.0.0-ALPHA
kylin.storage.columnar.shard-rowcount	2500000	The max rows of pre-calcualted cuboid parquet file.	4.0.0-ALPHA
kylin.storage.columnar.shard-countdistinct-rowcount	1000000	The max rows of pre-calcualted cuboid parquet file when cuboid has bitmap measure. (When cuboid has BItmap, it is large.)	4.0.0-ALPHA
kylin.query.spark-engine.join-memory-fraction	0.3	Limit memory used by broadcast join of Sparder. (Broadcast join cause unstable.)	4.0.0-ALPHA

Space shortcuts

Page tree

Basic configuration

Advanced configuration

Spark resources automatic adjustment strategy

Global dictionary