Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Spark resources automatic adjustment strategy (experimental feature)

Property Default Description Since

kylin.spark-conf.auto.prior

true

For a CubeBuildJob and CubeMergeJob, it is important to allocate enough and proper resources(cpu/memory), including following config entries mainly:

spark.driver.memory
spark.executor.memory
spark.executor.cores
spark.executor.memoryOverhead
spark.executor.instances
spark.sql.shuffle.partitions

When `kylin.spark-conf.auto.prior` is set to true, Kylin will try to adjust above config entries according to:

Count of cuboids to be built
Max size of fact table
Available resources from current resource manager 's queue

But user still can choose to override some config via in the form of `kylin.engine.spark-conf.` in <key> = <value>` at the Cube level. The parameter value configured by the user will overwrite the parameter value of automatic parameter adjustment.
Check detail at How to improve cube building and query performance

4.0.0

kylin.engine.spark-conf.spark.master

yarn

The cluster manager to connect to. Kylin support set it to yarn/local/standalone.

kylin.engine.spark-conf.spark.submit.deployMode

client

The deploy mode of Spark driver program, either "client" or "cluster", Which means to launch driver program locally ("client") or remotely ("cluster") on one of the nodes inside the cluster.

kylin.engine.spark-conf.spark.yarn.queue

default

kylin.engine.spark-conf.spark.shuffle.service.enabled

false

Enables the external shuffle service. This service preserves the shuffle files written by executors so the executors can be safely removed. The external shuffle service must be set up in order to enable it.

4.0.0

kylin.engine.spark-conf.spark.eventLog.enabled true Whether to log Spark events, useful for reconstructing the Web UI after the application has finished.

kylin.engine.spark-conf.spark.eventLog.dir

hdfs\:///kylin/spark-history

Base directory in which Spark events are logged, if spark.eventLog.enabled is true.

kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled

false

kylin.engine.spark-conf.spark.executor.extraJavaOptions

Code Block

language	bash
title	extraJavaOptions
collapse	true

-Dfile.encoding=UTF-8 
-Dhdp.version=current 
-Dlog4j.configuration=spark-executor-log4j.properties 
-Dlog4j.debug 
-Dkylin.hdfs.working.dir=${hdfs.working.dir} 
-Dkylin.metadata.identifier=${kylin.metadata.url.identifier} -Dkylin.spark.category=job 
-Dkylin.spark.project=${job.project} 
-Dkylin.spark.identifier=${job.id} 
-Dkylin.spark.jobName=${job.stepId} 
-Duser.timezone=${user.timezone}

kylin.engine.spark-conf.spark.yarn.jars

hdfs://localhost:9000/spark2_jars/*

Manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar at runtime

null

User can choose to set spark conf of Cube/Merge Job at Cube level.

4.0.0

kylin.engine.driver-memory-base

1024

Driver memory(spark.driver.memory) is auto adjusted by cuboid count and configuration.

kylin.engine.driver-memory-strategy will decided some level. For example, "2,20,100" will transfer to four cuboid count ranges, from low to high, as following:

Level 1 : (0, 2)
Level 2 : (2, 20)
Level 3 : (20, 100)
Level 4 : (100, +)

So, we can find a proper level for specific cuboid count. 12 will be level 2, and 230 will be level 4.

Driver memory will be calculated by following formula :

Code Block

language	sql
theme	Emacs

min(kylin.engine.driver-memory-base * level, kylin.engine.driver-memory-maximum)

4.0.0

kylin.engine.driver-memory-maximum

4096

See above.

4.0.0

kylin.engine.driver-memory-strategy

2,20,100

See above.

4.0.0

kylin.engine.base-executor-instance

5

4.0.0

kylin.engine.spark.required-cores

1

4.0.0

kylin.engine.executor-instance-strategy

100,2,500,3,1000,4

4.0.0

kylin.engine.retry-memory-gradient

4.0.0

...

Following files are under WORKING-DIR/$PROJECT/job_tmp/${JOB_ID}/share, produced in the first step of BuildJob. And they served to spark resources automatic adjustment strategy. (Source code : ResourceDetectBeforeCubingJob).

Resource Detect File

Data Type

Format

Description

count_distinct.json

Boolean

Binary

Cube contains COUNT_DISTINCT(bitmap) measure.

Sample :

true

${JOB_ID}_resource_path.json

Map<String, List<String>>

Binary

Key is cuboid ID, and value is cuboid's parent dataset's partition path.

-1 means Flat Table.

Sample :

Code Block

language	js
theme	DJango

{
   "-1" : ["hdfs://cdh-master:8020/user/hive/warehouse/tpch_flat_orc_10.db/lineitem", 
"hdfs://cdh-master:8020/user/hive/warehouse/tpch_flat_orc_10.db/part"]
}

${JOB_ID}_cubing_detect_items.json

Map<String, Integer>

Binary

Key is cuboid ID, and value is cuboid's parent dataset's partition count.

Sample :

Code Block

language	js
theme	DJango

{
  "-1": 32
}

Global dictionary

Property	Default	Description	Since




kylin.dictionary.detect-data-skew-sample-enabled
kylin.dictionary.detect-data-skew-sample-rate
kylin.dictionary.detect-data-skew-percentage-threshold

...

Space shortcuts

Page tree

Versions Compared

Old Version 41

New Version 42

Key

Spark resources automatic adjustment strategy (experimental feature)

Global dictionary