THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Format

...

    In Kylin 4, there are two steps in the cube building job, the first step detects how many source files which will be built as cube data, and the second one is to build the snapshot tables (if need), generate the global dictionary (if need) and build cube data as parquet files. In the second step, all calculations are operations with a relatively heavy load, so except using Joint and Hierarchy on Dimensions to reduce the number of cuboids ( refers to the section 'Reduce combinations' in http://kylin.apache.org/docs/tutorial/cube_build_performance.html ), it’s also very important to use the proper spark resources and configurations to build cube data. There are 3 key points in this section to improve cube building performance.

...

    If you don't know how to set these configurations properly, Kylin 4 will use below allocation rules to automatically set spark resources and configurations, all spark resources and configurations are set according to the maximum file size of source files and whether cube has accurate count distinct measure, this is the reason why we need to detect how many source files which will be built in the first step. You can see these allocation rules in the class 'SparkConfHelper':

...

        If ${the maximum file size} >= 1G or ${exist accurate count distinct}, then set 'spark.executor.memory' to 4G;

        Otherwise set 'spark.executor.memory' to 1G.

  • ExecutorCoreRule

        If ${the maximum file size} >= 1G or ${exist accurate count distinct}, then set 'spark.executor.cores' to 5;

        Otherwise set 'spark.executor.cores' to 1.

  • ExecutorOverheadRule

...

  1. Get the value of required cores, default value is 1;
  2. Get the value of configuration 'kylin.engine.base-executor-instance' as basic executor instances, default value is 5;
  3. According to the number of the cuboids, calculate the required number of executor instances: ${calculateExecutorInsByCuboidSize}. The configuration of the calculation strategy is 'kylin.engine.executor-instance-strategy', default value is '100,2,500,3,1000,4', which means if the number of the cuboids is greater and equal than 100, the factor is 2, and then the number of executor instances is ${basic executor instances} * ${factor} = 10, if greater and equal than 500, the factor is 3, and so on.
  4. Get the available memory and cores count of the default pool from yarn: ${availableMem} and ${availableCore};
  5. Get the sum memory value after applying 'ExecutorOverheadRule' and 'ExecutorMemoryRule' :  $  ${executorMem} = ${spark.executor.memory} + ${spark.executor.memoryOverhead};
  6. Get the cores count after applying 'ExecutorCoreRule': ${executorCore}
  7. According to ${availableMem}, ${availableCore}, ${executorCore} and ${executorMem}, calculate the maximum executor instances count which can request from yarn: ${queueAvailableInstance} = Math.min(${availableMem} / ${executorMem}, ${availableCore} / ${executorCore}); The purpose of this step is to avoid applying for more than the available resources on yarn.
  8. Get the final executor instances count: ${executorInstance} = Math.max(Math.min(${calculateExecutorInsByCuboidSize}, ${queueAvailableInstance}), ${kylin.engine.base-executor-instance});
  9. Set 'spark.executor.instances' to ${executorInstance};

...

    In Kylin 4.0, query engine (called SparderContext) uses spark as calculation engine too, it's real distributed query engine, especially for complex query, the performance will be better than calcite. However there are still many key performance points that need to be optimized. In addition to setting proper calculation resources mentioned above, it also includes reducing small or unevenness files, setting proper partitions, and pruning parquet files as many as possible. Kylin 4.0 and Spark provide some optimization strategies to improve query performance.

...

    According to the log messages, you can find that the final number of partitions is too large, this will impact the building performance and query performance, after increasing the value of configuration 'kylin.storage.columnar.shard-rowcount' or 'kylin.storage.columnar.shard-countdistinct-rowcount' and rebuilding again, the log messages are shown below:

...