Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

Part II . How Kylin on Parquet

Code design diagram and analysis

Interface (Cube building)

Image Removed

SparkCubingJob
Extends CubingJob to create batch job steps for spark cubing, including the two steps
Resource detect and Cubing.

It must extends class CubingJob, so that JobMonitor can collect job information and showing on front end.

NSparkExecutable

To submit spark job to local or cluster.

SparkApplication

The execatly executed instance on Spark

ResourceDetectStep
- Dump kylin metadata to working fs
- Specify the class name of the spark task execution
SparkCubingStep
- Dump kylin metadata to working fs
- Specify the class name of the spark task execution
- Update metadata after the building job done
ResourceDetectBeforeCubingJob
- Collect and dump source tables info
- Adaptively adjust spark parameters
- Create flat table and build Global dictionary(if needed)
CubeBuildJob
- Build cuboids by layer
- Save cuboids to FS as parquet format

Interface (Merge)

Image Removed

SparkMergingJob

Extends CubingJob to create batch job steps for spark cubing, including the three step: Resource detect, Merging and Cleanup temp files.

Cubing step and analysis

Resources detect

Collect and dump the following three source info

...

Code Block

language	text

kylin.engine.spark-conf.spark.executor.instances
kylin.engine.spark-conf.spark.executor.cores
kylin.engine.spark-conf.spark.executor.memory
kylin.engine.spark-conf.spark.executor.memoryOverhead
kylin.engine.spark-conf.spark.sql.shuffle.partitions
kylin.engine.spark-conf.spark.driver.memory
kylin.engine.spark-conf.spark.driver.memoryOverhead
kylin.engine.spark-conf.spark.driver.cores

Driver memory base is 1024M, it will adujst by the number of cuboids. The adjust strategy is define in KylinConfigBase.

...

java

...

public int[] getSparkEngineDriverMemoryStrategy() {
    String[] dft = { "2", "20", "100" };
    return getOptionalIntArray("kylin.engine.driver-memory-strategy", dft);
}

Create flat table and Global Dictionary

Improve

Distributed encoding
Using Roaring64NavigableMap, support canditary higher than Integer.MAX_VALUE

Build process

Group by FlatTable RDD then distinct
Repartion RDD, Using DictionaryBuilderHelper.calculateBucketSize()
MapPartiton RDD, using DictHelper.genDict()
Save encoded dict file to FS, using NGlobalDictHDFSStore.writeBucketDict()

Bucket concept

The bucket is used to store dictionaries. The number of bucket is just the RDD partitions(task parallelism). It has two import member variables -- relativeDictMap and absoluteDictMap.
At one segment building job, dictionaries are encoded parallelized and stored into RelativeDictionary and after segment building job done, dictionaries will be reencoded with buckets offsets. And this global dictionry will save to FS and tags as one version(If there's no global dictionary built before, version is 0).
When the next segment job starts, it will get the lastest vertion of dictionary and loaded to buckets and add new distinct values to buckts.

Image Removed

Cube build

Reduced build steps
- From ten-twenty steps to only two steps
Build Engine
- Simple and clear architecture
- Spark as the only build engine
- All builds are done via spark
- Adaptively adjust spark parameters
- Dictionary of dimensions no longer needed
- Supported measures
  - Sum
  - Count
  - Min
  - Max
  - TopN
  - CountDictinct(Bitmap, HyperLogLog)

...

Cubiod Storage

The flowing is the tree of parquet storage dictory in FS. As we can see, cuboids are saved into path spliced by Cube Name, Segment Name and Cuboid Id, which is processed by PathManager.java .

...

Columns[id, name, age] correspond to Dimension[2, 1, 0], measures[COUNT, SUM] correspond to [3, 4]

Part III . Reference

Code Design Diagram

Space shortcuts

Page tree

Versions Compared

Old Version 10

New Version 11

Key

Part II . How Kylin on Parquet

Code design diagram and analysis

Interface (Cube building)

Interface (Merge)

Cubing step and analysis

Resources detect

Collect and dump the following three source info

Create flat table and Global Dictionary

Improve

Build process

Bucket concept

Cube build

Cubiod Storage

Part III . Reference

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 10

New Version 11

Key

Part II . How Kylin on Parquet

Code design diagram and analysis

Interface (Cube building)

Interface (Merge)

Cubing step and analysis

Resources detect

Collect and dump the following three source info

Create flat table and Global Dictionary

Improve

Build process

Bucket concept

Cube build

Cubiod Storage

Part III . Reference