THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently we have to build Global Dictionary in single process/JVM, which may take a lot of time and memory for UHC. By this feature, we use MR to build and use Hive to store Global Dictionary for Kylin.

This is the technical article for Hive Global Dictionary version2.

Benefit

  1. Build Global Dictionary in distributed way, thus building job spent less time.
  2. Job Server will do less job, thus be more stable. 
  3. OneID, since the fact that Hive Global Dictionary is readable outside of Kylin, everyone can reuse this dictionary(Hive table) in the other scene across the company.

...

Release DateRelease versionJIRA issueComment
2019-10v3.0.0

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKYLIN-3841

Introduce Hive global dictionary.(first version).
2020-06v3.1.0N/AN/A

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKYLIN-4342

Use Mapreduce MapReduce other than HQL in some steps to improve performance.(version2)

Configuration

Conf keyExplanationExample
kylin.dictionary.mr-hive.databaseWhich database will the Hive Global Dictionary in.default
kylin.dictionary.mr-hive.columnsA list, contain all columns which need a Hive Global Dictionary, in a {TABLE_NAME}_{COLUMN_NAME} pattern.KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID
kylin.dictionary.mr-hive.table.suffixSuffix for Segment Dictionary Table and Global Dictionary Table_global_dict
kylin.dictionary.mr-hive.intermediate.table.suffixSuffix for Distinct Value Table_group_by
kylin.dictionary.mr-hive.columns.reduce.numA key/value structure(or a map), which key is {TABLE_NAME}_{COLUMN_NAME}, and value is number for expected reducers in Build Segment Level Dictionary (MR job Parallel Part Build).KYLIN_SALES_SALES_ID:3,KYLIN_SALES_BUYER_ID:2
kylin.source.hive.databasedirThe location of Hive table in HDFS./user/hive/warehouse/lacus.db
kylin.dictionary.mr-hive.ref.columnsTo reuse other global dictionary(s), you can specific a list here, to refer to some existent global dictionary(s) built by another cube.KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID

...

TableName PatternExplanation
Distinct Value Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.intermediate.table.suffix}

This table is a temporary hive table for storing literal value which be extracted from flat table.

It contain one normal column, dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once).

This table also contain a partition column, its name is dict_column, means one partition for one column.

Please note, literal value which has been encoded will be removed.

Segment Dictionary Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.table.suffix}

This table is a temporary hive table for storing literal value and its encoded integer which be extracted from flat table.

It contain two normal column: dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once); the second column, dict_value, contains the encoded integer for corresponding literal value.

This table also contain a partition column, its name is dict_column, means one partition for one column.

Global Dictionary Table${CUBE_NAME}_${kylin.dictionary.mr-hive.table.suffix}This table is the Global Dictionary. It has the same schema as Segment Dictionary Table .

New added stepssteps 

Compared to hive global dictionary version1

Serial NoStep NameInputOutput
1Create hive dictionary tableN/AThree hive table
2Extract distinct value into Distinct Value TableFlat tableDistinct Value Table
3Build Segment Level Dictionary (Parallel Part Build)Distinct Value Table(File path is determined by kylin.source.hive.databasedir)Intermediate dict file(Literal value encoded in partition-level, so each reducer will encode literal from zero).
4Build Segment Level Dictionary (Parallel Total Build)Intermediate dict fileSegment Level Dictionary
5Merge Segment Level Dictionary into Global Dictionary TableSegment Level Dictionary and old Global Dictionary Table  New Global Dictionary Table
6Replace/encode Flat TableFlat tableNew flat table (but literal value will be replaced with encoded integer)
7Cleanup temp table & dataAll temporary hive tables

Nothing, they will be removed.

...

Screenshots

Mapreduce Job Diagram


HQL Analysis

...

Step 3. Build new segment. 


...

Part III  Performance

Hadoop Env

Hadoop : CDH 5.7

Yarn memory : 102GB

Yarn cores :18

Comparison

...

TODO

Comparison

...

Comparison

Step Name

Duration

Job-1

Data size

Duration

Job-2

Data size

Duration

Job-3

Data size
Step Name

Duration EST

Data size

Create Intermediate Flat Hive Table



Build Hive Global Dict - extract distinct value



Redistribute Flat Hive Table



Build Hive Global Dict - parallel part build



Build Hive Global Dict - parallel total build



Build Hive Global Dict - merge to dict table



Build Hive Global Dict - replace intermediate table



Extract Fact Table Distinct Columns



Build Dimension Dictionary



Extract Dictionary from Global Dictionary(When shrunken dictionary enabled)



Build Base Cuboid



-

Total

Comment


Part IV Reference 

https://issues.apache.org/jira/browse/KYLIN-4342

...