THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.



Part I   What is Hive Global Dictionary

Backgroud



Benefit

  1. Build Global Dictionary in distributed way.
  2. Job Server will do less job, thus be more stable. 
  3. One ID, you can reuse the dictionary in whole ETL pipeline across the company.

Part II  How to use

Configuration

Conf keyExplanationExample
kylin.dictionary.mr-hive.databaseWhich database the Hive Global Dictionary indefault
kylin.dictionary.mr-hive.columnsA list, contain all columns which need a Hive Global Dictionary, in a {CUBE_NAME}_{COLUMN_NAME}KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID
kylin.dictionary.mr-hive.table.suffixSuffix for Segment Dictionary Table and Global Dictionary Table_dict_table
kylin.dictionary.mr-hive.intermediate.table.suffixSuffix for Distinct Value Table_distinct_value
kylin.dictionary.mr-hive.columns.reduce.numA key/value structure, which the key is {CUBE_NAME}_{COLUMN_NAME}, and value is number for expected reducers.KYLIN_SALES_SALES_ID:3,KYLIN_SALES_BUYER_ID:2
kylin.source.hive.databasedirWhere can Kylin find file for hive table/user/hive/warehouse/lacus.db
kylin.dictionary.mr-hive.ref.columnsTo reuse another global dictionary(s), you can specific a list here, to refer to some existent global dictionary built by another cubeKYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID

Hive Table

TableName PatternExplanation
Distinct Value Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.intermediate.table.suffix}
Segment Dictionary Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.table.suffix}
Global Dictionary Table${CUBE_NAME}_${kylin.dictionary.mr-hive.table.suffix}

New added steps

Serial NoStep NameExplanation
1Create hive dictionary table

2Extract distinct value into Distinct Value Table

3Build Segment Level Dictionary (MR job-1)

4Build Segment Level Dictionary (MR job-2)

5Merge Segment Level Dictionary into Global Dictionary Table

6Replace/encode Flat Table

7Cleanup temp table & data

Part III  Performance Comparison







  • No labels