THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.



Part I   What is Hive Global Dictionary

Background

Count distinct(bitmap) measure is very important for many scenario, such as PageView statistics, and Kylin support count distinct since 1.5.3 .
Apache Kylin implements precisely count distinct measure based on bitmap, and use Global Dictionary to encode string value into integer.

Currently we have to build Global Dictionary in single process/JVM, which may take a lot of time and memory for UHC. By this feature, we use MR to build and use Hive to store Global Dictionary for Kylin.

Benefit

  1. Build Global Dictionary in distributed way, thus building job spent less time.
  2. Job Server will do less job, thus be more stable. 
  3. OneID, everyone can reuse this dictionary in the other scene across the company.

Part II  How to use

Configuration

Conf keyExplanationExample
kylin.dictionary.mr-hive.databaseWhich database will the Hive Global Dictionary in.default
kylin.dictionary.mr-hive.columnsA list, contain all columns which need a Hive Global Dictionary, in a {CUBE_NAME}_{COLUMN_NAME} format.KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID
kylin.dictionary.mr-hive.table.suffixSuffix for Segment Dictionary Table and Global Dictionary Table_dict_table
kylin.dictionary.mr-hive.intermediate.table.suffixSuffix for Distinct Value Table_distinct_value
kylin.dictionary.mr-hive.columns.reduce.numA key/value structure, which the key is {CUBE_NAME}_{COLUMN_NAME}, and value is number for expected reducers in Build Segment Level Dictionary (MR job-1).KYLIN_SALES_SALES_ID:3,KYLIN_SALES_BUYER_ID:2
kylin.source.hive.databasedirThe location of Hive table in HDFS./user/hive/warehouse/lacus.db
kylin.dictionary.mr-hive.ref.columnsTo reuse another global dictionary(s), you can specific a list here, to refer to some existent global dictionary built by another cubeKYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID

Hive Table

TableName PatternExplanation
Distinct Value Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.intermediate.table.suffix}

This table is a temporary hive table for storing literal value which need to be encoded.

It contain one normal column, dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once).

This table also contain a partition column, its name is dict_column, means one partition for one column.

Segment Dictionary Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.table.suffix}
Global Dictionary Table${CUBE_NAME}_${kylin.dictionary.mr-hive.table.suffix}

New added steps

Serial NoStep NameInputOutput
1Create hive dictionary tableN/ACreate three hive table
2Extract distinct value into Distinct Value TableFlat tableDistinct Value Table
3Build Segment Level Dictionary (Parallel Part)Distinct Value TableIntermediate dict file(Literal value encoded in partition-level, so each reducer will encode literal from zero).
4Build Segment Level Dictionary (Parallel Total)Intermediate dict fileSegment Level Dictionary
5Merge Segment Level Dictionary into Global Dictionary TableSegment Level Dictionary and old Global Dictionary Table  New Global Dictionary Table
6Replace/encode Flat TableFlat tableNew flat table 
7Cleanup temp table & dataAll temporary hive table.Nothing, they will be removed.

Part III  Performance Comparison


Step Name | Duration | Data Size




Part IV Reference 

https://issues.apache.org/jira/browse/KYLIN-4342





  • No labels