Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
Part I What is Hive Global Dictionary
Background
Count distinct(bitmap) measure is very important for many scenario, such as PageView statistics, and Kylin support count distinct since 1.5.3 .
Apache Kylin implements precisely count distinct measure based on bitmap, and use Global Dictionary to encode string value into integer.
Currently we have to build Global Dictionary in single process/JVM, which may take a lot of time and memory for UHC. By this feature, we use MR to build and use Hive to store Global Dictionary for Kylin.
Benefit
- Build Global Dictionary in distributed way, thus building job spent less time.
- Job Server will do less job, thus be more stable.
- OneID, everyone can reuse this dictionary in the other scene across the company.
Configuration
Conf key | Explanation | Example |
---|---|---|
kylin.dictionary.mr-hive.database | Which database will the Hive Global Dictionary in. | default |
kylin.dictionary.mr-hive.columns | A list, contain all columns which need a Hive Global Dictionary, in a {CUBE_NAME}_{COLUMN_NAME} format. | KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID |
kylin.dictionary.mr-hive.table.suffix | Suffix for Segment Dictionary Table and Global Dictionary Table | _dict_table |
kylin.dictionary.mr-hive.intermediate.table.suffix | Suffix for Distinct Value Table | _distinct_value |
kylin.dictionary.mr-hive.columns.reduce.num | A key/value structure, which the key is {CUBE_NAME}_{COLUMN_NAME}, and value is number for expected reducers in Build Segment Level Dictionary (MR job-1). | KYLIN_SALES_SALES_ID:3,KYLIN_SALES_BUYER_ID:2 |
kylin.source.hive.databasedir | The location of Hive table in HDFS. | /user/hive/warehouse/lacus.db |
kylin.dictionary.mr-hive.ref.columns | To reuse another global dictionary(s), you can specific a list here, to refer to some existent global dictionary built by another cube | KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID |
Hive Table created for
Table | Name Pattern | Explanation |
---|---|---|
Distinct Value Table | ${FLAT_TABLE}_${kylin.dictionary.mr-hive.intermediate.table.suffix} | This table is a temporary hive table for storing literal value which need to be encoded. It contain one normal column, dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once). This table also contain a partition column, its name is dict_column, means one partition for one column. |
Segment Dictionary Table | ${FLAT_TABLE}_${kylin.dictionary.mr-hive.table.suffix} | |
Global Dictionary Table | ${CUBE_NAME}_${kylin.dictionary.mr-hive.table.suffix} |
New added steps
Serial No | Step Name | Input | Output |
---|---|---|---|
1 | Create hive dictionary table | N/A | Create three hive table |
2 | Extract distinct value into Distinct Value Table | Flat table | Distinct Value Table |
3 | Build Segment Level Dictionary (Parallel Part) | Distinct Value Table | Intermediate dict file(Literal value encoded in partition-level, so each reducer will encode literal from zero). |
4 | Build Segment Level Dictionary (Parallel Total) | Intermediate dict file | Segment Level Dictionary |
5 | Merge Segment Level Dictionary into Global Dictionary Table | Segment Level Dictionary and old Global Dictionary Table | New Global Dictionary Table |
6 | Replace/encode Flat Table | Flat table | New flat table (but literal value will be replaced with encoded integer) |
7 | Cleanup temp table & data | All temporary hive tables | Nothing, they will be removed. |
Part II How to use
Step1. Create cube which contains COUNT_DISTINCT(bitmap) measure.
Step 2. Add properties in configuration overwrite step.
Step 3. Build new segment.
Part III Performance Comparison
Hadoop Env
Hadoop : CDH 5.7
Yarn memory : 102GB
Yarn cores :18
Step Cost
Data Size
Step Name | Duration | Data Size
Part IV Reference
https://issues.apache.org/jira/browse/KYLIN-4342