Apache Kylin : Analytical Data Warehouse for Big Data
Page History
...
Count distinct(bitmap) measure is very important for many scenarios, such as PageView statistics, and Kylin support count distinct since 1.5.3 .
Apache Kylin implements precisely count distinct measure based on bitmap, and use Global Dictionary to encode string value string literal into integer.
Currently we have to build Global Dictionary in single process/JVM, which may take a lot of time and memory for UHC. By this feature, we use MR to build and use Hive to store Global Dictionary for Kylin.
...
- Build Global Dictionary in distributed way, thus building job spent less time.
- Job Server will do less job, thus be more stable.
- OneID, since the fact that Hive Global Dictionary is human-readable outside of Kylin, everyone can reuse this dictionary(Hive table) in the other scene across the company.
...
Release Date | Release version | JIRA issue | Comment | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
2019-10 | v3.0.0 |
| Introduce Hive global dictionary.(first version) | ||||||||
2020-06 | v3.1.0 |
| Use MapReduce other than HQL in some steps to improve performance.(version2second version) |
Configuration
Conf key | Explanation | Example |
---|---|---|
kylin.dictionary.mr-hive.database | Which database will the Hive Global Dictionary in. | default |
kylin.dictionary.mr-hive.columns | A list, contain all columns which need a Hive Global Dictionary, in a {TABLE_NAME}_{COLUMN_NAME} pattern. | KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID |
kylin.dictionary.mr-hive.table.suffix | Suffix for Segment Dictionary Table and Global Dictionary Table | _global_dict |
kylin.dictionary.mr-hive.intermediate.table.suffix | Suffix for Distinct Value Table | _group_by |
kylin.dictionary.mr-hive.columns.reduce.num | A key/value structure(or a map), which key is {TABLE_NAME}_{COLUMN_NAME}, and value is number for expected reducers in Build Segment Level Dictionary (MR job Parallel Part Build). | KYLIN_SALES_SALES_ID:3,KYLIN_SALES_BUYER_ID:2 |
kylin.source.hive.databasedir | The location of Hive table in HDFS. | /user/hive/warehouse/lacus.db |
kylin.dictionary.mr-hive.ref.columns | To reuse other global dictionary(s), you can specific a list here, to refer to some existent global dictionary(s) built by another cube. | KYLIN_SALES_SALES_ID,KYLIN_SALES_BUYER_ID |
...
Step Name | Duration EST | Data size |
---|---|---|
Create Intermediate Flat Hive Table | ||
Build Hive Global Dict - extract distinct value | ||
Redistribute Flat Hive Table | ||
Build Hive Global Dict - parallel part build | ||
Build Hive Global Dict - parallel total build | ||
Build Hive Global Dict - merge to dict table | ||
Build Hive Global Dict - replace intermediate table | ||
Extract Fact Table Distinct Columns | ||
Build Dimension Dictionary | ||
Extract Dictionary from Global Dictionary(When shrunken dictionary enabled) | ||
Build Base Cuboid | ||
-l | ||
Total | ||
Comment |
Part IV Reference
...