Apache Kylin : Analytical Data Warehouse for Big Data
Page History
...
Table | Name Pattern | Explanation |
---|---|---|
Distinct Value Table | ${FLAT_TABLE}_${kylin.dictionary.mr-hive.intermediate.table.suffix} | This table is a temporary hive table for storing literal value which be extracted from flat table. It contain one normal column, dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once). This table also contain a partition column, its name is dict_column, means one partition for one column. Please note, literal value which has been encoded previously will be removed. |
Segment Dictionary Table | ${FLAT_TABLE}_${kylin.dictionary.mr-hive.table.suffix} | This table is a temporary hive table for storing literal value and its encoded integer which be extracted from flat table. It contain two normal column: dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once); the second column, dict_value, contains the encoded integer for corresponding literal value. This table also contain a partition column, its name is dict_column, means one partition for one column. |
Global Dictionary Table | ${CUBE_NAME}_${kylin.dictionary.mr-hive.table.suffix} | This table is the Global Dictionary. It has the same schema as Segment Dictionary Table . |
New added steps
Compared to hive original global dictionary version1by Meituan.
Serial No | Step Name | Input | Output |
---|---|---|---|
1 | Create hive dictionary table | N/A | Three hive table |
2 | Extract distinct value into Distinct Value Table | Flat table | Distinct Value Table |
3 | Build Segment Level Dictionary (Parallel Part Build) | Distinct Value Table(File path is determined by kylin.source.hive.databasedir) | Intermediate dict file(Literal value encoded in partition-level, so each reducer will encode literal from zero). |
4 | Build Segment Level Dictionary (Parallel Total Build) | Intermediate dict file | Segment Level Dictionary |
5 | Merge Segment Level Dictionary into Global Dictionary Table | Segment Level Dictionary and old Global Dictionary Table | New Global Dictionary Table |
6 | Replace/encode Flat Table | Flat table | New flat table (but literal value will be replaced with encoded integer) |
7 | Cleanup temp table & data | All temporary hive tables | Nothing, they will be removed. |
...
Code Block | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
set mapreduce.job.name=Build Hive Global Dict - extract distinct value; USE NightlyBuild; set hive.exec.compress.output=false; set hive.mapred.mode=unstrict; -- create hive global dictionary table CREATE TABLE IF NOT EXISTS NightlyBuild.ValidationCube_global_dict ( dict_key STRING COMMENT '', dict_val INT COMMENT '' ) COMMENT 'Hive Global Dictionary' PARTITIONED BY (dict_column string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; -- create distinct value table DROP TABLE IF EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by; CREATE TABLE IF NOT EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by ( dict_key STRING COMMENT '' ) COMMENT '' PARTITIONED BY (dict_column string) STORED AS TEXTFILE ; -- create segment level dictionary table DROP TABLE IF EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f_global_dict; CREATE TABLE IF NOT EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f_global_dict ( dict_key STRING COMMENT '' , dict_val STRING COMMENT '' ) COMMENT '' PARTITIONED BY (dict_column string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE ; -- insert data into distinct value table INSERT OVERWRITE TABLE kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by PARTITION (dict_column = 'USERACTIONLOGSAMPLE_PV_ID') SELECT a.DICT_KEY FROM ( SELECT USERACTIONLOGSAMPLE_PV_ID as DICT_KEY FROM kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f GROUP BY USERACTIONLOGSAMPLE_PV_ID) a LEFT JOIN (SELECT DICT_KEY FROM NightlyBuild.ValidationCube_global_dict WHERE DICT_COLUMN = 'USERACTIONLOGSAMPLE_PV_ID' ) b ON a.DICT_KEY = b.DICT_KEY WHERE b.DICT_KEY IS NULL ; INSERT OVERWRITE TABLE kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by PARTITION (dict_column = 'USERACTIONLOGSAMPLE_PLAY_ID') SELECT a.DICT_KEY FROM ( SELECT USERACTIONLOGSAMPLE_PLAY_ID as DICT_KEY FROM kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f GROUP BY USERACTIONLOGSAMPLE_PLAY_ID) a LEFT JOIN (SELECT DICT_KEY FROM NightlyBuild.ValidationCube_global_dict WHERE DICT_COLUMN = 'USERACTIONLOGSAMPLE_PLAY_ID' ) b ON a.DICT_KEY = b.DICT_KEY WHERE b.DICT_KEY IS NULL ; -- calculate max dict id INSERT OVERWRITE TABLE kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by PARTITION (DICT_COLUMN = 'KYLIN_MAX_DISTINCT_COUNT') SELECT CONCAT_WS(',', tc.dict_column, cast(tc.total_distinct_val AS String), if(tm.max_dict_val is null, '0', cast(max_dict_val as string))) FROM ( SELECT dict_column, count(1) total_distinct_val FROM NightlyBuild.kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by WHERE DICT_COLUMN != 'KYLIN_MAX_DISTINCT_COUNT' GROUP BY dict_column) tc LEFT JOIN ( SELECT dict_column, if(max(dict_val) is null, 0, max(dict_val)) as max_dict_val FROM NightlyBuild.ValidationCube_global_dict GROUP BY dict_column) tm ON tc.dict_column = tm.dict_column; |
...