THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

TableName PatternExplanation
Distinct Value Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.intermediate.table.suffix}

This table is a temporary hive table for storing literal value which be extracted from flat table.

It contain one normal column, dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once).

This table also contain a partition column, its name is dict_column, means one partition for one column.

Please note, literal value which has been encoded previously will be removed.

Segment Dictionary Table${FLAT_TABLE}_${kylin.dictionary.mr-hive.table.suffix}

This table is a temporary hive table for storing literal value and its encoded integer which be extracted from flat table.

It contain two normal column: dict_key, that is all distinct literal value for each kylin.dictionary.mr-hive.columns(duplicated literal value are only remain once); the second column, dict_value, contains the encoded integer for corresponding literal value.

This table also contain a partition column, its name is dict_column, means one partition for one column.

Global Dictionary Table${CUBE_NAME}_${kylin.dictionary.mr-hive.table.suffix}This table is the Global Dictionary. It has the same schema as Segment Dictionary Table .

New added steps 

Compared to hive original global dictionary version1by Meituan.

Serial NoStep NameInputOutput
1Create hive dictionary tableN/AThree hive table
2Extract distinct value into Distinct Value TableFlat tableDistinct Value Table
3Build Segment Level Dictionary (Parallel Part Build)Distinct Value Table(File path is determined by kylin.source.hive.databasedir)Intermediate dict file(Literal value encoded in partition-level, so each reducer will encode literal from zero).
4Build Segment Level Dictionary (Parallel Total Build)Intermediate dict fileSegment Level Dictionary
5Merge Segment Level Dictionary into Global Dictionary TableSegment Level Dictionary and old Global Dictionary Table  New Global Dictionary Table
6Replace/encode Flat TableFlat tableNew flat table (but literal value will be replaced with encoded integer)
7Cleanup temp table & dataAll temporary hive tables

Nothing, they will be removed.

...

Code Block
languagesql
themeMidnight
titleCreate hive dict table & Extract value into distinct value table
linenumberstrue
set mapreduce.job.name=Build Hive Global Dict - extract distinct value;
USE NightlyBuild;
set hive.exec.compress.output=false;
set hive.mapred.mode=unstrict;

-- create hive global dictionary table
CREATE TABLE IF NOT EXISTS NightlyBuild.ValidationCube_global_dict
( dict_key STRING COMMENT '', 
   dict_val INT COMMENT '' 
) 
COMMENT 'Hive Global Dictionary' 
PARTITIONED BY (dict_column string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE; 

-- create distinct value table
DROP TABLE IF EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by; 
CREATE TABLE IF NOT EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by 
( 
   dict_key STRING COMMENT '' 
) 
COMMENT '' 
PARTITIONED BY (dict_column string) 
STORED AS TEXTFILE 
;

-- create segment level dictionary table
DROP TABLE IF EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f_global_dict; 
CREATE TABLE IF NOT EXISTS kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f_global_dict 
( 
   dict_key STRING COMMENT '' , 
   dict_val STRING COMMENT '' 
) 
COMMENT '' 
PARTITIONED BY (dict_column string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE 
;

-- insert data into distinct value table
INSERT OVERWRITE TABLE kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by 
PARTITION (dict_column = 'USERACTIONLOGSAMPLE_PV_ID') 
SELECT a.DICT_KEY FROM (
  SELECT 
   USERACTIONLOGSAMPLE_PV_ID as DICT_KEY 
  FROM kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f
  GROUP BY USERACTIONLOGSAMPLE_PV_ID) a 
    LEFT JOIN 
  (SELECT DICT_KEY FROM NightlyBuild.ValidationCube_global_dict    
   WHERE DICT_COLUMN = 'USERACTIONLOGSAMPLE_PV_ID' ) b 
ON a.DICT_KEY = b.DICT_KEY 
WHERE b.DICT_KEY IS NULL 
;
INSERT OVERWRITE TABLE kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by 
PARTITION (dict_column = 'USERACTIONLOGSAMPLE_PLAY_ID') 
SELECT a.DICT_KEY FROM (
  SELECT 
   USERACTIONLOGSAMPLE_PLAY_ID as DICT_KEY 
  FROM kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f
  GROUP BY USERACTIONLOGSAMPLE_PLAY_ID) a 
    LEFT JOIN 
  (SELECT DICT_KEY FROM NightlyBuild.ValidationCube_global_dict    
   WHERE DICT_COLUMN = 'USERACTIONLOGSAMPLE_PLAY_ID' ) b 
ON a.DICT_KEY = b.DICT_KEY 
WHERE b.DICT_KEY IS NULL 
;

-- calculate max dict id 
INSERT OVERWRITE TABLE  kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by PARTITION (DICT_COLUMN = 'KYLIN_MAX_DISTINCT_COUNT') 
SELECT CONCAT_WS(',', tc.dict_column, cast(tc.total_distinct_val AS String), if(tm.max_dict_val is null, '0', cast(max_dict_val as string))) 
FROM (
    SELECT dict_column, count(1) total_distinct_val
    FROM NightlyBuild.kylin_intermediate_validationcube_bb2091dc_82ba_f77f_86b9_6f83a1af639f__group_by
    WHERE DICT_COLUMN != 'KYLIN_MAX_DISTINCT_COUNT'
    GROUP BY dict_column) tc 
LEFT JOIN (
    SELECT dict_column, if(max(dict_val) is null, 0, max(dict_val)) as max_dict_val 
    FROM NightlyBuild.ValidationCube_global_dict
    GROUP BY dict_column) tm 
ON tc.dict_column = tm.dict_column;

...