Apache Kylin : Analytical Data Warehouse for Big Data
Page History
我们有 5 张 Hive 表记录 System Cube 相关的数据,3 张表记录了查询相关的指标数据,2 张表记录了任务相关的指标数据。当开启 Cube Planner 后,Kylin 中会出现项目 KYLIN_SYSTEM,里面包含 5 个 Cube,即每一张 Hive 表对应 1 个 Cube。
Hive 中有 3 张记录查询信息的表,为了理解这些表的含义,您可以阅读以下信息。
在 Kylin 中,查询存在以下逻辑:
1, 一条 SQL 可能会击中多个 Cube。如当单条 SQL 中嵌套子查询时,可能击中多个 Cube
2, 查询击中的 Cube 中可能包含多个 Segment,因此需要扫描多个 Segment 中的数据
3, 单个 Segment 中的数据可能存储在多个机器中,因此查询时需要扫描多个机器中的数据,每一台机器称为一个 RPC 目标服务器。如下图所示,展开 Cube 信息,在 Storage 标签下查看 Region Count,可以看出一个 Segment 对应多少个 RPC 目标服务器。
Hive 表描述
Table Name | Type | Description |
---|
备注 | ||
---|---|---|
hive_metrics_query_qa | Query | query metrics at the highest level |
查询每击中一个 Cube 生成 1 条数据。 | ||
hive_metrics_query_cube_qa | Query | 在 Cube 级别收集查指标。最重要的是 cuboids 相关的,其为 Cube planner 提供服务 |
查询击中 Cube 时,每扫描一个 Segment 生成一条数据。 | ||
hive_metrics_query_rpc_qa | Query | 在最低级别收集查指标。对于一个查询,相关的 aggregation 和 filter 能够下推到每一个 rpc 目标服务器。Rpc 目标服务器的健壮性是更好查询性能的基础。 |
查询扫描 Segment 时,Segment 每包含一个 RPC 目标服务器生成 1 条数据。 如某个 Segment 包含 3 个 RPC 目标服务器,当查询需要扫描该 Segment 时,生成 3 条数据。 | |||
hive_metrics_job_qa | Job | 收集成功 job 的指标 | 每个成功的任务生成一条数据 |
hive_metrics_job_exception_qa | Job | 收集失败 job 的指标 | 每个失败的任务生成一条数据 |
Hive 表列信息与 Kylin Cube 信息的关系
hive_metrics_query_qa
Column | Type | Description | Sample(逗号分隔不同的样例) | D/M | 度量函数 |
---|
备注 | |||||
---|---|---|---|---|---|
query_hash_code | bigint | query unique id | 7708685990456150000 | M | COUNT_DISTINCT |
每个 SQL 对应一个唯一的 query_hash_code,当再次查询同样的 SQL 时,不会生成新的 query_hash_code | ||||||
host | string | the host of server for query engine | cdh-client:10.1.3.91 | D | ||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
realization | string |
cube name | kylin_sales_cube_SIMPLE | D |
realization_type | int | 2 | D |
query_type | string | CACHE,OLAP,LOOKUP_TABLE,HIVE (users can query on different data sources) | OLAP, CACHE | D | ||
exception | string | It's for classifying different exception types (when doing query, exceptions may happen) | NULL, java.lang.NumberFormatException | D |
所有的异常都有一个分类吗?
query_time_cost | bigint | the time cost for the whole query | 1392 | M | MIN/SUM/MAX/PERCENTILE_APPROX | |
calcite_count_return | bigint | the row count of the result Calcite returns | 3 | M | SUM/MAX |
Calcite 返回给 Kylin 的数据行数,如 n1 | |||||
storage_count_return | bigint | the row count of the input to Calcite | 3 | M | SUM/MAX |
底层数据返回给 Calcite 的数据行数,如 n2 | |||||
calcite_count_aggregate_filter | bigint | the row count of Calcite aggregates and filters | 0 | M | SUM/MAX |
在 Calcite 中,被过滤或被聚合的数据行数,n2-n1 | ||||||
ktimestamp | bigint | query begin time (timestamp) | 1600938970920 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:16:10 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 16 | D | ||
ktime_second | int | query begin time (second) | 10 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
hive_metrics_query_cube_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
host | string | the host of server for query engine | cdh-client:10.1.3.91 | |||
project | string | project name | PEARVIDEOAPP | |||
cube_name | string | cube name | UserActionPhaseOneCube | D | ||
segment_name | string | segment name | 20201011000000_20201012000000 | D | ||
cuboid_source | bigint | source cuboid parsed based on query and Cube design | 12582912 | D |
查询最匹配的cuboid,可能是不存在 | |||||
cuboid_target | bigint | target cuboid already precalculated and served for source cuboid | 13041664 | D |
查询实际使用的cuboid,可以会通过后计算才能回答查询 | ||||||
if_match | boolean | whether source cuboid and target cuboid are equal | FALSE | D | ||
filter_mask | bigint | 4194304 | D | |||
if_success | boolean | whether a query on this Cube is successful or not | TRUE | D |
weight_per_hit | double |
单条查询击中的Cube数的倒数 | 1 | M | SUM |
storage_call_count | bigint | the number of rpc calls for a query hit on this Cube | 1 | M | SUM/MAX | |
storage_call_time_sum | bigint | sum of time cost for the rpc calls of a query | 268 | M | SUM/MAX | |
storage_call_time_max | bigint | max of time cost among the rpc calls of a quer | 268 | M | SUM/MAX | |
storage_count_skip | bigint | the sum of row count skipped for the related rpc calls | 0 | M | SUM/MAX | |
storage_count_scan | bigint | the sum of row count scanned for the related rpc calls | 929 | M | SUM/MAX | |
storage_count_return | bigint | the sum of row count returned for the related rpc calls | 45 | M | SUM/MAX | |
storage_count_aggregate_filter | bigint | the sum of row count aggregated and filtered for the related rpc calls,= STORAGE_COUNT_SCAN - STORAGE_COUNT_RETURN | 884 | M | SUM/MAX | |
storage_count_aggregate | bigint | the sum of row count aggregated for the related rpc calls | 36 | M | SUM/MAX | |
ktimestamp | bigint | query begin time (timestamp) | 1603462676906 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/10/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/10/18 | D | ||
kday_time | string | query begin time (time) | 22:17:56 | D | ||
ktime_hour | int | query begin time (hour) | 22 | D | ||
ktime_minute | int | query begin time (minute) | 17 | D | ||
ktime_second | int | query begin time (second) | 56 | |||
kday_date | string | query begin time (day) | 2020/10/23 | Hive 表分区列 |
hive_metrics_query_rpc_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
host | string | the host of server for query engine | cdh-client:10.1.3.91 | D | ||
project | string | project name | LEARN_KYLIN | D | ||
realization | string | cube name | kylin_sales_cube_SIMPLE | D | ||
rpc_server | string | the rpc related target server | cdh-worker-2 | D | ||
exception | string | the exception of a rpc call. If no exception, "NULL" is used | NULL | D | ||
call_time | bigint | the time cost of a rpc all | 60 | M | SUM/MAX/PERCENTILE_APPROX | |
count_return | bigint | the row count actually return | 3 | M | SUM/MAX | |
count_scan | bigint | the row count actually scanned | 3 | M | SUM/MAX | |
count_skip | bigint | based on fuzzy filters or else,a few rows will be skiped. This indicates the skipped row count | 0 | M | SUM/MAX | |
count_aggregate_filter | bigint | the row count actually aggregated and filtered,= COUNT_SCAN - COUNT_RETURN | 0 | M | SUM/MAX | |
count_aggregate | bigint | the row count actually aggregated | 0 | M | SUM/MAX | |
ktimestamp | bigint | query begin time (timestamp) | 1600938970918 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:16:10 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 16 | D | ||
ktime_second | int | query begin time (second) | 10 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
hive_metrics_job_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
job_id | string | job id | 51b40173-1f6c-7e55-e0ca-fbc84d242ac0 | |||
host | string | the host of server for job engine | cdh-client:10.1.3.91 | |||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
cube_name | string | cube name | kylin_sales_cube_poi | D | ||
job_type | string | job type: build, merge, optimize | BUILD | D | 只有这三种吗 | |
cubing_type | string | in kylin,there are two cubing algorithms,Layered & Fast(InMemory) | NULL | D | 我不明白 | |
duration | bigint | the duration from a job start to finish | 945001 | M | SUM/MAX/MIN/PERCENTILE_APPROX | |
table_size | bigint | the size of data source in bytes | 227964845 | M | SUM/MAX/MIN | |
cube_size | bigint | the size of created Cube segment in bytes | 35693596 | M | SUM/MAX/MIN | |
per_bytes_time_cost | double | DURATION / TABLE_SIZE | 0.00414538 | M | SUM/MAX/MIN | |
wait_resource_time | bigint | a job may includes serveral MR(map reduce) jobs. Those MR jobs may wait because of lack of Hadoop resources. | 158146 | M | SUM/MAX/MIN | |
step_duration_distinct_columns | bigint | 138586 | M | SUM/MAX | 我知道是不同步骤的时间,但我不知道这个步骤怎么划分的,实际的 job 多于 4 步 | |
step_duration_dictionary | bigint | 5311 | M | SUM/MAX | ||
step_duration_inmem_cubing | bigint | 89 | M | SUM/MAX | ||
step_duration_hfile_convert | bigint | 75382 | M | SUM/MAX | ||
ktimestamp | bigint | query begin time (timestamp) | 1600938458385 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:07:38 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 7 | D | ||
ktime_second | int | query begin time (second) | 38 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
hive_metrics_job_exception_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
job_id | string | job id | a333a36d-8e33-811f-9326-f04579cc2464 | |||
host | string | the host of server for job engine | cdh-client:10.1.3.91 | |||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
cube_name | string | cube name | kylin_sales_cube | D | ||
job_type | string | job type: build, merge, optimize | BUILD | D | ||
cubing_type | string | in kylin,there are two cubing algorithms,Layered & Fast(InMemory) | LAYER | D | ||
exception | string | when running a job,exceptions may happen. It's for classifying different exception types | org.apache.kylin.job.exception.ExecuteException | D | ||
ktimestamp | bigint | query begin time (timestamp) | 1600936844611 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 16:40:44 | D | ||
ktime_hour | int | query begin time (hour) | 16 | D | ||
ktime_minute | int | query begin time (minute) | 40 | D | ||
ktime_second | int | query begin time (second) | 44 | D | ||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |