Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
我们有 5 张 Hive 表记录 System Cube 相关的数据,3 张表记录了查询相关的指标数据,2 张表记录了任务相关的指标数据。当开启 Cube Planner 后,Kylin 中会出现项目 KYLIN_SYSTEM,里面包含 5 个 Cube,即每一张 Hive 表对应 1 个 Cube。
Hive 表描述
Table Name | Type | Description | My Question | |
---|---|---|---|---|
hive_metrics_query_qa | Query | query metrics at the highest level | 每条查询生成一条数据 | |
hive_metrics_query_cube_qa | Query | 在 Cube 级别收集查指标。最重要的是 cuboids 相关的,其为 Cube planner 提供服务 | 每条查询生成一条数据 | 确定不是查询击中 cube 的才生成一条数据?如果是一条下压查询,是不是记录在此? |
hive_metrics_query_rpc_qa | Query | 在最低级别收集查指标。对于一个查询,相关的 aggregation 和 filter 能够下推到每一个 rpc 目标服务器。Rpc 目标服务器的健壮性是更好查询性能的基础。 | 每条查询生成至少一条数据 | RPC 怎么理解 |
hive_metrics_job_qa | Job | 收集成功 job 的指标 | 每个成功的任务生成一条数据 | |
hive_metrics_job_exception_qa | Job | 收集失败 job 的指标 | 每个失败的任务生成一条数据 |
Hive 表列信息与 Kylin Cube 信息的关系
hive_metrics_query_qa
Column | Type | Description | Sample(逗号分隔不同的样例) | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
query_hash_code | bigint | query unique id | 7708685990456150000 | M | COUNT_DISTINCT | 每一条查询一个hash_code吗?完全一样的查询,查2次,是一个hash_code,还是2个呀? |
host | string | the host of server for query engine | cdh-client:10.1.3.91 | D | ||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
realization | string | Cube or Hybrid of Cubes (in Kylin, there are two OLAP realizations) | kylin_sales_cube_SIMPLE | D | 确定是cube name吗?其他表有cube_name字段,这里为什么和其他表不一样? | |
realization_type | int | 2 | D | 我不明白 | ||
query_type | string | CACHE,OLAP,LOOKUP_TABLE,HIVE (users can query on different data sources) | OLAP, CACHE | D | ||
exception | string | It's for classifying different exception types (when doing query, exceptions may happen) | NULL, java.lang.NumberFormatException | D | 所有的异常都有一个分类吗? | |
query_time_cost | bigint | the time cost for the whole query | 1392 | M | MIN/SUM/MAX/PERCENTILE_APPROX | |
calcite_count_return | bigint | the row count of the result Calcite returns | 3 | M | SUM/MAX | 我不明白 |
storage_count_return | bigint | the row count of the input to Calcite | 3 | M | SUM/MAX | 我不明白 |
calcite_count_aggregate_filter | bigint | the row count of Calcite aggregates and filters | 0 | M | SUM/MAX | 我不明白,主要是不明白calcite的执行原理 |
ktimestamp | bigint | query begin time (timestamp) | 1600938970920 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:16:10 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 16 | D | ||
ktime_second | int | query begin time (second) | 10 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
hive_metrics_query_cube_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
host | string | the host of server for query engine | cdh-client:10.1.3.91 | |||
project | string | project name | PEARVIDEOAPP | |||
cube_name | string | cube name | UserActionPhaseOneCube | D | ||
segment_name | string | segment name | 20201011000000_20201012000000 | D | ||
cuboid_source | bigint | source cuboid parsed based on query and Cube design | 12582912 | D | 谁是 source cuboid | |
cuboid_target | bigint | target cuboid already precalculated and served for source cuboid | 13041664 | D | 谁是 target cuboid | |
if_match | boolean | whether source cuboid and target cuboid are equal | FALSE | D | ||
filter_mask | bigint | 4194304 | D | |||
if_success | boolean | whether a query on this Cube is successful or not | TRUE | D | 我不明白 | |
weight_per_hit | double | 1 | M | SUM | 如果是一条下压查询,是不是记录在此? | |
storage_call_count | bigint | the number of rpc calls for a query hit on this Cube | 1 | M | SUM/MAX | |
storage_call_time_sum | bigint | sum of time cost for the rpc calls of a query | 268 | M | SUM/MAX | |
storage_call_time_max | bigint | max of time cost among the rpc calls of a quer | 268 | M | SUM/MAX | |
storage_count_skip | bigint | the sum of row count skipped for the related rpc calls | 0 | M | SUM/MAX | |
storage_count_scan | bigint | the sum of row count scanned for the related rpc calls | 929 | M | SUM/MAX | |
storage_count_return | bigint | the sum of row count returned for the related rpc calls | 45 | M | SUM/MAX | |
storage_count_aggregate_filter | bigint | the sum of row count aggregated and filtered for the related rpc calls,= STORAGE_COUNT_SCAN - STORAGE_COUNT_RETURN | 884 | M | SUM/MAX | |
storage_count_aggregate | bigint | the sum of row count aggregated for the related rpc calls | 36 | M | SUM/MAX | |
ktimestamp | bigint | query begin time (timestamp) | 1603462676906 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/10/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/10/18 | D | ||
kday_time | string | query begin time (time) | 22:17:56 | D | ||
ktime_hour | int | query begin time (hour) | 22 | D | ||
ktime_minute | int | query begin time (minute) | 17 | D | ||
ktime_second | int | query begin time (second) | 56 | |||
kday_date | string | query begin time (day) | 2020/10/23 | Hive 表分区列 |
hive_metrics_query_rpc_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
host | string | the host of server for query engine | cdh-client:10.1.3.91 | D | ||
project | string | project name | LEARN_KYLIN | D | ||
realization | string | cube name | kylin_sales_cube_SIMPLE | D | ||
rpc_server | string | the rpc related target server | cdh-worker-2 | D | ||
exception | string | the exception of a rpc call. If no exception, "NULL" is used | NULL | D | ||
call_time | bigint | the time cost of a rpc all | 60 | M | SUM/MAX/PERCENTILE_APPROX | |
count_return | bigint | the row count actually return | 3 | M | SUM/MAX | |
count_scan | bigint | the row count actually scanned | 3 | M | SUM/MAX | |
count_skip | bigint | based on fuzzy filters or else,a few rows will be skiped. This indicates the skipped row count | 0 | M | SUM/MAX | |
count_aggregate_filter | bigint | the row count actually aggregated and filtered,= COUNT_SCAN - COUNT_RETURN | 0 | M | SUM/MAX | |
count_aggregate | bigint | the row count actually aggregated | 0 | M | SUM/MAX | |
ktimestamp | bigint | query begin time (timestamp) | 1600938970918 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:16:10 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 16 | D | ||
ktime_second | int | query begin time (second) | 10 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
hive_metrics_job_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
job_id | string | job id | 51b40173-1f6c-7e55-e0ca-fbc84d242ac0 | |||
host | string | the host of server for job engine | cdh-client:10.1.3.91 | |||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
cube_name | string | cube name | kylin_sales_cube_poi | D | ||
job_type | string | job type: build, merge, optimize | BUILD | D | 只有这三种吗 | |
cubing_type | string | in kylin,there are two cubing algorithms,Layered & Fast(InMemory) | NULL | D | 我不明白 | |
duration | bigint | the duration from a job start to finish | 945001 | M | SUM/MAX/MIN/PERCENTILE_APPROX | |
table_size | bigint | the size of data source in bytes | 227964845 | M | SUM/MAX/MIN | |
cube_size | bigint | the size of created Cube segment in bytes | 35693596 | M | SUM/MAX/MIN | |
per_bytes_time_cost | double | DURATION / TABLE_SIZE | 0.00414538 | M | SUM/MAX/MIN | |
wait_resource_time | bigint | a job may includes serveral MR(map reduce) jobs. Those MR jobs may wait because of lack of Hadoop resources. | 158146 | M | SUM/MAX/MIN | |
step_duration_distinct_columns | bigint | 138586 | M | SUM/MAX | 我知道是不同步骤的时间,但我不知道这个步骤怎么划分的,实际的 job 多于 4 步 | |
step_duration_dictionary | bigint | 5311 | M | SUM/MAX | ||
step_duration_inmem_cubing | bigint | 89 | M | SUM/MAX | ||
step_duration_hfile_convert | bigint | 75382 | M | SUM/MAX | ||
ktimestamp | bigint | query begin time (timestamp) | 1600938458385 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:07:38 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 7 | D | ||
ktime_second | int | query begin time (second) | 38 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
hive_metrics_job_exception_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
job_id | string | job id | a333a36d-8e33-811f-9326-f04579cc2464 | |||
host | string | the host of server for job engine | cdh-client:10.1.3.91 | |||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
cube_name | string | cube name | kylin_sales_cube | D | ||
job_type | string | job type: build, merge, optimize | BUILD | D | ||
cubing_type | string | in kylin,there are two cubing algorithms,Layered & Fast(InMemory) | LAYER | D | ||
exception | string | when running a job,exceptions may happen. It's for classifying different exception types | org.apache.kylin.job.exception.ExecuteException | D | ||
ktimestamp | bigint | query begin time (timestamp) | 1600936844611 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 16:40:44 | D | ||
ktime_hour | int | query begin time (hour) | 16 | D | ||
ktime_minute | int | query begin time (minute) | 40 | D | ||
ktime_second | int | query begin time (second) | 44 | D | ||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
Overview
Content Tools
ThemeBuilder
Apps