Apache Kylin : Analytical Data Warehouse for Big Data
Welcome to Kylin Wiki.
官网文档请参考 http://kylin.apache.org/cn/docs/tutorial/setup_systemcube.html
1. Background
为什么我们需要 System Cube?System Cube 的作用?
Kylin 支持 System Cube 为了更好的监控 Kylin 系统。System Cube 记录了查询和 Job 相关的指标数据,能够有效的帮助系统运维。System Cube 将服务于 Cube Planner 和 DashBoard。
2. The Hive Tables for System Cube
如果说 System Cube 服务于 Cube Planner 和 DashBoard,那么我们将会有 5 张 Hive 表服务于 System Cube,这 5 张 Hive 表分别记录了查询和任务的相关指标数据,如查询的响应时间,任务的成功/失败数量,这 5 张 Hive 表是 System Cube 的基础。创建这些 Hive Tables 的方式将在下文介绍,本节我们主要说明 Hive Tables 与 System Cube 的关系。
2.1 Hive Tables Overview Introduction
2.1.1 前置知识:查询的逻辑
您对 Kylin 的操作,如查询或构建,将在 Hive 表中被记录。因此了解 Kylin 中查询的逻辑能够帮助我们理解 Hive 表的含义。
- 在 Kylin 中,您输入的一条查询可能会击中多个 Cube。如单条 SQL 中嵌套子查询时,可能击中多个 Cube。
- 当 Cube 中包含多个 Segment 时,可能需要扫描多个 Segment 来回答查询。如 UserActionPhaseTwoCube 中包含 2 个 Segment
- 为了更好的并行计算,单个 Segment 中的数据可能存储在多个 RPC 目标服务器中,因此需要扫描多个 RPC 服务器来回答查询。如下图展开 Cube 信息,在 Storage 标签下查看 Region Count,可以看出一个 Segment 中的数据存储在多少个 RPC 目标服务器中。
2.1.2 Hive Tables 介绍
假设您已经通过脚本创建了 5 张 Hive 表,它们的简介如下:
Table Name | Type | Description | 备注 | 图示 |
---|---|---|---|---|
hive_metrics_query_qa | 收集查询相关的信息。 | query metrics at the highest level | 单条查询每击中一个 Cube 生成 1 条数据。 右图所示,该查询击中 2 个 Cube,将生成 2 条数据。 | |
hive_metrics_query_cube_qa | 收集查询相关的信息 | 在 Cube 级别收集查指标。最重要的是 cuboids 相关的,其为 Cube planner 提供服务 | 查询击中 Cube 时,每扫描一个 Segment 生成一条数据。 右图所示,当查询击中 UserActionPhaseTwoCube 且不包含时间筛选条件时,将生成 2 条数据。 | |
hive_metrics_query_rpc_qa | 收集查询相关的信息 | 在最低级别收集查指标。对于一个查询,相关的 aggregation 和 filter 能够下推到每一个 rpc 目标服务器。Rpc 目标服务器的健壮性是更好查询性能的基础。 | 查询扫描 Segment 时,Segment 每包含一个 RPC 目标服务器生成 1 条数据。如某个 Segment 包含 3 个 RPC 目标服务器,当查询需要扫描该 Segment 时,生成 3 条数据。 右图所示,查询扫描该 Segment 时将生成 3 条数据 | |
hive_metrics_job_qa | 收集 job 相关的信息 | 收集成功 job 相关的信息 | 每个成功的任务生成一条数据 | |
hive_metrics_job_exception_qa | 收集 job 相关的信息 | 收集失败 job 相关的信息 | 每个失败的任务生成一条数据 |
2.1.3 验证时的注意事项
如果您希望验证 Kylin 中的查询或构建任务是否会在 Hive Tables 中被记录,请注意 Hive Tables 中的数据生成一般是有延迟的,系统一般会“在一定时间后”或“将一个固定批量的数据”一次性插入到 Hive tables 中。如果您希望快速验证,您可以采用以下两种方法:
- 修改配置文件 $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/classes/kylinMetrics.xml 中的配置项 maxReportSize 或 maxReportTime (未验证)
- 重启 Kylin 会立即记录所有需要被记录的数据(未验证)
2.2 The Relationship Between Hive Tables And System Cube
开启 System Cube 后,Kylin 中会出现一个项目 KYLIN_SYSTEM,里面包含 5 个 Cube,每个 Cube 中包含一张事实表(无维度表),每个事实表对应一个 Hive Table。
我们将介绍以下信息:
- Column: the name of Hive table column
- Type: the type of this column
- Description: the description of this column
- Sample: Sample data in Hive
- D/M: this column is set as Dimension or Measure in System Cube
- 度量函数: If it is set as Measure, the fuction of the measure
- 备注:其他补充信息
2.2.1 hive_metrics_query_qa
Column | Type | Description | Sample | Dimension or Measure in System Cube | 度量函数 | 备注 |
---|---|---|---|---|---|---|
query_hash_code | bigint | query unique id | 7708685990456150000 | M | COUNT_DISTINCT | 每个 SQL 对应一个唯一的 query_hash_code,当再次查询同样的 SQL 时,不会生成新的 query_hash_code |
host | string | the host of server for query engine | cdh-client:10.1.3.91 | D | ||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
realization | string | cube name | kylin_sales_cube_SIMPLE | D | ||
realization_type | int | the storage type | 2 | D | ||
query_type | string | CACHE,OLAP,LOOKUP_TABLE,HIVE (users can query on different data sources) | OLAP, CACHE | D | ||
exception | string | It's for classifying different exception types (when doing query, exceptions may happen) | NULL, java.lang.NumberFormatException | D | ||
query_time_cost | bigint | the time cost for the whole query | 1392 | M | MIN/SUM/MAX/PERCENTILE_APPROX | |
calcite_count_return | bigint | the row count of the result Calcite returns | 3 | M | SUM/MAX | Calcite 返回给 Kylin 的数据行数,如 n1 |
storage_count_return | bigint | the row count of the input to Calcite | 3 | M | SUM/MAX | 底层数据返回给 Calcite 的数据行数,如 n2 |
calcite_count_aggregate_filter | bigint | the row count of Calcite aggregates and filters | 0 | M | SUM/MAX | 在 Calcite 中,被过滤或被聚合的数据行数,n2-n1 |
ktimestamp | bigint | query begin time (timestamp) | 1600938970920 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:16:10 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 16 | D | ||
ktime_second | int | query begin time (second) | 10 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
2.2.2 hive_metrics_query_cube_qa
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
host | string | the host of server for query engine | cdh-client:10.1.3.91 | |||
project | string | project name | PEARVIDEOAPP | |||
cube_name | string | cube name | UserActionPhaseOneCube | D | ||
segment_name | string | segment name | 20201011000000_20201012000000 | D | ||
cuboid_source | bigint | source cuboid parsed based on query and Cube design | 12582912 | D | 查询最匹配的cuboid,可能是不存在 | |
cuboid_target | bigint | target cuboid already precalculated and served for source cuboid | 13041664 | D | 查询实际使用的cuboid,可以会通过后计算才能回答查询 | |
if_match | boolean | whether source cuboid and target cuboid are equal | FALSE | D | ||
filter_mask | bigint | 4194304 | D | |||
if_success | boolean | whether a query on this Cube is successful or not | TRUE | D | ||
weight_per_hit | double | 单条查询击中的Cube数的倒数 | 1 | M | SUM | |
storage_call_count | bigint | the number of rpc calls for a query hit on this Cube | 1 | M | SUM/MAX | |
storage_call_time_sum | bigint | sum of time cost for the rpc calls of a query | 268 | M | SUM/MAX | |
storage_call_time_max | bigint | max of time cost among the rpc calls of a quer | 268 | M | SUM/MAX | |
storage_count_skip | bigint | the sum of row count skipped for the related rpc calls | 0 | M | SUM/MAX | |
storage_count_scan | bigint | the sum of row count scanned for the related rpc calls | 929 | M | SUM/MAX | |
storage_count_return | bigint | the sum of row count returned for the related rpc calls | 45 | M | SUM/MAX | |
storage_count_aggregate_filter | bigint | the sum of row count aggregated and filtered for the related rpc calls,= STORAGE_COUNT_SCAN - STORAGE_COUNT_RETURN | 884 | M | SUM/MAX | |
storage_count_aggregate | bigint | the sum of row count aggregated for the related rpc calls | 36 | M | SUM/MAX | |
ktimestamp | bigint | query begin time (timestamp) | 1603462676906 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/10/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/10/18 | D | ||
kday_time | string | query begin time (time) | 22:17:56 | D | ||
ktime_hour | int | query begin time (hour) | 22 | D | ||
ktime_minute | int | query begin time (minute) | 17 | D | ||
ktime_second | int | query begin time (second) | 56 | |||
kday_date | string | query begin time (day) | 2020/10/23 | Hive 表分区列 |
2.2.3 hive_metrics_query_rpc_qa
这张表暂时不会在 Kylin 4.0 Beta 中被使用
Column | Type | Description | Sample | D/M | 度量函数 |
---|---|---|---|---|---|
host | string | the host of server for query engine | cdh-client:10.1.3.91 | D | |
project | string | project name | LEARN_KYLIN | D | |
realization | string | cube name | kylin_sales_cube_SIMPLE | D | |
rpc_server | string | the rpc related target server | cdh-worker-2 | D | |
exception | string | the exception of a rpc call. If no exception, "NULL" is used | NULL | D | |
call_time | bigint | the time cost of a rpc all | 60 | M | SUM/MAX/PERCENTILE_APPROX |
count_return | bigint | the row count actually return | 3 | M | SUM/MAX |
count_scan | bigint | the row count actually scanned | 3 | M | SUM/MAX |
count_skip | bigint | based on fuzzy filters or else,a few rows will be skiped. This indicates the skipped row count | 0 | M | SUM/MAX |
count_aggregate_filter | bigint | the row count actually aggregated and filtered,= COUNT_SCAN - COUNT_RETURN | 0 | M | SUM/MAX |
count_aggregate | bigint | the row count actually aggregated | 0 | M | SUM/MAX |
ktimestamp | bigint | query begin time (timestamp) | 1600938970918 | ||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | |
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | |
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | |
kday_time | string | query begin time (time) | 17:16:10 | D | |
ktime_hour | int | query begin time (hour) | 17 | D | |
ktime_minute | int | query begin time (minute) | 16 | D | |
ktime_second | int | query begin time (second) | 10 | ||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
2.2.4 hive_metrics_job_qa
这张表暂时不会在 Kylin 4.0 Beta 中被使用
Column | Type | Description | Sample | D/M | 度量函数 | My question |
---|---|---|---|---|---|---|
job_id | string | job id | 51b40173-1f6c-7e55-e0ca-fbc84d242ac0 | |||
host | string | the host of server for job engine | cdh-client:10.1.3.91 | |||
kuser | string | user name | ADMIN | D | ||
project | string | project name | LEARN_KYLIN | D | ||
cube_name | string | cube name | kylin_sales_cube_poi | D | ||
job_type | string | job type: build, merge, optimize | BUILD | D | 只有这三种吗 | |
cubing_type | string | in kylin,there are two cubing algorithms,Layered & Fast(InMemory) | NULL | D | 我不明白 | |
duration | bigint | the duration from a job start to finish | 945001 | M | SUM/MAX/MIN/PERCENTILE_APPROX | |
table_size | bigint | the size of data source in bytes | 227964845 | M | SUM/MAX/MIN | |
cube_size | bigint | the size of created Cube segment in bytes | 35693596 | M | SUM/MAX/MIN | |
per_bytes_time_cost | double | DURATION / TABLE_SIZE | 0.00414538 | M | SUM/MAX/MIN | |
wait_resource_time | bigint | a job may includes serveral MR(map reduce) jobs. Those MR jobs may wait because of lack of Hadoop resources. | 158146 | M | SUM/MAX/MIN | |
step_duration_distinct_columns | bigint | 138586 | M | SUM/MAX | 我知道是不同步骤的时间,但我不知道这个步骤怎么划分的,实际的 job 多于 4 步 | |
step_duration_dictionary | bigint | 5311 | M | SUM/MAX | ||
step_duration_inmem_cubing | bigint | 89 | M | SUM/MAX | ||
step_duration_hfile_convert | bigint | 75382 | M | SUM/MAX | ||
ktimestamp | bigint | query begin time (timestamp) | 1600938458385 | |||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | ||
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | ||
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | ||
kday_time | string | query begin time (time) | 17:07:38 | D | ||
ktime_hour | int | query begin time (hour) | 17 | D | ||
ktime_minute | int | query begin time (minute) | 7 | D | ||
ktime_second | int | query begin time (second) | 38 | |||
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
2.2.5 hive_metrics_job_exception_qa
这张表暂时不会在 Kylin 4.0 Beta 中被使用
Column | Type | Description | Sample | D/M | 度量函数 |
---|---|---|---|---|---|
job_id | string | job id | a333a36d-8e33-811f-9326-f04579cc2464 | ||
host | string | the host of server for job engine | cdh-client:10.1.3.91 | ||
kuser | string | user name | ADMIN | D | |
project | string | project name | LEARN_KYLIN | D | |
cube_name | string | cube name | kylin_sales_cube | D | |
job_type | string | job type: build, merge, optimize | BUILD | D | |
cubing_type | string | in kylin,there are two cubing algorithms,Layered & Fast(InMemory) | LAYER | D | |
exception | string | when running a job,exceptions may happen. It's for classifying different exception types | org.apache.kylin.job.exception.ExecuteException | D | |
ktimestamp | bigint | query begin time (timestamp) | 1600936844611 | ||
kyear_begin_date | string | query begin time (year) | 2020/1/1 | D | |
kmonth_begin_date | string | query begin time (month) | 2020/9/1 | D | |
kweek_begin_date | string | query begin time (week, begin with sumday) | 2020/9/20 | D | |
kday_time | string | query begin time (time) | 16:40:44 | D | |
ktime_hour | int | query begin time (hour) | 16 | D | |
ktime_minute | int | query begin time (minute) | 40 | D | |
ktime_second | int | query begin time (second) | 44 | D | |
kday_date | string | query begin time (day) | 2020/9/24 | Hive 表分区列 |
How To Enable System Cube
怎样启动 System Cube?System Cube 如何发挥作用?
请查阅官网文档 http://kylin.apache.org/cn/docs/tutorial/setup_systemcube.html or to be update
一般分为 3 个部分:
- 创建 Hive Tables 和 System Cube
- 构建 System Cube
- 为 System Cube 添加定时构建任务
How To Use System Cube
System Cube 记录了系统的指标数据,它为 DashBoard 和 Cube Planner 提供了数据基础。