THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

我们有 5 张 Hive 表记录 System Cube 相关的数据,3 张表记录了查询相关的指标数据,2 张表记录了任务相关的指标数据。当开启 Cube Planner 后,Kylin 中会出现项目 KYLIN_SYSTEM,里面包含 5 个 Cube,即每一张 Hive 表对应 1 个 Cube。

Hive 中有 3 张记录查询信息的表,为了理解这些表的含义,您可以阅读以下信息。

在 Kylin 中,查询存在以下逻辑:

1, 一条 SQL 可能会击中多个 Cube。如当单条 SQL 中嵌套子查询时,可能击中多个 Cube

Image Added

2, 查询击中的 Cube 中可能包含多个 Segment,因此需要扫描多个 Segment 中的数据

Image Added

3, 单个 Segment 中的数据可能存储在多个机器中,因此查询时需要扫描多个机器中的数据,每一台机器称为一个 RPC 目标服务器。如下图所示,展开 Cube 信息,在 Storage 标签下查看 Region Count,可以看出一个 Segment 对应多少个 RPC 目标服务器。Image Added

Hive 表描述

Table NameTypeDescription
My Question
备注
hive_metrics_query_qaQueryquery metrics at the highest level
每条查询生成一条数据

查询每击中一个 Cube 生成 1 条数据。

hive_metrics_query_cube_qaQuery

在 Cube 级别收集查指标。最重要的是 cuboids 相关的,其为 Cube planner 提供服务

每条查询生成一条数据确定不是查询击中 cube 的才生成一条数据?如果是一条下压查询,是不是记录在此?

查询击中 Cube 时,每扫描一个 Segment 生成一条数据。

hive_metrics_query_rpc_qaQuery

在最低级别收集查指标。对于一个查询,相关的 aggregation 和 filter 能够下推到每一个 rpc 目标服务器。Rpc 目标服务器的健壮性是更好查询性能的基础。

每条查询生成至少一条数据RPC 怎么理解

查询扫描 Segment 时,Segment 每包含一个 RPC 目标服务器生成 1 条数据。

如某个 Segment 包含 3 个 RPC 目标服务器,当查询需要扫描该 Segment 时,生成 3 条数据。

hive_metrics_job_qaJob收集成功 job 的指标每个成功的任务生成一条数据
hive_metrics_job_exception_qaJob收集失败 job 的指标每个失败的任务生成一条数据


Hive 表列信息与 Kylin Cube 信息的关系

hive_metrics_query_qa

ColumnTypeDescriptionSample(逗号分隔不同的样例)D/M度量函数
My question
备注
query_hash_code     bigintquery unique id7708685990456150000MCOUNT_DISTINCT
每一条查询一个hash_code吗?完全一样的查询,查2次,是一个hash_code,还是2个呀?

每个 SQL 对应一个唯一的 query_hash_code,当再次查询同样的 SQL 时,不会生成新的 query_hash_code

host                stringthe host of server for query enginecdh-client:10.1.3.91D 
kuser               stringuser nameADMIND 
project             stringproject nameLEARN_KYLIND 
realization         string
Cube or Hybrid of Cubes (in Kylin, there are two OLAP realizations)
cube namekylin_sales_cube_SIMPLED 
确定是cube name吗?其他表有cube_name字段,这里为什么和其他表不一样?


realization_type    int
2D 
我不明白

query_type          stringCACHE,OLAP,LOOKUP_TABLE,HIVE (users can query on different data sources)OLAP, CACHED 
exception           stringIt's for classifying different exception types (when doing query, exceptions may happen)NULL, java.lang.NumberFormatExceptionD 

所有的异常都有一个分类吗?


query_time_cost     bigintthe time cost for the whole query1392MMIN/SUM/MAX/PERCENTILE_APPROX
calcite_count_returnbigintthe row count of the result Calcite returns3MSUM/MAX
我不明白
Calcite 返回给 Kylin 的数据行数,如 n1
storage_count_returnbigintthe row count of the input to Calcite3MSUM/MAX
我不明白
底层数据返回给 Calcite 的数据行数,如 n2
calcite_count_aggregate_filterbigintthe row count of Calcite aggregates and filters0MSUM/MAX
我不明白,主要是不明白calcite的执行原理

在 Calcite 中,被过滤或被聚合的数据行数,n2-n1

ktimestamp          bigintquery begin time (timestamp)1600938970920  
kyear_begin_date    stringquery begin time (year)2020/1/1D 
kmonth_begin_date   stringquery begin time (month)2020/9/1D 
kweek_begin_date    stringquery begin time (week, begin with sumday)2020/9/20D 
kday_time           stringquery begin time (time)17:16:10D 
ktime_hour          intquery begin time (hour)17D 
ktime_minute        intquery begin time (minute)16D 
ktime_second        intquery begin time (second)10  
kday_date           stringquery begin time (day)2020/9/24Hive 表分区列 

hive_metrics_query_cube_qa

ColumnTypeDescriptionSampleD/M度量函数My question
host                stringthe host of server for query enginecdh-client:10.1.3.91  
project             stringproject namePEARVIDEOAPP  
cube_name           stringcube nameUserActionPhaseOneCubeD 
segment_name        stringsegment name20201011000000_20201012000000D 
cuboid_source       bigintsource cuboid parsed based on query and Cube design12582912D 
谁是 source cuboid
查询最匹配的cuboid,可能是不存在
cuboid_target       biginttarget cuboid already precalculated and served for source cuboid13041664D 
谁是 target cuboid
查询实际使用的cuboid,可以会通过后计算才能回答查询
if_match            booleanwhether source cuboid and target cuboid are equalFALSED 
filter_mask         bigint 4194304D 
if_success          booleanwhether a query on this Cube is successful or notTRUED 
我不明白

weight_per_hit      double
 
单条查询击中的Cube数的倒数1MSUM
如果是一条下压查询,是不是记录在此?

storage_call_count  bigintthe number of rpc calls for a query hit on this Cube1MSUM/MAX
storage_call_time_sumbigintsum of time cost for the rpc calls of a query268MSUM/MAX
storage_call_time_maxbigintmax of time cost among the rpc calls of a quer268MSUM/MAX
storage_count_skip  bigintthe sum of row count skipped for the related rpc calls0MSUM/MAX
storage_count_scan  bigintthe sum of row count scanned for the related rpc calls929MSUM/MAX
storage_count_returnbigintthe sum of row count returned for the related rpc calls45MSUM/MAX
storage_count_aggregate_filterbigintthe sum of row count aggregated and filtered for the related rpc calls,= STORAGE_COUNT_SCAN - STORAGE_COUNT_RETURN884MSUM/MAX
storage_count_aggregatebigintthe sum of row count aggregated for the related rpc calls36MSUM/MAX
ktimestamp          bigintquery begin time (timestamp)1603462676906  
kyear_begin_date    stringquery begin time (year)2020/1/1D 
kmonth_begin_date   stringquery begin time (month)2020/10/1D 
kweek_begin_date    stringquery begin time (week, begin with sumday)2020/10/18D 
kday_time           stringquery begin time (time)22:17:56D 
ktime_hour          intquery begin time (hour)22D 
ktime_minute        intquery begin time (minute)17D 
ktime_second        intquery begin time (second)56  
kday_date           stringquery begin time (day)2020/10/23Hive 表分区列 

hive_metrics_query_rpc_qa

ColumnTypeDescriptionSampleD/M度量函数My question
host                stringthe host of server for query enginecdh-client:10.1.3.91D 
project             stringproject nameLEARN_KYLIND 
realization         stringcube namekylin_sales_cube_SIMPLED 
rpc_server          stringthe rpc related target servercdh-worker-2D 
exception           stringthe exception of a rpc call. If no exception, "NULL" is usedNULLD 
call_time           bigintthe time cost of a rpc all60MSUM/MAX/PERCENTILE_APPROX
count_return        bigintthe row count actually return3MSUM/MAX
count_scan          bigintthe row count actually scanned3MSUM/MAX
count_skip          bigintbased on fuzzy filters or else,a few rows will be skiped. This indicates the skipped row count0MSUM/MAX
count_aggregate_filterbigintthe row count actually aggregated and filtered,= COUNT_SCAN - COUNT_RETURN0MSUM/MAX
count_aggregate     bigintthe row count actually aggregated0MSUM/MAX
ktimestamp          bigintquery begin time (timestamp)1600938970918  
kyear_begin_date    stringquery begin time (year)2020/1/1D 
kmonth_begin_date   stringquery begin time (month)2020/9/1D 
kweek_begin_date    stringquery begin time (week, begin with sumday)2020/9/20D 
kday_time           stringquery begin time (time)17:16:10D 
ktime_hour          intquery begin time (hour)17D 
ktime_minute        intquery begin time (minute)16D 
ktime_second        intquery begin time (second)10  
kday_date           stringquery begin time (day)2020/9/24Hive 表分区列 

hive_metrics_job_qa

ColumnTypeDescriptionSampleD/M度量函数My question
job_id              stringjob id51b40173-1f6c-7e55-e0ca-fbc84d242ac0  
host                stringthe host of server for job enginecdh-client:10.1.3.91  
kuser               stringuser name ADMIND 
project             stringproject nameLEARN_KYLIND 
cube_name           stringcube namekylin_sales_cube_poiD 
job_type            stringjob type: build, merge, optimizeBUILDD 只有这三种吗
cubing_type         stringin kylin,there are two cubing algorithms,Layered & Fast(InMemory)NULLD 我不明白
duration            bigintthe duration from a job start to finish945001MSUM/MAX/MIN/PERCENTILE_APPROX
table_size          bigintthe size of data source in bytes227964845MSUM/MAX/MIN
cube_size           bigintthe size of created Cube segment in bytes35693596MSUM/MAX/MIN
per_bytes_time_cost doubleDURATION / TABLE_SIZE0.00414538MSUM/MAX/MIN
wait_resource_time  biginta job may includes serveral MR(map reduce) jobs. Those MR jobs may wait because of lack of Hadoop resources.158146MSUM/MAX/MIN
step_duration_distinct_columnsbigint
138586MSUM/MAX 我知道是不同步骤的时间,但我不知道这个步骤怎么划分的,实际的 job 多于 4 步
step_duration_dictionarybigint 5311MSUM/MAX 
step_duration_inmem_cubingbigint 89MSUM/MAX 
step_duration_hfile_convertbigint 75382MSUM/MAX 
ktimestamp          bigintquery begin time (timestamp)1600938458385  
kyear_begin_date    stringquery begin time (year)2020/1/1D 
kmonth_begin_date   stringquery begin time (month)2020/9/1D 
kweek_begin_date    stringquery begin time (week, begin with sumday)2020/9/20D 
kday_time           stringquery begin time (time)17:07:38D 
ktime_hour          intquery begin time (hour)17D 
ktime_minute        intquery begin time (minute)7D 
ktime_second        intquery begin time (second)38  
kday_date           stringquery begin time (day)2020/9/24Hive 表分区列 

hive_metrics_job_exception_qa

ColumnTypeDescriptionSampleD/M度量函数My question
job_id              stringjob ida333a36d-8e33-811f-9326-f04579cc2464  
host                stringthe host of server for job enginecdh-client:10.1.3.91  
kuser               stringuser nameADMIND 
project             stringproject nameLEARN_KYLIND 
cube_name           stringcube namekylin_sales_cubeD 
job_type            stringjob type: build, merge, optimizeBUILDD 
cubing_type         stringin kylin,there are two cubing algorithms,Layered & Fast(InMemory)LAYERD 
exception           stringwhen running a job,exceptions may happen. It's for classifying different exception typesorg.apache.kylin.job.exception.ExecuteExceptionD 
ktimestamp          bigintquery begin time (timestamp)1600936844611  
kyear_begin_date    stringquery begin time (year)2020/1/1D 
kmonth_begin_date   stringquery begin time (month)2020/9/1D 
kweek_begin_date    stringquery begin time (week, begin with sumday)2020/9/20D 
kday_time           stringquery begin time (time)16:40:44D 
ktime_hour          intquery begin time (hour)16D 
ktime_minute        intquery begin time (minute)40D 
ktime_second        intquery begin time (second)44D 
kday_date           stringquery begin time (day)2020/9/24Hive 表分区列