THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Item

Value

Instance Type

m5.4xlarge

Node Memory

64 GB

Node vCPU

16

Node Disk

400 * 2; SSD

Network Brand with

Up to 10 Gbps

Node Count

A master node and four worker nodes

Allocated Memory on Yarn

202 GB

Allocated Cores on Yarn

52

Kylin Version

3.1.2 & 4.0.0

EMR Version

5.31

Hadoop Version

2.10.0

HBase Version

1.4.13


Benchmark Results

Cubing duration and Storage size

Image Removed

Image Removed

Response Time

Image Removed

Image Removed


Image Added

Figure-1 : Cubing duration of TPC-H (sf = 10)


Image Added

Figure-2 : Storage size of TPC-H (sf = 10)


Image Added

Figure-3 : Avg response time of SSB Query (sf=10)


Image Added

Figure-4 : Avg response time of TPC-H Query (sf=10)


Conclusions

Cubing duration and cube size.

Compared with Kylin 3's MR cube engine, thanks to higher resource utilization and no more steps of converting cuboid to specific data format(HFile), Kylin 4 greatly reduces the cubing duration by 62.6%.
In Kylin 3, the cuboid files are stored in two different formats, instead Kylin 4 uses Parquet. We know Parquet has better encode efficiency and higher compression ratio, so the disk space of same cube reduced greatly by 72.56%.

Kylin 4(New Spark Engine) has a higher and stable resource utilizationImage Modified

Figure-5 : Kylin 3(MR engine) has lower resource utilization


Kylin 3(MR engine) has lower resource utilizationImage Modified

Figure-6 : Kylin 4(New Spark Engine) has a higher and stable resource utilization

Query performance.

In big query scenarios(query which scans and does onsite complex calculations on a large mount of partitions/files), Kylin 3 query optimization is difficult, and needs to optimize HBase RS Server and Kylin Query Server repeatedly. In stress test scenarios, query node is unstable because it need do post-calculation on large data set, and performance(query latency) is getting worse as time goes by. Kylin 4 removes the single bottleneck of Query Server, and both Response Time and QPS are obviously improved and performance is stable during the stress test. In TPC-H query set, response time of Kylin 4 is improved by 5-7 times, and its concurrency is also improved by 4 times.

P95 response time of TPC-H Query under different concurrency

Figure-7 : P95 response time of TPC-H Query under different concurrency

In the point query scenario (query which scans small mount of partitions/files and do not need too much onsite calculations) , Kylin 4 can meet the sub-second query latency requirement after some simple parameters adjustment, and its performance is relatively close to Kylin 3 (to be specific, only worse sightly) .

...