Apache Kylin : Analytical Data Warehouse for Big Data
Page History
...
Item | Value |
Instance Type | m5.4xlarge |
Node Memory | 64 GB |
Node vCPU | 16 |
Node Disk | 400 * 2; SSD |
Network Brand with | Up to 10 Gbps |
Node Count | A master node and four worker nodes |
Allocated Memory on Yarn | 202 GB |
Allocated Cores on Yarn | 52 |
Kylin Version | 3.1.2 & 4.0.0 |
EMR Version | 5.31 |
Hadoop Version | 2.10.0 |
HBase Version | 1.4.13 |
Benchmark Results
Cubing duration and Storage size
Response Time
Figure-1 : Cubing duration of TPC-H (sf = 10)
Figure-2 : Storage size of TPC-H (sf = 10)
Figure-3 : Avg response time of SSB Query (sf=10)
Figure-4 : Avg response time of TPC-H Query (sf=10)
Conclusions
Cubing duration and cube size.
Compared with Kylin 3's MR cube engine, thanks to higher resource utilization and no more steps of converting cuboid to specific data format(HFile), Kylin 4 greatly reduces the cubing duration by 62.6%.
In Kylin 3, the cuboid files are stored in two different formats, instead Kylin 4 uses Parquet. We know Parquet has better encode efficiency and higher compression ratio, so the disk space of same cube reduced greatly by 72.56%.
Figure-5 : Kylin 3(MR engine) has lower resource utilization
Figure-6 : Kylin 4(New Spark Engine) has a higher and stable resource utilization
Query performance.
In big query scenarios(query which scans and does onsite complex calculations on a large mount of partitions/files), Kylin 3 query optimization is difficult, and needs to optimize HBase RS Server and Kylin Query Server repeatedly. In stress test scenarios, query node is unstable because it need do post-calculation on large data set, and performance(query latency) is getting worse as time goes by. Kylin 4 removes the single bottleneck of Query Server, and both Response Time and QPS are obviously improved and performance is stable during the stress test. In TPC-H query set, response time of Kylin 4 is improved by 5-7 times, and its concurrency is also improved by 4 times.
Figure-7 : P95 response time of TPC-H Query under different concurrency
In the point query scenario (query which scans small mount of partitions/files and do not need too much onsite calculations) , Kylin 4 can meet the sub-second query latency requirement after some simple parameters adjustment, and its performance is relatively close to Kylin 3 (to be specific, only worse sightly) .
...