THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.


At the QCon Global Software Developers Conference held on May 29, 2021, Zheng Shengjun, head of Youzan's data infrastructure platform, shared Youzan's internal use experience and optimization practice of Kylin 4.0 on the meeting room of open source big data frameworks and applications. For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how and why to upgrade to Kylin 4. This sharing is mainly divided into the following parts:


The reason for choosing Kylin 4

History of Kylin in Youzan


First of all, I would like to share why Youzan chose to upgrade to Kylin 4. Here, I would like to briefly review the history of Youzan OLAP infra.

In the early days of Youzan, in order to iterate develop process quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was introduced because of query flexibility and development efficiency, but there were problems such as low pre-aggregation, not supporting precisely count distinct measure. In this context, Youzan introduced Apache Kylin and ClickHouse. Kylin supports high aggregation, precisely count distinct measure and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc query).

From the introduction of Kylin in 2018 to now, Youzan has used Kylin for more than three years. With the continuous enrichment of business scenarios and the continuous accumulation of data volume, Youzan currently has 6 million existing merchants, GMV in 2020 is 107.3 billion, and the daily build data volume is 10 billion +. At present, Kylin has basically covered all the business scenarios of Youzan.

The challenges of Kylin 3

With Youzan's rapid development and in-depth use of Kylin, we also encountered some challenges:

  • First of all, the build performance of Kylin on HBase cannot meet the favorable expectations, and the build performance will affect the user's failure recovery time and stability experience;
  • Secondly, with the access of more large merchants (tens of millions of members in a single store, with hundreds of thousands of goods for each store), it also brings great challenges to our OLAP system. Kylin on HBase is limited by the single-point query of Query Server, and cannot support these complex scenarios well;
  • Finally, because HBase is not a cloud-native system, it is difficult to achieve flexible scale up and scale down. With the continuous growth of data volume, this system has peaks and valleys for businesses, which results in the average resource utilization rate is not high enough.


Faced with these challenges, Youzan chose to move closer and upgrade to the more cloud-native Apache Kylin 4.

Introduction to Kylin 4


First of all, let's introduce the main advantages of Kylin 4. Apache Kylin 4 completely depends on Spark for cubing job and query. It can make full use of Spark's parallelization, quantization(向量化), and global dynamic code generation technologies to improve the efficiency of large queries.
Here is a brief introduction to the principle of Kylin 4, that is storage engine, build engine and query engine.

Storage engine

First of all, let's take a look at the new storage engine, comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of Kylin on HBase is stored in the table of HBase. Single Segment corresponds to one HBase table. Aggregation is pushed down to HBase coprocessor.

But as we know,  HBase is not a real Columnar Storage and its throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, all the data is stored in files. Each segment will have a corresponding HDFS directory. All queries and cubing jobs read and write files without HBase . Although there will be a certain loss of performance for simple queries, the improvement brought about by complex queries is more considerable and worthwhile

Build engine

The second is the new build engine. Based on our test, the build speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. There are several reasons:

  • Kylin 4 removes the encoding of the dimension, eliminating a building step of encoding;
  • Removed the HBase File generation step;
  • All the cubing steps of the Kylin 4 are converted to Spark implementation;
  • Kylin on Parquet changes the granularity of cubing to cuboid level, which is conducive to further improving parallelism of cubing job.
  • Enhanced implementation for global dictionary. In the new algorithm, dictionary and source data are hashed in to the same buckets, making it possible for loading only piece of dictionary bucket to encode source data.

As you can see on the right, after upgradation to Kylin 4, cubing job changes from ten steps to two steps, the performance improvement of the construction is very obvious.

Query engine

Next is the new query engine of Kylin 4. As you can see, the calculation of Kylin on HBase is completely dependent on the coprocessor of HBase and query server process. When the data is read from HBase into query server to do aggregation, sorting, etc, the bottleneck will be restricted by the single point of query server. But Kylin 4 is converted to a fully distributed query mechanism based on Spark .

How to optimize performance of Kylin 4


Next, I'd like to share some performance optimizations made by Youzan in Kylin 4.

Optimization of query engine

Dynamic elimination of partitioning dimensions

Kylin4 have a new ability that the older version is not capable of, which is able to reduce dozens of times of data reading and computing for some big queries. It's offen the case that partition column is used to filter data but not used as group dimension. For those cases Kylin would always choose cuboid with partition column, but now it is able to use different cuboid in that query to reduce IO read and computing.

The key of this optimization is to split a query into two parts, one of the part uses all segment's data so that partition column doesn't have to be included in cuboid, the other part that uses part of segments data will choose cuboid with partition dimension to do the data filter.

We have tested that in some situations the response time reduced from 20s to 6s, 10s to 3s.

Partition cropping under complex filtering conditions

对用户来说可能不需要了解分区过滤的细节,这应该是一个必须要有的功能  Shengjun Zheng

Cache Calcite physical plan

场景有限,感觉不适合放在官网 Shengjun Zheng

Adjust spark configuration


Parquet optimization

... Please try to complete this part. Shengjun Zheng

Optimization of build engine

... Please try to complete this part if you have enough time. Shengjun Zheng

Practice of Kylin 4 in Youzan

After introducing Youzan's experience of performance optimization, let's share the optimization effect. That is, Kylin 4's practice in Youzan includes the upgrade process and the performance of online system.

Upgrade metadata to adapt to Kylin 4

First of all, for metadata for Kylin 3 which stored on HBase, we have developed a tool for seamless upgrading of metadata. First of all, our metadata in Kylin on HBase is stored in HBase. We export the metadata in HBase into local files, and then use tools to transform and write back the new metadata into MySQL. We also updated the operation documents and general principles in the official wiki of Apache Kylin. For more details, you can refer to: https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4 .

Let's give a general introduction to some compatibility in the whole process. The project metadata, tables metadata, permission-related metadata, and model metadata do not need be modified. What needs to be modified is the cube metadata, including the type of storage and query used by Cube. After updating these two fields, you need to recalculate the Cube signature. The function of this signature is designed internally by Kylin to avoid some problems caused by Cube after Cube is determined.

Performance of Kylin 4 on Youzan online system

After the migration of metadata to Kylin4, let's share the qualitative changes and substantial performance improvements brought about by some of the promising scenarios. First of all, in a scenario like Commodity Insight, there is a large store of hundreds of thousands of commodities. We have to analyze its transactions and traffic, etc. There are more than a dozen precise precisely count distinct measures in single cube. Precisely count distinct measure is actually very inefficient if it is not optimized through pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely count distinct measure. In a scene that requires complex queries to sort hundreds of thousands of commodities in various UV(precisely count distinct measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced from 27 seconds to less than 2 seconds.

What I find most appealing to me about Kylin 4 is that it's completely a manual, whereas Kylin on HBase is actually an automatic, because its concurrency is completely tied to the number of regions.

Plan for Kylin 4 in Youzan

... Please try to complete this part. Shengjun Zheng

Reference


  • No labels