Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

For Hudi DataLake source type Integrate:
- Integrate Kylin's sourcing input from Hudi format's dataset in enterprise company's raw or curated data in Data Lake
For Kylin cube rebuild&merge optimization(TBD):
- Enable Kylin's cuboid storage format type with Hudi
- Accelerate and optimize Kylin's cube rebuilding process using Hudi's incremental view query to extract only the changed source data from the timestamp of last cube building
- Accelerate and optimize Kylin's cube merge process using Hudi's native compaction functionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge multiple cuboid files into one, like upserting 1 basic MOR table with multiple select from rows operations

Q2. What problem is this proposal NOT designed to solve?

Other types of data source(e.g Kafka) which don't support Hudi is not in this scope
Streaming CubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

Currently, Kylin uses the Beeline JDBC mechanism to directly connect to the Hive source, no matter the input format is Hudi or not;
Today's implementation couldn't leverage the native & advanced functionality of Hudi such as incremental query, optimized view query...etc, which Kylin can benefit from smaller incremental cuboid merge, and faster source data extraction

Q4. What is new in your approach and why do you think it will be successful?

For Hudi Source integration:
- New Approach
  - Accelerate Kylin's cube building process using Hudi's native optimized view query with MOR table
- Why it will be successful
  - Hudi has been released and mature in bigdata domain&tech stack, which many companies already using in Data Lake/Raw/Curated data layer
  - Hudi lib has already integrated with Spark DF/Spark SQL, which can enable Kylin's Spark Engine to query Hudi source
  - Hudi's parquet base files and Avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format definition, which Kylin can leverage to successfully do the extraction
For Hudi Cuboid storage(TBD)
- New Approach
  - Optimize Kylin's cube rebuild process using Hudi's native incremental view query to only capture the changed data and re-calculate&update only the necessary cuboid file
  - Optimize Kylin's cube merge process using Hudi's upsert feature to manipulate the cuboid files, rather than former join & shuffle
- why it will be successful
  - Hudi support upsert based on PK of the record, which each cuboid's dimension key-id can be seen as the PK
  - so that when rebuild & merge operations, it can directly update the former cuboid files, or merge multiple cuboid files based on the PK and compact them into base parquet files

Q5. Who cares? If you are successful, what difference will it make?

Data scientist, who is doing data mining/exploration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin
Data Engineer, who is developing the data modeling of the DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/performance test on the cube

Q6. What are the risks?

There is no other risk as it's just an alternative option for configuration of Hudi source type, other Kylin's components & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How does it work?

Overall architectural design's logic diagram is as follows:

For Hudi source integration:
- Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
- Add new ISouce interface and implementation using Hudi native client API
- Use Hudi client API's optimal view query API on top of hive external table to extract the source Hudi dataset
For Hudi cuboid storage(TBD):
- Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
- Add new ITarget interface and implementation using Hudi write API for internal store and operations of cuboid files
For cube rebuild with new Hudi source type(TBD):
- Use Hudi's incremental query API to only extract the changed data from the last time of Cube segment's timestamp
- Use Hudi's upset API to merge the changed data & former history data of cuboid
For cube merge with new Hudi cuboid storage type(TBD):
- Use Hudi's upset API to merge the 2 cuboid files

Reference

Hudi framework: https://hudi.apache.org/docs/

hive/spark integration support for Hudi: https://hudi.apache.org/docs/querying_data.html

Space shortcuts

Page tree

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

Q2. What problem is this proposal NOT designed to solve?

Q3. How is it done today, and what are the limits of current practice?

Q4. What is new in your approach and why do you think it will be successful?

Q5. Who cares? If you are successful, what difference will it make?

Q6. What are the risks?

Q7. How long will it take?

Q8. How does it work?

For Hudi source integration:

For Hudi cuboid storage(TBD):

For cube rebuild with new Hudi source type(TBD):

For cube merge with new Hudi cuboid storage type(TBD):

Reference