Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

For Hudi DataLake source type Integrate:
- Integrate Kylin's sourcing input from Hudi format's dataset in enterprise company's raw or curated data in Data Lake
For kylin cubo rebuild&merge optimalization(TBD):
- Enable Kylin's cuboid storage format type with Hudi
- Accelerate and optimalize Kylin's cube rebuilding process using Hudi's incrmental view query to extract only the changed source data from the timestamp of last cube building
- Accelerate and optmalize Kylin's cube merge process using Hudi's native compation funtionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge mutiple cuboid files into one, like upserting 1 basic MOR table with mutiple select from rows operations

Q2. What problem is this proposal NOT designed to solve?

Other type of source dataset(e.g Kafka) which without Hudi enablement is not within this scope
Streaming CubeEnginer rather than batchCubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

Currently Kylin using beeline JDBC mechanism to directly connect to Hive source， no matter the inputformat is Hudi or not
This practice can't leaverage the native & advanced functionality of Hudi such as incremental query, optimalized view query...etc, which kylin can benefit from smaller incremental cuboio merge , and faster source data extraction

Q4. What is new in your approach and why do you think it will be successful?

For Hudi Source integration:
- New Approach
  - Accelerate Kylin's cube building process using Hudi's native optimalized view query with MOR table
- why it will be successful
  - Hudi has been released and mature in bigdata market&tech stack, which many company already using it in Data Lake/Raw/Curated data layer
  - Hudi lib has already integrated with Spark DF/Spark SQL , which can enable Kylin's Spark Enginee to query Hudi source
  - Hudi's parquet base files and avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format defination, which Kylin can leaverage to successfully do the extraction
For Hudi Cuboid storage(TBD)
- New Approach
  - Optimalize Kylin's cube rebuild process using Hudi's native incrmental view query to only capture the changed data and re-calculate&update only the naccessary cubuid file
  - Optimalize Kylin's cube merge process using Hudi's upsert feature to manuplate the cuboid files , rather than former join & shuffle
- why it will be successful
  - Hudi support upsert based on PK of the record, which each cuboid's dimention key id can be seen as the PK
  - so that when rebuild & merge operations, it can directly update the former cuboid files , or merge mutiple cuboid files based on the PK and compact them into base parquet files

Q5. Who cares? If you are successful, what difference will it make?

Data scientist, who is doing data mining/exploeration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin
Data Engineer, who is developing the data modeling of DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/perfmance test on the cube

Q6. What are the risks?

There is no other risk as it's just alternative option for configruation of Hudi source type, other Kylin's compoments & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How it works?

Overall architectural design's logic diagram is as following:

For Hudi souce integration:
- Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
- Add new ISouce interface and implemenation using Hudi native client api
- Use Hudi client API's optimal view query api on top of hive external table to extract the source hudi dataset
For Hudi cuboid storage(TBD):
- Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
- Add new ITarget interface and implementation using Hudi write api for interm store and operations of cuboid files
For cube rebuild with new Hudi souce type(TBD):
- Use Hudi's incremental query api to only extract the changed data from last time of Cube segement's timestamp
- Use Hudi's upset API to merge the changed data & former history data of cuboid
For cube merge with new Hudi cuboid storage type(TBD):
- Use Hudi's upset API to merge the 2 cuboid files

Reference

Hudi framework : https://hudi.apache.org/docs/

hive/spark integration support for Hudi: https://hudi.apache.org/docs/querying_data.html

Space shortcuts

Page tree

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

Q2. What problem is this proposal NOT designed to solve?

Q3. How is it done today, and what are the limits of current practice?

Q4. What is new in your approach and why do you think it will be successful?

Q5. Who cares? If you are successful, what difference will it make?

Q6. What are the risks?

Q7. How long will it take?

Q8. How it works?

For Hudi souce integration:

For Hudi cuboid storage(TBD):

For cube rebuild with new Hudi souce type(TBD):

For cube merge with new Hudi cuboid storage type(TBD):

Reference