Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

...

For Hudi DataLake source type Integrate:
- Integrate Kylin's sourcing input from Hudi format's dataset in enterprise company's raw or curated data in Data Lake
For kylin cubo cube rebuild&merge optimalizationoptimization(TBD):
- Enable Kylin's cuboid storage format type with Hudi
- Accelerate and optimalize optimize Kylin's cube rebuilding process using Hudi's incrmental incremental view query to extract only the changed source data from the timestamp of last cube building
- Accelerate and optmalize optimize Kylin's cube merge process using Hudi's native compation funtionality compaction functionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge mutiple multiple cuboid files into one, like upserting 1 basic MOR table with mutiple multiple select from rows operations

Q2. What problem is this proposal NOT designed to solve?

Other Another type of source dataset(e.g Kafka) which without Hudi enablement is not within this scope
Streaming CubeEnginer rather than batchCubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

Currently, Kylin using beeline the Beeline JDBC mechanism to directly connect to the Hive source， no matter the inputformat input format is Hudi or not
This practice can't leaverage leverage the native & advanced functionality of Hudi such as incremental query, optimalized optimized view query...etc, which kylin Kylin can benefit from smaller incremental cuboio cuboid merge, and faster source data extraction

...

For Hudi Source integration:
- New Approach
  - Accelerate Kylin's cube building process using Hudi's native optimalized optimized view query with MOR table
- why it will be successful
  - Hudi has been released and mature in bigdata market&tech stack, which many company companies already using it in Data Lake/Raw/Curated data layer
  - Hudi lib has already integrated with Spark DF/Spark SQL, which can enable Kylin's Spark Enginee Engine to query Hudi source
  - Hudi's parquet base files and avro Avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format definationdefinition, which Kylin can leaverage leverage to successfully do the extraction
For Hudi Cuboid storage(TBD)
- New Approach
  - Optimalize Optimize Kylin's cube rebuild process using Hudi's native incrmental incremental view query to only capture the changed data and re-calculate&update only the naccessary cubuid necessary cuboid file
  - Optimalize Optimize Kylin's cube merge process using Hudi's upsert feature to manuplate manipulate the cuboid files, rather than former join & shuffle
- why it will be successful
  - Hudi support upsert based on PK of the record, which each cuboid's dimention dimension key-id can be seen as the PK
  - so that when rebuild & merge operations, it can directly update the former cuboid files, or merge mutiple multiple cuboid files based on the PK and compact them into base parquet files

...

Data scientist, who is doing data mining/exploerationexploration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin
Data Engineer, who is developing the data modeling of the DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/perfmance performance test on the cube

Q6. What are the risks?

There is no other risk as it's just an alternative option for configruation configuration of Hudi source type, other Kylin's compoments components & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How does it

...

work?

Overall architectural design's logic diagram is as followingfollows:

For Hudi
souce
source integration:
- Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
- Add new ISouce interface and implemenation implementation using Hudi native client apiAPI
- Use Hudi client API's optimal view query api API on top of hive external table to extract the source hudi Hudi dataset
For Hudi cuboid storage(TBD):
- Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
- Add new ITarget interface and implementation using Hudi write api API for interm internal store and operations of cuboid files
For cube rebuild with new Hudi
souce
source type(TBD):
- Use Hudi's incremental query api API to only extract the changed data from the last time of Cube segementsegment's timestamp
- Use Hudi's upset API to merge the changed data & former history data of cuboid
For cube merge with new Hudi cuboid storage type(TBD):
- Use Hudi's upset API to merge the 2 cuboid files

...

Space shortcuts

Page tree

Versions Compared

Old Version 10

New Version 11

Key

Q2. What problem is this proposal NOT designed to solve?

Q3. How is it done today, and what are the limits of current practice?

Q6. What are the risks?

Q7. How long will it take?

Q8. How does it

work?

For Hudi

source integration:

For Hudi cuboid storage(TBD):

For cube rebuild with new Hudi

source type(TBD):

For cube merge with new Hudi cuboid storage type(TBD):

Space shortcuts

Page tree

Page History

Versions Compared

Key

For Hudi

source integration:

For Hudi cuboid storage(TBD):

For cube rebuild with new Hudi

source type(TBD):

For cube merge with new Hudi cuboid storage type(TBD):