THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  •  For Hudi DataLake source type Integrate:
    • Integrate Kylin's sourcing input from Hudi format's  dataset in enterprise company's  raw or curated data in Data Lake
  •  For kylin cubo cube rebuild&merge optimalizationoptimization(TBD):
    • Enable Kylin's cuboid storage format type with Hudi
    • Accelerate and optimalize optimize Kylin's cube rebuilding process using Hudi's incrmental incremental view query to extract only the changed source data from the timestamp of last cube building
    • Accelerate and optmalize optimize Kylin's cube merge process using Hudi's native compation funtionality compaction functionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge mutiple multiple cuboid files into one, like upserting 1 basic MOR table with mutiple multiple select from rows operations 

Q2. What problem is this proposal NOT designed to solve?

  •  Other Another type of source dataset(e.g Kafka) which without Hudi enablement is not within this scope 
  •  Streaming CubeEnginer rather than batchCubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

  •  Currently, Kylin using beeline the Beeline JDBC mechanism to directly connect to the Hive source, no matter the inputformat input format is Hudi or not
  •  This practice can't leaverage leverage the native & advanced functionality of Hudi such as incremental query, optimalized optimized view query...etc, which kylin Kylin can benefit from smaller incremental cuboio cuboid merge, and faster source data extraction 

...

  •  For Hudi Source integration:
    • New Approach
      • Accelerate Kylin's cube building process using Hudi's native optimalized optimized view query with MOR table
    • why it will be successful
      • Hudi has been released and mature in bigdata market&tech stack, which many company companies already using it in Data Lake/Raw/Curated data layer
      • Hudi lib has already integrated with Spark DF/Spark SQL, which can enable Kylin's Spark Enginee Engine to query Hudi source
      • Hudi's parquet base files and avro Avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format definationdefinition, which Kylin can leaverage leverage to successfully do the extraction    
  •  For Hudi Cuboid storage(TBD)
    • New Approach
      • Optimalize Optimize Kylin's cube rebuild process using Hudi's native incrmental incremental view query to only capture the changed data and re-calculate&update only the naccessary cubuid necessary cuboid file
      • Optimalize Optimize Kylin's cube merge process using Hudi's upsert feature to manuplate manipulate the cuboid files, rather than former join & shuffle 
    • why it will be successful
      • Hudi support upsert based on PK of the record, which each cuboid's dimention dimension key-id can be seen as the PK
      • so that when rebuild & merge operations, it can directly update the former cuboid files, or merge mutiple multiple cuboid files based on the PK and compact them into base parquet files 

...

  •  Data scientist, who is doing data mining/exploerationexploration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin 
  •  Data Engineer, who is developing the data modeling of the DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/perfmance performance test on the cube

Q6. What are the risks?

There is no other risk as it's just an alternative option for configruation configuration of Hudi source type, other Kylin's compoments components & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How does it

...

work?

Overall architectural design's logic diagram is as followingfollows:

  •  

    For Hudi

    souce

    source integration:

    • Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
    • Add new ISouce interface and implemenation implementation using Hudi native client apiAPI
    • Use Hudi client API's optimal view query api API on top of hive external table to extract the source hudi Hudi dataset
  •  

    For Hudi cuboid storage(TBD):

    • Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
    • Add new ITarget interface and implementation using Hudi write api API for interm internal store and operations of cuboid files
  •  

    For cube rebuild with new Hudi

    souce

    source type(TBD):

    • Use Hudi's incremental query api API to only extract the changed data from the last time of Cube segementsegment's timestamp
    • Use Hudi's upset API to merge the changed data & former history data of cuboid
  •  

    For cube merge with new Hudi cuboid storage type(TBD):

    • Use Hudi's upset API to merge the 2 cuboid files

...