THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  •  For Hudi DataLake source type Integrate:
    • Integrate Kylin's sourcing input from Hudi format's  dataset in enterprise company's  raw or curated data in Data Lake
  •  For Kylin cube rebuild&merge optimization(TBDout of this scope):
    • Enable Kylin's cuboid storage format type with Hudi Accelerate and accelerate and optimize Kylin's cube rebuilding & merge process using Hudi's upsert & incremental view query to extract only the changed source data from the timestamp of last cube building Accelerate and optimize Kylin's cube merge process using Hudi's native compaction functionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge multiple cuboid files into one, like upserting 1 basic MOR table with multiple select from rows operations , but kylin's cube rebuild need to calculate the whole raw flatTable data to sum or count...etc,  incremental update the raw flatTable can't do too much performance uplift for the whole cube rebuild&merge process, and also it's big architectural changes for the cuboid storage  
    • so this part work is out of the KIP scope

Q2. What problem is this proposal NOT designed to solve?

...

  •  Currently, Kylin uses the Beeline JDBC mechanism to directly connect to the Hive source, no matter the input format is Hudi or not;
  •  Today's implementation couldn't leverage the native & advanced functionality of Hudi such as incremental query, optimized view query...etc, which Kylin can benefit from smaller incremental cuboid merge, and faster source data extraction Customer's raw/curated data using hudi has multiple ways of implementation  ,such as Spark DF or Spark SQL, so hive not fully know the hudi source format in raw/curated data when Kylin using to extract source dataset

Q4. What is new in your approach and why do you think it will be successful?

  •  For Hudi Source integration:
    • New Approach
      • Accelerate Kylin's cube building process using Hudi's native optimized view query with MOR table
    • Why it will be successful
      • Hudi has been released and mature in bigdata domain&tech stack, which many companies already using in Data Lake/Raw/Curated data layer
      • Hudi lib has already integrated with Spark DF/Spark SQL, which can enable Kylin's Spark Engine to query Hudi source
      • Hudi's parquet base files and Avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format definition, which Kylin can leverage to successfully do the extraction    
     For Hudi Cuboid storage(TBD)
    • New Approach
      • Optimize Kylin's cube rebuild process using Hudi's native incremental view query to only capture the changed data and re-calculate&update only the necessary cuboid file
      • Optimize Kylin's cube merge process using Hudi's upsert feature to manipulate the cuboid files, rather than former join & shuffle 
    • why it will be successful
      • Hudi support upsert based on PK of the record, which each cuboid's dimension key-id can be seen as the PK
      • so that when rebuild & merge operations, it can directly update the former cuboid files, or merge multiple cuboid files based on the PK and compact them into base parquet files 

Q5. Who cares? If you are successful, what difference will it make?

  •  Data scientist, who is doing data mining/exploration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin 
  •  Data Engineer, who is developing the data modeling of the DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/performance test on the cube

Q6. What are the risks?

There is no other risk as it's just an alternative option for configuration of Hudi source type, other Kylin's components & pipeline won't be effected

...

  •  

    For Hudi source integration:

    • Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
    • Add new ISouce interface and implementation using Hudi native client API
    • Use Hudi client API's optimal view query API on top of hive external table to extract the source Hudi dataset
  •  

    For Hudi cuboid storage(

    TBD

    out of this scope):

    • Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
    • Add new ITarget interface and implementation using Hudi write API for internal store and operations of cuboid files
  •  

    For cube rebuild with new Hudi source type(

    TBD

    out of this scope):

    • Use Hudi's incremental query API to only extract the changed data from the last time of Cube segment's timestamp
    • Use Hudi's upset API to merge the changed data & former history data of cuboid
  •  

    For cube merge with new Hudi cuboid storage type(

    TBD

    out of this scope):

    • Use Hudi's upset API to merge the 2 cuboid files

...