THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

  • For Hudi DataLake source type Integrate:
    • Integrate Kylin's sourcing input from Hudi format's  dataset in enterprise company's  raw or curated data in Data Lake
  • For kylin cubo rebuild&merge optimalization(TBD):
    • Enable Kylin's cuboid storage format type with Hudi
    • Accelerate and optimalize Kylin's cube rebuilding process using Hudi's incrmental view query to extract only the changed source data from the timestamp of last cube building
    • Accelerate and optmalize Kylin's cube merge process using Hudi's native compation funtionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge mutiple cuboid files into one, like upserting 1 basic MOR table with mutiple select from rows operations 

Q2. What problem is this proposal NOT designed to solve?

  • Other type of source dataset(e.g Kafka) which without Hudi enablement is not within this scope 
  • Streaming CubeEnginer rather than batchCubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

  • Currently Kylin using beeline JDBC mechanism to directly connect to Hive source, no matter the inputformat is Hudi or not
  • This practice can't leaverage the native & advanced functionality of Hudi such as incremental query, optimalized view query...etc, which kylin can benefit from smaller incremental cuboio merge , and faster source data extraction 

Q4. What is new in your approach and why do you think it will be successful?

  • For Hudi Source integration:
    • New Approach
      • Accelerate Kylin's cube building process using Hudi's native optimalized view query with MOR table
    • why it will be successful
      • Hudi has been released and mature in bigdata market&tech stack, which many company already using it in Data Lake/Raw/Curated data layer
      • Hudi lib has already integrated with Spark DF/Spark SQL , which can enable Kylin's Spark Enginee to query Hudi source
      • Hudi's parquet base files and avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format defination, which Kylin can leaverage to successfully do the extraction    
  • For Hudi Cuboid storage(TBD)
    • New Approach
      • Optimalize Kylin's cube rebuild process using Hudi's native incrmental view query to only capture the changed data and re-calculate&update only the naccessary cubuid file
      • Optimalize Kylin's cube merge process using Hudi's upsert feature to manuplate the cuboid files , rather than former join & shuffle 
    • why it will be successful
      • Hudi support upsert based on PK of the record, which each cuboid's dimention key id can be seen as the PK
      • so that when rebuild & merge operations, it can directly update the former cuboid files , or merge mutiple cuboid files based on the PK and compact them into base parquet files 

Q5. Who cares? If you are successful, what difference will it make?

  • Data scientist, who is doing data mining/exploeration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin 
  • Data Engineer, who is developing the data modeling of DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/perfmance test on the cube

Q6. What are the risks?

There is no other risk as it's just alternative option for configruation of Hudi source type, other Kylin's compoments & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How it works?

Overall architectural design's logic diagram is as following:

  • For Hudi souce integration:

    • Add new config item in kylin.property for Hudi source type(e.g: isHudiSouce=true, HudiType=MOR)
    • Add new ISouce interface and implemenation using Hudi native client api
    • Use Hudi client API's optimal view query api on top of hive external table to extract the source hudi dataset
  • For Hudi cuboid storage(TBD):

    • Add new config item in kylin.property for Hudi storage type for cuboid(e.g: isHudiCuboidStorage=true)
    • Add new ITarget interface and implementation using Hudi write api for interm store and operations of cuboid files
  • For cube rebuild with new Hudi souce type(TBD):

    • Use Hudi's incremental query api to only extract the changed data from last time of Cube segement's timestamp
    • Use Hudi's upset API to merge the changed data & former history data of cuboid
  • For cube merge with new Hudi cuboid storage type(TBD):

    • Use Hudi's upset API to merge the 2 cuboid files

Reference

Hudi framework : https://hudi.apache.org/docs/

hive/spark integration support for Hudi: https://hudi.apache.org/docs/querying_data.html

  • No labels