THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  •  For Hudi DataLake source type Integrate:
    • Integrate Kylin's sourcing input from Hudi format's  dataset in enterprise company's  raw or curated data in Data Lake
  •  For kylin Kylin cube rebuild&merge optimization(TBD):
    • Enable Kylin's cuboid storage format type with Hudi
    • Accelerate and optimize Kylin's cube rebuilding process using Hudi's incremental view query to extract only the changed source data from the timestamp of last cube building
    • Accelerate and optimize Kylin's cube merge process using Hudi's native compaction functionality for the delta incremental cuboid files, or use Hudi's upsert feature to merge multiple cuboid files into one, like upserting 1 basic MOR table with multiple select from rows operations 

Q2. What problem is this proposal NOT designed to solve?

  •  Another type Other types of data source dataset(e.g Kafka) which without don't support Hudi enablement is not within in this scope 
  •  Streaming CubeEnginer rather than batchCubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

  •  Currently, Kylin using uses the Beeline JDBC mechanism to directly connect to the Hive source, source, no matter the input format is Hudi or not This practice can;
  •  Today's implementation couldn't leverage the native & advanced functionality of Hudi such as incremental query, optimized view query...etc, which Kylin can benefit from smaller incremental cuboid merge, and faster source data extraction 

...

  •  For Hudi Source integration:
    • New Approach
      • Accelerate Kylin's cube building process using Hudi's native optimized view query with MOR table
    • why Why it will be successful
      • Hudi has been released and mature in bigdata marketdomain&tech stack, which many companies already using in Data Lake/Raw/Curated data layer
      • Hudi lib has already integrated with Spark DF/Spark SQL, which can enable Kylin's Spark Engine to query Hudi source
      • Hudi's parquet base files and Avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format definition, which Kylin can leverage to successfully do the extraction    
  •  For Hudi Cuboid storage(TBD)
    • New Approach
      • Optimize Kylin's cube rebuild process using Hudi's native incremental view query to only capture the changed data and re-calculate&update only the necessary cuboid file
      • Optimize Kylin's cube merge process using Hudi's upsert feature to manipulate the cuboid files, rather than former join & shuffle 
    • why it will be successful
      • Hudi support upsert based on PK of the record, which each cuboid's dimension key-id can be seen as the PK
      • so that when rebuild & merge operations, it can directly update the former cuboid files, or merge multiple cuboid files based on the PK and compact them into base parquet files 

...