THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Apache Kylin : Analytical Data Warehouse for Big Data

Page tree

Welcome to Kylin Wiki.

Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.

  • Integrate&enable Kylin's sourcing input from Hudi format's raw or curated dataset in enterprise company's "Data Lake"
  • Accelerate Kylin's cube building process using Hudi's native optimalized view query with MOR table
  • Optimalize Kylin's cube rebuild process using Hudi's native incrmental view query to only capture the changed data and re-calculate&update only the naccessary cubuio

Q2. What problem is this proposal NOT designed to solve?

  • Other type of source dataset(e.g Kafka) which without Hudi enablement is not within this scope 
  • Streaming CubeEnginer rather than batchCubeEnginer is not within this scope

Q3. How is it done today, and what are the limits of current practice?

  • Currently Kylin using beeline JDBC mechanism to directly connect to Hive source, no matter the inputformat is Hudi or not
  • This practice can't leaverage the native & advanced functionality of Hudi such as incremental query, optimalized view query...etc, which kylin can benefit from smaller incremental cuboio merge , and faster source data extraction 

Q4. What is new in your approach and why do you think it will be successful?

  • Hudi has been released and mature in bigdata market&tech stack, which many company already using it in Data Lake/Raw/Curated data layer
  • Hudi lib has already integrated with Spark DF/Spark SQL , which can enable Kylin's Spark Enginee to query Hudi source
  • Hudi's parquet base files and avro redo logs as well as the index metadata...etc, can be connected via Hive's external table and input format defination, which Kylin can leaverage to successfully do the extraction    

Q5. Who cares? If you are successful, what difference will it make?

  • Data scientist, who is doing data mining/exploeration/reporting...etc, will have faster cube building time slot if enable the new integration feature in Kylin 
  • Data Engineer, who is developing the data modeling of DW/DM layer, will maximally reduce the implemenation&delivery effort for Unit test/perfmance test on the cube

Q6. What are the risks?

There is no other risk as it's just alternative option for configruation of Hudi source type, other Kylin's compoments & pipeline won't be effected

Q7. How long will it take?

N/A

Q8. How it works?

Overall architectural design's logic diagram is as following:

Reference

Hudi framework : https://hudi.apache.org/docs/

hive/spark integration support for Hudi: https://hudi.apache.org/docs/querying_data.html

  • No labels