You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 46 Next »

This wiki space hosts 

If you are looking for documentation on using Apache Hudi (Incubating), please visit the project site or engage with our community

Technical documentation

How-to blogs

  1. How to manually register Hudi tables into Hive via Beeline? 
  2. Ingesting Database changes via Sqoop/Hudi
  3. De-Duping Kafka Events With Hudi DeltaStreamer

Design documents/RFCs

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion.  Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.

Below is a list of RFCs 

Community Management

Roadmap

Below is a depiction of what's to come and how its sequenced

This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.

Writing data & Indexing 

  • Improving indexing speed for time-ordered keys/small updates
    • leverage parquet record indexes,
    • serving bloom filters/ranges from timeline server/consolidate metadata
    • Indexing the log file, moving closer to scalable 1-min ingests
  • Improving indexing speed for uuid-keys/large update spreads
    • global/hash based index to faster point-in-time lookup
  • Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
  • Auto tuning 
    • Auto tune bloom filter entries based on records
    • Partitioning based on historical workload trend
    • Determination of compression ratio

Reading data

  • Incremental Pull natively via Spark Datasource
  • Real-time view support on Presto
  • Hardening incremental pull via Realtime view
  • Realtime view performance/memory footprint reduction.
  • Support for Streaming style batch programs via Beam/Structured Streaming integration

Storage 

  • ORC Support
  • Support for collapsing and splitting file groups 
  • Custom strategies for data clustering
  • Columnar stats collection to power better query planning
  • Object storage

Usability 

  • Painless migration of historical data, with safe experimentation
  • Hudi on Flink
  • Hudi for ML/Feature stores

Metadata Management

  • Standalone timeline server to handle DFS listings, timeline requests
  • Consolidated filesystem metadata for query planning 
    • Hudi timeline is a log. if we compact it we get a snapshot of the table
  • No labels