Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you are looking for documentation on using Hudi, please visit the project site or engage with our community

Technical documentation

How-to blogs

  1. How to manually register Hudi tables into Hive via Beeline? 
  2. Ingesting Database changes via Sqoop/Hudi
  3. De-Duping Kafka Events With Hudi DeltaStreamer

Design documents/HIPs

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion. 

...

  1. RFC-1 : CSV Source Support for Delta Streamer
  2. RFC-2 : Orc Storage in Hudi
  3. RFC-3: Timeline Service with Incremental File System View Syncing 
  4. RFC-4 : Faster Hive incremental pull queries
  5. RFC-5: HUI (Hudi WebUI)

Community Management

Roadmap

Below is a depiction of what's to come and how its sequenced

...

This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.

Writing data & Indexing 

  • Improving indexing speed for time-ordered keys/small updates
    • leverage parquet record indexes,
    • serving bloom filters/ranges from timeline server/consolidate metadata
    • Indexing the log file, moving closer to scalable 1-min ingests
  • Improving indexing speed for uuid-keys/large update spreads
    • global/hash based index to faster point-in-time lookup
  • Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
  • Auto tuning 
    • Auto tune bloom filter entries based on records
    • Partitioning based on historical workload trend
    • Determination of compression ratio

Reading data

  • Incremental Pull natively via Spark Datasource
  • Real-time view support on Presto
  • Hardening incremental pull via Realtime view
  • Realtime view performance/memory footprint reduction.
  • Support for Streaming style batch programs via Beam/Structured Streaming integration

Storage 

  • ORC Support
  • Support for collapsing and splitting file groups 
  • Custom strategies for data clustering
  • Columnar stats collection to power better query planning

Usability 

  • Painless migration of historical data, with safe experimentation
  • Hudi on Flink
  • Hudi for ML/Feature stores

Metadata Management

  • Standalone timeline server to handle DFS listings, timeline requests
  • Consolidated filesystem metadata for query planning 
    • Hudi timeline is a log. if we compact it we get a snapshot of the table