Apache Hudi (Incubating)

This wiki space hosts

If you are looking for documentation on using Hudi, please visit the project site or engage with our community

Technical documentation

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion.

List below

This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.

Improving indexing speed for time-ordered keys/small updates
- leverage parquet record indexes,
- serving bloom filters/ranges from timeline server/consolidate metadata
- Indexing the log file, moving closer to scalable 1-min ingests
Improving indexing speed for uuid-keys/large update spreads
- global/hash based index to faster point-in-time lookup
Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
Auto tuning
- Auto tune bloom filter entries based on records
- Partitioning based on historical workload trend
- Determination of compression ratio

Incremental Pull natively via Spark Datasource
Real-time view support on Presto
Hardening incremental pull via Realtime view
Realtime view performance/memory footprint reduction.
Support for Streaming style batch programs via Beam/Structured Streaming integration

Standalone timeline server to handle DFS listings, timeline requests
Consolidated filesystem metadata for query planning
- Hudi timeline is a log. if we compact it we get a snapshot of the table