This wiki space hosts
If you are looking for documentation on using Hudi, please visit the project site or engage with our community
Technical documentation
How-to blogs
- How to manually register Hudi tables into Hive via Beeline?
- Ingesting Database changes via Sqoop/Hudi
- De-Duping Kafka Events With Hudi DeltaStreamer
Design documents/HIPs
RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion.
List below
- RFC-1 : CSV Source Support for Delta Streamer
- RFC-2 : Orc Storage in Hudi
- RFC-3: Timeline Service with Incremental File System View Syncing
- RFC-4 : Faster Hive incremental pull queries
- RFC-5: HUI (Hudi WebUI)
Roadmap
This is a rough roadmap (non exhaustive list) of what's to come in each of the areas for Hudi.
Writing data & Indexing
- Improving indexing speed for time-ordered keys/small updates
- leverage parquet record indexes,
- serving bloom filters/ranges from timeline server/consolidate metadata
- Indexing the log file, moving closer to scalable 1-min ingests
- Improving indexing speed for uuid-keys/large update spreads
- global/hash based index to faster point-in-time lookup
- Incrementalize & standardize all metadata operations e.g cleaning based on timeline metadata
- Auto tuning
- Auto tune bloom filter entries based on records
- Partitioning based on historical workload trend
- Determination of compression ratio
Reading data
- Incremental Pull natively via Spark Datasource
- Real-time view support on Presto
- Hardening incremental pull via Realtime view
- Realtime view performance/memory footprint reduction.
- Support for Streaming style batch programs via Beam/Structured Streaming integration
Storage
- ORC Support
- Support for collapsing and splitting file groups
- Custom strategies for data clustering
- Columnar stats collection to power better query planning
Usability
- Painless migration of historical data, with safe experimentation
- Hudi on Flink
- Hudi for ML/Feature stores
Metadata Management
- Standalone timeline server to handle DFS listings, timeline requests
- Consolidated filesystem metadata for query planning
- Hudi timeline is a log. if we compact it we get a snapshot of the table