Page History

...

If you are looking for documentation on using Apache Hudi, please visit the project site or engage with our community

Technical documentation

How-to blogs

...

Design documents/RFCs

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion. Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.

Below is a list of RFCs

Children DisplaypageRFC Process

Community Management

Apache Hudi - Release Guide (Pre Graduation)
Apache Hudi Community Bi-Weekly Sync
Committer On-boarding Guide
Community Support

Roadmap

Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)

Integrations

Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)
Trino integration with support for querying/writing Hudi table using SQL statements
Kinesis/Pulsar integrations with DeltaStreamer
Kafka Connect Sink for Hudi
Dremio integration
Interops with other table formats
ORC Support

Writing

Indexing
- MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.
- Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)
- Range index to maintain column/field value ranges, to help file skipping for query performance
- Addition of more auxiliary indexing structures - bitmaps, ..
- global/hash based index to faster point-in-time lookup
Concurrency Control
- Addition of optimistic concurrency control, with pluggable locking services.
- Non-blocking clustering implementation w.r.t updates
- Multi-writer support with fully non-blocking log based concurrency control.
- Multi table transactions
Performance
- Integrate row writer with all Hudi writer operations
Self Managing
- Clustering based on historical workload trend
- On-fly data locality during write time (HUDI-1628)
- Auto Determination of compression ratio

Querying

...

Performance

Complete integration with metadata table.
Realtime view performance/memory footprint reduction.

...

Incremental Query support on Presto

...

Storage handler to leverage metadata table for partition pruning

...

Hardening incremental pull via Realtime view
Spark Datasource redesign around metadata table
Streaming ETL via Structured Streaming

...

Support

...

Mutable, Columnar Cache Service

File group level caching to enable real-time analytics (backed by Arrow/AresDB)

Metadata Management

...

Space shortcuts

Page tree

Versions Compared

Old Version 51

New Version 52

Key