Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you are looking for documentation on using Apache Hudi, please visit the project site or engage with our community

Technical documentation

How-to blogs

...

...

Design documents/RFCs

RFCs are the way to propose large changes to Hudi and the RFC Process details how to go about driving one from proposal to completion.  Anyone can initiate a RFC. Please note that if you are unsure of whether a feature already exists or if there is a plan already to implement a similar one, always start a discussion thread on the dev mailing list before initiating a RFC. This will help everyone get the right context and optimize everyone’s usage of time.

Below is a list of RFCs 

Children DisplaypageRFC Process

Community Management

Roadmap

Below is a tentative roadmap for 2021 (in no particular order; since that is determined by Release Management process)

Integrations 

  1. Spark SQL with Merge/Delete statements support (RFC - 25: Spark SQL Extension For Hudi)

  2. Trino integration with support for querying/writing Hudi table using SQL statements

  3. Kinesis/Pulsar integrations with DeltaStreamer

  4. Kafka Connect Sink for Hudi

  5. Dremio integration 
  6. Interops with other table formats

  7. ORC Support

Writing 

  • Indexing 

    • MetadataIndex implementation that servers bloom filters/key ranges from metadata table, to speed up bloom index on cloud storage.

    • Addition of record level indexes for fast CDC (RFC-08 Record level indexing mechanisms for Hudi datasets)

    • Range index to maintain column/field value ranges, to help file skipping for query performance

    • Addition of more auxiliary indexing structures - bitmaps, .. 

    • global/hash based index to faster point-in-time lookup

  • Concurrency Control

    • Addition of optimistic concurrency control, with pluggable locking services.
    • Non-blocking clustering implementation w.r.t updates

    • Multi-writer support with fully non-blocking log based concurrency control.
    • Multi table transactions
  • Performance
    • Integrate row writer with all Hudi writer operations
  • Self Managing 

    • Clustering based on historical workload trend 

    • On-fly data locality during write time (HUDI-1628)
    • Auto Determination of compression ratio

Querying

...

Performance

  • Complete integration with metadata table.
  • Realtime view performance/memory footprint reduction.

...

  • Incremental Query support on Presto

...

  • Storage handler to leverage metadata table for partition pruning

...

  • Hardening incremental pull via Realtime view

  • Spark Datasource redesign around metadata table
  • Streaming ETL via Structured Streaming

...

...

Mutable, Columnar Cache Service

  • File group level caching to enable real-time analytics (backed by Arrow/AresDB)

Metadata Management

...