THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Page History

Versions Compared

Key

This line was added.
This line was removed.
Formatting was changed.

Quick Links

Table of Contents
outline true

Issue Management

Actual issue tracking is in Apache JIRA! We use this page to ground ourselves.

...

(Vinoth) Identify & land all critical outstanding PRs (that solve critical issues, take us forward in our 1.0 path)
- Vinoth to identify.
- [Sagar] Move master to 1.0.0
(Ethan & Vinoth & Danny) Land storage format 1.0 (Complete)
- [Vinoth] Put up a 1.0 tech specs doc
- Make all format changes described here. https://issues.apache.org/jira/browse/HUDI-6242
- Standardization of serialization - log blocks, timeline meta files.
- Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly.
- Changes to make multiple base file formats within each file group.
Remove any
- No Java classes
from showing
- show up in table properties. HUDI-5761
- [Danny] Introduce transition time into the active timeline
- [Danny] Land LSM Timeline in well-tested, performant shape (HUDI-309, HUDI-6626, this needs an epic ASAP???)
Design:
- [Sagar] Multi-table transactions? (VC: we have a strawman. but needs an RFC to validate correctness across phantom reads, self-joins, nested queries, and isolation levels)
- [Lin] Keys: UUIDs vs. what we do today.
- [Danny???] Time-Travel Read (+Write) (resolve HUDI-4500, HUDI-4677 and similar, address branch/merge use-cases)
- [Ethan???] Logical partitioning/Index Functions API (Java, Native) and its integration into Spark/Presto/Trino. (HUDI-512)
- [Shawn] Cloud native storage layout design (Udit's RFC-60)
- [Sagar + ???] Schema Evolution and version tracking in MT.
- [Vinoth] Lance file format + storing blobs/images.
Implementation
- [Sagar] RFC-46/RecordMerger API, is this our final choice? cross-platform? only for hoodie.merge.mode=custom ? (complete HUDI-3217)
- [Sagar] Async indexer is in final shape (complete HUDI-2488)
- [Lin] Land Parquet keyed lookup code (???)
- [Danny] Flink/Non-blocking CC (HUDI-5672, HUDI-6640, HUDI-6495 )
- [???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-4790)
- [Ethan] Implement MoR snapshot query (positional/key based updates, deletes), partial updates, custom merges on new File Format code path.
- [Ethan] Implement writers for positional updates, deletes, partial updates, ordering field based merging.
- Existing Optimistic Concurrency Control is in final shape (complete HUDI-1456)
- Implement a uniform way to fetch incremental data files based on new timeline (https://issues.apache.org/jira/browse/HUDI-2750)
- <what are some other code refactoring.. to burn down?> (, HUDI-2261, HUDI-6243, HUDI-3614, HUDI-4444, HUDI-4756)
(Sagar) Open/Risk Items:
- _hoodie_operation metafield. Spark/Flink interop.
- Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues)
- Are we happy with how log compaction is implemented? (https://issues.apache.org/jira/browse/HUDI-3580)
- Should we retain virtual keys support? https://issues.apache.org/jira/browse/HUDI-2235

Execution Phase 2 (Sept 15-Oct 30)

APIs: (https://issues.apache.org/jira/browse/HUDI-4141)

...

- FileGroup APIs in Java

...

- Rust/C++ APIs for Timeline, Metadata, FileGroup Read/Write (https://issues.apache.org/jira/browse/HUDI-6486)

...

- Internal APIs/Abstractions/Code Refactoring (https://issues.apache.org/jira/browse/HUDI-6243)

...

- - Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857)

...

- - HUDI-43

...

- - HoodieSchema ? https://issues.apache.org/jira/browse/HUDI-6499

...

Design

...

- General purpose, global timeline (no active vs archived distinction)

...

- Non-blocking concurrency control/clustering + updates, inserts + inserts for Spark + Flink.

...

- Spark SQL statements to complete DB vision. (vinoth has a list. ???)

...

Implementation

...

- Multi-table transaction

...

- Implement Non blocking CC for Spark...

...

- Secondary indexes (Bloom, RLI, VectorIndex, ..) on Spark read/write path. (HUDI-3907, HUDI-4128)

...

- MT integration across Presto, Trino (HUDI-4552, HUDI-4394)

...

- Presto : Snapshot, Incremental, Time Travel, CDC queries (on MT) (https://issues.apache.org/jira/browse/HUDI-3210)

...

- Trino: (repeat above https://issues.apache.org/jira/browse/HUDI-2687)

...

- Minimize configs and cleanup defaults (https://issues.apache.org/jira/browse/HUDI-1239)

...

- Meta Sync to Glue/HMS with reduced storage/API overhead (HUDI-2519, HUDI-5108, HUDI-6488), seamless inc query, cdc query, ro/rt experience

...

- Broader Performance improvements (HUDI-3249)

...

- Encoding updates as deletes + inserts. (HUDI-6490)

...

- SQL experience for timeline, metadata. (HUDI-6498)

...

- Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)

...

- Introduce HudiStorage APIs to abstract out Hadoop FileSystem. (HUDI-6497)

Packaging Phase (Nov 1- Nov 15)(Marked 1.1.0 for now)

Release (if still pending!)

...

Docs

...

Examples

...

Bundles & Packages (HUDI-3529)

...

Site updates

...

Deprecate/Cleanup cWiki

Below the line (Marked 1.1.0 for now)

Unstructured Hudi table.

...

Native HFile reader/writer in Hudi. (VC: This was punted since we'd default to Parquet based MDT)

...

Streaming Performance: optimize the current upsert DAG on MetadataIndex (hybrid of RLI, Bloom Index, ....)

...

Column family use-case (sparse rows on wide tables??)

...

Cool new indexes

...

- Spatial Index

...

- Search/Lucene Index

...

- Bitmap Index,

...

Hive Storage Handler

...

Demos

...

- Killer dbt demo (https://issues.apache.org/jira/browse/HUDI-6586)

...

Dev Hygiene

...

- https://issues.apache.org/jira/browse/HUDI-2597

...

Tests

...

- Reduce test runtime. HUDI-1574

...

- https://issues.apache.org/jira/browse/HUDI-2638

...

- https://issues.apache.org/jira/browse/HUDI-3121

...