THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
Quick Links
Table of Contents | ||
---|---|---|
|
Issue Management
Actual issue tracking is in Apache JIRA! We use this page to ground ourselves.
...
- (Vinoth) Identify & land all critical outstanding PRs (that solve critical issues, take us forward in our 1.0 path)
- Vinoth to identify.
- [Sagar] Move
master
to 1.0.0
- (Ethan & Vinoth & Danny) Land storage format 1.0 (Complete)
- [Vinoth] Put up a 1.0 tech specs doc
- Make all format changes described here. https://issues.apache.org/jira/browse/HUDI-6242
- Standardization of serialization - log blocks, timeline meta files.
- Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly.
- Changes to make multiple base file formats within each file group.
- No Java classes
- show up in table properties. HUDI-5761
- [Danny] Introduce transition time into the active timeline
- Design:
- [Sagar] Multi-table transactions? (VC: we have a strawman. but needs an RFC to validate correctness across phantom reads, self-joins, nested queries, and isolation levels)
- [Lin] Keys: UUIDs vs. what we do today.
- [Danny???] Time-Travel Read (+Write) (resolve HUDI-4500, HUDI-4677 and similar, address branch/merge use-cases)
- [Ethan???] Logical partitioning/Index Functions API (Java, Native) and its integration into Spark/Presto/Trino. (HUDI-512)
- [Shawn] Cloud native storage layout design (Udit's RFC-60)
- [Sagar + ???] Schema Evolution and version tracking in MT.
- [Vinoth] Lance file format + storing blobs/images.
- Implementation
- [Sagar] RFC-46/RecordMerger API, is this our final choice? cross-platform? only for
hoodie.merge.mode=custom
? (complete HUDI-3217)
- [Sagar] Async indexer is in final shape (complete HUDI-2488)
- [Lin] Land Parquet keyed lookup code (???)
- [???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-4790)
- [Ethan] Implement MoR snapshot query (positional/key based updates, deletes), partial updates, custom merges on new File Format code path.
- [Ethan] Implement writers for positional updates, deletes, partial updates, ordering field based merging.
- Existing Optimistic Concurrency Control is in final shape (complete HUDI-1456)
- Implement a uniform way to fetch incremental data files based on new timeline (https://issues.apache.org/jira/browse/HUDI-2750)
- [Sagar] RFC-46/RecordMerger API, is this our final choice? cross-platform? only for
- (Sagar) Open/Risk Items:
_hoodie_operation
metafield. Spark/Flink interop.
- Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues)
- Are we happy with how log compaction is implemented? (https://issues.apache.org/jira/browse/HUDI-3580)
- Should we retain virtual keys support? https://issues.apache.org/jira/browse/HUDI-2235
Execution Phase 2 (Sept 15-Oct 30)
...
- FileGroup APIs in Java
...
- Rust/C++ APIs for Timeline, Metadata, FileGroup Read/Write (https://issues.apache.org/jira/browse/HUDI-6486)
...
- Internal APIs/Abstractions/Code Refactoring (https://issues.apache.org/jira/browse/HUDI-6243)
...
- Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857)
...
...
- HoodieSchema ? https://issues.apache.org/jira/browse/HUDI-6499
...
- Design
...
- General purpose, global timeline (no active vs archived distinction)
...
- Non-blocking concurrency control/clustering + updates, inserts + inserts for Spark + Flink.
...
- Spark SQL statements to complete DB vision. (vinoth has a list. ???)
...
- Implementation
...
- Multi-table transaction
...
- Implement Non blocking CC for Spark...
...
...
...
- Presto : Snapshot, Incremental, Time Travel, CDC queries (on MT) (https://issues.apache.org/jira/browse/HUDI-3210)
...
- Trino: (repeat above https://issues.apache.org/jira/browse/HUDI-2687)
...
- Minimize configs and cleanup defaults (https://issues.apache.org/jira/browse/HUDI-1239)
...
...
- Broader Performance improvements (HUDI-3249)
...
- Encoding updates as deletes + inserts. (HUDI-6490)
...
- SQL experience for timeline, metadata. (HUDI-6498)
...
- Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)
...
- Introduce HudiStorage APIs to abstract out Hadoop FileSystem. (HUDI-6497)
Packaging Phase (Nov 1- Nov 15)(Marked 1.1.0 for now)
- Release (if still pending!)
...
- Docs
...
- Examples
...
- Bundles & Packages (HUDI-3529)
...
- Site updates
...
- Deprecate/Cleanup cWiki
Below the line (Marked 1.1.0 for now)
- Unstructured Hudi table.
...
- Native HFile reader/writer in Hudi. (VC: This was punted since we'd default to Parquet based MDT)
...
- Streaming Performance: optimize the current upsert DAG on MetadataIndex (hybrid of RLI, Bloom Index, ....)
...
- Column family use-case (sparse rows on wide tables??)
...
- Cool new indexes
...
- Spatial Index
...
- Search/Lucene Index
...
- Bitmap Index,
...
- Hive Storage Handler
...
- Demos
...
- Killer dbt demo (https://issues.apache.org/jira/browse/HUDI-6586)
...
- Dev Hygiene
...
...
- Tests
...
- Reduce test runtime. HUDI-1574
...
...
...