THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- Report status, planned next steps, call out any blockers/discussion items (1 min each max)
- Update this execution planner, see if we need to change course, adjust plans
- DIscuss blockers, Live jams to resolve issues within bounds of meeting.
Execution Phase 1 (Aug 15-
...
Oct 31)
Focus: Spark, Flink (for NB Concurrency Control)
- In progress/on track - blocked - In progress/slipping - Not started
- (Vinoth) Identify & land all critical outstanding PRs (that solve critical issues, take us forward in our 1.0 path)
- (Vinoth) to identify. https://github.com/apache/hudi/pulls?q=is%3Apr+is%3Aopen+label%3Arelease-1.0.0
- Land all relevant prs (Sagar) Move
master
to 1.0.0
- (Sagar & Vinoth & Danny) Land storage format 1.0
- (Sagar) Make all format changes described here. (Vinoth) Put up a 1.0 tech specs docScope this epic tight. https://issues.apache.org/jira/browse/HUDI-6242
- Standardization Make all the agreed upon format changes described here.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6776 (Sagar) - (Ethan) Standardization of serialization - log blocks, timeline meta files.
- (Sagar) Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly.
- (Danny) Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)
- (Sagar) Changes to make multiple base file formats within each file group.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6824 Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6825 Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6826 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6850 - (Sagar) Base file format can be different within file groups
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6821 - (Sagar) (Sagar) No Java classes show up in table properties. HUDI-5761
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6780 - Introduce transition time into the active timeline
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-1623 Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6775 (Danny) - (Danny) Land LSM Timeline in well-tested, performant shape (HUDI-309, HUDI-6626, HUDI-6698)
- Remove log block append for multiple commits
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6742 - (Danny) Introduces new completion time based file slicing
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6642 Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6743
- Design:
- (Sagar)
- (Sagar) Multi-table transactions? (
)Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6709 - (Lin) Keys: UUIDs vs. what we do today.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6701 - (Vinoth) Put up a 1.0 tech specs doc
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6706 - (Vinoth???) OCC/Time-Travel Read (+Write) (address branch/merge use-cases)
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-4677 - (Vinoth/Danny) Time-Travel read on NB CC .& finalize NB CC design
- (Danny) TrueTime API implementation for Hudi (wait based, or filesystem/stateless based)
- (Vinoth/Shawn) Cloud native storage layout design (Udit's RFC-60)
- (Ethan???) Logical Sagar/Vinoth) Logical partitioning/Index Functions API (Java, Native) and its integration into Spark/Presto/Trino. (HUDI-512)
- (Sagar + ???) Schema Evolution and version tracking in MT.Vinoth) Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues with Flink OCC)
- Implementation
- Finalize RFC-46/RecordMerger API, cross-platform support, only invoked for
hoodie.merge.mode=custom
? (complete HUDI-3217)Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6702 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6765 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6784 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-5249 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-5807 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6767 (Lin) - (Ethan) Implement (Ethan) Implement MoR snapshot query (positional/key based updates, deletes), partial updates, custom merges on new File Format code path.
Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6796 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6797 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6798 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6801 - (Danny) Implement Non blocking CC for Spark.Parity with what Flink has.
- (Lin) Implement a uniform way to fetch read incremental data files based on new timeline (https://issues.apache.org/jira/browse/HUDI-2750)
- Implement Implement writers for positional updates, deletes, partial updates, ordering field-based merging.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6653 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6795 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6800 (Ethan) - Implement engine agnostic FileGroup Read APIs across Spark/Hive (Ethan)
- (Vinoth) Implement DataFrame based write path; Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857)
- (Sagar) Async Hive
Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6785 - (Ethan/Lin?) Implement different query types in new FIlgeGroup reader for Spark
Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6786 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6789 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6790 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6792 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6793 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6794 Jira server ASF JIRA columnIds issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6802 - (Sagar) Async indexer is in final shape (complete HUDI-2488)
- (Sagar) Existing Optimistic Concurrency Control is in final shape (complete HUDI-1456)
- Lin) Land Parquet keyed lookup code (???Danny) Land LSM Timeline in well-tested, performant shape (HUDI-309) (
- Flink/Non-blocking CC (HUDI-6640, HUDI-6495 ) (Danny)
- [???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-4790) <what are some other code refactoring.. to burn down?> (, HUDI-2261, HUDI-6243, HUDI-3614, HUDI-4444, HUDI-4756)
- Finalize RFC-46/RecordMerger API, cross-platform support, only invoked for
- (Sagar) Open/Risk Items:
(Ethan/Danny) _hoodie_operation
metafield. Spark/Flink interop.- (Vinoth) Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues with Flink OCC)
- (Vinoth) Are we happy with how log compaction is implemented? (https://issues.apache.org/jira/browse/HUDI-3580)
- (Vinoth) Should we retain virtual keys support? https://issues.apache.org/jira/browse/HUDI-2235
Execution Phase 2 (Sept 15-Oct 30)
Execution Phase 2 (Nov 1-Nov 30)
- Pre-work
- (Vinoth/Balaji) Land all relevant prs
- https://issues.apache.org/jira/browse/HUDI-4141)
- FileGroup APIs External APIs in Java
- Rust/C++ APIs for Timeline, Metadata, FileGroup Read/Write (https://issues.apache.org/jira/browse/HUDI-6486)
- for metadata, timeline, file groups r/w
- Internal APIs/Abstractions/Code Refactoring (https://issues.apache.org/jira/browse/HUDI-6243)
APIs: ( - Design
- (Vinoth) General purpose, global timeline (no active vs archived distinction) (HUDI-309,
)Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6698 - (Vinoth) Non-blocking concurrency control/clustering + updates, inserts + inserts for Spark + Flink.
- (Vinoth) Spark SQL statements to complete DB vision. (vinoth has a list. ???)
- (Vinoth) Lance file format + storing blobs/images.(Needs an epic)
- (Vinoth) Redesign Hudi MT as an internal partition of the data table, exposing "files" metadata alone outside (HUDI-2461 etc)
- (Vinoth) Backwards compatibility testing. 1.0 reader can read 0.x format? reader/writer/table version?
- (Vinoth) General purpose, global timeline (no active vs archived distinction) (HUDI-309,
- Multi-table transaction
- (Sagar/Jon) Schema Evolution and version tracking in MT.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6778 - (Sagar/Jon) Schema on read support
- (??) MT <> DT redesign
- (Lin) Land Parquet keyed lookup code (???)
- MT/RLI on Parquet base files
- (???) Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)
- (Danny) Follow ups on LSM Timeline.
Jira server ASF JIRA serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key HUDI-6698 - (Vinoth) Implement DataFrame based write path; Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857)
- (Danny) Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly based on completion time
- (Sagar) Secondary indexes (Bloom, RLI, VectorIndex, ..) on Spark read/write path. (HUDI-3907, HUDI-4128)
- Minimize configs and cleanup defaults (https://issues.apache.org/jira/browse/HUDI-1239)(Sagar) Meta Sync to Glue/HMS with reduced storage/API overhead (HUDI-2519, HUDI-5108, HUDI-6488), seamless inc query, cdc query, ro/rt experience
- Broader Performance improvements (HUDI-3249)
- Encoding updates as deletes + inserts. (HUDI-6490)
- (Lin) SQL experience for timeline, metadata. (HUDI-6498)
- Introduce HudiStorage APIs to abstract out Hadoop FileSystem. [Rajesh???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-64974790)
- MT integration across Presto, Trino (HUDI-4552, HUDI-4394)Presto : Snapshot, Incremental, Time Travel, CDC queries (on MT) Minimize configs and cleanup defaults (https://issues.apache.org/jira/browse/HUDI-1239)
Implementation- Open/Risk Items:
(Ethan/Danny)
_hoodie_operation
metafield. Spark/Flink interop.- (Sagar) Are we happy with how log compaction is implemented? (https://issues.apache.org/jira/browse/HUDI-32103580)
- Trino: (repeat above (Vinoth) Should we retain virtual keys support? https://issues.apache.org/jira/browse/HUDI-2687)2235
...
GA Phase (
...
Dec 1-
...
Dec 31)(Marked 1.1.0 for now)
- Release (if still pending!)
- Docs
- Examples
- Bundles & Packages (HUDI-3529)
- Site updates
- Deprecate/Cleanup cWiki
...
- Unstructured Hudi table.
- Implement Non blocking CC for Spark...
- Rust/C++ APIs for Timeline, Metadata, FileGroup Read/Write (https://issues.apache.org/jira/browse/HUDI-6486)
- Multi-table transaction
- Broader Performance improvements (HUDI-3249).
- Encoding updates as deletes + inserts. (HUDI-6490)
- Native HFile reader/writer in Hudi. (VC: This was punted since we'd default to Parquet based MDT)
- Streaming Performance: optimize the current upsert DAG on MetadataIndex (hybrid of RLI, Bloom Index, ....)
- Column family use-case (sparse rows on wide tables??)
- Cool new indexes
- Spatial Index
- Search/Lucene Index
- Bitmap Index,
- Hive Storage Handler
- MT integration across Presto, Trino (HUDI-4552, HUDI-4394)
- Presto : Snapshot, Incremental, Time Travel, CDC queries (on MT) (https://issues.apache.org/jira/browse/HUDI-3210)
- Trino: (repeat above https://issues.apache.org/jira/browse/HUDI-2687)
- Demos
- Killer dbt demo (https://issues.apache.org/jira/browse/HUDI-6586)
- Dev Hygiene
- Tests