Quick Links

Issue Management

...

Any issue you file, please file as "Issue" and not as "~~Sub task~~" (sub tasks cannot be added to Epics)
Please attach issues to an Epic as much as possible, so it does not scatter around. (see 1.0 Epics)
Keep issues unassigned, unless you are about to begin working on it.
Issue must be tagged with Fix Version/s: 1.0.0 to show up on the board.
If you have a PR up, please ensure the JIRA is in "Review" state and mark the "Reviewers" field with who your review is blocked on.
Vinoth Chandar will move issues from 1.0.0 to 1.1.0 if it does not seem important.
Pending project management tasks:
- (Vinoth) to create a "roadmap" in JIRA
- (Vinoth) to go into each Epic deeply, clean up tasks themselves.
- (Vinoth) to scout for R.M.

...

Roadmap to visualize which epics are in what phase.

Sync Meeting Format

Daily 7pm PST, ping Vinoth Chandar to be added

Report status, planned next steps, call out any blockers/discussion items (1 min each max)
Update this execution planner, see if we need to change course, adjust plans
DIscuss blockers, Live jams to resolve issues within bounds of meeting.

Execution Phase 1 (Aug 15-

...

Oct 31)

Focus: Spark, Flink (for NB Concurrency Control)
- In progress/on track - blocked - In progress/slipping - Not started

(Vinoth) Identify & land all critical outstanding PRs (that solve critical issues, take us forward in our 1.0 path)
- (Vinoth) to identify. https://github.com/apache/hudi/pulls?q=is%3Apr+is%3Aopen+label%3Arelease-1.0.0
- (Sagar) Move master to 1.0.0

(Ethan Sagar & Vinoth & Danny) Land storage format 1.0 (Complete) 0

Make all format changes described here. (Vinoth) Put up a 1.0 tech specs docScope this epic tight. https://issues.apache.org/jira/browse/HUDI-6242 Standardization
(Sagar) Make all the agreed upon format changes described here.
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6776
(Ethan) Standardization of serialization - log blocks, timeline meta files.
Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly.
Changes to make multiple base file formats within each file group.

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6824

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6825

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6826

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6850

(Sagar) Base file format can be different within file groups
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6821
(Sagar) No Java classes show up in table properties. HUDI-5761
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6780
(Danny) Introduce Introduce transition time into the active timeline
(Danny) Land LSM Timeline in well-tested, performant shape (HUDI-309, HUDI-6626, this needs an epic ASAP???)

Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-1623
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6775
(Danny) Remove log block append for multiple commits
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6742
(Danny) Introduces new completion time based file slicing
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6642
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key HUDI-6743

Design:
- (Sagar) Multi-table transactions? (
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-6709
  )
- (Lin) Keys
Design:
- (Sagar) Multi-table transactions? (VC: we have a strawman. but needs an RFC to validate correctness across phantom reads, self-joins, nested queries, and isolation levels)
- (Lin) Keys: UUIDs vs. what we do today.
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-6701
- (Vinoth) Put up a 1.0 tech specs doc
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-6706
- (Vinoth) OCC/Time (Danny???) Time-Travel Read (+Write) (resolve HUDI-4500, HUDI-4677 and similar, address branch/merge use-cases)
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-4677
- (Vinoth/Danny) Time-Travel read on NB CC & finalize NB CC design
- (Danny) TrueTime API implementation for Hudi (wait based, or filesystem/stateless based)
- (Vinoth/Shawn) Cloud native storage layout design (Udit's RFC-60)
- (Sagar/Vinoth) Logical (Ethan???) Logical partitioning/Index Functions API (Java, Native) and its integration into Spark/Presto/Trino. (HUDI-512)
- (Shawn) Cloud native storage layout design (Udit's RFC-60)
- (Sagar + ???) Schema Evolution and version tracking in MT.
- (Vinoth) Lance file format + storing blobs/images.
- Vinoth) Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues with Flink OCC)

Implementation

(Lin) Finalize RFC

Implementation

(Sagar) RFC-46/RecordMerger API, is this our final choice? cross-platform ? support, only invoked for hoodie.merge.mode=custom ? (complete HUDI-3217)
(Sagar) Async indexer is in final shape (complete HUDI-2488)
(Lin) Land Parquet keyed lookup code (???)
(Danny) Flink/Non-blocking CC (HUDI-5672, HUDI-6640, HUDI-6495 )
[???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-4790)

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6702

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6765

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6784

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-5249

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-5807

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6767

(Ethan) Implement (Ethan) Implement MoR snapshot query (positional/key based updates, deletes), partial updates, custom merges on new File Format code path. (Ethan) Implement writers for positional updates, deletes, partial updates, ordering field based merging.
Existing Optimistic Concurrency Control is in final shape (complete HUDI-1456)

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6796

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6797

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6798

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6801

(Danny) Implement Non blocking CC for Spark.Parity with what Flink has.
(Lin) Implement a uniform way to fetch read incremental data files based on new timeline (https://issues.apache.org/jira/browse/HUDI-2750)
<what are some other code refactoring.. to burn down?> (, HUDI-2261, HUDI-6243, HUDI-3614, HUDI-4444, HUDI-4756)

(Sagar) Open/Risk Items:
- _hoodie_operation metafield. Spark/Flink interop.
- Are we happy with DT <> MT sync mechanism? does this need to be revisited? (HUDI-2461 + other issues)
- Are we happy with how log compaction is implemented? (https://issues.apache.org/jira/browse/HUDI-3580)
- Should we retain virtual keys support? https://issues.apache.org/jira/browse/HUDI-2235

Execution Phase 2 (Sept 15-Oct 30)

(Ethan)

Implement writers for positional updates, deletes, partial updates, ordering field-based merging.

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6653

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6795

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6800

(Ethan) Implement engine agnostic FileGroup Read APIs across Spark/Hive

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6785

(Ethan/Lin?) Implement different query types in new FIlgeGroup reader for Spark

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6786

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6789

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6790

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6792

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6793

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6794

Jira

server	ASF JIRA
columnIds	issuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	HUDI-6802

(Sagar) Async indexer is in final shape (complete HUDI-2488)
(Sagar) Existing Optimistic Concurrency Control is in final shape (complete HUDI-1456)
(Danny) Land LSM Timeline in well-tested, performant shape (HUDI-309)
(Danny) Flink/Non-blocking CC (HUDI-6640, HUDI-6495 )

Execution Phase 2 (Nov 1-Nov 30)

Pre-work
- (Vinoth/Balaji) Land all relevant prs
APIs: (https://issues.apache.org/jira/browse/HUDI-4141)
- FileGroup APIs External APIs in Java
- Rust/C++ APIs for Timeline, Metadata, FileGroup Read/Write (https://issues.apache.org/jira/browse/HUDI-6486)
- for metadata, timeline, file groups r/w
- Internal APIs/Abstractions/Code Refactoring (https://issues.apache.org/jira/browse/HUDI-6243)
  - Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857) HUDI-43
  - HoodieSchema ? https://issues.apache.org/jira/browse/HUDI-6499
  - <what are some other code refactoring.. to burn down?> (, HUDI-2261, HUDI-6243, HUDI-3614, HUDI-4444, HUDI-4756)
  - Introduce HudiStorage APIs to abstract out Hadoop FileSystem. (HUDI-6497)
Design
- (Vinoth) General purpose, global timeline (no active vs archived distinction) (HUDI-309,
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-6698
  )
- (Vinoth) Non-blocking concurrency control/clustering + updates, inserts + inserts for Spark + Flink.
- (Vinoth) Spark SQL statements to complete DB vision. (vinoth has a list. ???)
Implementation
- Multi-table transaction
- Implement Non blocking CC for Spark...
- Secondary indexes (Bloom, RLI, VectorIndex, ..) on Spark read/write path. (HUDI-3907, HUDI-4128)
- MT integration across Presto, Trino (HUDI-4552, HUDI-4394)
- Presto : Snapshot, Incremental, Time Travel, CDC queries (on MT) (https://issues.apache.org/jira/browse/HUDI-3210)
- Trino: (repeat above https://issues.apache.org/jira/browse/HUDI-2687)
- Minimize configs and cleanup defaults (https://issues.apache.org/jira/browse/HUDI-1239)
- (Vinoth) Lance file format + storing blobs/images.(Needs an epic)
- (Vinoth) Redesign Hudi MT as an internal partition of the data table, exposing "files" metadata alone outside (HUDI-2461 etc)
- (Vinoth) Backwards compatibility testing. 1.0 reader can read 0.x format? reader/writer/table version?
Implementation
- (Sagar/Jon) Schema Evolution and version tracking in MT.
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-6778
- (Sagar/Jon) Schema on read support
- (??) MT <> DT redesign
- (Lin) Land Parquet keyed lookup code (???)
- MT/RLI on Parquet base files
- (???) Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)
- (Danny) Follow ups on LSM Timeline.
  Jira
  server ASF JIRA
  serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
  key HUDI-6698
- (Vinoth) Implement DataFrame based write path; Take HoodieData abstraction to completion and end-end row writing for Spark? All write operations work with rows end-end (HUDI-4857)
- (Danny) Change Timeline/FileSystemView to support snapshot, incremental, CDC, time-travel queries correctly based on completion time
- (Sagar) Secondary indexes (Bloom, RLI, VectorIndex, ..) on Spark read/write path. (HUDI-3907, HUDI-4128)
- (Sagar) Meta Sync to Glue/HMS with reduced storage/API overhead (HUDI-2519, HUDI-5108, HUDI-6488), seamless inc query, cdc query, ro/rt experience
- Broader Performance improvements (HUDI-3249)
- Encoding updates as deletes + inserts. (HUDI-6490) (Lin) SQL experience for timeline, metadata. (HUDI-6498)
- Introduce TrueTime API or equivalent, to explain the foundations more clearly. (reuse HUDI-3057)
- Introduce HudiStorage APIs to abstract out Hadoop FileSystem. (HUDI-6497)

...

- [Rajesh???] Parquet Rewriting at Page Level for Spark Rows (Writer perf) (HUDI-4790)
- Minimize configs and cleanup defaults (https://issues.apache.org/jira/browse/HUDI-1239)
Open/Risk Items:
- (Ethan/Danny) _hoodie_operation metafield. Spark/Flink interop.
- (Sagar) Are we happy with how log compaction is implemented? (https://issues.apache.org/jira/browse/HUDI-3580)
- (Vinoth) Should we retain virtual keys support? https://issues.apache.org/jira/browse/HUDI-2235

GA Phase (Dec 1- Dec 31)(Marked 1.1.0 for now)

Release (if still pending!)
Docs
Examples
Bundles & Packages (HUDI-3529)
Site updates
Deprecate/Cleanup cWiki

Below the line (Marked 1.1.0 for now)

Unstructured Hudi table.
Rust/C++ APIs for Timeline, Metadata, FileGroup Read/Write (https://issues.apache.org/jira/browse/HUDI-6486)
Multi-table transaction
Broader Performance improvements (HUDI-3249).
Encoding updates as deletes + inserts. (HUDI-6490)
Native HFile reader/writer in Hudi. (VC: This was punted since we'd default to Parquet based MDT)
Streaming Performance: optimize the current upsert DAG on MetadataIndex (hybrid of RLI, Bloom Index, ....)
Column family use-case (sparse rows on wide tables??)
Cool new indexes
- Spatial Index
- Search/Lucene Index
- Bitmap Index,
Hive Storage Handler
MT integration across Presto, Trino (HUDI-4552, HUDI-4394)
Presto : Snapshot, Incremental, Time Travel, CDC queries (on MT) (https://issues.apache.org/jira/browse/HUDI-3210)
Trino: (repeat above https://issues.apache.org/jira/browse/HUDI-2687)
Demos
- Killer dbt demo (https://issues.apache.org/jira/browse/HUDI-6586)
Dev Hygiene
- https://issues.apache.org/jira/browse/HUDI-2597
Tests
- Reduce test runtime. HUDI-1574
- https://issues.apache.org/jira/browse/HUDI-2638
- https://issues.apache.org/jira/browse/HUDI-3121

Space shortcuts

Page tree

Versions Compared

Old Version 11

New Version Current

Key

Table of Contents

Quick Links

Issue Management

Sync Meeting Format

Execution Phase 1 (Aug 15-

Oct 31)

Execution Phase 2 (Sept 15-Oct 30)

Execution Phase 2 (Nov 1-Nov 30)

GA Phase (Dec 1- Dec 31)(Marked 1.1.0 for now)

Below the line (Marked 1.1.0 for now)

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 11

New Version Current

Key

Table of Contents

Quick Links

Issue Management

Sync Meeting Format

Execution Phase 1 (Aug 15-

Oct 31)

Execution Phase 2 (Sept 15-Oct 30)

Execution Phase 2 (Nov 1-Nov 30)

GA Phase (Dec 1- Dec 31)(Marked 1.1.0 for now)

Below the line (Marked 1.1.0 for now)