In this RFC, we propose to implement Option 1. In the future, we will add support for Option 2 as well as Option 3 in the future.

An Example of Concurrency control with Priority for clustering/compactions

Dependency/Follow up

With multiple writers, there is no guarantee on ordered completion of commits (for conflict-free multiple writers). Incremental reads depend on monotonically increasing timestamps to ensure no insert/update is missed. This guarantee will be violated with multiple writers. To be able to use multiple writers, we will introduce a new notion to the hoodie timeline, END_COMMIT_TIME along with the START_COMMIT_TIME that is already used for performing commits. Incremental Reads will depend on END_COMMIT_TIME to tail data and also publish them to downstream systems to be able to checkpoint them.

An Example of Concurrency control with Priority for clustering/compactions

Clustering is scheduled (files
Clustering is scheduled (files f1,f2,f3 -> g1)
Clustering moves inflight
Ingestion writes scheduled
Ingestion writes moved to inflight
Ingestion has updates for f1
Clustering finished after taking a lock and checking no other commit has succeeded, put the REPLACE file with mapping f1,f2,f3 -> g1 or put this information in the consolidated metadata
Ingestion tries to finish acquires a lock
Finds clustering has finished in the meantime, finds file level conflict, overlapping f1
If a higher priority was given to the new writer over clustering, one possible implementation is as follows:Write a new REPLACE metadata to reverse the mapping of f1,f2,f3 -> g1 that was done before.
NOTE that the entire mapping needs to be reversed since records can go from M file groups to N file groups
Need to ensure that the previous version (before clustering) is not cleaned
During the merge of consolidated metadata or the REPLACE timeline, this scenario of "revert" needs to be handled.
Side effect

Redundant clustering operation, previous one’s work is not used, another one needs to be scheduled
Queries will ping pong back and forth between different number of files, layout

Rollout/Adoption Plan

Once the proposed solution is implemented, users will be able to run jobs in parallel by simply launching multiple writers. Upgrading to the latest Hudi release providing this feature will automatically upgrade your timeline layout.

Failures & Build rollback

TBD

Turning on Scheduling Operations

By default, the scheduling of operations will be enabled for any job for backwards compatibility for current users. The users need to ensure they turn off scheduling and then turn it on only for the dedicated job.

Test Plan

Unit tests and Test suite integration

Dependency/Follow up

)
Clustering moves inflight
Ingestion writes scheduled
Ingestion writes moved to inflight
Ingestion has updates for f1
Clustering finished after taking a lock and checking no other commit has succeeded, put the REPLACE file with mapping f1,f2,f3 -> g1 or put this information in the consolidated metadata
Ingestion tries to finish acquires a lock
Finds clustering has finished in the meantime, finds file level conflict, overlapping f1
If a higher priority was given to the new writer over clustering, one possible implementation is as follows:
1. Write a new REPLACE metadata to reverse the mapping of f1,f2,f3 -> g1 that was done before.
2. NOTE that the entire mapping needs to be reversed since records can go from M file groups to N file groups
3. Need to ensure that the previous version (before clustering) is not cleaned
During the merge of consolidated metadata or the REPLACE timeline, this scenario of "revert" needs to be handled.
Side effect

Redundant clustering operation, previous one’s work is not used, another one needs to be scheduled
Queries will ping pong back and forth between different number of files, layout

Rollout/Adoption Plan

Once the proposed solution is implemented, users will be able to run jobs in parallel by simply launching multiple writers. Upgrading to the latest Hudi release providing this feature will automatically upgrade your timeline layout.

Failures & Build rollback

TBD

Turning on Scheduling Operations

By default, the scheduling of operations will be enabled for any job for backwards compatibility for current users. The users need to ensure they turn off scheduling and then turn it on only for the dedicated job.

Test Plan

Unit tests and Test suite integrationWith multiple writers, there is no guarantee on ordered completion of commits (for conflict-free multiple writers). Incremental reads depend on monotonically increasing timestamps to ensure no insert/update is missed. This guarantee will be violated with multiple writers. To be able to use multiple writers, we will introduce a new notion to the hoodie timeline, END_COMMIT_TIME along with the START_COMMIT_TIME that is already used for performing commits. Incremental Reads will depend on END_COMMIT_TIME to tail data and also publish them to downstream systems to be able to checkpoint them.

Space shortcuts

Page tree

Versions Compared

Old Version 16

New Version 17

Key

An Example of Concurrency control with Priority for clustering/compactions

Dependency/Follow up

An Example of Concurrency control with Priority for clustering/compactions

Rollout/Adoption Plan

Failures & Build rollback

Turning on Scheduling Operations

Test Plan

Dependency/Follow up

Rollout/Adoption Plan

Failures & Build rollback

Turning on Scheduling Operations

Test Plan

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 16

New Version 17

Key

An Example of Concurrency control with Priority for clustering/compactions

Dependency/Follow up

An Example of Concurrency control with Priority for clustering/compactions

Rollout/Adoption Plan

Failures & Build rollback

Turning on Scheduling Operations

Test Plan

Dependency/Follow up

Rollout/Adoption Plan

Failures & Build rollback

Turning on Scheduling Operations

Test Plan