...
All changes are backwards compatible. This is an add on feature and no changes to the existing checkpoint operations are required.
Test Plan
WIP
Analysis
A key part of the core Startpoint feature is for individual task instances to fetch the appropriate Startpoint for an SSP when a Startpoint is stored in the Metadata Store without any association to a specific task. The fan-out and intent-ACK approaches have been explored with the analysis detailed in the following subsections.
Fan-out
Loading Startpoints
Committing Startpoints
Pros
- Follows the natural progression of the JobCoordinator calculating the job model and then applying the info in the job model to fan out the SSP to SSP-Task startpoints.
- Cleaner and simpler book keeping of startpoints. SSP-only keyed startpoints are deleted after fan out and SSP+TaskName keyed startpoints are deleted upon checkpoint commit.
Cons
- May not work with a custom grouper combined with the Passthrough Job Coordinator. The default grouper for Passthrough does not support broadcast streams, so the SSP fanout to tasks across containers will be mutually exclusive, but that may not apply to custom groupers.
Intent-ACK
Loading Startpoints
Committing Startpoints
Pros
- Does not rely on Job Coordinator for consumption of startpoints
Cons
- Following the consumption of startpoints, metadata store will be polluted with SSP-only keyed startpoints and per-task-ACKs. Current implementation of the metadata store will bootstrap all the startpoints and ACKs until they are cleaned up.
- Task count may change during runtime due to the increase in the input stream partition count. Therefore, new tasks will pick up the Startpoint unless all tasks prior to the increase in task count have written their ACKs and the cleanup process ran prior to the task-to-partition mapping. Startpoints should only apply to the state of the job upon startup and not to any changes during runtime.
Rejected Alternatives
Previous explored solutions involved modifying the checkpoint offsets directly. Operationally, the proposed solution provides more safety because checkpoints are used for fault-tolerance and should not allow user intervention to prevent human error. Having the ability to set starting offsets out-of-band provides the safety layer.