Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

All changes are backwards compatible. This is an add on feature and no changes to the existing checkpoint operations are required.

Test Plan

WIP

Analysis

A key part of the core Startpoint feature is for individual task instances to fetch the appropriate Startpoint for an SSP when a Startpoint is stored in the Metadata Store without any association to a specific task. The fan-out and intent-ACK approaches have been explored with the analysis detailed in the following subsections.

Fan-out

Loading Startpoints

Image Added

Committing Startpoints

Image Added

Pros

  • Follows the natural progression of the JobCoordinator calculating the job model and then applying the info in the job model to fan out the SSP to SSP-Task startpoints.
  • Cleaner and simpler book keeping of startpoints. SSP-only keyed startpoints are deleted after fan out and SSP+TaskName keyed startpoints are deleted upon checkpoint commit.

Cons

  • May not work with a custom grouper combined with the Passthrough Job Coordinator. The default grouper for Passthrough does not support broadcast streams, so the SSP fanout to tasks across containers will be mutually exclusive, but that may not apply to custom groupers.

Intent-ACK

Loading Startpoints

Image Added

Committing Startpoints

Image Added

Pros

  • Does not rely on Job Coordinator for consumption of startpoints

Cons

  • Following the consumption of startpoints, metadata store will be polluted with SSP-only keyed startpoints and per-task-ACKs. Current implementation of the metadata store will bootstrap all the startpoints and ACKs until they are cleaned up.
  • Task count may change during runtime due to the increase in the input stream partition count. Therefore, new tasks will pick up the Startpoint unless all tasks prior to the increase in task count have written their ACKs and the cleanup process ran prior to the task-to-partition mapping. Startpoints should only apply to the state of the job upon startup and not to any changes during runtime. 


Rejected Alternatives

Previous explored solutions involved modifying the checkpoint offsets directly. Operationally, the proposed solution provides more safety because checkpoints are used for fault-tolerance and should not allow user intervention to prevent human error. Having the ability to set starting offsets out-of-band provides the safety layer.