Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Option1 is preferable because we have the max number of allowed pending checkpoints in most cases. The system can be blocked forever before pending checkpoints are aborted.

Compatibility, Deprecation, and Migration Plan

Test Plan

...

5. Checkpoint Coordinator

Currently, all pending checkpoints are aborted after the job state transits from Running to Restarting; a single task in the job restarts will cause a job state transition. Notifications of abort is sent to TaskExecutor#abortCheckpoint -> StreamTask#notifyCheckpointAbortAsync. The notification is not guaranteed to succeed. It should be fine even if the notification is unstable, as long as an incomplete checkpoint does not fail a job.

The Checkpoint coordinator remains the same.

6. State Restore

DefaultScheduler#restartTasks calls SchedulerBase#restoreState to restore states and abort pending checkpoints. With unaligned checkpoint enabled, during recovery failing over tasks have to:

  1. drop all of the in-flight data (easier),
  2. drop at least all of the partial records from restored input/output in-flight data,

This is because the restarted task will not receive other remaining buffers to complete the partial records.  

No special handling needed when unaligned checkpoints are disabled. In the very first version, ignoring state recovery can also be an option.

7. Frequent Failed Task

There will be no changes in this regard and we will keep the existing behavior.