Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The main issue is missing checkpoint barriers. It is possible that after recovery, a checkpoint barrier is lost. This will cause the checkpoint of the failed task (sink) waiting for the lost barrier indefinitely and not able to be complete. Upstream tasks of the sink are not affected by missing events since events are missing at the place of the failed task (Notice that the upstream subpartition has to reset isBlockedByCheckpoint if the failed task is blocked by checkpoint alignment before failure). 

It is also possible that downstream of the failed tasks miss barriers as well, but we will postpone the discussion till later.

The proposed solution is to

...