Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Milestone One: sink failure. Here a sink task stands for no consumers reading data from it. In this scenario, if a sink vertex fails, the sink is restarted from the last successfully completed checkpoint and data loss is expected. If a non-sink vertex fails, a regional failover strategy takes place. In milestone one, we focus on issues related to task failure handling and upstream reconnection.
  • Milestone Two: downstream failure -- a failed task leads all tasks rooted from the failed tasks to restart. In milestone two, we focus on issues related to missing checkpoint barriers.
  • Milestone Three: single task failure -- a failed task leads itself to restart. In milestone three, we focus on issues related to downstream reconnection and different types of failure detection/handling (TM, JM, network, and e.t.c).

Milestone One

This is a design doc for the first milestone in approximate task-local recovery. More specifically,

  1. The sink task is restarted if an exception is caught from the task. The rest of the tasks keep running.
  2. After restarting, the sink task restores its state and reconnects to its upstream tasks and continues to process data.
  3. The amount of data leading the failed sink from the last completed checkpoint to the state when the sink fails is lost.

Proposed Changes

Compatibility, Deprecation, and Migration Plan

...