Status

Current state: Under Discussion

Discussion thread:

JIRA:

Released:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Flink is no longer a pure streaming engine as it was born and has been extended to fit into many different scenarios over time: batch, AI, event-driven applications, e.t.c. Approximate task-local recovery is one of the attempts to fulfill these diversified scenarios and trade data consistency for fast failure recovery. More specifically, if a task fails, only the failed task restarts without affecting the rest of the job. Approximate task-local recovery is similar to RestartPipelinedRegionFailoverStrategy with two major differences:

Instead of restarting a connected region[flip1], approximate task-local recovery restarts only the failed task(s). In the setup of a streaming job (tasks connected with pipelined result partition type), a connected region is equal to the entire job.
RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate task-local recovery expects data loss and a bit of data duplication when sources fail.

Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable. A typical use case is online training. Online training jobs are usually complicated with all-to-all task connections, so a single task failure may result in a complete restart of the whole pipeline. Besides, the initialization is time-consuming, including the procedure of loading training models and starting Python subprocesses, etc. The initialization may take minutes to complete on average.

Goal

To ease the discussion, we divide the problem of approximate task-local recovery into three parts with each part only focusing on addressing a set of problems, as shown in the graph below.

Milestone One: sink failure. Here a sink task stands for no consumers reading data from it. In this scenario, if a sink vertex fails, the sink is restarted from the last successfully completed checkpoint and data loss is expected. If a non-sink vertex fails, a regional failover strategy takes place. In milestone one, we focus on issues related to task failure handling and upstream reconnection.
Milestone Two: downstream failure -- a failed task leads all tasks rooted from the failed tasks to restart. In milestone two, we focus on issues related to missing checkpoint barriers.
Milestone Three: single task failure -- a failed task leads itself to restart. In milestone three, we focus on issues related to downstream reconnection and different types of failure detection/handling (TM, JM, network, and e.t.c).

Proposed Changes

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
If we are changing behavior how will we phase out the older behavior?
If we need special migration tools, describe them here.
When will we remove the existing behavior?

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-135 Approximate Task-Local Recovery (WIP)