Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Describe the problems you are trying to solve.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

A public interface is any change to the following:

~~Binary log format~~
~~The network protocol and api behavior~~
~~Any class in the public packages under clientsConfiguration, especially client configuration~~
- ~~org/apache/kafka/common/serialization~~
- ~~org/apache/kafka/common~~
- ~~org/apache/kafka/common/errors~~
- ~~org/apache/kafka/clients/producer~~
- ~~org/apache/kafka/clients/consumer (eventually, once stable)~~
~~Monitoring~~
~~Command line tools and arguments~~
~~Anything else that will likely break existing users in some way when they upgrade~~

Proposed Changes

...

Flink is no longer a pure streaming engine as it was born and has been extended to fit into many different scenarios over time: batch, AI, event-driven applications, e.t.c. Approximate task-local recovery is one of the attempts to fulfill these diversified scenarios and trade data consistency for fast failure recovery. More specifically, if a task fails, only the failed task restarts without affecting the rest of the job. Approximate task-local recovery is similar to RestartPipelinedRegionFailoverStrategy with two major differences:

Instead of restarting a connected region[flip1], approximate task-local recovery restarts only the failed task(s). In the setup of a streaming job (tasks connected with pipelined result partition type), a connected region is equal to the entire job.
RestartPipelinedRegionFailoverStrategy is exactly-once, while approximate task-local recovery expects data loss and a bit of data duplication when sources fail.

Approximate task-local recovery is useful in scenarios where a certain amount of data loss is tolerable, but a full pipeline restart is not affordable. A typical use case is online training. Online training jobs are usually complicated with all-to-all task connections, so a single task failure may result in a complete restart of the whole pipeline. Besides, the initialization is time-consuming, including the procedure of loading training models and starting Python subprocesses, etc. The initialization may take minutes to complete on average.

Goal

To ease the discussion, we divide the problem of approximate task-local recovery into three parts with each part only focusing on addressing a set of problems, as shown in the graph below.

Milestone One: sink failure. Here a sink task stands for no consumers reading data from it. In this scenario, if a sink vertex fails, the sink is restarted from the last successfully completed checkpoint and data loss is expected. If a non-sink vertex fails, a regional failover strategy takes place. In milestone one, we focus on issues related to task failure handling and upstream reconnection.
Milestone Two: downstream failure -- a failed task leads all tasks rooted from the failed tasks to restart. In milestone two, we focus on issues related to missing checkpoint barriers.
Milestone Three: single task failure -- a failed task leads itself to restart. In milestone three, we focus on issues related to downstream reconnection and different types of failure detection/handling (TM, JM, network, and e.t.c).

Proposed Changes

Compatibility, Deprecation, and Migration Plan

...

Page tree

Versions Compared

Old Version 2

New Version 3

Key

Motivation

Public Interfaces

Proposed Changes

Goal

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Page tree

Page History

Versions Compared

Old Version 2

New Version 3

Key

Motivation

Public Interfaces

Proposed Changes

Goal

Proposed Changes

Compatibility, Deprecation, and Migration Plan