This is an unfinished paper. I am writing. After finish I will send e-mail to dev@flink.apache.org.

Status

Current state: Under Discussion

Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: Unable to render Jira issues macro, execution error.

Released: <Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

In Flink batch job, the job is usually divided into multiple parallel tasks executed cross many nodes in the cluster. It is common to encounter the performance degradation on some nodes due to hardware problems, or accident I/O busy, or high CPU load. This kind of degradation can probably cause the running tasks on the node to be quite slow, that is so called long tail tasks. Although the long tail tasks will not fail, they can severely affect the total job running time. Flink task scheduling does not take this long tail problem into account currently.

Here we propose the speculative execution strategy [FLINK-10644] to handle the problem. The basic idea is to run a copy of task on another node when the original task is identified to be long tail. The speculative task is executed in parallel with the original one and share the same failure retry mechanism. Once either task complete, the scheduler admits its output as the final result and cancel the other running one. I will introduce a blacklist module to schedule the long tail task on different machine from the original task. And modify FileOutputFormat.java to adapter speculative execution mechanism.

The preliminary experiments have demonstrated the effectiveness in our product cluster.

Proposed Changes

General design

Detection of Long Tail Tasks

A task will be classified as a long tail task when it meets the following three rules.

Finished Tasks Percentage

In one ExecutionJobVertex, when a configurable percentage of executions has been finished, the speculative execution thread begin really work.

Long Running Time

In speculative execution thread all executions' interval between the current time and it first create/deploying time before it failover in one ExecutionJobVertex are calculated. when the running time of a execution is greater than a configurable multiple of the median of the running time of other finished executions, this execution is defined as long tail execution.

Slow Processing Throughput

The primary characteristic of long tail tasks is that their processing throughput are much slower than the expected or than other normal tasks.

todo..

Scheduling of Speculative Executions

Because of the introduction of speculative execution, the ExecutionVertex can't default to only one execution is running at the same time. So change the currentExecution in ExecutionVertex to an arrayList named executionList(The purpose of using list is to increase the code extensibility, in case that multiple speculative executions will run at the same time in the future), which means that there can be multiple executions in an ExecutionVertex at the same time.

The scheduler should scheduling an execution(original or speculative execution) according to the index of this execution in executionList instead of default to scheduling the currentExecution in current code.

In each ExecutionJobVertex there is a speculative execution thread used for detect the long tail execution in this ExecutionJobVertex periodically based on above three rules.

todo..

Manage input and output of each ExecutionVertex

Manage InputSplit

Manage sink files

Manage middle ResultPartition

Limitations

Configuration

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
If we are changing behavior how will we phase out the older behavior?
If we need special migration tools, describe them here.
When will we remove the existing behavior?

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-168: Speculative execution for Batch Job