Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In Flink batch job, the job is usually divided into multiple parallel tasks executed cross many nodes in the cluster. It is common to encounter the performance degradation on some nodes due to hardware problems, or accident I/O busy, or high CPU load. This kind of degradation can probably cause the running tasks on the node to be quite slow, that is so called long tail tasks. Although the long tail tasks will not fail, they can severely affect the total job running time. Flink task scheduling does not take this long tail problem into account currently.

Here we propose the speculative execution strategy [FLINK-10644] to handle the problem. The basic idea is to run a copy of task on another node when the original task is identified to be long tail. In more details, the speculative task will be triggered when the speculative scheduler detects that the running time of a task is greater than a configurable multiple of the median of the running time of other finished executions and the data processing throughput of the task is much slower than others. The speculative task is executed in parallel with the original one and share the same failure retry mechanism. Once either task complete, the scheduler admits its output as the final result and cancel the other running one. I will introduce a blacklist module to schedule the long tail task on different machine from the original task. And modify FileOutputFormat.java to adapter speculative execution mechanism.

The preliminary experiments have demonstrated the effectiveness in our product cluster. 

Proposed Changes

General design

Detection of Long Tail Tasks

A task will be classified as a long tail task when it meets the following three rules.

Finished Tasks Percentage

In one ExecutionJobVertex, when a configurable percentage of executions has been finished, the speculative execution thread begin really work.

Long Running Time

All executions' interval between the current time and it first create/deploying time before it failover in one ExecutionJobVertex are calculated. when the running time of a execution is greater than a configurable multiple of the median of the running time of other finished executions, this execution is defined as long tail execution.

Slow Processing Throughput

todo..

The primary characteristic of long tail tasks is that their processing throughput are much slower than the expected or than other normal tasks.


Scheduling of Speculative Executions

...