Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Here we propose the speculative execution strategy [FLINK-10644] to handle the problem. The basic idea is to run a copy of task on another node when the original task is identified to be long tail. In more details, the speculative task will be triggered when the speculative scheduler detects that the running time of a task is greater than a configurable multiple of the median of the running time of other finished executions and the data processing throughput of the task is much slower than others. The speculative task is executed in parallel with the original one and share the same failure retry mechanism. Once either task complete, the scheduler admits its output as the final result and cancel the other running one. The preliminary experiments have demonstrated the effectiveness.  

Proposed Changes

General design

Detection of Long Tail Tasks

Finished Tasks Percentage


Long Running Time


Slow Processing Throughput


Scheduling of Speculative Executions


Limitations


Configuration


Compatibility, Deprecation, and Migration Plan

...