Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Flink jobs typically run in a distributed way. In a large clusterscluster, it’s very common for cluster nodes to often encounter the following issues that affect jobs running:

  1. Unrecoverable problems, such as insufficient disk space, bad hardware, network abnormalities. These problems

...

  1. will result in continuous job failures. Currently, Flink

...

  1. users need to take

...

  1. the problematic node offline to solve this problem. However, taking a node offline can be a heavy process. Users may need to contact cluster administors to do this. The operation can even be dangerous and not allowed during some important business events.
  2. Recoverable problems, such as temporary node hotspots. These problems can slow the jobs running on it, but it can resume after a period of time. In this case, users may just want to limit the load of the node and do not want to kill all the processes on it. Unfortunately, currently neither Flink itself nor external resource management systems can do this.

To solve the above problems

...

Currently, Flink users need to manually identify the problematic node and take it offline to solve this problem. But this approach has following disadvantages:

  1. Taking a node offline can be a heavy process. Users may need to contact cluster administors to do this. The operation can even be dangerous and not allowed during some important business events.
  2. Identifying and solving this kind of problems manually would be slow and a waste of human resources.

To solve this problem, we propose to introduce a blocklist mechanism for Flink to filter out problematic resources. Following two ways Two granularities of blocked resources will be introduced to blocklist resources:

...

supported: task managers and nodes, and two block actions will be introduced:

  1. MARK_BLOCKED: Just mark a task manager or node as blocked. Future slots should not be allocated from the blocked task manager or node. But slots that are already allocated will not be affected.
  2. MARK_BLOCKED_AND_EVACUATE_TASKS: Mark the task manager or node as blocked, and evacuate all tasks on it. Evacuated tasks will be restarted on non-blocked task managers.

In this design, only support manually specifying blocked resources via the REST API, an auto-detection may be introduced in the future

...

.

Public Interfaces

We propose to introduce following configuration options for blocklist:

...