Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

GET: http://{jm_rest_address:port}/blocklist

Request

Request body: {}

Response

Response code: 200(OK)

Response body:

Code Block
titleResponse Example
{
  /** This group only contains directly blocked task managers */
  "blockedTaskManagers": [
      {
          "id" : "container1",
          "action" : "MARK_BLOCKED",
          "startTimestamp" : "1652313600000",
          "endTimestamp" : "1652317200000",
          "cause" : "Hot machine"
      },
      {
          "id" : "container2",
          "action" : "MARK_BLOCKED_AND_EVACUATE_TASKS",
          "startTimestamp" : "1652315400000",
          "endTimestamp" : "1652319000000",
          "cause" : "No space left on device"
      }, 
      ...
  ],
  "blockedNodes": [
      {
          "id" : "node1",
          "action" : "MARK_BLOCKED",
          "startTimestamp" : "1652313600000",
          "endTimestamp" : "1652317200000",
          "cause" : "Hot machine",
          /** The task managers on this blocked node */
          "taskManagers" : ["container3", "container4"]
      },
      ...
  ]
} 

...

POST: http://{jm_rest_address:port}/blocklist/taskmanagers

Request

Request body:

Code Block
titleRequest Example
{
    [
        {
            "id" : "node1/container1",
            "action" : "MARK_BLOCKED",
            "endTimestamp" : "1652317200000",
            "cause" : "Hot machine",
			"mergeOnConflict" : "true"
        },
        {
            "id" : "node2/container2",
            "action" : "MARK_BLOCKED_AND_EVACUATE_TASKS",
            "timeout" : "3600000",
            "cause" : "No space left on device"
        }, 
        ...
    ]
}

...

When trying to add a taskmanager or node, if the corresponding taskmanager/node already exists in blocklist, we propose to introduce two processing behaviors:

  1. If field mergeOnConflict is false, return error. 
  2. If field mergeOnConflict is true. The newly added item and the existing item will be merged into one. Regarding the 3 fields, the merging algorithm:
    1. For action, merge(MARK_BLOCKED, MARK_BLOCKED_AND_EVACUATE_TASKS) = MARK_BLOCKED_AND_EVACUATE_TASKS
    2. For endTimestamp, merge(endTimestampA, endTimestampB) = max(endTimestampA, endTimestampB)
    3. For cause, we will combine all causes, merge("causeA", "causeB") = "causeA,causeB"
Response
  1. If no conflict, the response code is 201(CREATED), the response body is empty.
  2. If conflict occurs:
    1. If mergeOnConflict is false, the response code is 409(CONFLICT), and returns error.
    2. if mergeOnConflict is true, the response code is 202(ACCEPTED), the response body is the merged result.

Remove

DELETE: http://{jm_rest_address:port}/blocklist/node/<id>

DELETE: http://{jm_rest_address:port}/blocklist/taskmanager/<id>

Request

Request body: {}

Response

Response code: 200(OK)

Response body: {}

Proposed Changes

In this design, two granularities of blocked resources are supported: task managers and nodes. A record of blocklist information is called a blocked item, which is generally generated by the scheduler according to the exception of the tasks. These blocked items will be recorded in a special component and affect the resource allocation of Flink clusters. However,the blocked items are not permanent, there will be a timeout for it. Once an item times out, it will be removed, and the resource will become available again. The overall structure of the blocklist mechanism is shown in the figure below. 

...