Status

Discussion thread
Vote thread
JIRA

Unable to render Jira issues macro, execution error.

Release

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

It is hard to troubleshoot when all subtasks are always on the SCHEDULED status(just like the screenshot below) when users submit a job.

Proposed Changes

The most common reason for this problem is that vertex has applied for more resources than the cluster has. Pending slots data could help users to check which vertex or subtask is blocked.

Frontend Design

Add the pending status to the vertex node to show the pending reason.

REST API Design

  • add ScheduledUnit for SlotRequest.
    • in SchedulerImpl.internalAllocateSlot, after allocationFuture set setPendingScheduledUnit(slotRequestId, scheduledUnit).
    • add requestPendingSlotRequests in scheduler.

/**

* Requests the pending slot requests.

*

* @param timeout for the operation

* @return the list of pending slot requests.

*/

CompletableFuture<Collection<PendingSlotRequest>> requestPendingSlotRequests(@RpcTimeout Time timeout);


  • add requestPendingSlotRequestDetails in JobMasterGateway.

/**

* Request the details of pending slot requests of the current job.

*

* @param timeout for the rpc call

* @return the list of pending slot requests.

*/

CompletableFuture<Collection<JobPendingSlotRequestDetail>> requestPendingSlotRequestDetails(@RpcTimeout Time timeout); 


  •  add JobPendingSlotRequestsHandler for rest.
    • url: /jobs/:jobid/pendingslotrequest
    • response:


{

  "pending-slot-requests" : {

    "type" : "array",

    "items" : {

      "type" : "object",

      "id" : "urn:jsonschema:org:apache:flink:runtime:rest:messages:job:JobPendingSlotRequestDetail" 

      "properties" : {

        "vertex_id" : {

          "type" : "string"

        },

        "task_name" : {

          "type" : "string"

        },

        "slots" : {

          "type" : "array",

          "items" : {

            "type" : "object",

            "id" : "urn:jsonschema:org:apache:flink:runtime:rest:messages:job:JobPendingSlotRequestDetail:SlotInfo" 

            "properties" : {

              "id" : {

                "type" : "string"

              },

              "start_time" : {

                "type" : "long"

              },

              "co-location_id" : {

                "type" : "string"

              },

              "sharing_id" : {

                "type" : "string"

              }

            }

          }

        }

      }

    }

  },

  "total" : {

    "type" : "integer"

  }

}


Test Plan

Covered by unit tests.