Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Why need the waiting mechanism?

  • If a TM comes, all the slots of the TM will be allocated, which is a greedy strategy. 
  • Without a global perspective, it will be difficult In order to achieve the global optimal mapping from slot to TM.

Where: The waiting mechanism would be introduced in SlotPool

  • The mapping between Slot and TM is done by SlotPool, which must wait for all slots to arrive before they can have a global perspective to process.
  • The final result of scheduling will be: there can be at most one TM that has available slots.

How

  • , we must respect 2 conditions before scheduling slot requests to physical slots of TM:
    • Condition1: All slot requests are sent from JobMaster to SlotPool
    • Condition2: All request slots are offered from TaskManager to SlotPool
  • Why the condition1 is needed?
    • Assuming we start assign slots to tm after a part of slot requests are sent from JobMaster to SlotPool.
    • If the number of tasks on the first batch of slots is small, the number of tasks on the second batch of slots is larger.
    • All slots of first batch will be assigned to the first batch TM, and all slots of second batch will be assigned to the second batch TM.
    • The first batch of TMs run a small number of tasks, and the second batch of TMs run a large number of tasks.
    • So it's better to start assign slots to tm after all slot requests are sent from JobMaster to SlotPool.
  • Why the condition2 is needed?
    • If a TM comes, all the slots of the TM will be allocated, which is a greedy strategy. 
    • Without a global perspective, it will be difficult to achieve the global optimal mapping from slot to TM.
    • Also, all streaming jobs cannot run well if resources are insufficient. So we can start scheduling after all request slots are offered from TaskManager to SlotPool
      • 1. For streaming job with shuffle, if resources are insufficient, the job cannot run even if we schedule it.

      • 2. For streaming job without shuffle, if resources are insufficient, some of tasks(regions) can run. However, the checkpoint cannot be trigger, and job will fail after `slot.request.timeout`.

Where: The waiting mechanism would be introduced in SlotPool

  • The mapping between Slot and TM is done by SlotPool, which must wait for all slots to arrive before they can have a global perspective to process.
  • The final result of scheduling will be: there can be at most one TM that has available slots.

How

  • How to respect condition2:
    • Start assigning slots to tm after the number of available slots exceeds or equals the number of slot requests.
  • How to respect condition1:
    • The root cause of condition1 cannot be met is that the slot requests are sent from JobMaster to SlotPool is one by one instead of one whole batch.
    • The condition1 is meet in the most of scenarios, because the request for slots is faster than registering slots to the slot pool.(Registering slots to the slot pool needs to request and start TM.)
    • But there are two special cases: 
      • This is not true if a job is re-scheduled after a task failure. (Slots are ready in advance)
      • It's also not true for batch jobs. (Single task can run)
    • Solution1: Introducing an additional waiting mechanism: `slot.request.max-interval`
      • It means the maximum time interval when JM requests multiple slots from slotPool.
      • We think all slot requests are sent from JobMaster to SlotPool when `current time > arrival time of the last slotRequest + slot.request.max-interval`.
      • Of course, the `slot.request.max-interval` can be very small, like 30ms? not sure. Because the call stack is java method call, it isn't across process and it isn't rpc.
    • Solution2: Changing or adding the interface that from Scheduler to SlotPool.
      • 2.1 Changing the interface from the single slot request to batch slot request. It doesn't work well with PipelinedRegionSchedulingStrategy.
      • 2.2 Adding a interface let slot pool know all requests are arrived. (If so, the slot pool doesn't need any timeout or waiting mechanism).
    • We choose the solution1 because the solution2 may complicate the scheduling process a lot. (From this discussion).
  • Based on these information, when
  • Usually, we think that the request for slots is faster than registering slots to the slot pool.(Registering slots to the slot pool needs to request and start TM.) But there are two special cases: 
    • This is not true if a job is re-scheduled after a task failure.
    • It's also not true for batch jobs.
  • In order to solve these two cases, we could introduce an additional timeout option: `slot.request.max-interval`, which means the maximum time interval when JM requests multiple slots from SlotPool.
  • When `current time > arrival time of the last slotRequest + slot.request.max-interval` and `the number of available slots cached exceeds or equals the pending slotsslots`, the slots selection and allocation can be carried out.

...