Table of Contents |
---|
Status
...
Page properties | |
---|---|
|
...
JIRA: _
...
|
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
Currently, the SlotManager supports failing unfulfillable slot requests by calling ResourceActions.notifyAllocationFailure. A slot is unfulfillable if the SlotManager has neither allocated slots nor can allocate a slot from the ResourceManager. This works because we have individual slot requests which are identified by the AllocationID. With the declarative resource management, we cannot fail individual slot requests. However, we can let the JobMaster know if we cannot fulfill the resource requirement for a job after resourcemanager.standalone.start-up-time has passed. In order to send this notification we have to introduce a new rpc RPC JobMaster.notifyNotEnoughResources(
AvailableResources availableResourcesCollection<ResourceRequirement> acquiredResources)
. AvailableResources contains acquiredResources
is the set collection of available acquired resources at for the ResourceManagerjob.
This signal is sent whenever the SlotManager
tried to fulfill the requirements for a job but failed to do so.
...
Code Block | ||
---|---|---|
| ||
interface SlotManager {
/**
* Process the given resource requirements. The resource requirements define the
* required resources for the specified job. The SlotManager will try to fulfill
* these requirements.
*
* @param resourceRequirements resourceRequirements defines the resource requirements for a job
*/
void processResourceRequirements(ResourceRequirements resourceRequirements);
} |
In order to enable the SlotManager
to notify the JobMaster
about not enough resources, we need to extend the JobMasterGateway
with an additional method:
Code Block | ||||
---|---|---|---|---|
| ||||
interface JobMasterGateway {
/**
* Notifies that not enough resources are available to fulfill the resource requirements of a job.
*
* @param acquiredResources the resources that have been acquired for the job
*/
void notifyNotEnoughResourcesAvailable(Collection<ResourceRequirement> acquiredResources);
} |
Accepting resources
On the JobMaster side, the SlotPool is responsible for accepting offered slots, and matching these against the requirements of the job. It has to follow the same logic for matching slots as the SlotManager.
...
If the SlotPool is provided with more slots than are currently required, then it will reject return these slots after the idle slot timeout has passed. This serves as a sort of grace period, potentially allowing us to make use of excessive slots later on without having to do another round-trip to the ResourceManager.
Note: Depending on the scheduling requirements it might make sense to reuse slots which have been freed on the JobMaster because it reduces latency or to return them and to ask for properly sized slots because it improves resource utilization (assuming different resource requirements). At the moment, we assume that reusing slots is possible. In the future we might have to make this behaviour configurable.
Releasing resources
Resources/Slots are released by the JobMaster by calling TaskExecutorGateway.freeSlot() and by updating the required resources by calling ResourceManagerGateway.declareRequiredResources with the updated resource requirements.
...
The slotmanager.request-timeout
option will no longer have an effect.
Follow ups
Removing the AllocationID
Once the old SlotPool
implementations are removed, it might be possible to remove the AllocationID
and to identify slots via the SlotID
.