Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

On the ResourceManager side, Flink requires the environment in which task executors run has required GPU resources and the GPU resources are accessible to task executors.

Regarding reRegarding the amount of GPU resources:

...

We introduce the configuration option “external-resource.gpu.param.discovery-script” and “external-resource.gpu.param.discovery-script.args”, which define the path of discovery script and its arguments. The discovery script should have two functions: allocate and releasedeallocateAll. GPUDriver will execute the allocate function and get the available GPU resources from the output when it is opened and execute the release function deallocateAll function when it is closed. 

In the allocate function, the discovery script should:

  • Return a list of the available GPU indexes, split by a comma.
  • Exit with non-zero if the output does not meet the expectation. GPUDriver will throw exception in that case and then cause TaskExecutor initialization to fail.
  • Flink passes the keyword “allocate” and the amount (external-resource.gpu.amount) as the first two arguments into the script. The user-defined arguments would be appended after it.

In the release functiondeallocateAll function, the discovery script should:

  • Clean up all the state and file it produced in allocate function.
  • Exit with non-zero in failure. GPUDriver will throw exception and print error log.
  • Flink passes the keyword “release” and the amount (external-resource.gpu.amount) “deallocateAll” as the first two arguments argument into the script. The user-defined arguments would be appended after it.

...

  • The script would first get all the indexes of visible GPU resources, by using the “nvidia-smi”/“rocm-smi” command.
  • It will return a list of indexes of discovered GPUs, split by a comma. 
  • The number of GPU indexes returned from the default script should always match the amount configured through “external-resource.gpu.amount”
    • If there are more GPUs discovered than configured, the script returns only as many indexes as the configured amount.
    • If there are not enough GPU discovered, the script will fail and exit with non-zero.

In its “release” “deallocateAll” function, the script will do nothing and exit with zero.

...

After that, the script leverages the “cgroups” mechanism[5], with which we could set the visibility of each GPU resource for a certain control group (task executor and its children processes). The script will create a specific cgroup for the task executor. Then add the indexes of GPU resources, which the task executor could not use, to the device blacklist of this cgroup. The task executor and its children processes will only see the GPU resources allocated for it since then.

The release function deallocateAll function will check the GPUs used by the task executor through PID and remove those records from the assignment file.

  • If the task executor failed unexpectedly, the release function deallocateAll function may not be triggered.  In this scenario, task executors start after this will read the dirty data. That may cause task executor mistakenly reports there is no enough GPU resource. To address this issue, the script provides a “--check-dead” option. If it is added to in “external-resource.gpu.param.discovery-script.args”, in case of no enough non-recorded GPU, the allocate function will check whether the associated processes of exist records are still alive, and take over those GPUs whose associated processes are already dead. 

...