Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Proposed Changes

  • We introduce the ExternalResourceDriver external resource framework for external resource allocation and management.
  • User sets the “external-resource.gpu.amount”, “external-resource.gpu.driver-factory.amount”  class” and specifies the “external-resource.gpu.param.discovery-script.[path|args]” if needed.
  • For Yarn/Kubernetes mode, Flink maps the “external-resource.gpu.amount” to the corresponding field of resource requests to the external resource manager.
  • Introduce a GPUDriver, which will execute the discovery script and get the available GPU resources from the output.
  • Operators and functions get the GPU resource information from GPUDriver

...

On the ResourceManager side, Flink requires the environment in which task executors run has required GPU resources and the GPU resources are accessible to task executors.

reRegarding Regarding the amount of GPU resources:

...

For standalone mode, multiple task executors may be co-located on the same device, and each GPU is visible to all the task executors. To achieve worker-level isolation in such scenarios, we need first decide which task executor uses which GPU in a cooperative way. We provide a privilege mode for this in the default script.

...