Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • external-resource.{resourceName}.amount. Define the amount of external resources in a task executor.
  • external-resource.{resourceName}.driver.class. Define the class name of ExternalResourceDriver.
  • external-resource.{resourceNameresourceName}.paramkubernetes.{params}. Each ExternalResourceDriver could define their specific configs following this pattern.

For GPU resource, we introduce the following configuration options:

  • key. Define the configuration key of that external resource in Kubernetes. Only valid for Kubernetes mode.
  • external-resource.{resourceName}.yarn.key. Define the configuration key of that external resource in Yarn. Only valid for Yarn mode.
  • external-resource.{resourceName}.param.{params}. Each ExternalResourceDriver could define their specific configs following this pattern.

For GPU resource, we introduce the following configuration options:

  • external-resource.gpu.amount”: Define how many GPUs in a task executor. The default value should be 0
  • external-resource.gpu.amount”: Define how many GPUs in a task executor. The default value should be 0.
  • external-resource.gpu.param.discovery-script.path”: Define the path of the discovery script. See Discovery Script Section.
  • external-resource.gpu.param.discovery-script.args”path”: Define the arguments passed to path of the discovery script. See Discovery Script Section.
  • external-resource.gpu.param.vendor”discovery-script.args”: Define the vendor of the GPU resource. In Kubernetes, arguments passed to the discovery script. See Discovery Script Section.
  • external-resource.{resourceName}.kubernetes.key. Define the configuration key of GPU resource is “<vendor>in Kubernetes. The default value is “nvidia.com/gpu”[3]. Only accept “nvidia” and “amd” at the moment. Only valid for Kubernetes mode. If using amd GPU, user could set it to "amd.com/gpu"
  • external-resource.{resourceName}.yarn.key. Define the configuration key of GPU in Yarn. The default value is "yarn.io/gpu".

Introduce the ExternalResourceInfo class, which contains the information of the external resources. Operators and functions could get that information from the RuntimeContext.

...

  • We introduce the ExternalResourceDriver framework for external resource allocation and management.
  • User sets the “taskmanager.resource.gpu.amount”  and specifies the “external-resource.gpu.param.discovery-script.[path|args]” if needed.
  • For Yarn/Kubernetes mode, Flink maps the “taskmanager.resource.gpu.amount” to the corresponding field of resource requests to the external resource manager.
  • Introduce a GPUManager, which will execute the discovery script and get the available GPU resources from the output.
  • Operators and functions get the GPU resource information from GPUManager

...

To provide extensibility and decouple the TaskExecutor/ResourceManager from the external resource management/allocation(following the separation of concern rule), we introduce the ExternalResourceDriver framework for the external resource allocation and management. This class could be extended by third-party for other external resources they want to leverage.

The ExternalResourceDriver framework drives the end-to-end workflow of external resource allocation and management.

On the ResourceManager side, user defines the amount of the external resource. ExternalResourceDriver framework takes the responsibility to allocate resources from external resource managers(Yarn/Kubernetes). ResourceManager does not need to understand how to allocate a specific external resourceUser needs to specify the configuration key of that external resource on Yarn/Kubernetes. Then, Yarn/KubernetesResourceManager forward this external resource request to the external resource managers.

  • For Yarn, the ExternalResourceDriver needs to add YarnResourceManager adds the external resource to the ContainerRequest.
  • For Kubernetes, the pod for TaskExecutor is built by multiple decorators. ExternalResourceDriver needs to provide a specific decorator to forward KubernetesResourceManager adds the external resource request to the pod definitionfor TaskExecutor.

On the TaskExecutor side, ExternalResourceDriver takes the responsibility to detect and provide information of external resources. TaskExecutor does not need to manage a specific external resource by itself, Operators and functions would get the ExternalResourceInfo from RuntimeConext.

...

  • external-resource.{resourceName}.amount. Define the amount of external resources in a task executor.
  • external-resource.{resourceName}.driver.class. Define the class name of ExternalResourceDriver.
  • external-resource.{resourceName}.param.{params}. Each ExternalResourceDriver could define their specific configs following this pattern.

The definition of ExternalResourceDriver and ExternalResourceInfo is:

  • resourceName}.kubernetes.key. Define the configuration key of that external resource in Kubernetes. Only valid for Kubernetes mode.
  • external-resource.{resourceName}.yarn.key. Define the configuration key of that external resource in Yarn. Only valid for Yarn mode.
  • external-resource.{resourceName}.param.{params}. Each ExternalResourceDriver could define their specific configs following this pattern.

The definition of ExternalResourceDriver and ExternalResourceInfo is:


Code Block
languagejava
titleExternalResourceDriver
public abstract class ExternalResourceDriver {
Code Block
languagejava
titleExternalResourceDriver
public abstract class ExternalResourceDriver {

    /**
    * Retrieve the information of the external resources according to the resourceProfile.
    */
    List<ExternalResourceInfo> retrieveResourceInfo(ResourceProfile resourceProfile);

    /**
    * When running in Kubernetes, we need to decorate the TM pod to request the external resource.
    */
    AbstractKubernetesStepDecorator getExternalResourceDecorator();

    /**
    * When running in Yarn,* weRetrieve needthe toinformation addof the external resourceresources requestaccording to the Resource of 
   resourceProfile.
    */
      voidList<ExternalResourceInfo> addExternalResourceToRequestretrieveResourceInfo(AMRMClient.ContainerRequestResourceProfile containerRequestresourceProfile);
}

public abstract class ExternalResourceInfo {

    // Return the name of that external resource.
    String getName();

}

We introduce the GPUDriver for the GPU resources.

On the ResourceManager side, Flink requires the environment in which task executors run has required GPU resources and the GPU resources are accessible to task executors.

...

  • We introduce the configuration option “external-resource.gpu.amount”, which defines how many GPU cores a task executor should have. Notice that this value will be passed to the discovery script as the first argument, See Discovery Script Section.
  • For standalone mode, it should be guaranteed by user that there are required GPU resources in the environment where task executors run.
  • For Yarn/Kubernetes mode, they will guarantee there are required amount of GPU resources in the container if we set the corresponding field in the request.
    • For Yarn, Flink will set the corresponding field(field(external-resource.gpu.yarn.io/gpukey) of container requests by GPUDriver#addExternalResourceToRequest.
    • For Kubernetes, Flink will set the corresponding field of pod requests by applying the decorator from GPUDriver#getExternalResourceDecorator. The corresponding field is “<vendor>.com/gpu” for Kubernetes. The vendor is configured by  “external(external-resource.gpu.param.vendor”.kubernetes.keyof pod requests. 

Regarding the accessibility of the GPU resources:

...

Note: To make GPU resources accessible, certain setups/preparation are needed depending on your environment. See External Requirements Section.

...

Once the required GPU resources are accessible to task executors, GPUDriver needs to discover GPU resources and provide the GPU resource information to operators.

...