Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


...

Page properties

...


Discussion thread

Discussion threadhttp://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-108-Add-GPU-support-in-Flink-td38286.html

...


Vote thread
JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-17044

...

Release1.11


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Introduce the external resource framework for external resource allocation and management. The pattern of configuration options is:

  • external-resourceresources.list. Define the {resourceName} list of enabled external resources, split by delimiter ",".
  • external-resource.{resourceName}.amount. Define the amount of external resources in a task executor.
  • external-resource.{resourceName}.driver-factory.class. Define the class name of ExternalResourceDriverFactory.
  • external-resource.{resourceName}.kubernetes.key. Optional config which defines the configuration key of that external resource in Kubernetes. If you want the Flink to request the external resource from Kubernetesexternal resource from Kubernetes(through its Device Plugin mechanism[3]), you need to explicitly set this key. Only valid for Kubernetes mode.
  • external-resource.{resourceName}.yarn.key. Optional config which defines the configuration key of that external resource in Yarn. If you want the Flink to request the external resource from Yarn, you need to explicitly set this key. Only valid for Yarn mode.
  • external-resource.{resourceName}.param.{params}. Each ExternalResourceDriver could define their specific configs following this pattern.

...

Code Block
languagejava
titleExternalResourceDriver
public interface ExternalResourceDriverFactory {
    /**
    * Construct the ExternalResourceDriver from configuration.
    */
    ExternalResourceDriver retrieveResourceInfocreateExternalResourceDriver(Congiuration config);
}

public interface ExternalResourceDriver {
    /**
    * Retrieve the information of the external resources according to the resourceProfileamount.
    */
    Set<? Set<ExternalResourceInfo>extends ExternalResourceInfo> retrieveResourceInfo(long amount);
}

...

Code Block
languagejava
titleRuntimeContext
public interface RuntimeContext {
    /**
	 * Get the specific external resource information. Index by the resource name defined in "external-resource.list".resourceName.
	 */
	Map<String, Set<ExternalResourceInfo>> getExternalResourceInfo(Set<ExternalResourceInfo> getExternalResourceInfos(String resourceName);
}

For GPU resource, we introduce the following configuration options:

...

Proposed Changes

  • We introduce the ExternalResourceDriver external resource framework for external resource allocation and management.
  • User sets the “external“external-resource.gpu.amount”, “external-resource.gpu.driver-factory.amount”  class” and specifies the “external-resource.gpu.param.discovery-script.[path|args]” if needed.
  • For Yarn/Kubernetes mode, Flink maps the “external-resource.gpu.amount” to the corresponding field of resource requests to the external resource manager.
  • Introduce a GPUDriver, which will execute the discovery script and get the available GPU resources from the output.
  • Operators and functions get the GPU resource information from GPUDriver

...

  • For Yarn, the YarnResourceManager adds the external resource to the ContainerRequest.
  • For Kubernetes, the KubernetesResourceManager adds the external resource to the pod for TaskExecutor(leverage the Device Plugin mechanism[3]).

On the TaskExecutor side, we introduce ExternalResourceDriver, which takes the responsibility to detect and provide information of external resources. TaskExecutor does not need to manage a specific external resource by itself, Operators and functions would get the ExternalResourceInfo from RuntimeConext.

Regarding the configuration, the common config keys are the amount of the external resources and the class name of ExternalResourceDriver. Besides, each driver could define their own configs following the specific pattern. In summary:

  • external-resource.listresourcesDefine the {resourceName} list of enabled external resources with delimiter ",". If configured, ResourceManager and TaskExecutor would check if the relevant configs exist for resources in this list. ResourceManager will forward the request to the underlying external resource manager. TaskExecutor will launch the corresponding ExternalResourceDriver.
  • external-resource.{resourceName}.amount. Define the amount of external resources in a task executor.
  • external-resource.{resourceName}.driver-factory.class. Define the class name of ExternalResourceDriverFactory.
  • external-resource.{resourceName}.kubernetes.key. Optional config which defines the configuration key of that external resource in Kubernetes. If you want the Flink to request the external resource from Kubernetes, you need to explicitly set this key. Only valid for Kubernetes mode.
  • external-resource.{resourceName}.yarn.key. Optional config which defines the configuration key of that external resource in Yarn. If you want the Flink to request the external resource from Yarn, you need to explicitly set this key. Only valid for Yarn mode.
  • external-resource.{resourceName}.param.{params}. Each ExternalResourceDriver could define their specific configs following this pattern.

...

Code Block
languagejava
titleExternalResourceDriver
public interface ExternalResourceDriverFactory {
    /**
    * Construct the ExternalResourceDriver from configuration.
    */
    ExternalResourceDriver retrieveResourceInfocreateExternalResourceDriver(Congiuration config);
}

public interface ExternalResourceDriver {
    /**
    * Retrieve the information of the external resources according to the resourceProfileamount.
    */
    Set<ExternalResourceInfo>Set<? extends ExternalResourceInfo> retrieveResourceInfo(long amount);
}

public interface ExternalResourceInfo {
    String getProperty(String key);
    Collection<String> getKeys();
}

On the ResourceManager side, Flink requires the environment in which task executors run has required GPU resources and the GPU resources are accessible to task executors.

reRegarding Regarding the amount of GPU resources:

...

For standalone mode, multiple task executors may be co-located on the same device, and each GPU is visible to all the task executors. To achieve worker-level isolation in such scenarios, we need first decide which task executor uses which GPU in a cooperative way. We provide a privilege mode for this in the default script.

...