Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Current state: "Under Discussion"

Discussion thread: here (<- link to https://mail-archives.apache.org/mod_mbox/flink-dev/)

JIRA: Under Discussion

Released: <Flink Version>TODO


Motivation


Flink uses a JobManager process model with multiple TaskManagers. The JobManager process includes the Dispatcher, ResourceManager, and RestEndpoint core components. The components within the JobManager are responsible for different jobs. Their responsibilities are as follows

  • Dispatcher: After receiving the user's JobGrpah, it parses it, and creates a thread-level JobManagerRunner for each job. The JobManagerRunner creates a JobMaster. After performing the ExecutionGraph, checkpoint coordination, Shuffle, and scheduling operations, the system sends a resource request to ResourceManager
  • RestEndpoint: receives jobs submitted by the client's JobGraph or WebUI user. A small Web server is maintained internally, and different handlers are used to distribute requests to the Dispatcher
  • ResourceManager: ResourceManager is responsible for resource allocation and management. For different environments and resource management platforms (such as Standalone deployment, or YARN or K8S), there are different specific implementations to allocate specific resources

...

At present, when the Dispatcher receives the job (JobGraph) submitted by the user through the client, it creates a JobMaster thread for each job through the Dispatcher and runs on the Dispatcher. The resources required by each topology go to the ResourceManager to apply for Slot resources and interact with the Taskmanager for communication. The overall process is shown in the figure below.


// TODOImage Added



The process is as follows

...

This proposal proposes a splitting scheme for the current process and a new process implementation idea that is compatible with the original process model: splitting the internal JobMaster component of the JobManager, and controlling whether to enable this new process through a parameter In the split scheme, when the user configures, the JobMaster will make it run as an independent service, reducing the workload of the JobManager. By implementing a new Dispatcher to communicate and interact with a single split JobMaster or multiple JobMasters, to achieve job management


Public Interfaces

...


  • Add a new configuration option 'jobmanager.jobmaster.single.enable' for enabling this feature.

  • Add a new bash script for manually starting JobMaster processes.

  • Add a new configration file 'JOB_MASTERS' for specifying JobMaster processes to be started when using 'start-cluster.sh' 

Proposed Changes


The current runtime of Flink consists of two types of processes: a JobManager and one or more TaskManagers. The RestEndpoint in the JobManager is responsible for receiving commands such as user client submission, cancellation, and Savepoint operations, which will be forwarded to the Dispatcher for processing.

...

When the value of the user configuration (jobmanager.jobmaster.single.enable) is true, the JobManager process separation is enabled. After the process separation is enabled, the JobManager will use the JobMasterDispatcher to implement, run the JobMasterPrimaryDispatcher internally, and receive the job request submitted by the client and forward it to the registered On JobMasterContainerRunner. When it receives these requests, it will create multiple JobMaster threads on this process, and run it according to the original when it was not separated. The separated process model is as follows


//TODO PictureImage Added


After the thread separation parameter is enabled, the job submission process will look like this

...