Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Detection and restart of the process(es) that execute the JobManager and ResourceManager

  • Recovery of the job’s JobGraph and libraries


...

 

Architecture with Cluster Managers


 

YARN

Compared to the state in Flink 1.1, the new Flink-on-YARN architecture offers the following benefits:

  • The client directly starts the Job in YARN, rather than bootstrapping a cluster and after that submitting the job to that cluster. The client can hence disconnect immediately after the job was submitted

  • All user code libraries and config files are directly in the Application Classpath, rather than in the dynamic user code class loader

  • Containers are requested as needed and will be released when not used any more

  • The “as needed” allocation of containers allows for different profiles of containers (CPU / memory) to be used for different operators


Without Dispatcher

Image Added


With Dispatcher

Image Added


Yarn-specific Fault Tolerance Aspects

ResourceManager and JobManager run inside the ApplicationMaster process. Failure detection and restart of that process is done by YARN.

JobGraph and libraries are always part of the working directory from which the ApplicationMaster processes is spawned. Internally, YARN stores them in a private HDFS directory.


Mesos

Mesos based setups are similar to YARN with a dispatcher. A dispatcher is strictly required for Mesos, because it is the only way to have the Mesos-specific ResourceManager run inside the Mesos cluster.

Image Added


Mesos-specific Fault Tolerance Aspects

ResourceManager and JobManager run inside a regular Mesos container. The Dispatcher is responsible for monitoring and restarting those containers in case they fail. The Dispatcher itself must be made highly available by a Mesos service like Marathon, as in this picture:


Image Added







JobGraph and libraries need to be stored by the dispatcher in a persistent storage, typically the same storage where the checkpoints are stored.

Public Interfaces

Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.

...