...
Detection and restart of the process(es) that execute the JobManager and ResourceManager
Recovery of the job’s JobGraph and libraries
...
Architecture with Cluster Managers
YARN
Compared to the state in Flink 1.1, the new Flink-on-YARN architecture offers the following benefits:
The client directly starts the Job in YARN, rather than bootstrapping a cluster and after that submitting the job to that cluster. The client can hence disconnect immediately after the job was submitted
All user code libraries and config files are directly in the Application Classpath, rather than in the dynamic user code class loader
Containers are requested as needed and will be released when not used any more
The “as needed” allocation of containers allows for different profiles of containers (CPU / memory) to be used for different operators
Without Dispatcher
With Dispatcher
Yarn-specific Fault Tolerance Aspects
ResourceManager and JobManager run inside the ApplicationMaster process. Failure detection and restart of that process is done by YARN.
JobGraph and libraries are always part of the working directory from which the ApplicationMaster processes is spawned. Internally, YARN stores them in a private HDFS directory.
Mesos
Mesos based setups are similar to YARN with a dispatcher. A dispatcher is strictly required for Mesos, because it is the only way to have the Mesos-specific ResourceManager run inside the Mesos cluster.
Mesos-specific Fault Tolerance Aspects
ResourceManager and JobManager run inside a regular Mesos container. The Dispatcher is responsible for monitoring and restarting those containers in case they fail. The Dispatcher itself must be made highly available by a Mesos service like Marathon, as in this picture:
JobGraph and libraries need to be stored by the dispatcher in a persistent storage, typically the same storage where the checkpoints are stored.
Public Interfaces
Briefly list any new interfaces that will be introduced as part of this proposal or any existing interfaces that will be removed or changed. The purpose of this section is to concisely call out the public contract that will come along with this feature.
...