Status
Current state: Under Discussion
Discussion thread:
JIRA:
Released:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Streaming jobs which run for several days or longer usually experience changes in their workload during their lifetime. These changes can originate from seasonal spikes, such as day vs. night, weekdays vs. weekend or holidays vs. non-holidays, sudden events or simply growing popularity of your product. Some of these changes are more predictable than others but what all have in common is that they change the resource demand of your job if you want to keep the same service quality for your customers.
Even if you can estimate an upper bound for the maximum resource requirements, it is almost always prohibitively expensive to run your job from the very beginning with max resources. Consequently, it would be great if Flink could make use of TaskManagers which are.
Similarly, if the underlying resource management system decides that some of the currently assigned resources are needed elsewhere, Flink should not fail but scale the job down once the resources are revoked. Such a behaviour would make Flink a nicer citizen of the respective resource manager.
In contrast to the active mode, the reactive mode is oblivious to its underlying runtime system. Due to this aspect, Flink can only react to newly available or removed resources. The goal of this mode would be to use available TaskManagers by scaling the job up/down.
Elastic streaming pipelines
Flink is often used as the analytics component in a streaming pipeline with various upstream systems. In order to support fully elastic end-to-end streaming pipelines all pipeline components need to be able to react to changes in resource requirements. This means that Flink needs to be able to make use of newly available TaskManagers. Ideally, Flink itself is able to realize how many resources it needs and actively requests them.
Changing availability of resources
When Flink runs on top of a resource manager it can happen that Flink won’t obtain all requested resources in a timely manner. Instead of waiting potentially for a very long time, it would be best if Flink could start executing the job with the currently available set of resources. If more resources become available later, Flink should be able to scale up in order to make use of them.
Application mode (aka Flink-as-a-library)
Flink’s application mode (aka Flink-as-a-library) requires that Flink is able to make use of newly available TaskManagers. The idea of the application mode is that the user can start Flink processes where one process becomes the master and the other processes become the workers. The workers will spawn a TaskManager process and offer their resources to the master which runs the JobManager. Since not all application processes might be started at the same time because it might happen on-demand, it is important that Flink is able to react to newly available TaskManagers by scaling the application up or down.
Execution modes
As some of the described examples showed, different use cases have different requirements for Flink. In the case of elastic streaming pipelines it would be best if Flink could actively allocate new resources in order to not become the bottleneck of the pipeline. In case of the application mode, new processes are usually started by an external system and Flink only reacts to external decisions. These behavioral differences hint towards different execution modes of Flink.