Page History

...

Support autoscaling Flink jobs deployed on Kubernetes via the Flink Kubernetes operator
Provide a default Flink job metric collector via the Flink metrics API
Provide a robust default scaling algorithm
1. Ensure scaling yields effective usage of assigned task slots, i.e. prevents backpressure but does not overprovision
2. Ramp up in case of any backlog to ensure it gets processed in a timely manner
3. Minimize the number of scaling decisions to prevent costly rescale operation
Allow plugging in a custom scaling algorithm and metric collection code

Non-Goals (for now)

Vertical scaling via adjusting container resources or changing the number of task slots per TM
Autoscaling in standalone Flink session cluster deployment mode
Integrate with the k8s scheduler to reserve required resources for rescaling

...

In this diagram, we show an example dataflow which reads from two different sources. Log 1 has a backlog growth rate of 100 records per time unit. Similarly, Log 2 has a backlog growth of 500. This means that without any processing, the backlog grows by the 100 or 500 records, respectively. Source 1 is able to read 10 records per time unit, Source 2 reads 50 records per time unit. The downstream operators process the records from the source and produce new records for their downstream operators. For simplicity, the number of inflowing and outflowing records for an operator is always the same here. The actual algorithm doesn't make that assumption.

Image RemovedImage Added

According to the reference paper, in the worst case it takes up to three attempts to reach this state. However, it can already converge after one scaling attempt. Note that the algorithm handles up- and downscaling. In the example, one of the vertices is scaled by parallelism 1 which won't change the parallelism, but it is also possible for vertices to be scaled down if the scaling factor is lower than 1.

...

Page tree

Versions Compared

Old Version 24

New Version Current

Key

Non-Goals (for now)