Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Important points are these ones:

...

A. When new slots are available:

  • , Flink should rescale immediately only if last rescale was done more than scaling-interval.min ago otherwise .
  • Otherwise it should schedule a rescale at (last-rescale now + scaling-interval.min) point in time. It is equivalent to resetting the cooldown period when new slots arrive during a cooldown period. Indeed, we decided to lower the scaling-interval.min default value to be more reactive (cf compatibility part), resetting the period allows to protect against too frequent rescales. 
  • The rescale is done like this: 
    • if minimum scaling requirements are met (AdaptiveScheduler#shouldRescale), the job is restarted with new parallelism (as before)
    • if minimum scaling requirements are not met but last rescale was done more than scaling-interval.max ago, a rescale is forced.

                     => except when scaling-interval.max is exceeded, AdaptiveScheduler#shouldRescale is always called upon a rescale to avoid unnecessary restarts.

B. when slots are lost (most of the time after a  TaskManager failure), there will be no change compared to the current behavior:

    1. the

...

    1. job transitions to Restarting state (cf FLIP-160)
    2. then it transitions to Waiting for Resources state (cf FLIP-160) in which the

...

    1. job will not be rescaled before stable resources timeout. This will protect against subsequent scaling operations (slot losses due to more TaskManager failures or slot offerings) during this timeout period (configurable via existing jobmanager.adaptive-scheduler.resource-stabilization-timeout).


The cooldown period will be tied to the Executing state (cf FLIP-160). As a consequence, if the job or the JobManager fail,  the current state of the cooldown period is reset.  

...

But we could also consider that setting a default jobmanager.adaptive-scheduler.scaling-interval.min value to 300s would to a value higher than 0 would not really break the user but rather give him a protection against too frequent scale changes.

So this FLIP proposes setting defaults values to jobmanager.adaptive-scheduler.scaling-interval.min = 300s 30s and no jobmanager.adaptive-scheduler.scaling-interval.max (force scaling disabled). Indeed, we prefer to favor lower numbers (for smooth rescale experience) and consider higher numbers as exceptions set by the users when when they have high workloads.

Test Plan

The new cooldown period feature should be covered by comprehensive tests. They should test rescaling in various time conditions:  scaling-interval.min exceeded and not exceeded, scaling-interval.max enabled and disabled, scaling-interval.max exceeded and not exceeded. These tests car be added to existing ExecutingTest.

...