Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the context of reactive mode, we would like to introduce a cooldown period , during which no further scaling actions are performed , after a scaling action. Indeed, we would like to avoid too frequent scaling operations either in scaling up or in scaling down.

...

Important points are these oncesones:

  • when new slots are available, flink should rescale immediately only if last rescale was done more than scaling-interval.min ago otherwise it should schedule a rescale at (last-rescale + scaling-interval.min) time. 
  • when slots are lost (most of the time after a  TaskManager failure), there will be no change compared to the current behavior:
    1. the pipeline transitions to Restarting state (cf FLIP-160)
    2. then it transitions to Waiting for Resources state (cf FLIP-160) in which the pipeline will not be rescaled before stable resources timeout. This will protect against subsequent scaling operations (slot losses due to more TaskManager failures or slot offerings) during this timeout period (configurable via existing jobmanager.adaptive-scheduler.resource-stabilization-timeout).

...

Reactive mode and adaptive scheduler are already released but the current behavior has no cooldown period. So the current state behavior is equivalent to setting the jobmanager.adaptive-scheduler.scaling-interval.min to 0s with no jobmanager.adaptive-scheduler.scaling-interval.max . That way, there will be set. Such default values will have no impact on the users.

But we could also consider that setting a default jobmanager.adaptive-scheduler.scaling-interval.min value to 300s would not really break the user but rather give him a protection against too frequent scale changes.

...

Rejected Alternatives

rejected the option of adding a queue for scaling requests.