Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties


Discussion threadhere (thread)
Vote threadhere (thread)
JIRAhere (FLINK-21883)
Release<Flink Version>


...

In the context of reactive mode, we would like to introduce a cooldown period , during which no further scaling actions are performed , after a scaling action. Indeed, we would like to avoid too frequent scaling operations either in scaling up or in scaling down.

...

  • jobmanager.adaptive-scheduler.scaling-interval.min allowing the user to configure the minimum time between 2 scaling operations
  • jobmanager.adaptive-scheduler.scaling-interval.max optional parameter allowing the user to configure the time after which a scaling operation is triggered regardless if the requirements (AdaptiveScheduler#shouldRescale() ) are met . I f not set, there will be no forcing of the force-scaling.

Proposed Changes

Important points are these onces:

...

This FLIP proposes the following changes:

A. When new slots are available:

  • Flink , flink should rescale immediately only if last rescale was done more than scaling-interval.min ago otherwise .
  • Otherwise it should schedule a rescale at (now + scaling-interval.min) point in time. It is equivalent to resetting the cooldown period when new slots arrive during a cooldown period. Indeed, we decided to lower the scaling-interval.min default value to be more reactive (cf compatibility part), resetting the period allows to protect against too frequent rescales. 
  • The rescale is done like this: 
    • if minimum scaling requirements are met (AdaptiveScheduler#shouldRescale), the job is restarted with new parallelism (as before)
    • if minimum scaling requirements are not met
      • if last rescale was done more than scaling-
    rescale + scaling-interval.min time. 
      • interval.max ago, a rescale is forced.
      • otherwise, schedule a forced rescale in scaling-interval.max

                     => When a rescale is forced, the rescale is done as long as the parallelism has changed. Otherwise, to avoid unnecessary restarts, the rescale is done when added resources are above the configured minimum. 

B. when slots are lost (most of the time after a  TaskManager failure), there will be no change compared to the current behavior:

    1. the

...

    1. job transitions to Restarting state (cf FLIP-160)
    2. then it transitions to Waiting for Resources state (cf FLIP-160) in which the

...

    1. job will not be rescaled before stable resources timeout. This will protect against subsequent scaling operations (slot losses due to more TaskManager failures or slot offerings) during this timeout period (configurable via existing jobmanager.adaptive-scheduler.resource-stabilization-timeout).


The cooldown period will be tied to the Executing state (cf FLIP-160). As a consequence, in case of JobManager failureif the job or the JobManager fail,  the current state of the cooldown period is reset.  

...

Reactive mode and adaptive scheduler are already released but the current behavior has no cooldown period. So the current state behavior is equivalent to setting the jobmanager.adaptive-scheduler.scaling-interval.min to 0s with no jobmanager.adaptive-scheduler.scaling-interval.max . That way, there will be set. Such default values will have no impact on the users.

But we could also consider that setting a default jobmanager.adaptive-scheduler.scaling-interval.min value to 300s would not to a value higher than 0 would not really break the user but rather give him a protection against too frequent scale changes.

So this FLIP proposes setting defaults values to jobmanager.adaptive-scheduler.scaling-interval.min = 300s 30s and no jobmanager.adaptive-scheduler.scaling-interval.max (force scaling disabled). Indeed, by default we prefer to favor lower scaling-interval.min (for more reactive rescaling) and let the users increase the value when they have high workloads.

Test Plan

The new cooldown period feature should be covered by end-to-end tests. The current set of related end-to-end tests cover only resuming a pipeline with various configuration combinations (file/rocksDb, sync/async, parallelism change/ no parallelism change ...). So we need to add some E2E tests covering the use cases described above measuring the time between scaling operations in various situations. We should be able to use the same DataStreamAllroundTestProgram in the E2E testscomprehensive tests. They should test rescaling in various time conditions:  scaling-interval.min exceeded and not exceeded, scaling-interval.max enabled and disabled, scaling-interval.max exceeded and not exceeded. These tests car be added to existing ExecutingTest.

Rejected Alternatives

rejected the option of adding a queue for scaling requests.