Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • jobmanager.adaptive-scheduler.scaling-interval.min allowing the user to configure the minimum time between 2 scaling operations
  • jobmanager.adaptive-scheduler.scaling-interval.max optional parameter allowing the user to configure the time after which a scaling operation is triggered regardless if the requirements (AdaptiveScheduler#shouldRescale()) are met . I f not set, there will be no forcing of the scaling.

...

  • when new slots are available, flink Flink should rescale immediately only if last rescale was done more than scaling-interval.min ago otherwise it should schedule a rescale at (last-rescale + scaling-interval.min) time. 
    • if minimum scaling requirements are met (AdaptiveScheduler#shouldRescale), the job is restarted with new parallelism (as before)
    • if minimum scaling requirements are not met but last rescale was done more than scaling-interval.max ago, a rescale is forced.
  • when slots are lost (most of the time after a  TaskManager failure), there will be no change compared to the current behavior:
    1. the pipeline transitions to Restarting state (cf FLIP-160)
    2. then it transitions to Waiting for Resources state (cf FLIP-160) in which the pipeline will not be rescaled before stable resources timeout. This will protect against subsequent scaling operations (slot losses due to more TaskManager failures or slot offerings) during this timeout period (configurable via existing jobmanager.adaptive-scheduler.resource-stabilization-timeout).

...

The new cooldown period feature should be covered by end-to-end tests. The current set of related end-to-end tests cover only resuming a pipeline with various configuration combinations (file/rocksDb, sync/async, parallelism change/ no parallelism change ...). So we need to add some E2E tests covering the use cases described above measuring the time between scaling operations in various situations. We should be able to use the same DataStreamAllroundTestProgram in the E2E testscomprehensive tests. They should test rescaling in various time conditions:  scaling-interval.min exceeded and not exceeded, scaling-interval.max enabled and disabled, scaling-interval.max exceeded and not exceeded. These tests car be added to existing ExecutingTest.

Rejected Alternatives

rejected the option of adding a queue for scaling requests.