You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »


Discussion threadhere (thread)
Vote threadhere ()
JIRAhere (FLINK-21883)
Release<Flink Version>

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

In the context of reactive mode, we would like to introduce a cooldown period, during which no further scaling actions are performed, after a scaling action. Indeed, we would like to avoid too frequent scaling operations either in scaling up or in scaling down.

Public Interfaces

Only a new user configuration  scaling-cooldown-period allowing the user to configure the minimum time between 2 scaling operations

Proposed Changes

Important points are these onces: when a scaling event is received either scaling up or scaling down:

  • If it falls outside a cooldown period, it is executed right away and a timer is started
  • If it falls during the cooldown period, it is not dropped, it is rather queued
  • Receiving a scaling event during a cooldown period does not reset the period timer to avoid  increasing the delay in scaling operations.
  • When the period ends, all the queued scaling operations are aggregated to result into a single operation. This operation is executed and then a new scaling-cooldown-period is started


The diagram below shows the different steps and cases:


scaling-cooldown-period scaling-cooldown-period Scheduler ScalingOperationQueue CooldownTimer Scheduler ScalingOperationQueue CooldownTimer scaling event trigger scale change start timer scaling event queue operation scaling event queue operation end of cooldown period dequeue operations aggregate operations trigger scale change start timer end of cooldown period no operation to trigger scaling-cooldown-period scaling-cooldown-period

This diagram is explained as this:

  • A first scaling event is received and the scaling operation is executed right away leading to the creation of a cooldown period
  • Then, 2 scaling events arrive during the  cooldown period. These events are queued.
  • When the cooldown period ends, the queued operations are aggregated and executed as a single scaling operation
  • Executing this operation leads to creating another cooldown period. During this period no scaling event is received, so no new scaling operation is queued 
  • When this last cooldown period ends, the scheduler has finished his scaling job

Compatibility, Deprecation, and Migration Plan

Reactive mode and adaptive scheduler are already released but the current behavior has no cooldown period. So the current state is equivalent to setting the scaling-cooldown-period new configuration parameter to 0s. That way, there will be no impact on the users.

But we could also consider that setting a default scaling-cooldown-period value to 300s would not break the user but rather give him a protection against too frequent scale changes.

=> I'd tend to prefer setting a default scaling-cooldown-period = 300s when reactive mode is enabled.

Test Plan

The new cooldown period feature should be covered by end-to-end tests. The current set of related end-to-end tests cover only resuming a pipeline with various configuration combinations (file/rocks, sync/async, parallelism change/ no parallelism change ...). So we need to add some E2E test cases covering the use cases described in the sequence diagram above measuring the time between scaling operations in various situations. We should be able to use the same DataStreamAllroundTestProgram in the E2E tests.

Rejected Alternatives

When scaling operations are dequeued, they are not executed one by once at a sclaing-cooldown-period pace to avoid adding too much delay in scaling.

  • No labels