Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The declarative scheduler will be a beta feature which the user has to activate explicitly by setting the config option jobmanager.scheduler: declarative. It will only be chosen if the user submitted a streaming job.

Limitations & future improvements

The first version of the declarative scheduler will come with a handful of limitations in order to reduce the scope of it.

Streaming jobs only

The declarative scheduler runs with streaming jobs only. When submitting a batch job, then the default scheduler will be used.

No support for local recovery

In the first version of the scheduler we don't intend to support local recovery. Adding support for it should be possible and we intend to add support for it as a follow up. One idea could be to extend the existing state machine by a new state "Restarting locally":

PlantUML
@startuml
hide empty description

[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting
state "Restarting globally" as RestartingG
state "Restarting locally" as RestartingL
Waiting --> Waiting : Resources are not stable yet
Waiting --> Executing : Resources are stable
Waiting --> Finished : Cancel or suspend
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend or job reached terminal state
Executing --> RestartingG : Recoverable global fault
Executing --> RestartingL : Recoverable local fault
RestartingL --> Executing : Recovered locally
RestartingL --> RestartingL : Recoverable local fault
RestartingL --> RestartingG : Local recovery timeout
RestartingL --> Canceling : Cancel
RestartingL --> Finished : Suspend
RestartingL --> Failing : Unrecoverable fault
RestartingG --> Finished : Suspend
RestartingG --> Canceling : Cancel
RestartingG --> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]

@enduml


No support for local failovers

Supporting local failovers is another feature which we want to add as a follow up. Adding support for it allows to not having to restart the whole job.

No integration with Flink's web UI

The declarative scheduler allows that a job's parallelism can change over its lifetime. This means that we have to extend the web UI to be able to display different forms of a job. One idea would be to have a timeline which allows to pick a time for which the web UI displays the current job. This will require changes on the backend as well as frontend side.

No support for fine grained resource specifications

For the sake of simplicity and narrowing down the scope, the declarative scheduler will ignore any resource specifications. In the future when having different resource profiles to fulfil, it will be the task of the ResourceManager to make sure that different resource requirements are fulfilled equally well.

Non-zero rescaling

Rescaling happens through restarting the job, thus jobs with large state might need a lot of resources and time to rescale. Rescaling a job causes downtime of your job, but no data loss.

...