Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Extend jobmanager.scheduler to accept new value declarative in order to activate the declarative scheduler
  • Introduce declarative-scheduler.resource-timeout to configure the resource timeout for the "Waiting for resources" state

Compatibility, Deprecation, and Migration Plan

...

Rescaling happens through restarting the job, thus jobs with large state might need a lot of resources and time to rescale. Rescaling a job causes downtime of your job, but no data loss.

Per-job configuration

It might be useful to select the used scheduler on a per-job basis. Within the scope of this FLIP, the scheduler will only configurable for the whole cluster. Hence, introducing a job configuration for selecting which scheduler to use could be a good follow up.

Slow performance when recovering from a fault

Since creating an ExecutionGraph is a costly operation (see FLINK-21110) which can also involve IO operation if certain sources/sinks are used, the failover might be not very fast. If this becomes a problem, then we have to think about pulling one time initialisation tasks out of the ExecutionGraph and to speed up the creation of the ExecutionGraph in order to speed up the failover.

Test Plan

The new scheduler needs extensive unit, IT and end-to-end testing because it is a crucial component which is at the heart of Flink.

...