Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

...

Page properties


Discussion thread

...

...

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-21075

...

Release1.13


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

PlantUML
@startuml
hide empty description

[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting
Waiting --> Waiting : Resources are not stable yet
Waiting --> Executing : Resources are stable
Waiting --> Finished : Cancel, suspend or not enough \nresources for executing
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend or job reached terminal state
Executing --> Restarting : Recoverable fault
Restarting --> Finished : Suspend
Restarting --> Canceling : Cancel
Restarting --> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]

@enduml



The states have the following semantics:

...

The scheduler consists of the following services to accomplish its job. These services are used by the different states to decide on state transitions and to perform certain operations


PlantUML

@startuml

package "Adaptive Scheduler" {
  [SlotAllocator]
  [FailureHandler]
  [ScaleUpController]
}

@enduml

...

Supporting local failovers is another feature which we want to add as a follow up. Adding support for it allows to not having to restart the whole job. One idea could be to extend the existing state machine by a new state "Restarting locally":



PlantUML

@startuml
hide empty description

[*] -> Created
Created --> Waiting : Start scheduling
state "Waiting for resources" as Waiting
state "Restarting globally" as RestartingG
state "Restarting locally" as RestartingL
Waiting --> Waiting : Resources are not stable yet
Waiting --> Executing : Resources are stable
Waiting --> Finished : Cancel, suspend or \nnot enough resources for executing
Executing --> Canceling : Cancel
Executing --> Failing : Unrecoverable fault
Executing --> Finished : Suspend or job reached terminal state
Executing --> RestartingG : Recoverable global fault
Executing --> RestartingL : Recoverable local fault
RestartingL --> Executing : Recovered locally
RestartingL --> RestartingL : Recoverable local fault
RestartingL --> RestartingG : Local recovery timeout
RestartingL --> Canceling : Cancel
RestartingL --> Finished : Suspend
RestartingL --> Failing : Unrecoverable fault
RestartingG --> Finished : Suspend
RestartingG --> Canceling : Cancel
RestartingG --> Waiting : Cancelation complete
Canceling --> Finished : Cancelation complete
Failing --> Finished : Failing complete
Finished -> [*]

@enduml

...