Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. During the normal execution, Flink will record states of JM (ExecutionGraph, OperatorCoordinator, etc) to persistent storage so that we can recover based on these states after JM crash. We will introduce an event-based method to record the state of JM.
  2. During the JM crashes and restarts (generally HA will be responsible for restarting JM), the shuffle service and TM will retain the partitions related to the target job and try to continuously reconnect.
  3. After JM restarts, the connection with shuffle service and TM will be re-established. Then JM will recover the job progress based on the previous recorded states and partitions currently existing in the cluster, and restart the schedulingThen JM will recover the job progress based on the previous recorded states and partitions currently existing in the cluster, and restart the scheduling.

Record states during normal execution

...