Page History

...

Therefore, for checkpoints after some tasks are finished we need to trigger all the leaf new root tasks whose precedent tasks are all finished. Currently CheckpointCoordinator computes tasks to trigger in a separate timer thread, which runs concurrently with JobMaster’s main thread. The mechanism works well currently since it only needs to pick the current execution attempt for each execution vertex. However, to find all the leaf new root tasks needs to iterate over the whole graph and if JM changes tasks’ state during computation, it would be hard to guarantee the rightness of the algorithm, especially the JM might restarted the finished tasks on failover. To simplify the implementation we would proxy the computation of tasks to trigger to the main thread.

The basic algorithm to compute the tasks to trigger would be iterating over the ExecutionGraph to find the leaf running new root running tasks. However, the algorithm could be optimized by iterating over the JobGraph instead. The detailed algorithm is shown in Appendix.

...

	Aligned Checkpoint	Unaligned Checkpoint
Received Trigger + all EndOfPartition	Wait till processed all barriers, and then report the snapshot	Wait till processed all EndOfPartition, and then report the snapshot. In this case we would take the snapshot the same as the aligned method.
Do not receive Trigger + some barriers	Wait till received barrier or EndOfPartition from each channel, then report the snapshot.	When receiving the first barrier, start spilling the buffers before the barrier or EndOfParitition for each channel.

Tasks Waiting For Checkpoint Complete Before Finish

...

Page tree

Versions Compared

Old Version 24

New Version 25

Key

Tasks Waiting For Checkpoint Complete Before Finish