Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To find the new "root" of the current execution graph, we iterates from the   we iterates all the tasks to find the tasks who are still running but have no running precedent tasks. A direct implementation would have O(n2) time complexity since it needs to check all the precedent tasks of each task. However, we could reduce the complexity to O(n) by exploiting the isomorphism of ALL_TO_ALL edges. The detail is described in Appendix 1. 

To avoid the case that the tasks finished during the computation, the computation is done in the JobMaster's main thread. However, there might still be inconsistency due to:

  1. For tasks running on different TaskManagers, the order of the reports of FINISHED status arrived at JobMaster is not guaranteed. That is to say some tasks might report FINISHED after its descendant tasks. 
  2. The tasks might finish between the computation and the tasks get triggered. 

In both cases, the checkpoint trigger would fail and the checkpoint would fail due to timeout. Since checkpoint timeout would block the next checkpoint and cause failover by default, it would need to 


The basic algorithm to compute the tasks to trigger would be iterating over the ExecutionGraph to find the new root running tasks. However, the algorithm could be optimized by iterating over the JobGraph instead. The detailed algorithm is shown in Appendix. 

...