Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is applicable only to bounded intermediate results (batch jobs). It means that the consuming operator starts only after the entire bounded result has been produced. This bounds the cancellations/restarts downstream in batch jobs.


Image Modified


Public Interfaces

 Will affect the way failures are logged and displayed in the web frontend, since failures do not lead the job to holistically go to recovery

The Number-of-restarts parameter or RestartStrategy needs to be interpreted differently

  • maximum-per-task-failures or
  • maximum-total-task-failures


Compatibility, Deprecation, and Migration Plan

  • In the first version, the feature should be selectively activated (StreamExecutionEnvironment.setRecoveryMode(JOB_LEVEL | TASK_LEVEL)
  • Given the simple impact on user job configuration (and the fact that most users go for infinite restarts for streaming jobs), good documentation of the change should help.

 

Implementation Plan

Version two strictly builds upon version one - it only takes the intermediate result types into account as backtracking barriers.

...

  1.  Extend backtracking to stop at Intermediate results that are available for the checkpoint to resume from.
  2. Implement “Caching Intermediate Result”
  3. Implement “Memory-only Caching Intermediate Result”
  4. Upon reaching a result that is not guaranteed to be there (like the “Memory-only Caching Intermediate Result”), the ExecutionGraph sends a message to the result (TaskManager holding it) to “pin” it, so it does not get released in the meantime.
    The response to the “pin” command is “okay” in which case the backtracking stops there, or “disposed”, in which case the backtracking continues.

Rejected Alternatives

(none yet)