You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

 

 

 

Logic for handling failures of Tasks, and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures

 

Sender Failure

  • sender fails

  • partition becomes erroneous, with a SenderFailedException
  • netty channel needs to be closed 

  • receiver cancels itself when encountering the SenderFailedException
  • receiver may also be cancelled by the JobManager (if that call is faster than the detection of the failed sender) 

  • receiver may not be able to find the subpartition any more, when the sender has clean it away
    • receiver does not immediately fail or cancel
    • receiver requests status of sender from JobManager
    • If JobManager sees sender as failed/canceles, it responds with "cancelled, please cancel yourself"
    • If JobManager sees sender as running, it responds with "still running". In that case, the receiver retries the status pool with an exponential backoff (max 3 seconds) and fails if the JobManager never asked it to cancel.

Receiver Failure

  • receiver fails

  • sender keeps going. may be back-pressured when no receiver pulls the data any more
  • sender may be cancelled by JobManager

  • partition stays sane
  • netty channel needs to be closes, transfer canceled by a cancel message (receiver to sender)

Transport Failure

  • Attributed to the receiver

  • Receiver fails with an Exception

  • Subpartition on the sender side stays sane
  • Netty channel needs to be closed (as result of transport error) - data transfer aborted

 

  • No labels