Task Failures and Error Handling

What this is about: Logic for handling failures of tasks, and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures.

We distinguish three types of failures:

Sender Failure
1. Sender fails
  1. Produced result partition becomes erroneous with a SenderFailedException (FLINK-1955)
  2. Receiver cancels itself when encountering the SenderFailedException
    - May also be cancelled by the JobManager (if that call is faster than the detection of the failed sender)
    - This closes the Netty channel
2. Receiver may not be able to find the subpartition any more, when the sender has cleaned it away (FLINK-1636)
  1. Receiver does not immediately fail or cancel
  2. Receiver requests status of sender from JobManager
  3. If JobManager sees sender as failed/canceled, it responds with "cancelled, please cancel yourself"
  4. If JobManager sees sender as running, it responds with "still running". In that case, the receiver retries the status pool with an exponential backoff (max 3 seconds) and fails if the JobManager never asked it to cancel
Receiver Failure: receiver fails (FLINK-1958)
1. Sender keeps going. May be back-pressured when no receiver pulls the data any more.
2. Sender may be cancelled by JobManager
3. Partition stays sane
4. Netty channel needs to be closed
5. Transfer needs to be canceled by a cancel message (receiver to sender)
Transport Failure (FLINK-1957)
1. Attributed to the receiver
2. Receiver fails with an Exception
3. Subpartition on the sender side stays sane
4. Netty channel needs to be closed (as result of transport error) and data transfer aborted

Page tree

Task Failures and Error Handling