THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
What this is about: Logic for handling failures of tasks, and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures.
We distinguish three types of failures:
- Sender Failure
- Sender fails
- Produced result partition becomes erroneous with a SenderFailedException (FLINK-1955)
- Receiver cancels itself when encountering the SenderFailedException
- May also be cancelled by the JobManager (if that call is faster than the detection of the failed sender)
- This closes the Netty channel
- Receiver may not be able to find the subpartition any more, when the sender has cleaned it away (FLINK-1636)
- Receiver does not immediately fail or cancel
- Receiver requests status of sender from JobManager
- If JobManager sees sender as failed/canceled, it responds with "cancelled, please cancel yourself"
- If JobManager sees sender as running, it responds with "still running". In that case, the receiver retries the status pool with an exponential backoff (max 3 seconds) and fails if the JobManager never asked it to cancel
- Sender fails
- Receiver Failure: receiver fails (FLINK-1958)
- Sender keeps going. May be back-pressured when no receiver pulls the data any more.
- Sender may be cancelled by JobManager
- Partition stays sane
- Netty channel needs to be closed
- Transfer needs to be canceled by a cancel message (receiver to sender)
- Transport Failure (FLINK-1957)
- Attributed to the receiver
- Receiver fails with an Exception
- Subpartition on the sender side stays sane
- Netty channel needs to be closed (as result of transport error) and data transfer aborted