THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
What this is about: Logic for handling failures of
...
tasks, and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures
...
.
We distinguish three types of failures:
- Sender Failure
...
- Sender fails
- Produced result partition becomes erroneous
- Sender fails
...
- with a SenderFailedException
...
- Receiver cancels itself when encountering the SenderFailedException
- Receiver cancels itself when encountering the SenderFailedException
...
- May also be cancelled by the JobManager (if that call is faster than the detection of the failed sender)
...
- This closes the Netty channel
- Receiver may not be able to find the subpartition any more, when the sender has
...
- cleaned it away
- cleaned it away
...
- Receiver does not immediately fail or cancel
...
- Receiver requests status of sender from JobManager
- If JobManager sees sender as failed/
...
- canceled, it responds with "cancelled, please cancel yourself"
- If JobManager sees sender as running, it responds with "still running". In that case, the receiver retries the status pool with an exponential backoff (max 3 seconds) and fails if the JobManager never asked it to cancel
...
- Receiver Failure
...
- : receiver fails
...
- Sender keeps going.
...
- May be back-pressured when no receiver pulls the data any more.
...
- Sender may be cancelled by JobManager
...
- Partition stays sane
...
- Netty channel needs to be closed
- Transfer needs to be
...
- canceled by a cancel message (receiver to sender)
- Transport Failure
- Attributed to the receiver
- Receiver fails with an Exception
- Subpartition on the sender side stays sane
- Netty channel needs to be closed (as result of transport error)
...
- and data transfer aborted