Logic for handling failures of Tasks, and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures

Sender Failure

sender fails
partition becomes erroneous, with a SenderFailedException
netty channel needs to be closed
receiver cancels itself when encountering the SenderFailedException
receiver may also be cancelled by the JobManager (if that call is faster than the detection of the failed sender)
receiver may not be able to find the subpartition any more, when the sender has clean it away
- receiver does not immediately fail or cancel
- receiver requests status of sender from JobManager
- If JobManager sees sender as failed/canceles, it responds with "cancelled, please cancel yourself"
- If JobManager sees sender as running, it responds with "still running". In that case, the receiver retries the status pool with an exponential backoff (max 3 seconds) and fails if the JobManager never asked it to cancel.

receiver fails
sender keeps going. may be back-pressured when no receiver pulls the data any more
sender may be cancelled by JobManager
partition stays sane
netty channel needs to be closes, transfer canceled by a cancel message (receiver to sender)

Attributed to the receiver
Receiver fails with an Exception
Subpartition on the sender side stays sane
Netty channel needs to be closed (as result of transport error) - data transfer aborted