THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!

Page History

Versions Compared

Old Version 1

changes.mady.by.user Stephan Ewen

Saved on Apr 28, 2015

compared with

New Version 2

changes.mady.by.user Ufuk Celebi

Saved on Apr 28, 2015

Key

This line was added.
This line was removed.
Formatting was changed.

What this is about: Logic for handling failures of

...

tasks, and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures

...

.

We distinguish three types of failures:

Sender Failure

...

1. Sender fails
  1. Produced result partition becomes erroneous

...

1. 1. with a SenderFailedException

...

1. 1. Receiver cancels itself when encountering the SenderFailedException

...

1. 1. - May also be cancelled by the JobManager (if that call is faster than the detection of the failed sender)

...

1. 1. - This closes the Netty channel
2. Receiver may not be able to find the subpartition any more, when the sender has

...

1. cleaned it away
...
1. 1. 1. Receiver does not immediately fail or cancel
...
1. 1. 1. Receiver requests status of sender from JobManager
    2. If JobManager sees sender as failed/
...
1. 1. 1. canceled, it responds with "cancelled, please cancel yourself"
    2. If JobManager sees sender as running, it responds with "still running". In that case, the receiver retries the status pool with an exponential backoff (max 3 seconds) and fails if the JobManager never asked it to cancel
...
2. Receiver Failure
...
1. : receiver fails
...
1. 1. Sender keeps going.
...
1. 1. May be back-pressured when no receiver pulls the data any more.
...
1. 1. Sender may be cancelled by JobManager
...
1. 1. Partition stays sane
...
1. 1. Netty channel needs to be closed
  2. Transfer needs to be
...
1. 1. canceled by a cancel message (receiver to sender)
2. Transport Failure
  1. Attributed to the receiver
  2. Receiver fails with an Exception
  3. Subpartition on the sender side stays sane
  4. Netty channel needs to be closed (as result of transport error)
...
1. 1. and data transfer aborted