Status
Current state: "Under Discussion"
Discussion thread: TODO
JIRA:
Released: target 2.6
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Today, Kafka Streams relies mainly on its internal clients (consumer/producer/admin) to handle timeout exceptions and retries (the "global thread" is the only exception). However, this approach has many disadvantages. (1) It's harder for users to configure and reason about the behavior and (2) if a client retries internally, all other tasks of the same StreamThread
are blocked.
Thus, we propose to catch all TimeoutException
s in Kafka Streams and handle them more gracefully and more robust.
Public Interfaces
There is no public interface change, because Kafka Streams already has a retires
configuration parameter. However, retries
is currently only used by the "global thread". This KIP proposes to reuse the existing config to also handle and retry timeout exceptions, hence, it is a semantic change.
Furthermore, we propose to change the default value of retries
from currently 1 to 5.
Proposed Changes
We propose to catch all TimeoutException
in Kafka Streams instead of treating them as fatal, and thus to not rely on the consumer/producer/admin client to handle all such errors. To make sure that timeout issues can be reported eventually, we use the existing retries
config to allow user to stop processing at some point if timeout errors persist and no progress can be made.
In particular to propose to apply the retries
config on a per-task level, i.e., each task gets its own retries counter. Each time a TimeoutException
occurs for a task, the counter is increased, and we move to the next task for processing. The task would be retried in the next processing loop. If the counter of a task exceed the number of configured retries, we stop processing all tasks by throwing an exception.
Compatibility, Deprecation, and Migration Plan
- TODO
Test Plan
TODO
Rejected Alternatives
TODO