You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Status

Current state"Under Discussion"

Discussion thread: TODO

JIRA Unable to render Jira issues macro, execution error.

Released: target 2.6

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Today, Kafka Streams relies mainly on its internal clients (consumer/producer/admin) to handle timeout exceptions and retries (the "global thread" is the only exception). However, this approach has many disadvantages. (1) It's harder for users to configure and reason about the behavior and (2) if a client retries internally, all other tasks of the same StreamThread  are blocked.

Thus, we propose to catch all TimeoutExceptions in Kafka Streams and handle them more gracefully and more robust.

Public Interfaces

There is no public interface change, because Kafka Streams already has a retires configuration parameter. However, retries is currently only used by the "global thread". This KIP proposes to reuse the existing config to also handle and retry timeout exceptions, hence, it is a semantic change.

Furthermore, we propose to change the default value of retries from currently 1 to 5.

Proposed Changes

We propose to catch all TimeoutException in Kafka Streams instead of treating them as fatal, and thus to not rely on the consumer/producer/admin client to handle all such errors. To make sure that timeout issues can be reported eventually, we use the existing retries config to allow user to stop processing at some point if timeout errors persist and no progress can be made.

In particular to propose to apply the retries config on a per-task level, i.e., each task gets its own retries counter. Each time a TimeoutException occurs for a task, the counter is increased, and we move to the next task for processing. The task would be retried in the next processing loop. If the counter of a task exceed the number of configured retries, we stop processing all tasks by throwing an exception.

Compatibility, Deprecation, and Migration Plan

  • TODO

Test Plan

TODO

Rejected Alternatives

TODO

  • No labels