Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To make Kafka Streams more robust, we propose to catch all client TimeoutExceptions in Kafka Streams and handle them more gracefully. Furthermore, reasoning about time is simpler for users then reasoning about number of retries. Hence, we propose to base all configs on timeouts and to deprecate all configs that rely on number of retries: this includes producer and admin client retries configuration parameter for Kafka Streams.

Public Interfaces

We propose to deprecate the retries configuration parameter for the producer and admin client, as well as for Kafka Streams. This includes to deprecate the console producer's config flag --message-send-max-retries.parameter for Kafka Streams. Furthermore, we introduce task.timeout.ms as an upper bound for any task to make progress with a default config of 5 minutes. If a task hits a client TimeoutException, the task would be skipped and the next task is processed.

The existing retry.backoff.ms is used as backoff time (default value 100ms) if a tight retry loop is required. We rely on client internal retry/backoff mechanism to void busy waiting (cf. KIP-580: Exponential Backoff for Kafka Clients).

Proposed Changes

Producer and admin client use a default retries config value of Integer.MAX_VALUE and rely on timeouts be default already (cf KIP-91 and KIP-533). They would still respected the deprecated retries config but log a warning if used. For Kafka Streams the retires config would be ignored (Kafka Streams will ignore the retires config and we only keep it to not break code that might set it ) and log a warning if used. The default retries value in Kafka Streams is 0 and we want to have a more robust default configuration. Note that the default retries values of 0 does not apply the embedded producer or admin client. Only if the user explicitly sets retries the embedded producer and admin client configs would we changed (this KIP does not change this behavior).

...

Last, the admin client is used within the group leader to collect topic metadata and to create internal topics if necessary. If those calls fails, they are retried within Kafka Streams re-using the admin client's retries config. Because admin retries will be deprecated, we should not re-use it any longer for this purpose. The current retry loop is across multiple admin client calls that are issues interleaved. This interleaved retry logic should be preserved. FurthermoreHowever, we should not retry infinitely (and also not allow users to specify how long to retry) to avoid that the leader is stuck forever (even if it would be removed from the group by the group coordinator after a timeout anyway that is set to max.poll.interval.ms). To avoid dropping out of the consumer group, the retry loop should be stopped before we hit the timeout. We propose to use a 50% threshold, i.e., half of max.poll.interval.ms.

Compatibility, Deprecation, and Migration Plan

...

Kafka Streams will ignore retries config; however, the new default will be more robust and thus no backward compatibly concern arises. If users really want to have the old "non robust" fail immediately  behavior, they can set task.timeout.ms=0.

Test Plan

Regular unit and integration tests are sufficient. Existing system tests should provide good coverage implicitly.

...

  • Reuse the existing retries config and handle client TimeoutException based on it. Rejected because a reasoning about time is easier for users and other client started to move away from count based retries already.
  • A task could be retried immediately if a client TimeoutException occurs instead of skipping it. However, this would result is "busy wait" pattern and other tasks could not make progress until the "failing" task makes progress again of eventually times out.
  • It would be possible to apply retries on a per method level (ie, for each client method that is called, an individual retry counter is maintained). This proposal is rejected because it seems to be too fine grained and hard to reason about for users.
  • If would be possible to apply retries at the thread level, i.e., whenever the thread does not make any progress in one task-processing-loop (ie, all tasks throw a timeout exception within the loop), the per-thread retry counter would be increased. This proposal is rejected as too coarse grained. In particular, a single task could get stuck while other tasks make progress and this case would not be detected.
  • If would be possible to apply thread.timeout.ms at the thread level instead of a task.timeout.ms at a task level: whenever the thread does not make any progress on any tasks within the timeout, the thread would fail. This proposal is rejected as too coarse grained. In particular, a single task could get stuck while other tasks make progress and this case would not be detected.
  • To distinguish between retries within Kafka Streams and client retries (in particular the producer's send retries config), we could add a new config (eg, `task.retries`). However, keeping the number of config small is desirable and the gain of the new config seems limited.
  • To avoid that people need to consider setting producer.retries and admin.retires explicitly, we could change the behavior of Kafka Streams and use retries explicitly for Streams level retries. For this case, setting retries would not affect the producer or admin client and both retries could only be change with their corresponding client-prefix config. This would be a backward incompatible change.
  • We considered to deprecate the retries configuration parameter also for the producer and admin client. However, there are some use cases that need to disable retries all together what is no possible by setting producer/admin client timeouts to zero (in contrast to the new task.timeout.ms=0 setting that disables retrying).