Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

In all One of the current Kafka clients, we have retry mechanisms with a fixed backoff to handle transient connection issues with the brokers. However most common problems that we have in Kafka is with respect to metadata fetches. For example, if there is a broker failure, all clients start to fetch metadata at the same time and it often takes a while for the metadata to converge. In a high load cluster, there are also issues where the volume of metadata has compounded this problem and made convergence of metadata even slower. The reason for this is because we have a static retry backoff mechanism. However, with a small backoff (currently it defaults to the default is 100 ms), we could send tens of requests per second to the broker, and if the connection issue is prolonged, this could mean quite a bit of pressure on the brokers.
would lead to the problem described previously. To reduce this pressure, it would be useful to support an exponentially increasing backoff policy for all Kafka clients, similar to the configuration introduced in KIP-144 for exponentially increasing backoffs for broker reconnect attempts.

Public Interfaces

Introduce We can introduce a new common client configuration for clients called retry.backoff.max.ms that is defined as:

The maximum amount of time in milliseconds to wait when retrying a request to the broker that has repeatedly failed. If provided, the backoff per client will increase exponentially for each failed request, up to this maximum. To prevent all clients from being synchronized upon retry, a randomization factor of 0.2 will be applied to the backoff, resulting in a random range between 20% below and 20% above the computed value.

This config would default to 1000 ms if retry.backoff.ms is not set explicitly. Otherwise, it will default to the same value as retry.backoff.ms. The formula to calculate the backoff is as follows (where retry.backoff.ms starts out with 100 ms, which is the default):

Code Block
(retry.backoff.ms * 2**(failures - 1)) * random(0.8, 1.2)

"random" in this case is the random function that will randomly factor in a "jitter" that is 20% higher or lower to the computed value. This will keep increasing until it hits the retry.backoff.max.ms value.

...

Ignoring the randomized jitter for the sake of the example below, we can see the following example to illustrate the benefits of exponential backoff in comparison to static backoff. Currently, with the default of 100 ms per retry backoff, in 1 second we would have 10 retries. In the case of using an exponential backoff, we would have a total of 4 retries in 1 second. Thus we have less than half of the amount of retries in the same timeframe and can lessen broker pressure. This calculation is done as following:

Try 1 at time 0 ms, failures = 0, next retry in 100 ms (default retry ms is initially 100 ms)
Try 2 at time 100 ms, failures = 1, next retry in 200 ms
Try 3 at time 300 ms, failures = 2, next retry in 400 ms
Try 4 at time 700 ms, failures = 3, next retry in 800 ms
Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms (default max retry ms is 1000 ms)

Proposed Changes

Since there are different components that make use of the retry.backoff.ms configuration, each module that uses this will have to be changed. That being said, across all components, there will be a similar logic for dynamically updating the retry backoff, just making it fit to the component it’s in.

...