Status

Current state: "Under Discussion"

Discussion thread: here

JIRA: https://issues.apache.org/jira/browse/KAFKA-9678

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

In all of the current Kafka clients, we have retry mechanisms with a fixed backoff to handle transient connection issues with the brokers. However with a small backoff (currently it defaults to 100 ms), we could send tens of requests per second to the broker, and if the connection issue is prolonged, this could mean quite a bit of pressure on the brokers.

To reduce this pressure, it would be useful to support an exponentially increasing backoff policy for all Kafka clients, similar to the configuration introduced in KIP-144 for exponentially increasing backoffs for broker reconnect attempts.

Public Interfaces

Introduce a new common client configuration for clients called retry.backoff.max.ms that is defined as:

The maximum amount of time in milliseconds to wait when retrying a request to the broker that has repeatedly failed. If provided, the backoff per client will increase exponentially for each failed request, up to this maximum.

This config would default to 1000 ms if retry.backoff.ms is not set explicitly. Otherwise, it will default to the same value as retry.backoff.ms. The formula to calculate the backoff is as follows (where retry.backoff.ms starts out with 100 ms, which is the default):

retry.backoff.ms * 2**(failures - 1)

This will keep increasing until it hits the retry.backoff.max.ms value.

Proposed Changes

Since there are different components that make use of the retry.backoff.ms configuration, each module that uses this will have to be changed. That being said, across all components, there will be a similar logic for dynamically updating the retry backoff, just making it fit to the component it’s in.

Admin Client

In KafkaAdminClient, we will have to modify the way the retry backoff is calculated for the calls that have failed and need to be retried. From the current static retry backoff, we have to introduce a mechanism for all calls that upon failure, the next retry time is dynamically calculated.

Consumer

In KafkaConsumer, we have to change the retry backoff logic in the Fetcher, ConsumerNetworkClient, and Metadata. Since ConsumerNetworkClient and Metadata are also used by other clients, they would have to house their own retry backoff logic. For the Fetcher however, it could query a dynamically updating retryBackOffMs from KafkaConsumer.

Producer

For the KafkaProducer, we have to change the retry backoff logic in ConsoleProducer, RecordAccumulator, Sender, TransactionManager, and Metadata. As mentioned above, Metadata is used by other clients, so it would have its own retry backoff logic. For the rest of the classes, as described in the “Consumer” section above, it can query a dynamically updating retryBackOffMs from KafkaProducer.

Broker API Versions Command

Changes made for AdminClient and ConsumerNetworkClient would apply here. The main changes that would have to be made are for BrokerApiVersionsCommand to set the appropriate arguments for AdminClient and ConsumerNetworkClient after changes for those are made.

Compatibility, Deprecation, and Migration Plan

For users who have not set retry.backoff.ms explicitly, the default behavior will change so that the backoff will grow up to 1000 ms. For users who have set retry.backoff.ms explicitly, the behavior will remain the same as they could have specific requirements.

Rejected Alternatives

Default retry.backoff.max.ms to the same value as retry.backoff.ms so that existing behavior is always maintained: for reasons explained in the compatibility section.
Default retry.backoff.max.ms to be 1000 ms unconditionally: for reasons explained in the compatibility section.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Admin Client

Consumer

Producer

Broker API Versions Command

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-580: Exponential Backoff for Kafka Clients

Status

Motivation

Public Interfaces

Proposed Changes

Admin Client

Consumer

Producer

Broker API Versions Command

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives