Status
Current state: "Under Discussion"
Discussion thread: here
JIRA: https://issues.apache.org/jira/browse/KAFKA-9678
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
In all of the current Kafka clients, we have retry mechanisms with a fixed backoff to handle transient connection issues with the brokers. However with a small backoff (currently it defaults to 100 ms), we could send tens of requests per second to the broker, and if the connection issue is prolonged, this could mean quite a bit of pressure on the brokers.
To reduce this pressure, it would be useful to support an exponentially increasing backoff policy for all Kafka clients, similar to the configuration introduced in KIP-144 for exponentially increasing backoffs for broker reconnect attempts.
Public Interfaces
Introduce a new common client configuration for clients called retry.backoff.max.ms
that is defined as:
The maximum amount of time in milliseconds to wait when retrying a request to the broker that has repeatedly failed. If provided, the backoff per client will increase exponentially for each failed request, up to this maximum.
This config would default to 1000 ms if retry.backoff.ms
is not set explicitly. Otherwise, it will default to the same value as retry.backoff.ms
. The formula to calculate the backoff is as follows (where retry.backoff.ms
starts out with 100 ms, which is the default):
retry.backoff.ms * 2**(failures - 1)
This will keep increasing until it hits the retry.backoff.max.ms
value.
Proposed Changes
Since there are different components that make use of the retry.backoff.ms
configuration, each module that uses this will have to be changed. That being said, across all components, there will be a similar logic for dynamically updating the retry backoff, just making it fit to the component it’s in.
Admin Client
In KafkaAdminClient
, we will have to modify the way the retry backoff is calculated for the calls that have failed and need to be retried. From the current static retry backoff, we have to introduce a mechanism for all calls that upon failure, the next retry time is dynamically calculated.
Consumer
In KafkaConsumer
, we have to change the retry backoff logic in the Fetcher
, ConsumerNetworkClient
, and Metadata
. Since ConsumerNetworkClient
and Metadata
are also used by other clients, they would have to house their own retry backoff logic. For the Fetcher
however, it could query a dynamically updating retryBackOffMs
from KafkaConsumer
.
Producer
For the KafkaProducer
, we have to change the retry backoff logic in ConsoleProducer, RecordAccumulator, Sender, TransactionManager,
and Metadata
. As mentioned above, Metadata
is used by other clients, so it would have its own retry backoff logic. For the rest of the classes, as described in the “Consumer” section above, it can query a dynamically updating retryBackOffMs
from KafkaProducer
.
Broker API Versions Command
Changes made for AdminClient
and ConsumerNetworkClient
would apply here. The main changes that would have to be made are for BrokerApiVersionsCommand
to set the appropriate arguments for AdminClient
and ConsumerNetworkClient
after changes for those are made.
Compatibility, Deprecation, and Migration Plan
For users who have not set retry.backoff.ms
explicitly, the default behavior will change so that the backoff will grow up to 1000 ms. For users who have set retry.backoff.ms
explicitly, the behavior will remain the same as they could have specific requirements.
Rejected Alternatives
- Default
retry.backoff.max.ms
to the same value asretry.backoff.ms
so that existing behavior is always maintained: for reasons explained in the compatibility section. - Default
retry.backoff.max.ms
to be 1000 ms unconditionally: for reasons explained in the compatibility section.