Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Under discussion.Accepted

Discussion thread: here  (Not happening yet)

...

Instantiating a new client may result in a fatal failure if the bootstrap server cannot be resolved due to misconfiguration or transient network issues such as slow DNS. This is suboptimal because of the fact that it might take a long time for the address to become available at the DNS server, and users will need to continue to retry.  Also, the ConfigException exception type does not accurately reflect the root cause of the problem, which makes it hard to handle this failure case.  We think it is reasonable to allow users to have a grace period to retry if the address cannot be resolved immediately. Also, poisoning the clients during the construction can be obstructive; I think it is better to fail the client on its first attempt to connect to the network.

...

  • Timeout Configuration: bootstrap.connectionresolve.timeout.ms
  • Exception: BootstrapConnectionException BootstrapResolutionException extends KafkaException
  • Logging: log.WARN("Unable to bootstrap after {} ms.", elapsedMs)

...

  • Client Constructor: The constructor will only parse the bootstrap configuration.

  • NetworkClient:

    • Bootstrapping will now occur in the poll method before attempting to update the metadata. This includes resolving the addresses and bootstrapping the metadata.
    • An error message will be logged in the event of a failed bootstrap process.
    • If the timeout exceeds, a non-retriable BootstrapConnectionException BootstrapResolutionException will be thrown.
  • Consumer, Producer, and Admin Clients: The bootstrap code will be changed.

New Configuration

bootstrap.connectionresolve.timeout.ms

The proposed configuration specifies the maximum amount of time clients can spend trying to establish a connection to resolve for the bootstrap server and resolve its IP address. If the connection resolution cannot be established and resolved completed within this timetimeframe, a BootstrapConnectionException BootstrapResolutionException will be thrown.

Note that the default value for this configuration option is open for discussion. It can be set to 0, which is the same as the current behavior of exiting upon the first failure.

Type:long
Default:300000 (5 120000 (i.e. 2 minutes)
Valid Values:0 - LONG_MAX
Importance:high

New Error

Name: BootstrapConnectionException extends BootstrapResolutionException extends KafkaException

Message: "Unable to establish a connection to the bootstrap server in {}ms.Couldn't resolve server {} from {} as DNS resolution failed for {}"

Type: Non-retriable.

Compatibility, Deprecation, and Migration Plan

Compatibility

  • Failed DNS resolution throws BootstraResolutionException: Users are expected to catch the error or the client will be poisioned
  • Users who tried to catch ConfigException for DNS resolution error will no longer need this logic.
  • There shouldn't be a compatibility problem as the bootstrap logic changes only affect the failure scenario.
  • The user can use a timeout (bootstrap.connection.timeout.ms) of 0 to mimic the current behavior, i.e. fatal upon the first failure.

Deprecation

  • There's no deprecation plan

...

  1. Misconfiguration or non-transient issues with the network
  2. Transient network issues, e.g., slow DNS resolution.

KafkaConsumer

Case 1: Unable to connect to the bootstrapNon-transient case

When the bootstrap timeout expires, the client will throw a BootstrapConnectionExceptionBootstrapResolutionException.

Case 2: Transient Network Issue

consumer poll won't return any record until the client has been bootstrapped. If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrown.

KafkaProducer

Case 1: Unable to connect to the bootstrapNon-transient case

The BootstrapConnectionException BootstrapResolutionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the max.block.ms elapsed before the timeout expires, a TimeoutException will be thrown instead.

...

The send() and partitionsFor() methods will be blocked on bootstrap until either the max.block.ms or the bootstrap timeout elapses.

AdminClient

Case 1: Unable to connect to the bootstrapNon-transient case

The API call results will either timeout if the request times out first or be completed exceptionally with a BootstrapConnectionExceptionBootstrapResolutionException.

Case 2: Transient Network Issue

If there is a transient network issue, such as a transient DNS failure, the The user won't be able to get the results back until the bootstrap issue address is resolved.  Meanwhile, the API calls will can expire.

What should users do after the timeout expires?

...

  1. NetworkClient

    1. Test DNS resolution upon its initial poll

    2. Test if the right exception type is thrown

  2. Existing clients (Consumer, Producer, AdminClient)

    1. Test successful bootstrapping upon retrying

Rejected Alternatives

We've discussed many alternatives.  Eventually, we asked ourselves what's the goal of this KIP, i.e., giving people a chance to retry on DNS resolution without poisoning the client. Which came down to two resolutions: 1. giving people a configurable timeout, and 2. adding a fatal error to alert the user.

Here are the rejected alternatives:

  1. Maintain the current code behavior and add a retry loop with a timeout.

    1. Pros: Same logic, less code change.

    2. Cons: Do people users want object instantiation to blockto be blocked on instantiating the client? I don't think it is a good like this idea.

  2. Throw DNS resolution upon failing but no retry

    1. Pros: No additional config is needed

    2. Cons: This is a behavioral change, and the application owner might need to rewrite the exception handling, i.e. catching the DNS failure logic.

  3. No retry.  The network client will continue to retry until it is interrupted.
    1. Pros: No compatibility break. No additional exception handling logic, the network client will just log the error and continue to retry upon the next poll
    2. Cons: I think we should have some failure mechanism to notify users.
  4. Making BootstrapResolutionException retriable
    1. Pros: For the transient case, we might not even need a timeout, people are expected to retry on catching this exception
    2. Cons: Then we reply on alerting mechanism to alert users the issue. If it is indeed a configuration issue, then it is harder to discover
  5. Combine DNS resolution and connection into a single timeout
    1. Pros: Using a single timer to account for the connection time.
    2. Cons: Should we make connection retry fatal after the timeout? Maybe not.
  6. 5min default timeout
    1. We've decided to reduce it to 2min to stay coherent to the delivery.timeout.ms
    2. 5min can be too long