Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Under discussion.Accepted

Discussion thread: here  (Not happening yet)

JIRA: here

Motivation

Transient Instantiating a new client may result in a fatal failure if the bootstrap server cannot be resolved due to misconfiguration or transient network issues such as slow DNS resolution might lead to client instantiation failing.  The behavior is disruptive because the application owner must either implement retry logic or manually restart the application.  After all, it is not very straightforward to address network issue because of the following reasons.

  1. Failed DNS resolution throws a ConfigException, which is not descriptive of the actual problem.  While the error message is fine, but catch and retrying based on the error message is not ideal.

  2. One of the common problems we encountered was DNS resolution.  In a dynamic cloud environment, it can take tens of minutes before the bootstrap server is registered to the DNS server and it is convenient to provide a grace period before failing the client.

Public Interfaces

See Proposed Changes.

  1. New config to timeout the bootstrap connection.
  2. A new exception type of prescribing for this specific issue.
  3. Additional logging upon failure to bootstrap.
  4. A change in failure condition.

Proposed Changes

Public Facing Changes

. This is suboptimal because of the fact that it might take a long time for the address to become available at the DNS server, and users will need to continue to retry.  Also, the ConfigException exception type does not accurately reflect the root cause of the problem, which makes it hard to handle this failure case.  We think it is reasonable to allow users to have a grace period to retry if the address cannot be resolved immediately. Also, poisoning the clients during the construction can be obstructive; I think it is better to fail the client on its first attempt to connect to the network.

Proposed Changes

This KIP proposes moving bootstrapping logic from the constructor to the NetworkClient poll for two purposes,

1. not failing the client upon instantiation. In many cases, this behavior also kills the app, which might not be desirable.

2. piggybacking onto the client poll is a more natural way to retry.

We propose to add a new configuration option for timing out the bootstrapping process, a new exception type for handling bootstrap-related issues, and additional logging to aid in diagnosing bootstrapping failures.

Public Facing Changes

  • Timeout Configuration: bootstrap.resolveConfig: bootstrap.connection.timeout.ms
  • Exception: BootstrapConnectionException BootstrapResolutionException extends KafkaException (non-retriable)
  • Logging: log.WARN("Unable to bootstrap , retry in after {} ms.", elapsedMs)

Internal Changes

  • Client Constructor: Only The constructor will only parse the bootstrap config and validate its format thereconfiguration.

  • NetworkClient:

    • Bootstrapping
    should
    • will now occur in the poll method before attempting to update the metadata. This includes resolving the addresses and bootstrapping the metadata.
    Logs an
    • An error message
    with
    • will be logged in the event of a failed bootstrap process.
    • If the timeout exceeds,
    throw
    • a
    BootstrapConnectionException, which is
    • non-retriable
    Consumer Client: Bootstrap logic
    • BootstrapResolutionException will be
    moved
    • thrown.
  • Producer Client: Bootstrap logic will be moved.
  • Admin Client: This will be movedConsumer, Producer, and Admin Clients: The bootstrap code will be changed.

New Configuration

bootstrap.connectionresolve.timeout.ms

The proposed configuration specifies the maximum amount of time clients can try to establish a connection to spend trying to resolve for the bootstrap server and resolve for the IP address. If the time exceeds this value, a BootstrapConnectionException resolution cannot be completed within this timeframe, a BootstrapResolutionException will be thrown.

Note: the default value is up for discussion. It can be 0, which is the same as the current behavior.  Exit upon the first retry.

Type:long
Default:300000 (5 120000 (i.e. 2 minutes)
Valid Values:0 - LONG_MAX
Importance:high

New Error

Name: BootstrapConnectionException extends BootstrapResolutionException extends KafkaException

Message: "Unable to establish a connection to the bootstrap server in {}ms.Couldn't resolve server {} from {} as DNS resolution failed for {}"

Type: Non-retriable.

Compatibility, Deprecation, and Migration Plan

  • Client Behaviors

    • Clients won’t attempt to resolve the bootstrap addresses upon initialization.

    • Clients won’t exit fatally if DNS resolution fails.

    • KafkaConsumer: Users must poll to retry the lookup if it fails.

    • KafkaAdminClient: Users will need to resend the request if failing.

    • KafkaProducer: The sender loop should already be polling continuously.

    Exception Handling

    • Failed DNS resolution will result in NetworkException

Case Study

...

Compatibility

  • Failed DNS resolution throws BootstraResolutionException: Users are expected to catch the error or the client will be poisioned
  • Users who tried to catch ConfigException for DNS resolution error will no longer need this logic.

Deprecation

  • There's no deprecation plan

Migration

  • There's no migration plan

Case Study

In this section, I outlined how clients can react to bootstrap failures. In particular, I want to cover two common cases:

  1. Misconfiguration or non-transient issues with the network
  2. Transient network issues, e.g., slow DNS resolution.

KafkaConsumer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

Suppose the user instantiates a KafkaConsumer with an invalid bootstrap config. When the user invokes assign() and starts poll(), the poll() method will continue to return empty ConsumerRecords and log a warning message.

Non-transient case

When the The user can continue to retry for the configured duration. After the bootstrap timeout expires, the client will throw a BootstrapConnectionExceptionBootstrapResolutionException.

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaConsumer with a valid bootstrap config, but there is a transient network issue, such as slow DNS resolution.

When the user starts poll(), the poll() method will return an empty ConsumerRecord and log a warning message.

Issue

consumer poll won't return any record until the client has been bootstrapped. If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrownThe user can continue to retry, and the network issue will be successfully resolved after some time. The KafkaConsumer will then continue to function normally.

KafkaProducer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

Suppose the user instantiates a KafkaProducer with an invalid bootstrap config. As the produce is instantiated, the sender thread starts running.  A warning message is logged everytime the NetworkClient tries to poll().

If the user tries to produce messages, the producer callback may be completed with a TimeoutException until the bootstrap timeout runs out.

Non-transient case

The BootstrapResolutionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the max.block.ms elapsed before the timeout expires, a TimeoutException will be thrown insteadEventually, a BootstrapConnectionException will be thrown.

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaProducer with a valid bootstrap config, but there is a transient network issue. As the sender thread starts running, a warning message is logged upon trying to bootstrap the client.

If the network issue is resolved before the user tries to produce a message, only warning messages will be logged.

Issue

If the user tries to produce a message before the issue is resolved, the sender callback will be completed with a TimeoutException if the network issue persists. The send will be completed normally if the network issue is resolved before exhausting the The send() and partitionsFor() methods will be blocked on bootstrap until either the max.block.msor the bootstrap timeout elapses.

AdminClient

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

...

Non-transient case

The API call results will either timeout if the request times out first or be completed exceptionally with a BootstrapResolutionException

...

.

Case 2: Transient Network Issue (For example: transient DNS failure)Issue

The user

...

won't be able to get the results back until the address is resolved.  Meanwhile, the API calls can expire.

What should users do after the timeout expires?

The exception is meant to be fatal, so the user should check their network setup, configuration, or adjust the timeout.

The user can continue to retry, but this exception is meant to alert user to take action upon failing to bootstrap

...

.

Test Plan

  1. NetworkClient

    1. Test DNS resolution upon its initial poll

    2. Test if the right exception type is thrown

  2. Existing clients (Consumer, Producer, AdminClient)

    1. Test successful bootstrapping upon retrying

Rejected Alternatives

We've discussed many alternatives.  Eventually, we asked ourselves what's the goal of this KIP, i.e., giving people a chance to retry on DNS resolution without poisoning the client. Which came down to two resolutions: 1. giving people a configurable timeout, and 2. adding a fatal error to alert the user.

Here are the rejected alternatives:

  1. Maintain the current code behavior and add a retry loop with a timeout.

    1. Pros: Same logic, less code change.

    2. Cons: Do users want to be blocked on instantiating the client? I don't like this idea.

  2. Throw DNS resolution upon failing but no retry

  3. Allow the application owner to specify a retry period. The clients will fail after exceeding the timeout. The default set to 0s, which makes retry an opt-in config.

    1. Pros: Allows users to have more control over how long to retry

    2. Cons: Require a new config; client instantiation can block.

  4. No retry. Let the application owner handle the DNS resolution exception. This means we would still throw a DNSLookupException upon failing.

    1. Pros: No additional config is needed

    2. Cons: This is a behavioral change, and the application owner might need to rewrite the exception handling, i.e. catching the DNS failure logic.

  5. Not throwing an exception but letting NetworkClient retry on pollNo retry.  The network client will continue to retry until it is interrupted.
    1. Pros: No compatibility break. No additional exception handling logic, the network client will just log the error and continue to retry upon the next poll
    2. Cons: The discussion thread mentioned that it wouldn't fail upon starting.: I think we should have some failure mechanism to notify users.
  6. Making BootstrapResolutionException retriable
    1. Pros: For the transient case, we might not even need a timeout, people are expected to retry on catching this exception
    2. Cons: Then we reply on alerting mechanism to alert users the issue. If it is indeed a configuration issue, then it is harder to discover
  7. Combine DNS resolution and connection into a single timeout
    1. Pros: Using a single timer to account for the connection time.
    2. Cons: Should we make connection retry fatal after the timeout? Maybe not.
  8. 5min default timeout
    1. We've decided to reduce it to 2min to stay coherent to the delivery.timeout.ms
    2. 5min can be too long