You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 23 Next »

Status

Current state: Under discussion.

Discussion thread: here  (Not happening yet)

JIRA: here

Motivation

Instantiating a new client may result in a fatal failure if the bootstrap server cannot be resolved due to misconfiguration or transient network issues such as slow DNS. This is suboptimal because of, including the fact that the ConfigException exception type does not accurately reflect the root cause of the problem, and it doesn't allow users to retry to resolve transient issues.  It would be more effective to provide a grace period for retry attempts before ultimately failing.

Proposed Changes

This KIP proposes moving bootstrapping logic from the constructor to the NetworkClient poll. This change ensures that the client doesn't fail at startup due to issues like misconfiguration or network disruptions and allows for retries upon subsequent poll() invocations. The proposed updates include introducing a new configuration option for timing out the bootstrapping process, a new exception type for handling bootstrap-related issues, and additional logging to aid in diagnosing bootstrapping failures.

Public Facing Changes

  • Timeout Configuration: bootstrap.connection.timeout.ms
  • Exception: BootstrapConnectionException extends KafkaException
  • Logging: log.WARN("Unable to bootstrap after {} ms.", elapsedMs)

Internal Changes

  • Client Constructor: The constructor will only parse the bootstrap configuration.

  • NetworkClient:

    • Bootstrapping will now occur in the poll method before attempting to update the metadata. This includes resolving the addresses and bootstrapping the metadata.
    • An error message will be logged in the event of a failed bootstrap process.
    • If the timeout exceeds, a non-retriable BootstrapConnectionException will be thrown.
  • Consumer, Producer, and Admin Clients: The bootstrap code will be changed.

New Configuration

bootstrap.connection.timeout.ms

The proposed configuration specifies the maximum amount of time clients can spend trying to establish a connection to the bootstrap server and resolve its IP address. If the connection cannot be established and resolved within this time, a BootstrapConnectionException will be thrown.

Note that the default value for this configuration option is open for discussion. It can be set to 0, which is the same as the current behavior of exiting upon the first failure.

Type:long
Default:300000 (5 minutes)
Valid Values:0 - LONG_MAX
Importance:high

New Error

Name: BootstrapConnectionException extends KafkaException

Message: "Unable to establish a connection to the bootstrap server in {}ms."

Type: Non-retriable.

Compatibility, Deprecation, and Migration Plan

  • Client Behaviors Changes

    • Bootstrap upon first NetworkClient.poll() instead of in the constructor.

    • Bootstrapping is retriable.

    • KafkaConsumer: Users can retry bootstrapping via poll() if it fails. Each retry will be bounded by either the poll timer or the bootstrap timer, whichever expires first.

    • KafkaAdminClient: Bootstrap exception is thrown when user tries to materialize the result future. The retry is bounded by the API timeout.

    • KafkaProducer: Bootstrap will be done in the background thread. If the client has not been bootstrapped when the user attempts to send a message, it can wait up to the maximum block timer or until the bootstrap is complete, whichever expires first.

    Exception Handling

      • Throws BootstrapConnectionException

Case Study

Every time the network client attempts to bootstrap and fails, a warning message will be logged.  In this section, I outlined how clients can react to bootstrap failures. In particular, I want to cover two common cases:

  1. Misconfiguration or non-transient issues with the network
  2. Transient network issues, e.g., slow DNS resolution.

KafkaConsumer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

When the bootstrap timeout expires, the client will throw a BootstrapConnectionException on poll().

Case 2: Transient Network Issue (For example: transient DNS failure)

If there is a transient network issue, such as slow DNS resolution, poll() will continue to return an empty record until the issue is resolved. If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrown on poll().

KafkaProducer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

The BootstrapConnectionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the max.block.ms elapsed before the timeout expires, a TimeoutException will be thrown instead.

Case 2: Transient Network Issue (For example: transient DNS failure)

The send() and partitionsFor() methods will be blocked on bootstrap until either the max.block.ms or the bootstrap timeout elapses.

AdminClient

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

The API call results will either timeout if the request times out first or be completed exceptionally with a BootstrapConnectionException

Case 2: Transient Network Issue (For example: transient DNS failure)

If there is a transient network issue, such as a transient DNS failure, the user won't be able to get the results back until the bootstrap issue is resolved.  Meanwhile, the API calls will expire.

Test Plan

  1. NetworkClient

    1. Test DNS resolution upon its initial poll

    2. Test if the right exception type is thrown

  2. Existing clients (Consumer, Producer, AdminClient)

    1. Test successful bootstrapping upon retrying

Rejected Alternatives

  1. Maintain the current code behavior and add a retry loop with a timeout.

    1. Pros: Same logic, less code change.

    2. Cons: Do people want object instantiation to block? I don't think it is a good idea.

  2. Throw DNS resolution upon failing but no retry

    1. Pros: No additional config is needed

    2. Cons: This is a behavioral change, and the application owner might need to rewrite the exception handling, i.e. catching the DNS failure logic.

  3. No retry.  The network client will continue to retry until it is interrupted.
    1. Pros: No compatibility break. No additional exception handling logic, the network client will just log the error and continue to retry upon the next poll
    2. Cons: I think we should have some failure mechanism to notify users.
  • No labels