Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Discussion thread: here  (Not happening yet)

JIRA: here

Motivation

...

Transient Instantiating a new client may result in a fatal failure if the bootstrap server cannot be resolved due to potential misconfiguration or transient network issues such as slow DNS resolution might lead to client instantiation failing.  The behavior is disruptive because the application owner must either implement retry logic or manually restart the application.  After all, it is not very straightforward to address network issue because of the following reasons.

...

Failed DNS resolution throws a ConfigException, which is not descriptive of the actual problem.  While the error message is fine, but catch and retrying based on the error message is not ideal.

. This is suboptimal for several reasons, including the fact that the ConfigException exception type does not accurately reflect the root cause of the problem. It would be more effective to provide a grace period for retry attempts before ultimately failing, as this would improve the client

...

's resilience and increase the chances of successful initialization.

Proposed Changes

This KIP proposes changing the bootstrap behavior of the NetworkClient by moving the logic from the constructor to the first poll() call. This change ensures that the client doesn't fail at startup due to issues like misconfiguration or network disruptions and allows for retries upon subsequent poll() invocations. The proposed updates include introducing a new configuration option for timing out the bootstrapping process, a new exception type for handling bootstrap-related issues, and additional logging to aid in diagnosing bootstrapping failures.

Public Facing Changes

  • Timeout Configuration

Public Interfaces

See Proposed Changes.

  1. New config to timeout the bootstrap connection.
  2. A new exception type of prescribing for this specific issue.
  3. Additional logging upon failure to bootstrap.
  4. A change in failure condition.

Proposed Changes

Public Facing Changes

  • Config: bootstrap.connection.timeout.ms
  • Exception: BootstrapConnectionException extends KafkaException (non-retriable)
  • Logging: log.WARN("Unable to bootstrap , retry in after {} ms.", elapsedMs)

Internal Changes

  • Client Constructor: Only The constructor will only parse the bootstrap config and validate its format thereconfiguration.

  • NetworkClient:

    • Bootstrapping
    should
    • will now occur in the poll method before attempting to update the metadata. This includes resolving the addresses and bootstrapping the metadata.
    Logs an
    • An error message
    with
    • will be logged in the event of a failed bootstrap process.
    • If the timeout exceeds,
    throw
    • a
    BootstrapConnectionException, which is
    • non-retriable
    Consumer Client: Bootstrap logic
    • BootstrapConnectionException will be
    moved
    • thrown.
  • Producer Client: Bootstrap logic will be moved.
  • Admin Client: This will be movedConsumer, Producer, and Admin Clients: The bootstrap code will be changed.

New Configuration

bootstrap.connection.timeout.ms

The proposed configuration specifies the maximum amount of time clients can

...

spend trying to establish a connection to the bootstrap server and resolve

...

its IP address. If the

...

connection cannot be established and resolved within this time, a BootstrapConnectionException will be thrown.

Note

...

that the default value for this configuration option is

...

open for discussion. It can be set to 0, which is the same as the current behavior

...

of exiting upon the first

...

failure.

Type:long
Default:300000 (5 minutes)
Valid Values:0 - LONG_MAX
Importance:high

...

Compatibility, Deprecation, and Migration Plan

  • Client Behaviors

    • Bootstrap upon first NetworkClient.poll() instead of in the constructor.

    • Bootstrapping is retriable.

    • Clients won’t attempt to resolve the bootstrap addresses upon initialization.

    • Clients will retry bootstraping until the bootstrap timer expires

    • KafkaConsumer: Users can retry bootstraping bootstrapping via poll() , if it fails.   Each retry is will be bounded by either the poll timer or the bootstrap timer, whichever expires first.

    • KafkaAdminClient: Bootstrap exception is thrown when user tries to materialize the result future. The retry is bounded by the API timeout.

    • KafkaProducer: Bootstrap is will be done in the background thread.   If the client hasn't has not been bootstrapped when the user attempts to send  send a message, it can wait up to max the maximum block timer or until the bootstrap is complete, whichever expires first.

    Exception Handling

    • Failed DNS resolution will result in NetworkException

...

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaConsumer with a valid bootstrap config, but If there is a transient network issue, such as slow DNS resolution, poll() will continue to return an empty record until the issue is resolved.   If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrown on poll().

KafkaProducer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

The BootstrapConnectionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the the max.block.ms ms elapsed before the timeout expires, a TimeoutException will be thrown instead.

Case 2: Transient Network Issue (For example: transient DNS failure)

...