Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Discussion thread: here  (Not happening yet)

JIRA: here

Motivation

Currently, clients would fail if DNS resolution fails. The application owner will either need to Transient network issues such as slow DNS resolution might lead to client instantiation failing.  The behavior is disruptive because the application owner must either implement retry logic or manually restart the application. This is inconvenient and hard to handle because:  After all, it is not very straightforward to address network issue because of the following reasons.

  1. Failed DNS resolution throws a ConfigException, which is not descriptive of the actual problem (.  While the error message is fine, but the exception type is misleading); unless the developer tries to parse and match catch and retrying based on the error message .is not ideal.

  2. One of the common problems we encountered was DNS resolution.  In a dynamic cloud environment, it can take tens of It can take minutes before the bootstrap server is registered to the DNS server , and it is reasonable to allow clients to continue to retryconvenient to provide a grace period before failing the client.

Public Interfaces

  • Users can catch a NetworkException and retry. (Remove the ConfigException)
  • Several logging (warn) around failing DNS resolution will be removed.
  • DNS lookup will happen on the first poll.

Proposed Changes

Remove the DNS lookup in the client constructor and delegate this task to the NetworkClient poll method, which means the clients won't attempt to resolve for DNS upon starting.

See Proposed Changes.

  1. New config to timeout the bootstrap connection.
  2. A new exception type of prescribing for this specific issue.
  3. Additional logging upon failure to bootstrap.
  4. A change in failure condition.

Proposed Changes

Public Facing Changes

  • Config: bootstrap.connection.timeout.ms
  • Exception: BootstrapConnectionException extends KafkaException (non-retriable)
  • Logging: log.WARN("Unable to bootstrap, retry in {} ms.")

Internal Changes

  • Client Constructor: Only parse the bootstrap config and validate its format there

  • Bootstrap Connection Timeout: A timeout configuration for connecting to the bootstrap server.
  • NetworkClient:

    • Bootstrapping should now occur in the poll method before attempting to update the metadata. This includes resolving the addresses and bootstrapping the metadata.

    • Logs an error message with failed bootstrap process

    • If the timeout exceeds, throw a BootstrapConnectionException, which is non-retriable 

...

New Configuration

bootstrap.connection.timeout.ms

...

Type:long
Default:300000 (5 minutes)
Valid Values:0 - LONG_MAX
Importance:high

New Error

...

Name: BootstrapConnectionException extends KafkaException

...

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

Suppose the user instantiates a KafkaProducer with an invalid bootstrap config. As the produce is instantiated, the sender thread starts running.  A warning message is logged everytime the NetworkClient tries to poll().

If the user tries to produce messages, the producer callback may be completed with a TimeoutException until the bootstrap timeout runs out.

Eventually, a BootstrapConnectionException will be thrown.

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaProducer with a valid bootstrap config, but there is a transient network issue. As the sender thread starts running, a warning message is logged upon trying to bootstrap the client.

If the network issue is resolved before the user tries to produce a message, only warning messages will be logged.

If the user tries to produce a message before the issue is resolved, the sender callback will be completed with a TimeoutException if the network issue persists. The send will be completed normally if the network issue is resolved before exhausting the max.block.ms.

AdminClient

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

...