You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

Status

Current state: Under discussion.

Discussion thread: here  (Not happening yet)

JIRA: here

Motivation

Transient network issues such as slow DNS resolution might lead to client instantiation failing.  The behavior is disruptive because the application owner must either implement retry logic or manually restart the application.  After all, it is not very straightforward to address network issue because of the following reasons.

  1. Failed DNS resolution throws a ConfigException, which is not descriptive of the actual problem.  While the error message is fine, but catch and retrying based on the error message is not ideal.

  2. One of the common problems we encountered was DNS resolution.  In a dynamic cloud environment, it can take tens of minutes before the bootstrap server is registered to the DNS server and it is convenient to provide a grace period before failing the client.

Public Interfaces

See Proposed Changes.

  1. New config to timeout the bootstrap connection.
  2. A new exception type of prescribing for this specific issue.
  3. Additional logging upon failure to bootstrap.
  4. A change in failure condition.

Proposed Changes

Public Facing Changes

  • Config: bootstrap.connection.timeout.ms
  • Exception: BootstrapConnectionException extends KafkaException (non-retriable)
  • Logging: log.WARN("Unable to bootstrap, retry in {} ms.")

Internal Changes

  • Client Constructor: Only parse the bootstrap config and validate its format there

  • NetworkClient:

    • Bootstrapping should now occur in the poll method before attempting to update the metadata. This includes resolving the addresses and bootstrapping the metadata.

    • Logs an error message with failed bootstrap process

    • If the timeout exceeds, throw a BootstrapConnectionException, which is non-retriable
  • Consumer Client: Bootstrap logic will be moved.
  • Producer Client: Bootstrap logic will be moved.
  • Admin Client: This will be moved.

New Configuration

bootstrap.connection.timeout.ms

The amount of time clients can try to establish a connection to the bootstrap server and resolve for the IP address. If the time exceeds this value, a BootstrapConnectionException will be thrown.

Note: the default value is up for discussion. It can be 0, which is the same as the current behavior.  Exit upon the first retry.

Type:long
Default:300000 (5 minutes)
Valid Values:0 - LONG_MAX
Importance:high

New Error

Name: BootstrapConnectionException extends KafkaException

Message: "Unable to establish a connection to the bootstrap server in {}ms."

Type: Non-retriable.

Compatibility, Deprecation, and Migration Plan

  • Client Behaviors

    • Clients won’t attempt to resolve the bootstrap addresses upon initialization.

    • Clients will retry bootstraping until the bootstrap timer expires

    • KafkaConsumer: Users retry bootstraping via poll(), if it fails.  Each retry is bounded by the poll timer.

    • KafkaAdminClient: Bootstrap exception is thrown when user tries to materialize the result future. The retry is bounded by the API timeout.

    • KafkaProducer: Bootstrap is done in the background thread.  If the client hasn't been bootstrapped when the user attempts to send  can wait up to max block timer

    Exception Handling

    • Failed DNS resolution will result in NetworkException

Case Study

Every time the network client attempts to bootstrap and fails, a warning message will be logged.  In this section, I outlined how clients can react to bootstrap failures. In particular, I want to cover two common cases:

  1. Misconfiguration or non-transient issues with the network
  2. Transient network issues, e.g., slow DNS resolution.

KafkaConsumer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

When the bootstrap timeout expires, the client will throw a BootstrapConnectionException on poll().

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaConsumer with a valid bootstrap config, but there is a transient network issue, such as slow DNS resolution. poll() will continue to return an empty record until the issue is resolved.  If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrown.

KafkaProducer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

The BootstrapConnectionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the max.block.ms elapsed before the timeout expires a TimeoutException will be thrown instead.

Case 2: Transient Network Issue (For example: transient DNS failure)

The send() and partitionsFor() methods will be blocked on bootstrap until either the max.block.ms or the bootstrap timeout elapses.

AdminClient

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

The API call results will either timeout if the request times out first or be completed exceptionally with a BootstrapConnectionException

Case 2: Transient Network Issue (For example: transient DNS failure)

If there is a transient network issue, such as a transient DNS failure, the user won't be able to get the results back until the bootstrap issue is resolved.  Meanwhile, the API calls will expire.

Test Plan

  1. NetworkClient

    1. Test DNS resolution upon its initial poll

    2. Test if the right exception type is thrown

  2. Existing clients (Consumer, Producer, AdminClient)

    1. Test successful bootstrapping upon retrying

Rejected Alternatives

  1. Maintain the current code behavior and add a retry loop with a timeout.

    1. Pros: Same logic, less code change.

    2. Cons: Do people want object instantiation to block? I don't think it is a good idea.

  2. Throw DNS resolution upon failing but no retry

    1. Pros: No additional config is needed

    2. Cons: This is a behavioral change, and the application owner might need to rewrite the exception handling, i.e. catching the DNS failure logic.

  3. No retry.  The network client will continue to retry until it is interrupted.
    1. Pros: No compatibility break. No additional exception handling logic, the network client will just log the error and continue to retry upon the next poll
    2. Cons: I think we should have some failure mechanism to notify users.
  • No labels