Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Client Behaviors

    • Clients won’t attempt to resolve the bootstrap addresses upon initialization.

    • Clients will retry bootstraping until the bootstrap timer expires

    • KafkaConsumer: Users retry bootstraping via poll(), if it fails.  Each retry is bounded by the poll timer.

    • KafkaAdminClient: Bootstrap exception is thrown when user tries to materialize the result future. The retry is bounded by the API timeout.

    • KafkaProducer: Bootstrap is done in the background thread.  If the client hasn't been bootstrapped when the user attempts to send  can wait up to max block timer

    Exception Handling

    • Failed DNS resolution will result in NetworkException

Case Study

...

Every time the network client attempts to bootstrap and fails, a warning message will be logged.  In this section, I outlined how clients can react to bootstrap failures. In particular, I want to cover two common cases:

  1. Misconfiguration or non-transient issues with the network
  2. Transient network issues, e.g., slow DNS resolution.

KafkaConsumer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

If the user instantiates a KafkaConsumer with an invalid bootstrap configuration and then initiates a poll(), the method will block until the poll timer or the bootstrap timeout expires. When the bootstrap timeout expires, the client will throw a BootstrapConnectionException on poll().

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaConsumer with a valid bootstrap config, but there is a transient network issue, such as slow DNS resolution. When the user starts  poll(), if the transient error is resolved before the poll timer runs out, the client will behave normally.  Otherwise,  will continue to return an empty record until the issue is resolved.  If the issue cannot be resolved within the bootstrap timeout, a BootstrapConnectionException will be thrown.

KafkaProducer

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

The BootstrapConnectionException will be thrown in send() and partitionsFor() when the bootstrap timeout expires. If the KafkaProducer is instantiated with an invalid bootstrap config, a warning message is logged every time the NetworkClient tries to bootstrap. A BootstrapconnectionException will be thrown when the timer runs outthe max.block.ms elapsed before the timeout expires a TimeoutException will be thrown instead.

Case 2: Transient Network Issue (For example: transient DNS failure)

Now, suppose the user instantiates a KafkaProducer with a valid bootstrap config, but there is a transient network issue. As the sender thread starts running, a warning message is logged upon trying to bootstrap the client.

If the network issue is resolved before the user tries to produce a message, only warning messages will be logged.

If the user tries to produce a message before the issue is resolved, the sender callback will be completed with a TimeoutException if the network issue persists. The send will be completed normally if the network issue is resolved before exhausting the The send() and partitionsFor() methods will be blocked on bootstrap until either the max.block.ms or the bootstrap timeout elapses.

AdminClient

Case 1: Unable to connect to the bootstrap (For example: misconfiguration)

When the user instantiates a new admin client and makes admin client API calls, the result The API call results will either timeout if the request times out first ; otherwise, the request will or be completed exceptionally with a BootstrapConnectionException. Note, a warning message will be logged every time the Network Client tries to bootstrap.

Case 2: Transient Network Issue (For example: transient DNS failure)

In case if If there is a transient network issue, such as a transient DNS failure, the user won't be able to get the results back until the bootstrap issue is resolved.  Meanwhile, if the call time expires, the request will be completed exceptionally with a TimeoutExceptionthe API calls will expire.

Test Plan

  1. NetworkClient

    1. Test DNS resolution upon its initial poll

    2. Test if the right exception type is thrown

  2. Existing clients (Consumer, Producer, AdminClient)

    1. Test successful bootstrapping upon retrying

Rejected Alternatives

  1. Maintain the current code behavior and add a retry loop with a timeout.

    1. Pros: Same logic, less code change.

    2. Cons: Do people want object instantiation to block? I don't think it is a good idea.

  2. Throw DNS resolution upon failing but no retry

  3. Allow the application owner to specify a retry period. The clients will fail after exceeding the timeout. The default set to 0s, which makes retry an opt-in config.

    1. Pros: Allows users to have more control over how long to retry

    2. Cons: Require a new config; client instantiation can block.

  4. No retry. Let the application owner handle the DNS resolution exception. This means we would still throw a DNSLookupException upon failing.

    1. Pros: No additional config is needed

    2. Cons: This is a behavioral change, and the application owner might need to rewrite the exception handling, i.e. catching the DNS failure logic.

  5. Not throwing an exception but letting NetworkClient retry on pollNo retry.  The network client will continue to retry until it is interrupted.
    1. Pros: No compatibility break. No additional exception handling logic, the network client will just log the error and continue to retry upon the next poll
    2. Cons: The discussion thread mentioned that it wouldn't fail upon startingI think we should have some failure mechanism to notify users.