Table of Contents

Status

Current state: Under Discussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: KAFKA-8206here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

A Kafka client performs bootstrapping when it’s initialized, i.e. it connects to a server from bootstrap.servers and fetches the cluster metadata, including the addresses of online brokers. This list of brokers from the fetched metadata is used for the real work. The client periodically updates the metadata during the network client’s polls so even if the set of brokers change over time, this generally works well. However, brokers already known to the client are used for fetching the subsequent metadata updated instead of the bootstrap servers.

...

Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key KAFKA-8206
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key KAFKA-12480
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key KAFKA-13405
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key KAFKA-13467
Jira
server ASF JIRA
serverId 5aa69414-a9e9-3523-82ec-879b028fb15b
key KAFKA-3068

Proposed Changes

This KIP proposes to allow Java clients (admin, Kafka producer , and consumer instances) clients to repeat the bootstrap process when fetching updating metadata if none of the known nodes brokers are available. The A broker is unavailable when the client doesn't have an established connection with it and cannot establish a connection (e.g. due to the reconnect backoff).

During the rebootstrap process, the client forgets the brokers it knows about and falls back on the bootstrap brokers (i.e. provided by bootstrap.servers which was originally provided via by the client configuration is used for this) as if it had just been initialized.

The client will check the cluster ID returned by the broker during the rebootstrap process. If no cluster ID was known to the client (i.e. it was originally bootstrapped with an old broker version that doesn't support cluster IDs), any returned value will be considered valid. Otherwise, the client will fail if the returned cluster ID doesn't match the previously known one.

The admin client behaves differently and doesn't update metadata. Because of this, it is excluded from the scope of this KIP.

reconnect.backoff.max.ms can be configured so low that brokers that are truly unavailable will never be considered as such, i.e. always will be eligible for reconnect. This is a known limitation. Unfortunately, it's hard to find a good criteria when to ignore this and trigger rebootstrapping nevertheless. It was decided to keep this out of the scope of this KIP.

Since this changes the user-facing behavior, it’s proposed to make this configurable (see Public Interfaces), defaulting to the current behavior.

Public Interfaces

Configuration Keys

Key Name	Description	Valid Values	Default Value
metadata.recovery.strategy	Controls how the consumer or producer client recovers when none of the brokers known to it is available. If set to `none`, the client fails. If set to `rebootstrap`, the client repeats the bootstrap process using `bootstrap.servers`. `reconnect.backoff.max.ms` may be so low that it prevents identifying brokers as unavailable, in this case rebootstrapping won't happen.	`none`, `rebootstrap`	`none`

Compatibility, Deprecation, and Migration Plan

Migrating to the new version will have no impact on clients as the default configuration value keeps the old behavior.

...

No special migration process or tool is needed to migrate to the new version.

Test Plan

The proposed change could be tested on the integration level. The KIP proposed two test cases, one for the producer and one for the consumer. In the tests, clients will bootstrap using bootstrap.servers=broker1,broker2, where only broker1 is in the cluster. After that, broker1 will be shut down and broker2 will be brought up and the client will be made to communicate with the cluster. As broker1, previously known to it, is unavailable, it’ll be forced to rebootstrap and connect to broker2.

Also, the cluster ID checking logic will be tested, preferably on the unit level.

Rejected Alternatives

One alternative is to introduce a thread that periodically refreshes metadata in the background, independently of the network client explicit polls. This was considered more complex, introducing new failure modes, while bringing little more value compared to the proposed approach.

...

Space shortcuts

Child pages

Versions Compared

Old Version 1

New Version Current

Key

Status

Motivation

Proposed Changes

Public Interfaces

Configuration Keys

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 1

New Version Current

Key

Status

Motivation

Proposed Changes

Public Interfaces

Configuration Keys

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives