Table of Contents

Status

Current state: Under DiscussionAccepted

Discussion thread: Here

JIRA:

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-7641

This KIP is part of a series of related proposals which aim to solidify Kafka's consumer group protocol

KIP-345: Introduce static membership protocol to reduce consumer rebalances

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-7610

KIP-394: Require member.id for initial join group request

Motivation

Consumer groups are an essential mechanism of Kafka. They allow consumers to share load and elastically scale by dynamically assigning the partitions of topics to consumers. In our current model of consumer groups, whenever a rebalance happens every consumer from that group experiences downtime - their poll() calls block until every other consumer in the group calls poll(). That is due to the fact that every consumer needs to call JoinGroup in a rebalance scenario in order to confirm it is still in the group.

...

The bigger problem is the potential risk described in

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-7610

where N faulty (or even malicious) clients could result in the broker thinking more than N consumers are joining during the rebalance. This has the potential to burst broker memory before the session timeout occurs and puts additional CPU strain on the Coordinator Broker - causing problems for other consumer groups using the same coordinator.
The root of the problem isn't necessarily the client's behavior (clients can behave any way they want), it is the fact that the broker has no way to shield itself from such a scenario.

Consumer Group Size Considerations

Large consumer groups can be seen as an anti-pattern. To summarize, we have the following concerns:

Memory usage of stable groups is not very high, but the runaway consumer group scenario described in KAFKA-7610 can reach large consumer numbers very quickly and affect memory (rough memory usage documented)
CPU spikes - there are a number of O(N) operations done on the consumers collection for a group
Rebalance times do not grow linearly with the consumer group size - unfortunately we do not have any concrete results, just anecdotes. It is recommended to be wary of rebalance frequencies and duration when consumer counts reach hundreds

Proposition

We propose to address the critical stability issue via the addition of a configurable upper-bound for the number of consumers in a consumer group. Adding such a config will enable server-side protection against buggy/malicious applications.
It is also useful in the sense that this configuration gives Admin/Ops teams better control over the cluster, limiting the ways in which novice developers can shoot themselves in the foot (via large consumer groups).

Public Interfaces

Add a new cluster-level grouplevel group.max.size config size config with a default value of -1 (disabled)`Int.MAX_VALUE`.

Add a new response error:

...

Space shortcuts

Child pages

Versions Compared

Old Version 14

New Version Current

Key

Status

Motivation

Consumer Group Size Considerations

Proposition

Public Interfaces

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 14

New Version Current

Key

Status

Motivation

Consumer Group Size Considerations

Proposition

Public Interfaces