Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Modify motivation section

...

Today, if the client has configured `max.poll.interval.ms` to a large value, the group coordinator broker will take in an unlimited number of join group requests and the rebalance could therefore continue for an unbounded amount of time. 
Further, large consumer groups are not very practical with our current model due to two reasons:
1. The more consumers there are, the likelier it is that one will fail/timeout its session, causing a rebalance
2. Rebalances are upper-bounded in time by the slowest-reacting consumer. The more consumers, the higher the chance one is slow (e.g called poll() right before the rebalance and is busy processing the records offline). This means that rebalances are more likely to be long-lived and disruptive to consumer applications.

There is also the
potential risk described in The bigger problem is the potential risk described in 

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-7610
 where N faulty (or even malicious) clients could result in the broker thinking more than N consumers are joining during the rebalance. ( This has the potential to burst broker memory before the session timeout occurs and puts additional CPU strain ).  on the Coordinator Broker - causing problems for other consumer groups using the same coordinator.
The root of the problem isn't necessarily the client's behavior (clients can behave any way they want), it is the fact that the broker has no way to shield itself from such a scenario.
Further, large consumer groups are not very practical with our current model due to two reasons:
1. The more consumers there are, the likelier it is that one will fail/timeout its session, causing a rebalance
2. Rebalances are upper-bounded in time by the slowest-reacting consumer. The more consumers, the higher the chance one is slow (e.g called poll() right before the rebalance and is busy processing the records offline). This means that rebalances are more likely to be long-lived and disruptive to consumer applications


To ensure stability of the broker, this KIP proposes the We propose to address the critical stability issue via the addition of a configurable upper-bound for the number of consumers in a consumer group. Adding such a config will enable server-side protection against buggy/malicious applications. There  
It is also value useful in the sense that this configuration gives Admin/Ops teams better control over the cluster, further enabling self-service Kafka which developers can uselimiting the ways in which novice developers can shoot themselves in the foot (via large consumer groups).

Public Interfaces

Add a new cluster-level group.max.size config with a default value of -1 (disabled).

...