Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Further, large consumer groups are not very practical with our current model due to two reasons:
1. The more consumers there are, the likelier it is that one will fail/timeout its session, causing a rebalance
2. Rebalances are upper-bounded in time by the slowest-reacting consumer. The more consumers, the higher the chance one is slow (e.g called poll() right before the rebalance and is busy processing the records offline). This means that rebalances are more likely to be long-lived and disruptive to consumer applications


To ensure stability of the broker, this KIP proposes the addition of a configurable upper-bound for the number of consumers in a consumer group. Adding such a config with a sensible default value and documentation would ensure broker protection and help guide users on using consumer groups effectivelywill enable server-side protection against buggy/malicious applications. There is also value in the sense that this configuration gives Admin/Ops teams better control over the cluster, further enabling self-service Kafka which developers can use.

Public Interfaces

Add a new cluster-level group.max.size config with a default value of 250 -1 (disabled).

Add a new response error:

...

Since the cap should never be reached in practice, the consumer will fatally exit upon receiving this error message.

When the Coordinator loads the consumer groups state from the log, it will force a rebalance for any groups that cross the max.size threshold so that the newly-formed generation will abide by the size constraint.

Compatibility, Deprecation, and Migration Plan

This is a backward compatible change. Old clients will still fail by converting the new error to the non-retriable UnknownServerException

Migration Plan

When upgrading to the new version with a defined `group.max.size` config, we need a way to handle existing groups which cross that threshold.
Since the default value is to disable the config, users who define it should do their due diligence to shrink the consumer groups that cross it or expect them to be shrunk by Kafka.

Rejected Alternatives

  • Topic-level config
    • It is harder to enforce since a consumer group may touch multiple topics. One approach would be to take the min/max of every topic's group size configuration.
    • This fine-grained configurability does not seem needed for the time being and may best be left for the future if the need arises
  • There are other ways of limiting how long a rebalance can take, discussed here
    • In the form of time - have a max rebalance timeout (decoupled from `max.poll.interval.ms`)
      • Lack strictness, a sufficiently buggy/malicious client could still overload the broker in a small time period
    • In the form of memory - have a maximum memory bound that can be taken up by a single group
      • Lacks intuitiveness, users shouldn't think about how much memory a consumer group is taking
  • Default value of 250
    • Large consumer groups are currently considered an anti-pattern and a sensible default value would hint at that well
    • It is better to be considerate of possible deployments that already pass that threshold. A Kafka update shouldn't cause disruption
  • High default value (5000)
    • This might mislead users into thinking big consumer groups aren't frowned upon
  • Do not force rebalance on already-existing groups that cross the configured threshold
    • Groups will therefore eventually get shrunk when each consumer inevitably gets restarted and is unable to join the already-over-capacity group
    • Users might perceive this as unintuitive behavior
    • Since we settled on a default value that disables the functionality, it is reasonable to be more strict when the config is defined