Status
Current state: Under Discussion
Discussion thread: TBD
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Today group coordinator will take in unlimited number of join group requests into the membership metadata. There is a potential risk described in
where too many illegal joining members will burst broker memory before session timeout GC them. To ensure stability of the broker, we propose to enforce a hard limit on the size of consumer group in order to prevent explosion of server side cache/memory.Public Interfaces
We propose to add a new configuration into KafkaConfig.scala, and its behavior will affect the following coordinator APIs:
def handleJoinGroup(...) def handleSyncGroup(...)
where we shall enforce the group size capping rules upon requests.
Proposed Changes
We shall add a config called group.max.size on the coordinator side.
val GroupMaxSizeProp = "group.max.size" ... val GroupMaxSize = 1000000 ... .define(GroupMaxSizeProp, INT, Defaults.GroupMaxSize, MEDIUM, GroupMaxSizeDoc)
The default value 1_000_000 proposed here is based on a rough size estimation of member metadata (120B), so the max allowed memory usage per group is 120B * 1_000_000 = 100 MB which should be sufficient large number of 5X~10X for most use cases I know. Further discussion is welcomed on defining the default value!
Implementation wise we shall block registration of new member once a group reaches its capacity, and define a new error type:
GROUP_MAX_SIZE_REACHED(77, "Consumer group is already at its full capacity.", GroupMaxSizeReachedException::new);
Since the cap should never be reached, the consumer would fail itself upon receiving this error message to reduce load on broker side because reaching capacity limit is a red flag indicating some client side logic bug and should be prohibited to ensure server stability.
Compatibility, Deprecation, and Migration Plan
- This is backward compatible change.
Rejected Alternatives
Some discussion here proposed other approaches like enforcing memory limit or changing initial rebalance delay. We believe that those approaches are "either not strict or not intuitive" (Quote from Stanislav), compared with group size cap which is very easy to understand and config by end user in the customized manner.