Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

so that when member name has duplicates, we could refuse join request from members with an outdated member.id (since we update the mapping upon each join group request). In an edge case where the client hits DUPLICATE_STATIC_MEMBER exception in the response, it is suggesting that some other consumer takes its spot. The client should immediately fail itself to inform end user that there is a configuration bug which is generating duplicate consumers with same identity. For first version of this KIP, we just want to have straightforward handling to expose the error in early stage and reproduce bug cases easily.

For join group requests under dynamic membership (without member name set), the handling logic will remain unchanged.

If the broker version is not the latest (< v4), the join group request shall be downgraded to v3 without setting the member Id.

Scale up

We will not plan to solve the scale up issue holistically within this KIP, since there is a parallel discussion about Incremental Cooperative Rebalancing, in which we will encode the "when to rebalance" logic at the application level, instead of at the protocol level. 

For scaling up from empty stage, we We also plan to deprecate group.initial.rebalance.delay.ms since we no longer needs it once the incremental rebalancing work is done.

Rolling bounce

Currently there is a config called rebalance timeout which is configured by consumer max.poll.intervals. The reason we set it to poll interval is because consumer could only send request within the call of poll() and we want to wait sufficient time for the join group request. When reaching rebalance timeout, the group will move towards completingRebalance stage and remove unjoined groups. This is actually conflicting with the design of static membership, because those temporarily unavailable members will potentially reattempt the join group and trigger extra rebalances. Internally we would optimize this logic by having rebalance timeout only in charge of stopping prepare rebalance stage, without removing non-responsive members immediately.

Fault-tolerance of static membership 

To make sure we could recover from broker failure/leader transition, an in-memory member name map is not enough. We would reuse the `_consumer_offsets` topic to store the static member map information. When another broker takes over the leadership, we could transfer the mapping together. 

Command line API for membership management

forceStaticRebalance (introduced above) will trigger one rebalance immediately on static membership, which is mainly used for fast scale down/host replacement cases (we detect consumer failure faster than the session timeout). Error will be returned if

...

We need to enforce special access to these APIs to the end user who may not be in administrative role of Kafka Cluster. We shall allow a similar access level to the join group request, so the consumer service owner could easily use this API.

Compatibility, Deprecation, and Migration Plan

The fallback logic has been discussed previously. Broker with a lower version would just downgrade static membership towards dynamic membership.

Upgrade from dynamic membership to static membership

...

  1. Upgrade your broker to include this KIP-change.
  2. Rolling bounce Upgrade your client to include this KIP.
  3. Set consumer group to set member name and and session timeout to a reasonable number, and rolling bounce your consumer group.

That's it! We believe that the static membership logic is compatible with the current dynamic membership, which means it is allowed to have static members and dynamic members co-exist within the same consumer group. This assumption could be further verified when we do some modeling of the protocol (through TLA maybe) or dev test. 

Compatibility, Deprecation, and Migration Plan

...

Non-goal

We do have some offline discussions on handling leader rejoin case, for example since the broker could also do the subscription monitoring work, we don't actually need to trigger rebalance on leader side blindly based on its rejoin request. However this is a separate topic and we will address it in another KIP. 

...