...
- Static Membership: the membership protocol where the consumer group will not trigger rebalance unless
- A new member joins
- A leader rejoins (possibly due to topic assignment change)
- An existing member offline time is over session timeout
- Broker receives a leave group request containing a list of alistof `group.instance.id`s (details later)
...
The new `group.instance.id` config will be added to the join group request. Two lists A list of tuples containing `group.instance.id` and `member.id` will be added to the LeaveGroupRequest, while removing the single `member.id` field.
Code Block |
---|
JoinGroupRequest => GroupId SessionTimeout RebalanceTimeout MemberId GroupInstanceId ProtocolType GroupProtocols GroupId => String SessionTimeout => int32 RebalanceTimeout => int32 MemberId => String GroupInstanceId => String // new ProtocolType => String GroupProtocols => [Protocol MemberMetadata] Protocol => String MemberMetadata => bytes LeaveGroupRequest => GroupId GroupInstanceIdList MemberIdListMemberIdentityList GroupId => String MemberId => String // removed GroupInstanceIdListMemberIdentityList => List[Tuple[String] // new MemberIdList => List[String], String]] // new |
In the meantime, we bump the join/leave group request/response version to v4/v3.
...
Code Block | ||||
---|---|---|---|---|
| ||||
MEMBER_ID_MISMATCH(78, "For join group request, this This implies some group.instance.id is already in the consumer group, however the corresponding member.id was not matching the record on coordinator; For leave group request, this implies the member.id list length doesn't align with group.instance.id list matching the record on coordinator", MemeberIdMisMatchException::new), GROUP_INSTANCE_ID_NOT_FOUND(79, "Some group.instance.id specified in the leave group request are not found", GroupInstanceIdInvalidException::new) |
...
On server side, broker will keep handling leave group request <= v3 as before. We extended the LeaveGroupRequest API with two with a new lists tuple list which map from pairs `group.instance.id` to `member.id`. The reason to include `member.id` list instead of solely adding a `group.instance.id` list is to move LeaveGroupRequest towards a more consistent batch API in long term. The two lists are expected to be of same length and aligned, which means each `group.instance.id` at the same index of .id` list instead of solely adding a `group.instance.id` has a strong matching reflected within current static membership on brokerlist is to move LeaveGroupRequest towards a more consistent batch API in long term. The processing rules are following:
- For static member, `group.instance.id` must be provided. Client could optionally provide a `member.id` when `group.instance.id` is configured non-empty. If `member.id` is provided, the member will only be removed if the `member.id` matches. Otherwise, only the `group.instance.id` is used. The `member.id` serves as a validation here, which currently will not be used (set to empty string) but potentially useful if we do fully automated removal process.
- For leave group requests under dynamic membership, the member will apply a singleton list of one tuple containing a `member.id` that it is currently using, and a singleton list of `group.instance.id` which is set to empty string. If this is the case, we shall just remove the given dynamic member the same way as current leave group logic.
- Error cases expected are:
- Some instance ids (non-empty) are not found, which means the request is not valid (GROUP_INSTANCE_ID_INVALID, defined in the public changes section)
- The length of MemberIdList doesn't match length of GroupInstanceIdList (MEMBER_ID_MISMATCH, defined in the public changes section)
- A theoretical case would be that both `member.id` and `group.instance.id` are set to empty string. We shall expose error in the server log. If the entire batch request is configured with empty strings, UNKNOWN_MEMBER_ID error will be returned.
...
Currently the scale down is controlled by session timeout, which means if user removes the over-provisioned consumer members it waits until session timeout to trigger the rebalance. This is not ideal and motivates us to change LeaveGroupRequest to be able to include a list of tuples of `group.instance.id` and a list of `member.id` such that we could batch remove offline members and trigger rebalance immediately without them.
...