Status

Current state: In review

Discussion thread: TBD

JIRA: here

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

For stateful applications, one of the biggest performance bottleneck is the state shuffling. In Kafka consumer, there is a concept called "rebalance" which means that for given M partitions and N consumers in one consumer group, Kafka will try to balance the load between consumers and ideally have each consumer dealing with M/N partitions. Broker will also adjust the workload dynamically by monitoring consumers' health so that we could kick dead consumer out of the group, and handling new consumers' join group request. When the service state is heavy, a rebalance of one topic partition from instance A to B means huge amount of data transfer. If multiple rebalances are triggered, the whole service could take a very long time to recover due to data transfer.

The idea of this KIP is to reduce number of rebalances by introducing a new concept called static membership. It would help with following example use cases.

Improve performance of heavy state applications. We have seen that rebalance is the major performance killer with large state application scaling, due to the time wasted in state shuffling.
Improve general rolling bounce performance. For example MirrorMaker processes take a long time to rolling bounce the entire cluster, because one process restart will trigger one rebalance. With the change stated, we only need constant number of rebalance (e.g. for leader restart) for the entire rolling bounce, which will significantly improves the availability of the MirrorMaker pipeline as long as they could restart within the specified timeout.

Background of consumer rebalance

Right now broker handles consumer state in a two-phase protocol. To solely explain consumer rebalance, we only discuss 3 involving states here: RUNNING, PREPARE_REBALANCE and SYNC.

When a new member joins the consumer group, if this is a new member or the group leader, the broker will move this group state from RUNNING to PREPARE_REBALANCE. The reason for triggering rebalance when leader joins is because there might be assignment protocol change (for example if the consumer group is using regex subscription and new matching topics show up). If an old member rejoins the group, the state will not change.
When moved to PREPARE_REBALANCE state, the broker will mark first joined consumer as leader, and wait for all the members to rejoin the group. Once we collected enough number of consumers/ reached rebalance timeout, we will reply the leader with current member information and move the state to SYNC. All current members are informed to send SyncGroupRequest to get the final assignment.
The leader consumer will decide the assignment and send it back to broker. As last step, broker will announce the new assignment by sending SyncGroupResponse to all the followers. Till now we finished one rebalance and the group generation is incremented by 1.

In the current architecture, during each rebalance consumer groups on broker side will assign new member id with a UUID randomly generated each time. This is to make sure we have unique identity for each group member. During client restart, consumer will send a JoinGroupRequest with a special UNKNOWN_MEMBER id, which has no intention to be treated as an existing member. To make the KIP work, we need to change both client side and server side logic to make sure we persist member identity by checking `group.instance.id` (explained later) throughout restarts, which means we could reduce number of rebalances since we are able to apply the same assignment based on member identities. The idea is summarized as static membership, which in contrary to dynamic membership (the one our system currently uses), is prioritizing "state persistence" over "liveness".

We will be introducing two new terms:

Static Membership: the membership protocol where the consumer group will not trigger rebalance unless 1. a new member joins 2. a leader rejoins. 3. an existing member go offline over session timeout.
Group instance id: the unique identifier defined by user to distinguish each client instance.

Public Interfaces

New Configurations

Consumer configs

group.instance.id

The unique identifier of the consumer instance provided by end user.

Default value: empty string.

Client side changes

The new `group.instance.id` config will be added to the join group request, and a list of `group.instance.id` will be added to the LeaveGroupRequest.

JoinGroupRequest => GroupId SessionTimeout RebalanceTimeout MemberId GroupInstanceId ProtocolType GroupProtocols
  GroupId             => String
  SessionTimeout      => int32
  RebalanceTimeout	  => int32
  MemberId            => String
  GroupInstanceId     => String // new
  ProtocolType        => String
  GroupProtocols      => [Protocol MemberMetadata]
  Protocol            => String
  MemberMetadata      => bytes


LeaveGroupRequest => GroupId MemberId GroupInstanceIdList
  GroupId             => String
  MemberId            => String
  GroupInstanceIdList => list[String] // new

In the meantime, we bump the join/leave group request/response version to v4/v3.

public static Schema[] schemaVersions() {
    return new Schema[] {JOIN_GROUP_REQUEST_V0, JOIN_GROUP_REQUEST_V1, JOIN_GROUP_REQUEST_V2, JOIN_GROUP_REQUEST_V3, JOIN_GROUP_REQUEST_V4};
}

public static Schema[] schemaVersions() {
    return new Schema[] {LEAVE_GROUP_REQUEST_V0, LEAVE_GROUP_REQUEST_V1, LEAVE_GROUP_REQUEST_V2, LEAVE_GROUP_REQUEST_V3};
}

public static Schema[] schemaVersions() {
    return new Schema[] {JOIN_GROUP_RESPONSE_V0, JOIN_GROUP_RESPONSE_V1, JOIN_GROUP_RESPONSE_V2, JOIN_GROUP_RESPONSE_V3, JOIN_GROUP_RESPONSE_V4};
}

public static Schema[] schemaVersions() {
    return new Schema[] {LEAVE_GROUP_RESPONSE_V0, LEAVE_GROUP_RESPONSE_V1, LEAVE_GROUP_RESPONSE_V2, LEAVE_GROUP_RESPONSE_V3};
}

We are also introducing a new type of return error in JoinGroupResponse V4. Will explain the handling in the next section.

MEMBER_ID_MISMATCH(78, "The join group contains group instance id which is already in the consumer group, however the member id was not matching the record on coordinator", MemeberIdMisMatchException::new),

Server side changes

We shall increase the cap of session timeout to 30 min for relaxing static membership liveness tracking.

val GroupMaxSessionTimeoutMs = 1800000 // 30 min for max cap

For fault-tolerance, we also include group instance id within the member metadata to backup in the offset topic.

private val MEMBER_METADATA_V3 = new Schema(
  new Field(MEMBER_ID_KEY, STRING),
  new Field(MEMBER_NAME_KEY, STRING), // new
  new Field(CLIENT_ID_KEY, STRING),
  new Field(CLIENT_HOST_KEY, STRING),
  new Field(REBALANCE_TIMEOUT_KEY, INT32),
  new Field(SESSION_TIMEOUT_KEY, INT32),
  new Field(SUBSCRIPTION_KEY, BYTES),
  new Field(ASSIGNMENT_KEY, BYTES))

Command line API and Scripts

We will define one command line API to help us better manage consumer groups:

public static MembershipChangeResult removeMemberFromGroup(String groupId, list<String> groupInstanceIds, RemoveMemberFromGroupOptions options);

which will use the latest LeaveGroupRequest API to inform broker the permanent leaving of a bunch of instances through passing the id list.

A script called kafka-invoke-consumer-rebalance.sh will be added for end user to easily manipulate the consumer group.

./bin/kafka-invoke-consumer-rebalance.sh --zookeeper localhost:2181 --broker 1 --group-id group-1 will immediately trigger a consumer group rebalance by transiting group state to PREPARE_REBALANCE. (explanation in next section.)

Proposed Changes

In short, the proposed feature is enabled if

JoinGroupRequest V4 is supported on both client and broker
`group.instance.id` is configured with non-empty string.

Client behavior changes

On client side, we add a new config called `group.instance.id` in ConsumerConfig. On consumer service init, if the `group.instance.id` config is set, we will put it in the initial join group request to identify itself as a static member (static membership). Note that it is user's responsibility to assign unique `group.instance.id` for each consumers. This could be in service discovery hostname, unique IP address, etc. We also have logic handling duplicate `group.instance.id` in case client configured it wrong.

For the effectiveness of the KIP, consumer with `group.instance.id` set will not send leave group request when they go offline, which means we shall only rely on session.timeout to trigger group rebalance. It is because the proposed rebalance protocol will trigger rebalance with this intermittent in-and-out which is not ideal. In static membership we leverage the consumer group health management to client application such as K8. Therefore, it is also advised to make the session timeout large enough so that broker side will not trigger rebalance too frequently due to member come and go.

Kafka Streams change

KStream uses stream thread as consumer unit. For a stream instance configured with `num.threads` = 16, there would be 16 main consumers running on a single instance. We could actually borrow the idea for initializing client id for stream thread consumer to apply for `group.instance.id` generation.

For example if user specifies the client id, the stream consumer client id will be like: User client id + "-StreamThread-" + thread id + "-consumer". If user client id is not set, then we will use process id.

Our plan is to reuse the above consumer client id to define `group.instance.id`, so effectively the KStream instance will be able to use static membership if end user defines unique `client.id` for stream instances.

Server behavior changes

On server side, broker will keep handling join group request <= v3 as before. The `member.id` generation and assignment is still coordinated by broker, and broker will maintain an in-memory mapping of {group.instance.id → member.id} to track member uniqueness. When receiving an known member's (A.K.A `group.instance.id` known) rejoin request, broker will return the cached assignment back to the member, without doing any rebalance.

For join group requests under static membership (with `group.instance.id` set),

If the `member.id` uses UNKNOWN_MEMBER_NAME, we shall always generate a new member id and replace the one within current map. We also expect that after KIP-394, all the join group requests are requiring `member.id` to physically enter the consumer group, so the behavior of static member is consistent with that proposal.
we are requiring member.id (if not unknown) to match the value stored in cache, otherwise reply MEMBER_ID_MISMATCH. The edge case is that if we could have members with the same `group.instance.id` (for example mis-configured instances with a valid member.id but added a used `group.instance.id` on runtime). When `group.instance.id` has duplicates, we could refuse join request from members with an outdated `member.id` (since we update the mapping upon each join group request). In an edge case where the client hits this exception in the response, it is suggesting that some other consumer takes its spot. The client should immediately fail itself to inform end user that there is a configuration bug which is generating duplicate consumers with same identity. For first version of this KIP, we just want to have straightforward handling to expose the error in early stage and reproduce bug cases easily.

For join group requests under dynamic membership (without `group.instance.id` set), the handling logic will remain unchanged. If the broker version is not the latest (< v4), the join group request shall be downgraded to v3 without setting the member Id.

Scale up

We will not plan to solve the scale up issue holistically within this KIP, since there is a parallel discussion about I ncremental Cooperative Rebalancing, in which we will encode the "when to rebalance" logic at the application level, instead of at the protocol level.

We also plan to deprecate group.initial.rebalance.delay.ms since we no longer needs it once static membership is delivered and the incremental rebalancing work is done.

Rolling bounce

Currently there is a config called rebalance timeout which is configured by consumer max.poll.intervals. The reason we set it to poll interval is because consumer could only send request within the call of poll() and we want to wait sufficient time for the join group request. When reaching rebalance timeout, the group will move towards completingRebalance stage and remove unjoined members. This is actually conflicting with the design of static membership, because those temporarily unavailable members will potentially reattempt the join group and trigger extra rebalances. Internally we would optimize this logic by having rebalance timeout only in charge of stopping prepare rebalance stage, without removing non-responsive members immediately. There would not be a full rebalance if the lagging consumer sent a JoinGroup request within the session timeout.

So in summary, the member will only be removed due to session timeout. We shall remove it from both in-memory static group instance id mapping and member list.

Scale down

Currently the scale down is controlled by session timeout, which means if user removes the over-provisioned consumer members it waits until session timeout to trigger the rebalance. This is not ideal and motivates us to change LeaveGroupRequest to be able to include a list of `group.instance.id`s such that we could batch remove offline members and trigger rebalance immediately without them.

Fault-tolerance of static membership

To make sure we could recover from broker failure/leader transition, an in-memory `group.instance.id` map is not enough. We would reuse the `_consumer_offsets` topic to store the static member map information. When another broker takes over the leadership, it will load the static mapping info together.

Command line API for membership management

RemoveMemberFromGroup (introduced above) will trigger one rebalance immediately on static membership, which is mainly used for fast scale down/host replacement cases (we detect consumer failure faster than the session timeout). This API will first send a FindCoordinatorRequest to locate the target broker, and initiate a LeaveGroupRequest to target broker hosting that coordinator, and the coordinator will decide whether to take this metadata change request based on its status at time.

Error will be returned if

the broker is on an old version.
Consumer group does not exist.
Operator is not authorized. (neither admin nor consumer group creater)
if the group is not in a valid state to transit to rebalance. (use `canRebalance` function defined in GroupMetadata.scala to check)
Some instance ids are not found. (which means the request is not valid)

We need to enforce special access to these APIs to the end user who may not be in administrative role of Kafka Cluster. We shall allow a similar access level to the join group request, so the consumer service owner could easily use this API.

Compatibility, Deprecation, and Migration Plan

The fallback logic has been discussed previously. Broker with a lower version would just downgrade static membership towards dynamic membership.

Upgrade process

The recommended upgrade process is as follow:

Upgrade your broker to include this KIP.
Upgrade your client to include this KIP.
Set `group.instance.id` and session timeout to a reasonable number, and rolling bounce your consumer group.

That's it! We believe that the static membership logic is compatible with the current dynamic membership, which means it is allowed to have static members and dynamic members co-exist within the same consumer group. This assumption could be further verified when we do some modeling of the protocol (through TLA maybe) or dev test.

Downgrade process

The downgrade process is also straightforward. End user could just unset `group.instance.id` and do a rolling bounce to switch back to dynamic membership. The static membership metadata stored on broker will not take any effect when `group.instance.id` is empty. After consumer offset topic retention, the old mapping messages will be gone completely.

Non-goal

We do have some offline discussions on handling leader rejoin case, for example since the broker could also do the subscription monitoring work, we don't actually need to trigger rebalance on leader side blindly based on its rejoin request. However this is a separate topic and we will address it in another KIP.

Rejected Alternatives

In this pull request, we did an experimental approach to materialize member id(the identity given by broker, equivalent to the `group.instance.id` in proposal) on the instance local disk. This approach could reduce the rebalances as expected, which is the experimental foundation of KIP-345. However, KIP-345 has a few advantages over it:

It gives users more control of their `group.instance.id`; this would help for debugging purposes.
It is more cloud-/k8s-and-alike-friendly: when we move an instance from one container to another, we can copy the `group.instance.id` to the config files.
It doe not require the consumer to be able to access another dir on the local disks (think your consumers are deployed on AWS with remote disks mounted).
By allowing consumers to optionally specifying a `group.instance.id`, this rebalance benefit can be easily migrated to connect and streams as well which relies on consumers, even in a cloud environment.

Future Works

Beyond static membership we could unblock many interactive use cases between broker and consumer. We will initiate separate discussion threads once 345 is done. Examples are:

Pre-registration (Proposed by Jason). Client user could provide a list of hard-coded `group.instance.id` so that the server could respond to scaling operations more intelligently. For example when we scale up the fleet by defining 4 new client instance ids, the server shall wait until all 4 new members to join the group before kicking out the rebalance, same with scale down.
Add hot standby hosts by defining `target.group.size` (proposed by Mayuresh). We shall keep some idle consumers within the group and when one of the active member go offline, we shall trigger hot swap due to the fact that current group size is smaller than `target.group.size`. With this change we might even not need to extend the session timeout since we should easily use the spare consumer to start working.