Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is the schema of the metadata advertised by each member.

NameTypeDoc
ProcessIduuid / staticIdentity of the instance that may have multiple consumers.
UserEndPointbytes / static

Used for cross-client communication.

ClientTagsmap / static

Used for rack-aware assignment algorithm.

TopologyHashuuid / dynamicOnly updatable when reason is not zero.
TaskLagarray / dynamicOnly updatable when reason is not zero.

Member Metadata Reasons

  • None (0)
  • Shutdown (1)
  • WarmUpReady (2)
  • WarmUpFailed (3)
  • TopologyChanged (4)

Assignment Metadata Schema

NameTypeDoc
ActiveTaskslist

Local assignment for this consumer.

StandbyTasksmap

Local standby tasks for this consumer.

WarmupTasksmapLocal warming up tasks for this consumer.
PartitionsByHostmapGlobal assignment information used for IQ.

Assignment Metadata Errors

...

Compatibility, Deprecation, and Migration Plan

Cluster Upgrade

  • Deploy the new version of the software
  • Roll the cluster
  • Enable the new protocol
  • Roll the cluster

Group Upgrade

  • Deploy the new software
  • Roll the members
  • Enable the new protocol
  • Roll the members

Regular expression.

  • Kafka 4.0 opt-in, consumer support.
  • Kafka 4.x opt-in, streams support.
  • Kafka 5.0, new protocol by default.

Test Plan

  • Unit/Integration/System tests
  • Simulation Tests
  • TLA+

Rejected Alternatives

TODO

Future Work

...

The change is mainly backward compatible. The current protocol is still supported without any changes. The consumers and streams should be able to upgrade to the new protocol without any issues. The only part that may require adaptations is the regex based subscriptions. At the moment, those are validated on the client side based and that is language specific. We plan to use https://github.com/google/re2 on the server side so clients may have to adapt their regex based subscriptions when they migrate to the new protocol. We believe that the transition should be frictionless for most of the users.

The migration requires to first upgrades the servers to the new software and enable the new group coordinator. A roll is required for this. Then, the consumers must be upgraded to the new software as well and the new protocol must be enabled. This can be done with on roll if the consumer version is supported by the upgrade path. If not, two rolls are required.

We plan to release the feature in a Kafka 4.x release, possibly 4.0. The feature will be opt-in to start. We plan to make it the default in Kafka 5.0.

We don't plan to deprecate the current rebalance protocol anytime soon because Connect and others still rely on it. However, we will deprecate the consumer embedded protocol used in the current protocol in Kafka 5.0 and to remove it in Kafka 6.0. Consumers and Streams won't be able to use it any longer after this.

Test Plan

Our primary method for testing the implementation will be through Discrete Event Simulation (DES). DES allows us to test a large number of deterministically generated random scenarios which include various kinds of faults (such as network partitions). It allows us to define system invariants programmatically which are then checked after each step in the simulation. The protocol will be formally verified with a TLA+ model as well. Other than that, we will use the typical suite of unit/integration/system tests. System tests will be parameterised to run with both protocols.

Rejected Alternatives

TODO

Future Work

Eventually, we aim at deprecating the current membership/rebalance API. In order to get to this point, we would need to first move all the use cases away from it.

Connect Group/Rebalance Protocol

Kafka Connect is the second protocol type which is currently supported by Apache Kafka. We propose to use a similar approach that the one used by the current proposal for Connect in the future. We would introduce a new connect group type and introduce a new set of APIs for Connect. The rebalance protocol is very similar to the consumer rebalance protocol but works with different resource types.

Membership/Leader Election API

The group membership protocol is also used outside of Apache Kafka. For instance, the Confluent Schema Registry uses it for leader election. It is not clear whether we really want to suppose such cases in the future. If we do, we could also define a new set of APIs for it. That would be much cleaner in the long run.