Status

Current state: [Under Discussion]

Discussion thread: TBD

JIRA: TBD

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently Kafka Streams uses consumer membership protocol to coordinate the stream task assignment. When we scale up the stream application, KStream group will attempt to revoke active tasks and let the newly spinned up hosts take over them. It takes time for the new host to restore the tasks if assigned ones are stateful, but current strategy is to reassign the tasks upon receiving new member join group requests to achieve application consumption balance. For state heavy application, it is not ideal to give up the tasks immediately once the new player joins the party, instead we should buffer some time to let the new player accept some restoring tasks, and wait until it is “ready” to take over the active tasks. Ideally, we could realize no downtime transition during cluster scaling up if we take this approach. Same situation applies to scale down, when we need to buffer time for migrating the tasks from ready-to-shut-down hosts to retained ones.

Recently the community is promoting cooperative rebalancing to mitigate the pain points in the stop-the-world rebalancing protocol and an initiation for Kafka Connect already started in KIP-415. There is already great discussion around it, but the hard part for KStream is that delayed rebalance is not the most ideal solution. The better approach is to adopt some great design fo KConnect in KIP-415, while let KStream members explicitly announce the state changes and trigger necessary rebalance to migrate the resource ownership, once they are fully ready after task restoring.

Thus we are proposing a dedicated design specifically for KStream rebalancing in order to holistically smooth the scale up/down experience.

Proposed Changes

Learn Task Definition

We will introduce a new type of stream task called `learner task`, which is a special task that gets assigned to one stream instance to restore a current active task state from another instance. It shares the same semantics as standby task, and the only difference is that when the restoration of learner task is complete, the stream instance will initiate a new JoinGroupRequest to call out rebalance of the new task assignment. The goal of learner task is to delay the task migration when the destination host has not finished or even started replaying the active task. This applies to both scale up and scale down scenarios.

Alleviating Stop-the-world Effect

As mentioned in motivation section, we also want to mitigate the stop-the-world effect of current global rebalance protocol. A quick recap of current rebalance semantics on KStream when rebalance starts, all members would

Join group with all current assigned tasks revoked
Wait until group stabilized to resume the work

The reason for doing so is because we need to guarantee each topic partition is assigned with exactly one consumer at a time. So one topic partition could not be re-assigned before it is revoked.

For Kafka Connect, we choose to avoid revoking all current assigned tasks and trade off with one more rebalance. The behavior becomes:

Join group with all current active tasks running
Sync the revoked partitions and stop them (first rebalance)
Rejoin group immediately with only active tasks (second rebalance)

Feel free to take a look at KIP-415 example to get a sense of how the optimization works.

For KStream, we are going to take a trade-off between “revoking all” and “revoking none” solution: we shall only revoke tasks that are being learned since last round. So when we assign learner tasks to new member, we shall also mark active tasks as `learnee task` on current owners. Everytime when a rebalance begins, the task owners will revoke the learning bucket and join group without affecting other ongoing tasks. This way learned tasks could immediately transfer ownership without attempting for a second round of rebalance. Compared with KIP-415, we are optimizing for fewer rebalances, but increasing the metadata size.

Public Interfaces

We will be adding following new configs:

A public interface is any change to the following:

Binary log format
The network protocol and api behavior
Any class in the public packages under clientsConfiguration, especially client configuration
- org/apache/kafka/common/serialization
- org/apache/kafka/common
- org/apache/kafka/common/errors
- org/apache/kafka/clients/producer
- org/apache/kafka/clients/consumer (eventually, once stable)
Monitoring
Command line tools and arguments
Anything else that will likely break existing users in some way when they upgrade

Compatibility, Deprecation, and Migration Plan

Metadata size increase
No downtime upgrade due to change of protocolType

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Space shortcuts

Child pages

Status

Motivation

Proposed Changes

Learn Task Definition

Alleviating Stop-the-world Effect

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-429: Incremental Cooperative Rebalancing for Kafka Streams

Status

Motivation

Proposed Changes

Learn Task Definition

Alleviating Stop-the-world Effect

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives