Status
Current state: [Under Discussion]
Discussion thread: TBD
JIRA: TBD
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Currently Kafka Streams uses consumer membership protocol to coordinate the stream task assignment. When we scale up the stream application, KStream group will attempt to revoke active tasks and let the newly spinned up hosts take over them. It takes time for the new host to restore the tasks if assigned ones are stateful, but current strategy is to reassign the tasks upon receiving new member join group requests to achieve application consumption balance. For state heavy application, it is not ideal to give up the tasks immediately once the new player joins the party, instead we should buffer some time to let the new player accept some restoring tasks, and wait until it is “ready” to take over the active tasks. Ideally, we could realize no downtime transition during cluster scaling up if we take this approach. Same situation applies to scale down, when we need to buffer time for migrating the tasks from ready-to-shut-down hosts to retained ones.
Recently the community is promoting cooperative rebalancing to mitigate the pain points in the stop-the-world rebalancing protocol and an initiation for Kafka Connect already started in KIP-415. There is already great discussion around it, but the hard part for KStream is that delayed rebalance is not the most ideal solution. The better approach is to adopt some great design fo KConnect in KIP-415, while let KStream members explicitly announce the state changes and trigger necessary rebalance to migrate the resource ownership, once they are fully ready after task restoring.
Thus we are proposing a dedicated design specifically for KStream rebalancing in order to holistically smooth the scale up/down experience. The primary purposes are two:
- Reduce unnecessary downtime due to task restoration
- Make rebalance performance better for stream applications, A.K.A alleviating Stop-The-World Effect.
Proposed Changes
New Stream Task Type
We will introduce a new type of stream task called `learner task`, which is a special task that gets assigned to one stream instance to restore a current active task state from another instance. It shares the same semantics as standby task, and the only difference is that when the restoration of learner task is complete, the stream instance will initiate a new JoinGroupRequest to call out rebalance of the new task assignment. The goal of learner task is to delay the task migration when the destination host has not finished or even started replaying the active task. This applies to both scale up and scale down scenarios.
Stop-The-World Effect Recap
As mentioned in motivation section, we also want to mitigate the stop-the-world effect of current global rebalance protocol. A quick recap of current rebalance semantics on KStream when rebalance starts, all members would
Join group with all current assigned tasks revoked.
Wait until group stabilized to resume the work.
The reason for doing so is because we need to guarantee each topic partition is assigned with exactly one consumer at a time. So one topic partition could not be re-assigned before it is revoked.
For Kafka Connect, we choose to keep all current assigned tasks running and trade off with one more rebalance. The behavior becomes:
Join group with all current active tasks running.
Sync the revoked partitions and stop them (first rebalance).
Rejoin group immediately with only active tasks (second rebalance).
Feel free to take a look at KIP-415 example to get a sense of how the algorithm works.
For KStream, we are going to take a trade-off between “revoking all” and “revoking none” solution: we shall only revoke tasks that are being learned since last round. So when we assign learner tasks to new member, we shall also mark active tasks as "being learned task" on current owners. Every time when a rebalance begins, the task owners will revoke the being learned tasks and join group without affecting other ongoing tasks. This way learned tasks could immediately transfer ownership without attempting for a second round of rebalance. Compared with KIP-415, we are optimizing for fewer rebalances, but increasing the metadata size and sacrificing partial availability of the learner tasks.
Next we are going to look at several typical scaling scenarios to better understand the algorithm.
Scale Up Running Application
The newly joined members will be assigned with learner tasks by the group leader and they will replay the corresponding changelogs on local first. By the end of first round of rebalance, there is no “real task transfer”. When new member finally finishes the replay task, it will re-attempt to join the group to indicate that it is “ready” to take on real active tasks. During second rebalance, the leader will eventually transfer the task ownership.
Cluster has 3 stream instances S1(leader), S2, S3, and they each own some tasks 1 ~ 5 Group stable state: S1[T1, T2], S2[T3, T4], S3[T5] #First Rebalance New member S4 joins the group S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) #Second Rebalance New member S5 joins the group. Member S1~S5 join with following metadata: (S4 is not ready yet) S1(assigned: [T2], revoked: [T1], learning: []) // T1 revoked because it's "being learned" S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) #Third Rebalance Member S4 finishes its replay and becomes ready, re-attempt to join the group. S5 is not ready yet. Member S1~S5 join with following status:(S5 is not ready yet) S1(assigned: [T2], revoked: [T1], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) // T3 revoked because it's "being learned" S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T2], revoked: [T1], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [], revoked: [], learning: [T3]) #Fourth Rebalance Member S5 is ready, re-attempt to join the group. Member S1~S5 join with following status:(S5 is not ready yet) S1(assigned: [T2], revoked: [], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) // T3 revoked because it's "being learned" S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T2], revoked: [], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [T3], revoked: [], learning: []) Now the group reaches balance with 5 members and 5 tasks.
Scaling Up from Empty Group
Normal Scale Down
As we have already discussed around the “learner” logic, when we perform the scale down of stream group, it is also favorable to initiate learner tasks before actually shutting down the instances. Although standby tasks could help in this case, it requires user to pre-set which may not be available when admin performs scaling down. The plan is to use command line tool to tell certain stream members that a shutdown is on the way to be executed. These informed members will send join group request with join reason indicating they are “leaving soon”. During rebalance assignment, leader will perform the learner assignment among members without intention of leaving. And the leaving member will shut down itself once received the instruction to revoke all its active tasks.
For ease of operation, a new tool for scaling down the stream app shall be built. It will have access to the application instances, and compute the scaled down members while end user just needs to provide a % of scale down. For example, if the current cluster size is 40 and we choose to scale down to 80%, then the script will attempt to inform 8 of 40 hosts to “prepare leaving” the group.
Group stable state: S1[T1, T2], S2[T3, T4], S3[T5] Scaling down the application, S2 will be leaving. #First Rebalance Member S2 joins the group and claim that it is leaving S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: [T3]) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: [T4]) #Second Rebalance S3 finishes replay first and trigger another rebalance Member S1~S3 join with following status:(S1 is not ready yet) S1(assigned: [T1, T2], revoked: [], learning: [T3]) S2(assigned: [T3], revoked: [T4], learning: []) S3(assigned: [T5], revoked: [], learning: [T4]) S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: [T3]) S2(assigned: [T3], revoked: [T4], learning: []) S3(assigned: [T4, T5], revoked: [], learning: []) #Third Rebalance S1 finishes replay and trigger rebalance. Member S1~S3 join with following status: S1(assigned: [T1, T2], revoked: [], learning: [T3]) S2(assigned: [], revoked: [T3], learning: []) S3(assigned: [T4, T5], revoked: [], learning: []) S1 performs task assignments: S1(assigned: [T1, T2, T3], revoked: [], learning: []) S2(assigned: [], revoked: [T3], learning: []) S3(assigned: [T4, T5], revoked: [], learning: []) S2 will shutdown itself upon new assignment since there is no assigned task left.
Note that to make sure the above resource shuffling could happen as expected, we need to have at least three new task status indicator to be provided:
- Which task is learner task? This could be a tag on standby task as "isLearner";
- Which task is being learned? This could be a tag on active task as "isLearned";
- Which learner task has become ready? This could be a tag on standby task as "isReady".
Optimizations
Stateful vs Stateless Tasks
For stateless tasks the ownership transfer should happen immediately without the need of a learning stage, because there is nothing to restore. We should fallback the algorithm towards KIP-415 where the stateless tasks will only be revoked during second rebalance.
Public Interfaces
We will be adding following new configs:
A public interface is any change to the following:
Binary log format
The network protocol and api behavior
Compatibility, Deprecation, and Migration Plan
- Metadata size increase
- No downtime upgrade due to change of protocolType
Rejected Alternatives
If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.