...
Thus we are proposing a dedicated design specifically for KStream rebalancing in order to holistically smooth the scale up/down experience. The primary purposes are two:
- Reduce unnecessary downtime due to task restoration
- Make rebalance performance better for stream applications, A.K.A alleviating Stop-The-World Effect.
Proposed Changes
New Stream Task Type
We will introduce a new type of stream task called `learner task`, which is a special task that gets assigned to one stream instance to restore a current active task state from another instance. It shares the same semantics as standby task, and the only difference is that when the restoration of learner task is complete, the stream instance will initiate a new JoinGroupRequest to call out rebalance of the new task assignment. The goal of learner task is to delay the task migration when the destination host has not finished or even started replaying the active task. This applies to both scale up and scale down scenarios.
...
Stop-the-world Effect Recap
As mentioned in motivation section, we also want to mitigate the stop-the-world effect of current global rebalance protocol. A quick recap of current rebalance semantics on KStream when rebalance starts, all members would
...
For KStream, we are going to take a trade-off between “revoking all” and “revoking none” solution: we shall only revoke tasks that are being learned since last round. So when we assign learner tasks to new member, we shall also mark active tasks as "being learned task" on current owners. Every time when a rebalance begins, the task owners will revoke the being learned tasks and join group without affecting other ongoing tasks. This way learned tasks could immediately transfer ownership without attempting for a second round of rebalance. Compared with KIP-415, we are optimizing for fewer rebalances, but increasing the metadata size and sacrificing partial availability of the learner tasks.
Next we are going to look at several examples of auto scaling cases
Scale UpWe could illustrate the process with the following example:
Code Block | ||||
---|---|---|---|---|
| ||||
Cluster has 3 stream instances S1(leader), S2, S3, and they each own some tasks 1 ~ 5 Group stable state: S1[T1, T2], S2[T3, T4], S3[T5] #First Rebalance New member S4 joins the group S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) #Second Rebalance New member S5 joins the group. Member S1~S5 join with following metadata: (S4 is not ready yet) S1(assigned: [T2], revoked: [T1], learning: []) // T1 revoked because it's "being learned" S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) #Third Rebalance Member S4 finishes its replay and becomes ready, re-attempt to join the group. S5 is not ready yet. Member S1~S5 join with following status:(S5 is not ready yet) S1(assigned: [T2], revoked: [T1], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) // T3 revoked because it's "being learned" S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T2], revoked: [T1], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [], revoked: [], learning: [T3]) #Fourth Rebalance Member S5 is ready, re-attempt to join the group. Member S1~S5 join with following status:(S5 is not ready yet) S1(assigned: [T2], revoked: [], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) // T3 revoked because it's "being learned" S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T2], revoked: [], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [T3], revoked: [], learning: []) Now the group reaches balance with 5 members and 5 tasks. |
...
For stateless tasks the ownership transfer should happen immediately without the need of a learning stage, because there is nothing to restore. We should fallback the algorithm towards KIP-415 where the stateless task tasks will only be revoked during second rebalance.
...