...
Thus we are proposing a dedicated design specifically for KStream rebalancing in order to holistically smooth the scale up/down experience.
Proposed Changes
...
New Stream Task
...
Type
We will introduce a new type of stream task called `learner task`, which is a special task that gets assigned to one stream instance to restore a current active task state from another instance. It shares the same semantics as standby task, and the only difference is that when the restoration of learner task is complete, the stream instance will initiate a new JoinGroupRequest to call out rebalance of the new task assignment. The goal of learner task is to delay the task migration when the destination host has not finished or even started replaying the active task. This applies to both scale up and scale down scenarios.
Alleviating StopAlleviating Stop-the-world Effect
As mentioned in motivation section, we also want to mitigate the stop-the-world effect of current global rebalance protocol. A quick recap of current rebalance semantics on KStream when rebalance starts, all members would
Join group with all current assigned tasks revoked.
Wait until group stabilized to resume the work.
The reason for doing so is because we need to guarantee each topic partition is assigned with exactly one consumer at a time. So one topic partition could not be re-assigned before it is revoked.
For Kafka Connect, we choose to avoid revoking keep all current assigned tasks running and trade off with one more rebalance. The behavior becomes:
Join group with all current active tasks running.
Sync the revoked partitions and stop them (first rebalance).
Rejoin group immediately with only active tasks (second rebalance).
Feel free to take a look at KIP-415 example to get a sense of how the optimization algorithm works.
For KStream, we are going to take a trade-off between “revoking all” and “revoking none” solution: we shall only revoke tasks that are being learned since last round. So when we assign learner tasks to new member, we shall also mark active tasks as `learnee task` "being learned task" on current owners. Everytime Every time when a rebalance begins, the task owners will revoke the learning bucket being learned tasks and join group without affecting other ongoing tasks. This way learned tasks could immediately transfer ownership without attempting for a second round of rebalance. Compared with KIP-415, we are optimizing for fewer rebalances, but increasing the metadata size and sacrificing partial availability of the learner tasks.
We could illustrate the process with the following example:
Code Block | ||||
---|---|---|---|---|
| ||||
Cluster has 3 stream instances S1(leader), S2, S3, and they each own some tasks 1 ~ 5 Group stable state: S1[T1, T2], S2[T3, T4], S3[T5] #First Rebalance New member S4 joins the group S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) #Second Rebalance New member S5 joins the group. Member S1~S5 join with following metadata: (S4 is not ready yet) S1(assigned: [T2], revoked: [T1], learning: []) // T1 revoked because it's "being learned" S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T1, T2], revoked: [], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) #Third Rebalance Member S4 finishes its replay and becomes ready, re-attempt to join the group. S5 is not ready yet. Member S1~S5 join with following status:(S5 is not ready yet) S1(assigned: [T2], revoked: [T1], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) // T3 revoked because it's "being learned" S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [], revoked: [], learning: [T1]) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T2], revoked: [T1], learning: []) S2(assigned: [T3, T4], revoked: [], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [], revoked: [], learning: [T3]) #Fourth Rebalance Member S5 is ready, re-attempt to join the group. Member S1~S5 join with following status:(S5 is not ready yet) S1(assigned: [T2], revoked: [], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) // T3 revoked because it's "being learned" S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [], revoked: [], learning: [T3]) S1 performs task assignments: S1(assigned: [T2], revoked: [], learning: []) S2(assigned: [T4], revoked: [T3], learning: []) S3(assigned: [T5], revoked: [], learning: []) S4(assigned: [T1], revoked: [], learning: []) S5(assigned: [T3], revoked: [], learning: []) Now the group reaches balance with 5 members and 5 tasks. |
Note that to make sure the above resource shuffling could happen as expected, we need to have at least three new task status indicator to be provided:
- Which task is learner task? This could be a tag on standby task as "isLearner".
- Which task is being learned? This could be a tag on active task as "isLearned".
- Which learner task has become ready? This could be a tag on standby task as "isReady".
Public Interfaces
We will be adding following new configs:
...
Binary log format
The network protocol and api behavior
Any class in the public packages under clientsConfiguration, especially client configuration
org/apache/kafka/common/serialization
org/apache/kafka/common
org/apache/kafka/common/errors
org/apache/kafka/clients/producer
org/apache/kafka/clients/consumer (eventually, once stable)
Monitoring
Command line tools and arguments
- Anything else that will likely break existing users in some way when they upgrade
Compatibility, Deprecation, and Migration Plan
...