Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagetext
titleOnline Swapping
Group stable state: S1[T1, T2], S2[T3, T4]
Swapping application instances, adding S3, S4 with new instance type.

#First Rebalance 
Member S3, S4 join the group.
S1 performs task assignments:
	S1(assigned: [T1, T2], revoked: [], learning: [])
	S2(assigned: [T3, T4], revoked: [], learning: [])
	S3(assigned: [], revoked: [], learning: [T2])
	S4(assigned: [], revoked: [], learning: [T4])

Use scaling tool to indicate S1 & S2 are leaving.
#Second Rebalance 
Member S1, S2 initiate rebalance to indicate state change (leaving)
Member S1~S4 join with following status: 
	S1(assigned: [T1], revoked: [T2], learning: [])
	S2(assigned: [T3], revoked: [T4], learning: []) 
	S3(assigned: [], revoked: [], learning: [T2])
	S4(assigned: [], revoked: [], learning: [T4])
S1 performs task assignments:
	S1(assigned: [T1, T2], revoked: [], learning: [])
	S2(assigned: [T3, T4], revoked: [], learning: [])
	S3(assigned: [], revoked: [], learning: [T1, T2])
	S4(assigned: [], revoked: [], learning: [T3, T4])

#Third Rebalance 
S3 and S4 finishes replay T1 ~ T4 trigger rebalance.
Member S1~S4 join with following status: 
	S1(assigned: [], revoked: [T1, T2], learning: [])
	S2(assigned: [], revoked: [T3, T4], learning: [])
	S3(assigned: [], revoked: [], learning: [T1, T2])
	S4(assigned: [], revoked: [], learning: [T3, T4])
S1 performs task assignments:
	S1(assigned: [], revoked: [], learning: [])
	S2(assigned: [], revoked: [], learning: [])
	S3(assigned: [T1, T2], revoked: [], learning: [])
	S4(assigned: [T3, T4], revoked: [], learning: [])
S1~S2 will shutdown themselves upon new assignment since there is no assigned task left.

Edge Scenarios

Backing

...

Up Information

...

On Leader 

Since the incremental rebalancing requires certain historical information of last round assignment, the leader worker will need to maintain the knowledge of:

...

Code Block
languagetext
titleLeader crash before
Cluster has 3 stream workers S1(leader), S2 and they own tasks T1 ~ T5
Group stable state: S1[T1], S2[T2, T3, T4, T5]

#First Rebalance 
New member S4 joins the group, at the same time S1 crash.
S2 takes over the leader, while T1 is not assigned now
S2 ~ S4 join with following status
	S2(assigned: [T2, T3, T4, T5], revoked: [], learning: [])
	S3(assigned: [], revoked: [], learning: []) 
	S4(assigned: [], revoked: [], learning: [])
S2 performs task assignments: (no learner assignment since S2 doesn't know S4 is new member)
	S2(assigned: [T2, T3, T4, T5], revoked: [], learning: [])
	S3(assigned: [T1], revoked: [], learning: [])
	S4(assigned: [], revoked: [], learning: [])

Now the group reaches balance, although the eventual load is skewed.

...

Code Block
languagesql
Algorithm incremental-rebalancing

Input Set of Tasks,
	  Set of Instances,
      Set of Workers,

      Where each worker contains:
		Set of active Tasks,
		Set of standby Tasks,
		owned by which instance

Main Function
	
	Assign active tasks: (if any)
		To instances with learner tasks that indicate "ready"
		To previous owners
		To unready learner tasks owners
  	 	To instances with standby tasks
		To instances who are not marked "leaving"	
		To resource available instances

	Keep existing learner tasks' assignment unchanged

 	Pick new learner tasks out of heaviest loaded instances
 
	Assign learner tasks: (if any)
		To new-coming instances with abundant resource
		To instances who are not marked "leaving"	
		To instances with corresponding standby tasks
	Prerequisite is that the instance version supports learner mechanism. 

	Assign standby tasks: (if any)
		To instances without matching active tasks
		To previous active task owners after learner transfer in this round
		To resource availableabundant instances
		To instances who are not marked "leaving"
	Based on num.standby.task config, standby task assignment could take multiple rounds

Output Finalized Task Assignment

...

A stream worker S1 takes two learner tasks T1, T2, where restoring time time(T1) < time(T2). Under eager rebalance approach, the worker will call out rebalance immediately when T1 finishes replaying. While under conservative approach, worker will rejoin the group until it finishes replaying of both T1 and T2.

Standby Task Utilization

...

User naturally wants to reach a sweet spot between ongoing task transfer and streaming resource free-up. So we want to take a similar approach as KIP-415, where we shall introduce a client config to make sure the scale down is time-bounded. If the time takes to migrate tasks outperforms this config, the leader will send out join group request and force removing active tasks on the leaving members and transfer those tasks to other staying members, so that leaving members will shut down themselves immediately after this round of rebalance.

Trade-offs

More Rebalances vs Global Efficiency

The new algorithm will invoke many more rebalances than the current protocol as one could perceive. As we have discussed in the overall incremental rebalancing design, it is not always bad to have multiple rebalances when we do it wisely, and after KIP-345 we have a future proposal to avoid scale up rebalances for static members. The goal is to pre-register the members that are planning to be added. The broker coordinator will augment the member list and wait for all the new members to join the group before rebalancing, since by default stream application’s rebalance timeout is infinity. The conclusion is that: it is server’s responsibility to avoid excessive rebalance, and client’s responsibility to make each rebalance more efficient.

Metadata Space vs

...

Allocation Efficiency

Since we are carrying over more information during rebalance, we should be alerted on the metadata size increase. So far the hard limit is 1MB per metadata response, which means if we add-on too much information, the new protocol could hit hard failure. This is a common pain point for finding better encoding scheme for metadata if we are promoting incremental rebalancing KIPs like 415 and 429. Some thoughts from Guozhang have started in this JIRA and we will be planning to have a separate KIP discussing different encoding technologies and see which one could work.

...

For the smooth delivery of all the features discussed so far, the iteration is broken down divided into four stages:

Version 1.0
Anchor
algorithm_version_1_0
algorithm_version_1_0

...

The goal of first version is to realize the foundation of learner algorithm for scaling up scenario. The leader worker will use previous round assignment to figure out which instances are new ones, and the learner tasks shall only be assigned to new instances onceThe reason for only implementing new instances logic is because there is a potential edge case that could break current naive learner assignment: when the number of tasks are much smaller than total cluster capacity, we could fall in endless resource shuffling. We plan to better address this issue in version 4.0 where we take eventual load balance into consideration. Some discussions have been initiated on marking task weight for different types of tasks for a while. To me, it is unclear so far what kind of eventual balance model we are going to implement at current stage. In conclusion, we want to postpone the finalized design for eventual balance until last version.

...

  1. Create new tooling for marking instances as ready to scale down.
  2. Tag the leaving information for targeted members.
  3. Scale down timeout implementationsupport.

Version 3.0

Delivery goal: Eager rebalance

...

The 4.0 and the final version will take application eventual load balance into consideration. If we define a balancing factor x, the total number of tasks each instance owns should be within the range of +-x% of the expected number of tasks (according to relative instance weightcapacity), which buffers some capacity in order to avoid imbalance. A stream.imbalance.percentage will be provided for the user to configure. The smaller this number sets to, the more strict the assignment protocol will behave. 

...