Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Rebalance happens rarely in static membership. When receiving an existing member's rejoin request, broker will return the cached assignment back to the member, without doing any rebalance.

There are two configs to control the behaviorcase when we must do a rebalance: registration timeout and expansion timeout.


Code Block
titleConsumerConfig.java
public static final STRING REGISTRATION_TIMEOUT_MS = "registration.timeout.ms";

Registration timeout is the timeout we will trigger rebalance when a member goes offline for too long. It should usually be set much larger than session timeout which is used to detect consumer health. By setting it to 15 ~ 30 minutes, we are loosening the track of static member progress, and transfer the member management to client application like K8. Of course, we should not wait forever for the member to back online simply for the purpose of reducing rebalances. Eventually the member will be kicked out of group and a final rebalance is triggered. Note that we are tracking the earliest offline member and compare with the registration timeout. Example below with registration timeout 15 min:

EventTimeearliest timeAction
Member A dropped 00:00 00:00 N/A

Member B dropped

 00:10 00:00N/A
Member A back online 00:14 00:10N/A
B gone for too long00:25 00:10Rebalance

...

Rebalance 

Another case is adding new static memberships (scale up!). This operation should be happening fast enough (to make sure capacity could catch up quickly), we are defining another config called expansion timeout.

Code Block
titleConsumerConfig.java
public static final STRING EXPANSION_TIMEOUT_MS = "expansion.timeout.ms";

This is the timeout when we count down to trigger exactly one rebalance (i.e, the time estimate to spin up # of hosts) since the first new member join group request. It is advised to be set roughly the same with session timeout to make sure the workload become balanced when you 2X or 3X your stream job. Example with expansion timeout 5 min: 

 

EventTimecount timeAction
New member A join 00:00 00:00 N/A

New member B join

 00:03 00:00N/A
Expansion timeout00:05N/ARebalance
New member C join00:0600:06N/A
Expansion timeout00:11N/ARebalance

In this example unfortunately, we triggered two rebalances, because C is too late to catch first round expansion timeout. When C finally joins, it triggers the counter of expansion timeout. After 5 min, another rebalance kicks and assign new tasks to C.

...

Switch between static and dynamic membership

...