Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There are two configs to control the case when we must do a rebalance: registration timeout and expansion timeout.

Scale down

Code Block
languagescala
titleConsumerConfigGroupMetadata.javascala
public static final STRING REGISTRATION_TIMEOUT_MSdef registrationTimeoutMs = "registration.timeout.ms";Int // Default 3015 min

Registration timeout is the timeout we will trigger rebalance when a member goes offline for too long. It should usually be set much larger than session timeout which is used to detect consumer health. It is monitored through heartbeat the same way as session timeout, and will replace the session timeout in a static membership. The reason we define a different config is because we would like easy switch between dynamic membership and static membership without adding on mental management burden. By setting it to 15 ~ 30 minutes, we are loosening the track of static member progress, and transfer the member management to client application like K8. Of course, we should not wait forever for the member to back online simply for the purpose of reducing rebalances. Eventually the member will be kicked out of group and a final rebalance is triggered. Note that we are tracking the earliest offline member and compare with the registration timeout. Example below with registration timeout 15 min:

...

Adding new static memberships should be straightforward. This operation should be happening fast enough (to make sure capacity could catch up quickly), we are defining another config called expansion timeout.

Code Block
titleConsumerConfigGroupMetadata.javascala
public static final STRING EXPANSION_TIMEOUT_MSdef expansionTimeoutMs = "expansion.timeout.ms";Int // Default 5 min

This is the timeout when we count down to trigger exactly one rebalance (i.e, the time estimate to spin up # of hosts) since the first joined member's request. It is advised to be set roughly the same with session timeout to make sure the workload become balanced when you 2X or 3X your stream job. Example with expansion timeout 5 min: 

...