Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Proposed Assignment/Rebalance Algorithm

At a high level, we are proposing an iterative algorithm that prioritizes recovery time, while planning task movements to improve balance over consecutive rebalances. This proposal builds on KIP-429, so we assume the overhead of rebalancing is much less than it is today.

The task movements are accomplished by assigning the task as a standby task to the destination instance, and then rebalancing once those standbys are up to date. This means we need some mechanism to trigger a rebalance outside of group or topic changes (which are the only triggers currently). There are a number of ways to do this, but we propose to build a UserData field into the heartbeat protocol, so that group members can continuously send their standby progress to the consumer leader. The leader will re-evaluate the assignment balance on each such metadata response, triggering a rebalance when a better balance becomes available.

One advantage of this approach is that the balancing algorithm is pluggable; over time, we can add more detailed information to the heartbeat user data to allow rebalancing in response to CPU/Memory/Disk pressure, etc., instead of purely the number of tasks.

Parameters

  • balance_factor: A scalar integer value representing the target difference in number of tasks assigned to the node with the most tasks vs. the node with the least tasks. Defaults to 1. Must be at least 1.
  • acceptable_recovery_lag: A scalar integer value indicating a task lag (number of offsets to catch up) that is acceptable for immediate assignment. Defaults to 10,000 (should be well under a minute and typically a few seconds, depending on workload). Must be at least 0.
  • num_standbys: A scalar integer indicating the number of hot-standby task replicas to maintain in addition to the active processing tasks. Defaults to 0. Must be at least 0.

...