Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Co-ordinator failure during rebalance

...

A rebalance operation goes through several phases -

  1. Co-ordinator receives notification of a rebalance - either a zookeeper watch fires for a topic/partition change or a new consumer registers or an existing consumer dies.
  2. Co-ordinator initiates a rebalance operation
  3. Consumers send a JoinGroupRequest
  4. Co-ordinator increments the group's generation id in zookeeper
  5. Co-ordinator sends a JoinGroupResponse

Co-ordinator can fail at any of the above phases during a rebalance operation. This section discusses how the failover handles each of these scenarios.

  1. If the co-ordinator fails at step #1 after receiving a notification but not getting a chance to act on it, the new co-ordinator has to be able to detect the need for a rebalance operation on completing the failover. During failover, the co-ordinator reads a group's metadata from zookeeper, including the list of topics the group has subscribed to and the previous partition ownership decision. If the # of topics or # of partitions for the subscribed topics are different from the ones in the previous partition ownership decision, the new co-ordinator detects the need for a rebalance and initiates one for the group. Similarly if the consumers that connect to the new co-ordinator are different from the ones in the group's generation in zookeeper, it initiates a rebalance for the group. For example, if a consumer that is not in the current generation sends a HeartbeatRequest or

Consumer id assignment