Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

                                                                    -- (reject) -->    Consumer (register failed due to some invalid meta info, try to reconnect to the coordinator and register again)

Detecting consumer failure. For every registered consumer, the coordinator will try to check its liveness by periodic pinging it. The replied ping response should include the partition consumed offfset if auto-offset-commit is enabled. If the consumer fails to reply the ping request for a configurable number of times. The coordinator will mark the consumer as failed and trigger the rebalance procedure for the group this consumer belongs to.

...

                                (rebalance)         time out...         Consumer (failed)

Detecting changes to topic partitions. The coordinator keeps a watch for every registered partitions stored in brokers. When a new partition is added (possibly due to a new added broker) or an old partition is deleted (possibly due to a broker failure). It will trigger the rebalance procedure for every group that is consuming this partition.

Process rebalance tasks and communicate the rebalance results to consumers. When a coordinator decides a rebalance is needed for certain group, it will first sends the stop-fetcher request to each consumers in the group and wait for all of them to reply with the last partition consumed offsets. Upon receiving the reply, the consumer should have stopped its fetchers. Then it will send a start-fetcher request to each consumers in the group with the newly computed partition assignment along with the starting offset. The coordinator will finish the rebalance by waiting for all the consumers to finish restarting the fetchers and respond. If any reply is not received after the timeout value, the coordinator will retry the rebalance procedure by marking the consumer as failed and sending the stop-fetcher request to everybody else. If the rebalance cannot finish within a number of rebalance retries, the coordinator will log that as an exception.

...

(retry rebalance)  time out...        Consumer (failed)

When a coordinator is failed, other peer brokers will notice that and try to re-elect to be the new coordinator. The new coordinator will not try to connect to the consumers but wait for the consumers to realize the previous coordinator has dead and try to re-connect to itself.

...