Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

2) Upon receiving the HeartbeatRequest, coordinator checks the generation number, the consumer id and the consumer group. If the consumer specifies an invalid or stale generation id (i.e., consumer id not assigned before, or generation number is smaller than current value), it send an IllegalGeneration error code in the HeartbeatResponse.

3) If coordinator has not heard from a consumer at least once in failed_heartbeat_threshold*( session_timeout /heartbeat_frequency), mark ms, it marks the consumer as dead, break breaks the socket connection to the consumer and trigger triggers a rebalance process for the group.

4) If consumer has not received a heartbeat from the coordinator after session timeout, it treats the coordinator as failed and triggers the co-ordinator re-discovery process.

5) If consumer finds the socket channel to the coordinator to be closed, it treats the coordinator as failed and triggers the co-ordinator re-discovery process..

Note that on coordinator failover, the consumers may discover the new coordinator before or after the new coordinator has finished the failover process including loading the consumer group metadata from ZK, etc. In the latter case, the new coordinator will just accept its ping request as normal; in the former case, the new coordinator may reject its request, causing it to re-dicover coordinators and re-connect again, which is fine. Also, if the consumer connects to the new coordinator too late, the co-ordinator may have marked the consumer has dead and will be treat the consumer as a new consumer, which is also fine.

Request formats

For each consumer group, the coordinator stores the following information:

...

  1. If a consumer receives a higher generation id for it's group, it stops fetching and commits existing offsets before sending a JoinGroupRequest to the co-ordinator.
  2. The co-ordinator checks the generation id in the OffsetCommitRequest and rejects it if the generation id in the request is higher than the generation id on the co-ordinator. This indicates a bug in the consumer code.
  3. The co-ordinator does not allow offset commit requests with generation ids older than the current group generation id in zookeeper either. This constraint is not a problem during a rebalance since until all consumers have sent a JoinGroupRequest, the co-ordinator does not increment the group's generation id. And from that point until the point the co-ordinator sends a JoinGroupResponse, it does not expect to receive any OffsetCommitRequests from any of the consumers in the group in the current generation. So the generation id on every OffsetCommitRequest sent by the consumer should always match the current generation id on the co-ordinator. Another case worth discussing is when a consumer goes through a soft failure e.g. a long GC pause during a rebalance. If a consumer pauses for more than session.timeout.ms, it is marked dead by the co-ordinator. The co-ordinator It will expire the JoinGroupRequests received so far, fail the current rebalance attempt and trigger another rebalance operation.

Heartbeats during rebalance

  1. A
  2.