Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The consumer specifies a session timeout in the JoinGroupRequest that it sends to the co-ordinator in order to join a consumer group. When the consumer has successfully joins a group, the failure detection process starts on the consumer as well as the co-ordinator. The consumer initiates periodic heartbeats (HeartbeatRequest), every session.timeout.ms/heartbeat.frequency to the co-ordinator and waits for a response. If the co-ordinator does not receive a HeartbeatRequest from a consumer for session.timeout.ms, it marks the consumer dead. Similarly, if the consumer does not receive the HeartbeatResponse within session.timeout.ms, it assumes the co-ordinator is dead and starts the co-ordinator rediscovery process. Here is the protocol in more detail -1)

  1. After receiving a ConsumerMetadataResponse or a JoinGroupResponse, a consumer periodically sends a HeartbeatRequest to the coordinator every

...

  1. session.timeout.ms/heartbeat.frequency milliseconds.

...

  1. Upon receiving the HeartbeatRequest, coordinator checks the generation number, the consumer id and the consumer group. If the consumer specifies an invalid or stale generation

...

  1. id

...

  1. , it send an IllegalGeneration error code in the HeartbeatResponse.

...

  1. If coordinator does not receive a HeartbeatRequest from a consumer at least once in session.timeout.ms, it marks the consumer dead

...

  1. and triggers a rebalance process for the group.

...

  1. If consumer does not receive a HeartbeatResponse from the coordinator after session.timeout.ms or finds the socket channel to the coordinator to be closed, it treats the coordinator as failed and triggers the co-ordinator re-discovery process.

5) If consumer finds the socket channel to the coordinator to be closed, it treats the coordinator as failed and triggers the co-ordinator re-discovery process..

Note that on coordinator Note that on coordinator failover, the consumers may discover the new coordinator before or after the new coordinator has finished the failover process including loading the consumer group metadata from ZK, etc. In the latter case, the new coordinator will just accept its ping request as normal; in the former case, the new coordinator may reject its request, causing it to re-dicover the co-ordinator and re-connect again, which is fine. Also, if the consumer connects to the new coordinator too late, the co-ordinator may have marked the consumer dead and will be treat the consumer as a new consumer, which is also fine.

...

  1. After startup, a consumer learns it's consumer id in the very first JoinGroupResponse it receives from the co-ordinator. From that point onwards, the consumer is expected to include this consumer id in every request it sends to the co-ordinator (HeartbeatRequest, JoinGroupRequest, OffsetCommitRequest). If the co-ordinator receives a HeartbeatRequest or an OffsetCommitRequest with a consumer id that is different from the ones in the group, it sends an UnknownConsumer error code in the corresponding responses.
  2. The co-ordinator assigns a consumer id to a consumer on a successful rebalance and sends it in the JoinGroupResponse. The consumer should include this id in every subsequent JoinGroupRequest as well until it is shutdown or dies.
  3. The co-ordinator does consumer id assignment after it has received a JoinGroupRequest from all existing consumers in a group. At this point, it assigns a new id <group>-<consumer_host>-<sequence> to every consumer that did not send have a consumer id in the JoinGroupRequest. The assumption is that such consumers are newly started up.
  4. If a consumer fails to send the same consumer id on subsequent JoinGroupRequests, it will cause a chain of rebalance attempts and can cause the group to never finish a rebalance operation successfully. This is because the way a co-ordinator knows that a rebalance operation should be triggered due to a new consumer, is by checking the consumer id in the JoinGroupRequest. If there is no consumer id, it assumes that a new consumer wants to join the group.
  5. If a consumer id is specified in the JoinGroupRequest but it does not match the ids in the current group membership, the co-ordinator sends an UnknownConsumer error code in the JoinGroupResponse and prevents the consumer from joining the group. This does not cause a rebalance operation for the rest of the consumers in the group, but also does not allow such a consumer to join an existing group.

...

Code Block
{
  CorrelationId          => int32
  ErrorCode              => int16
  Coordinator  CoordinatorId          => Broker
  CoordinatorEpoch       => int32
}

...

 

Code Block
{
  Version                => int16
  CorrelationId          => int32
  GroupId                => String
  ConsumerHost           => String
  SessionTimeout         => int32
  Topics                 => [String]
  DoesConsumerIdExist    => int16
  ConsumerId             => String 
 }

JoinGroupResponse

Code Block
{
  Version                => int16
  CorrelationId          => int32
  ErrorCode              => int16
  GroupId GroupGenerationId      => int32
  GroupId      => String 
  GroupGenerationId      => String int32
  ConsumerId             => String
  PartitionsToOwn        => [TopicAndPartition]
}

 

...

With wildcard subscription (for example, whitelist and blacklist), the consumers are responsible to discover matching topics through topic metadata request. That is, its topic metadata request will contain an empty topic list, whose response then will return the partition info of all topics, it will then filter the topics that match its wildcard expression, and then updates the subscription list. Again, if the subscription list has changed from the previous values, it will start trigger rebalance for the consumer re-join group procedure.

Interesting scenarios to consider

...

  1. The co-ordinator for a group detects changes to the number of partitions for the subscribed list of topics.
  2. The co-ordinator then marks the group ready for rebalance and closes the socket connection to all consumers in the group to trigger a rebalance operation.On losing connection to the co-ordinator, the consumers go through the co-ordinator re-discovery process. If the consumer receives a sends the IllegalGeneration error code in the HeartbeatResponse, it . The consumer then stops fetching data, commits offsets and sends a JoinGroupRequest to the co-ordinator.
  3. The co-ordinator waits for all consumers to send it the JoinGroupRequest for the group. Once it receives all expected JoinGroupRequests, it increments the group's generation id in zookeeper, computes the new partition assignment and returns the updated assignment and the new generation id in the JoinGroupResponse
  4. On receiving the JoinGroupResponse, the consumer stores the new generation id locally and the consumer id locally and starts fetching data for the returned set of partitions. The subsequent requests sent by the consumer to the co-ordinator will use the new generation id returned by the last JoinGroupResponse.

...

  1. If a consumer receives an IllegalGeneration error code, it stops fetching and commits existing offsets before sending a JoinGroupRequest to the co-ordinator.
  2. The co-ordinator checks the generation id in the OffsetCommitRequest and rejects it if the generation id in the request is higher than the generation id on the co-ordinator. This indicates a bug in the consumer code.
  3. The co-ordinator does not allow offset commit requests with generation ids older than the current group generation id in zookeeper either. This constraint is not a problem during a rebalance since until all consumers have sent a JoinGroupRequest, the co-ordinator does not increment the group's generation id. And from that point until the point the co-ordinator sends a JoinGroupResponse, it does not expect to receive any OffsetCommitRequests from any of the consumers in the group in the current generation. So the generation id on every OffsetCommitRequest sent by the consumer should always match the current generation id on the co-ordinator.
  4. Another case worth discussing is when a consumer goes through a soft failure e.g. a long GC pause during a rebalance. If a consumer pauses for more than session.timeout.ms, the co-ordinator will not receive a JoinGroupRequest from such a consumer within session.timeout.ms. The co-ordinator marks the consumer dead and expires the JoinGroupRequests received so far and marks the current rebalance attempt failed. It will continue to return the IllegalGeneration error code in HeartbeatResponses until the consumers sent send it another JoinGroupRequest.

...

  1. Consumer periodically sends a HeartbeatRequest to the coordinator every 0.3*every session.timeout.ms/hearbeat.frequency milliseconds. If the consumer receives the IllegalGeneration error code in the HeartbeatResponse, it stops fetching, commits offsets and sends a JoinGroupRequest to the co-ordinator. Until the consumer receives a JoinGroupResponse, it does not send any more HearbeatRequests to the co-ordinator. 
  2. A higher heartbeat.frequency ensures lower latency on a rebalance operation since the co-ordinator notifies a consumer of the need to rebalance only on a HeartbeatResponse. To protect the broker from receiving too many heartbeats, the broker will have a heartbeat.frequency.threshold config that puts a lower bound on the frequency of heartbeats sent by consumers.
     
  3. The co-ordinator pauses failure detection for a consumer that has sent a JoinGroupRequest until a JoinGroupResponse is sent out. It restarts the hearbeat timer once the JoinGroupResponse is sent out and marks a consumer dead if it does not receive a HeartbeatRequest from that time for another session.timeout.ms milliseconds. This prevents the co-ordinator from marking a consumer dead during a rebalance operation. Note that this does not prevent the rebalance operation from finishing if a consumer goes through a soft failure during a rebalance operation. If the consumer pauses before it sends a JoinGroupRequest, the co-ordinator will mark it dead during the rebalance and send a JoinGroupResponse to the rest of the consumers with a RebalanceFailed error code. If a consumer pauses after it sends a JoinGroupRequest, the co-ordinator will send it the JoinGroupResponse assuming the rebalance completed successfully and will restart it's heartbeat timer. If the consumer resumes before session.timeout.ms, consumption starts normally. If the consumer pauses for session.timeout.ms after that, then it is marked dead by the co-ordinator and it will trigger a rebalance operation.
  4. The co-ordinator returns the new generation id only in the JoinGroupResponse. Once the consumer receives a JoinGroupResponse, it sends the next HeartbeatRequest with the new generation id.

...