Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

max.poll.interval.msThe maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.int300000[1,...]medium

This scenario is typically hit Timeout from this configuration typically happens when the application code to process the consumer's fetched records takes too long (longer than max.poll.interval.ms). Hitting this timeout will cause the consumer to leave the group and trigger a rebalance (if it is not a static member as described in KIP-345: Introduce static membership protocol to reduce consumer rebalances). The consumer will end up rejoining the group if processing time was the only issue (and not a static member). This scenario is not ideal as rebalancing will disrupt processing and take additional time.

Additionally, sometimes Sometimes a long processing time is unavoidable if:

...

In such cases, the user must fine-tune the configurations to fit their use-case however detection of such events is currently difficult. The only way to definitely identify this scenario is by searching application logs or the user must record their processing time on their own. The consumer will log an error when max.poll.interval.ms is hit:

Code Block
Member {} sending LeaveGroup request to coordinator {} due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

An application owner has the ability to write code to measure processing time, but Kafka operators are out of luck as they must get the application owner to implement such instrumentation. If the application owner does not provide this, then the Kafka operator does not have this data.

It would be beneficial to add a metric to record the average/max time between calls to poll as it can be used by both Kafka application owners and operators to:

  • Easily identify if/when max.poll.interval.ms needs to be changed (and to what value)
  • View trends/patterns
  • Verify max.poll.interval.ms was hit using the max metric when debugging consumption issues (if logs are not available)
  • Configure alerts to notify when average/max time is too close to max.poll.interval.ms

...