Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Timeout from this configuration typically happens when the application code to process the consumer's fetched records takes too long (longer than max.poll.interval.ms ). Hitting this timeout will cause the consumer to leave the group and trigger a rebalance (if it is not a static member as described in KIP-345: Introduce static membership protocol to reduce consumer rebalances). The consumer will end up rejoining the group if processing time was the only issue (and not a static member). This scenario is not ideal as rebalancing will disrupt processing and take additional time.

...

It would be beneficial to add a metric to record the average/max time between calls to poll as it can be used by both Kafka application owners and operators to:

  • Easily identify if/when max.poll.interval.ms needs   needs to be changed (and to what value)
  • View trends/patterns
  • Verify max.poll.interval.ms was   was hit using the max metric when debugging consumption issues (if logs are not available)
  • Configure alerts to notify when average/max time is too close to to max.poll.interval.ms 

Example Usage

An application owner reports that their consumers are seeing the max.poll.interval.ms timeout error log mentioned in the previous section. The application owner may claim that their application code is fine and that the Kafka infrastructure is broken. The application does not have any instrumentation measuring the processing time of their application which makes it difficult for the Kafka operator to prove otherwise and resolve the issue.

...