You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Status

Current state: DRAFTING

Discussion thread: here

JIRA: here

Motivation

The KafkaConsumer is a complex client that incorporates different configurations for detecting consumer failure to allow remaining consumers to pick up the partitions of failed consumers. One such configuration is max.poll.interval.ms which is defined as:

max.poll.interval.msThe maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.int300000[1,...]medium

Timeout from this configuration typically happens when the application code to process the consumer's fetched records takes too long (longer than max.poll.interval.ms). Hitting this timeout will cause the consumer to leave the group and trigger a rebalance (if it is not a static member as described in KIP-345: Introduce static membership protocol to reduce consumer rebalances). The consumer will end up rejoining the group if processing time was the only issue (and not a static member). This scenario is not ideal as rebalancing will disrupt processing and take additional time.

Additionally, sometimes a long processing time is unavoidable if:

  1. Minimum processing time is long to begin with (ex. hit a database which takes 1 second per record minimum)
  2. Processing involves talking to a downstream service which sometimes causes a spike in processing time (ex. intermittent load issues)

In such cases, the user must fine-tune the configurations to fit their use-case however detection of such events is difficult. The only way to definitely identify this scenario is by searching application logs or the user must record their processing time on their own. The consumer will log an error when max.poll.interval.ms is hit:

Member {} sending LeaveGroup request to coordinator {} due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

An application owner has the ability to write code to measure processing time, but Kafka operators are out of luck as they must get the application owner to implement such instrumentation. If the application owner does not provide this, then the Kafka operator does not have this data.

It would be beneficial to add a metric to record the average/max time between calls to poll as it can be used by both Kafka application owners and operators to:

  • Easily identify if/when max.poll.interval.ms needs to be changed (and to what value)
  • View trends/patterns
  • Verify max.poll.interval.ms was hit using the max metric when debugging consumption issues (if logs are not available)
  • Configure alerts to notify when average/max time is too close to max.poll.interval.ms

Public Interfaces

We will add the following metrics:

Metric NameMbean NameDescription
time-between-poll-avg kafka.consumer:type=consumer-metrics,client-id=([-.\w]+)The average delay between invocations of poll().
time-between-poll-max kafka.consumer:type=consumer-metrics,client-id=([-.\w]+)The max delay between invocations of poll().

Proposed Changes

As we want this metric to measure the time that the user takes to call poll() , we will store a long lastPollMs  in KafkaConsumer. We will calculate the elapsed time (and update lastPollMs on every call to poll.

Compatibility, Deprecation, and Migration Plan

As this KIP simply adds new metrics, there is no issue regarding compatibility, deprecation, or migration plan.

Rejected Alternatives

None at the moment.


  • No labels