Status

Current state: DRAFTING

Discussion thread: here

JIRA: here

Motivation

The KafkaConsumer is a complex client that incorporates different configurations for detecting consumer failure to allow remaining consumers to pick up the partitions of failed consumers. One such configuration is max.poll.interval.ms which is defined as:

max.poll.interval.ms

The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.

int

300000

[1,...]

medium

This scenario is typically hit when the application code to process the consumer's fetched records takes too long (longer than max.poll.interval.ms). Hitting this timeout will cause the consumer to leave the group and trigger a rebalance (if it is not a static member as described in KIP-345: Introduce static membership protocol to reduce consumer rebalances). The consumer will end up rejoining the group if processing time was the only issue.

Sometimes a long processing time is unavoidable if:

Minimum processing time is long to begin with (ex. hit a database which takes 1 second per record minimum)
Processing involves talking to a downstream service which sometimes causes a spike in processing time (ex. intermittent load issues)

In such cases, the user must fine-tune the configurations to fit their use-case however detection of such events is currently difficult. The only way to identify this scenario is by searching application logs or the user must record their processing time on their own. The consumer will log an error when max.poll.interval.ms is hit:

Member {} sending LeaveGroup request to coordinator {} due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

It would be beneficial to add a metric to record the average/max time between calls to poll as it can be used by both Kafka application owners and operators to:

Easily identify if/when max.poll.interval.ms needs to be changed (and to what value)
View trends/patterns
Verify max.poll.interval.ms was hit using the max metric (if logs are not available)
Configure alerts if average/max time is too close to max.poll.interval.ms

Public Interfaces

We will add the following metrics:

Proposed Changes

TODO

Compatibility, Deprecation, and Migration Plan

As this KIP simply adds new metrics, there is no issue regarding compatibility, deprecation, or migration plan.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-517: Add consumer metric indicating time between poll calls

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives