Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: DISCUSSIONACCEPTED

Discussion thread: here

Vote Thread: here

JIRA:

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-8874

Motivation

The KafkaConsumer is a complex client that incorporates different configurations for detecting consumer failure to allow remaining consumers to pick up the partitions of failed consumers. One such configuration is max.poll.interval.ms which is defined as:

...

Timeout from this configuration typically happens when the application code to process the consumer's fetched records takes too long (longer than max.poll.interval.ms ). Hitting this timeout will cause the consumer to leave the group and trigger a /rebalance (if it is not a static member as described in KIP-345: Introduce static membership protocol to reduce consumer rebalances). The consumer will end up rejoining the group if processing time was the only issue (and not a static member). This scenario is not ideal as rebalancing will disrupt processing and take additional time.

...

It would be beneficial to add a metric to record the average/max time between calls to poll metrics to track poll calls as it can be used by both Kafka application owners and operators to:

  • Easily identify if/when max.poll.interval.ms needs   needs to be changed (and to what value)
  • View trends/patterns
  • Verify max.poll.interval.ms was   was hit using the max metric when debugging consumption issues (if logs are not available)
  • Configure alerts to notify when average/max time is too close to to max.poll.interval.ms 

Example Usage

An application owner reports that their consumers are seeing the max.poll.interval.ms timeout error log mentioned in the previous section. The application owner may claim that their application code is fine and that the Kafka infrastructure is broken. The application does not have any instrumentation measuring the processing time of their application which makes it difficult for the Kafka operator to prove otherwise and resolve the issue.

...

  • Unusual anomaly: Check health/performance of dependencies and consumer host
  • Processing time was always close to max.poll.interval.ms: Increase max.poll.interval.ms or decrease max.poll.records to avoid hitting it due to variance

Public Interfaces

We will add the following metrics:

Metric NameMbean NameDescription
time-between-poll-avg kafka.consumer:type=consumer-metrics,client-id=([-.\w]+)The average delay between invocations of poll().
time-between-poll-max kafka.consumer:type=consumer-metrics,client-id=([-.\w]+)The max delay between invocations of poll().
last-poll-seconds-ago kafka.consumer:type=consumer-metrics,client-id=([-.\w]+)The number of seconds since the last poll() invocation.
poll-idle-ratio-avgkafka.consumer:type=consumer-metrics,client-id=([-.\w]+)The average fraction of time the consumer's poll() is idle as opposed to waiting for the user code to process records.

Proposed Changes

time-between-poll (avg/max)

As we want this metric to measure the time that the user takes to call poll(), we will store a long lastPollMs in KafkaConsumer. We will calculate the elapsed time (and update lastPollMs) on every call to poll.

On each call to poll(), we will calculate the elapsed time since lastPollMs, record it in the Sensor which has Avg & Max, and update lastPollMs .

last-poll-seconds-ago

The last-poll-seconds-ago  metric will measure the time since the user last called poll() to provide more insight into how/when the user code calls poll()  .

poll-idle-ratio-avg

There will be an additional metric that measures the idle time of the consumer in waiting for the user to process records returned from poll:

LaTeX Formatting
$\textrm{poll-idle-ratio-avg} = \frac{\textrm{time-inside-poll}}{\textrm{total-time}}$

This metric will have a value between 0.0 and 1.0.

A value approaching 1.0 means consumer is idle in poll (ex. waiting for records) while a value approaching 0.0 means the application is busy processing in user code.

A low value (approaching 0.0) could indicate a potential issue or performance bottleneck in user code.

Compatibility, Deprecation, and Migration Plan

As this KIP simply adds new metrics, there is no issue regarding compatibility, deprecation, or migration.

Rejected Alternatives

None at the moment.