Status

Current stateAccepted

Discussion thread: here

JIRA: KAFKA-6263

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When a partition changes leadership, the new coordinator must load the consumer offset cache from the start of the log, which can take arbitrarily long depending on how large the partition has grown. Finding the problem is rather difficult because this ends up looking like a long period of inactivity in the consumer group and not an actual issue. A broker metric that exposes how long a load took would explain why we see this. For similar reasons, another metric should be added to the transaction state log, when the broker is elected leader for one of the partitions and it loads state from the partition.

Public Interfaces

Add the following metrics via a sensor:

  • kafka.server:type=group-coordinator-metrics,name=group-load-time-max

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and group metadata from the __consumer_offsets partitions loaded in the last 30 seconds.

  • kafk.aserver:type=group-coordinator-metrics,name=group-load-time-avg

Type: SampledStat.Avg

Value: 0 or greater over time; average time, in milliseconds, it took to load offsets and group metadata from the __consumer_offsets partitions loaded in the last 30 seconds.

Note: this average may look very low at times when a majority of the partitions are unused causing some load times to be 0 seconds.

  • kafka.server:type=transaction-coordinator-metrics,name=transaction-load-time-max

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and transaction state from the __transaction_state partitions loaded in the last 30 seconds.

  • kafka.server:type=transaction-coordinator-metrics,name=transaction-load-time-avg

Type: SampledStat.Avg

Value: 0 or greater over time; average time, in milliseconds, it took to load offsets and transaction state from the __transaction_state partitions loaded in the last 30 seconds.

Note: this average may look very low at times when a majority of the partitions are unused causing some load times to be 0 seconds.

Proposed Changes

For each of the group metadata manager and transaction state manager, add a sensor that indicates the max and avg number of milliseconds it took to load each partition. This max and average are computed from a running window based on the partitions that finished loading in the last 30 seconds. 

Compatibility, Deprecation, and Migration Plan

This KIP simply adds a new metric attribute.

Rejected Alternatives

Use Yammer metrics instead of Kafka metrics.

  • The Yammer metrics don’t support the type of metric that we want to show the client



  • No labels