Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateUnder DiscussionAccepted

Discussion thread: TODO here

JIRA: KAFKA-6263

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When a partition changes leadership, the new coordinator must load the consumer offset cache from the start of the log, which can take arbitrarily long depending on how large the partition has grown. Finding the problem is rather difficult because this ends up looking like a long period of inactivity in the consumer group and not an actual issue. A broker metric that exposes how long a load took would explain why we see this. For similar reasons, another metric should be added to the transaction state log, when the broker is elected leader for one of the partitions and it loads state from the partition.

Public Interfaces

Add the following metrics via a sensor:1) kafka.coordinator.group

  • kafka.server:type=

...

  • group-coordinator-metrics,name=

...

  • group-load-time-max

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and group metadata from

...

the __consumer_offsets

...

partitions loaded in the last

...

30 seconds.

...

  • kafk.aserver:type=group-coordinator-metrics,name=group-load-time-avg

Type: SampledStat.Avg

Value: 0 or greater over time; average time, in milliseconds, it took to load offsets and group metadata from the __consumer_offsets partitions loaded in the last 30 seconds.

Note: this average may look very low at times when a majority of the partitions are unused causing some load times to be 0 seconds.

  • kafka.server:type=transaction-coordinator-metrics,name=transaction-load-time-max

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and transaction state from

...

the __

...

transaction_state partitions loaded in the last 30 seconds.

  • kafka.server:type=transaction-coordinator-metrics,name=transaction-load-time-avg

Type: SampledStat.Avg

Value: 0 or greater over time; average time, in milliseconds, it took to load offsets and transaction state from the __transaction_state partitions loaded in the last 30 seconds.

Note: this average may look very low at times when a majority of the partitions are unused causing some load times to be 0 seconds.

Proposed Changes

For each of the group metadata manager and transaction state manager, add a sensor that indicates the max and avg number of milliseconds it took to load the each partition. This max is and average are computed from a running window based on the partitions that were loaded finished loading in the last 3 hours. Lengthening or shortening the 3 hour time window is up for discussion (default is 30sec)30 seconds

Compatibility, Deprecation, and Migration Plan

This KIP simply adds a new metric attribute.

Rejected Alternatives

Use Yammer metrics instead of Kafka metrics.

...