Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Table of Contents

Status

Current state[One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]6263

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When a partition changes leadership, the new coordinator must load the consumer offset cache from the start of the log, which can take arbitrarily long depending on how large the partition has grown. Finding the problem is rather difficult because this ends up looking like a long period of inactivity in the consumer group and not an actual issue. A broker metric that exposes how long a load took would explain why we see this. For similar reasons, another metric should be added to the transaction state log, when the broker is elected leader for one of the partitions and it loads state from the partition.

Public Interfaces

Add the following metrics via a sensor:1) 

  • kafka.

...

  • server:type=

...

  • group-coordinator-metrics,name=

...

  • group-load-time-max

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and group metadata from

...

the __consumer_offsets

...

partitions loaded in the last

...

30 seconds.

  • kafk.aserver:type=group-coordinator-metrics,name=group-load-time-avg

Type: SampledStat.Avg

Value: 0 or greater over time; average time, in milliseconds, it took to load offsets and group metadata from the __consumer_offsets partitions loaded in the last 30 seconds.

Note: this average may look very low at times when a majority of the partitions are unused causing some load times to be 0 seconds.

  • kafka.server:type=transaction-coordinator-metrics,name=transaction-load-time-max

...

titleMetric for TransactionStateManager

...

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and transaction state from

...

the __

...

transaction_state partitions loaded in the last 30 seconds.

  • kafka.server:type=transaction-coordinator-metrics,name=transaction-load-time-avg

Type: SampledStat.Avg

Value: 0 or greater over time; average time, in milliseconds, it took to load offsets and transaction state from the __transaction_state partitions loaded in the last 30 seconds.

Note: this average may look very low at times when a majority of the partitions are unused causing some load times to be 0 seconds.

Proposed Changes

For each of the group metadata manager and transaction state manager, add a sensor that indicates the max and avg number of milliseconds it took to load the each partition. This max is and average are computed from a running window based on the partitions that were loaded finished loading in the last 3 hours. Lengthening or shortening the 3 hour time window is up for discussion (default is 30sec)30 seconds

Compatibility, Deprecation, and Migration Plan

This KIP simply adds a new metric attribute.

Rejected Alternatives

...

Use Yammer metrics instead of Kafka metrics.

  • The Yammer metrics don’t support the type of metric that we want to show the client