You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state[One of "Under Discussion", "Accepted", "Rejected"]

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When a partition changes leadership, the new coordinator must load the consumer offset cache from the start of the log, which can take arbitrarily long depending on how large the partition has grown. Finding the problem is rather difficult because this ends up looking like a long period of inactivity in the consumer group and not an actual issue. A broker metric that exposes how long a load took would explain why we see this. For similar reasons, another metric should be added to the transaction state log, when the broker is elected leader for one of the partitions and it loads state from the partition.

Public Interfaces

Add the following metrics via a sensor:

1) kafka.coordinator.group:type=GroupMetadataManager,name=TimeToLoad

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and group metadata from one __consumer_offsets partition in the last 3 hours.

Metric for TransactionStateManager
kafka.coordinator.group:type=TransactionStateManager,name=TimeToLoad

Type: SampledStat.Max

Value: 0 or greater over time; maximum time, in milliseconds, it took to load offsets and transaction state from one __consumer_offsets partition in the last 3 hours.

Proposed Changes

For each of the group metadata manager and transaction state manager, add a sensor that indicates the max number of milliseconds it took to load the partition. This max is computed from a running window based on the partitions that were loaded in the last 3 hours. Lengthening or shortening the 3 hour time window is up for discussion (default is 30sec). 

Compatibility, Deprecation, and Migration Plan

This KIP simply adds a new metric attribute.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.


  • No labels