Status

Current state: "Under Discussion"

Discussion thread: here here

JIRA: https://issues.apache.org/jira/browse/KAFKA-12713

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The produce latency is an important metric to monitor for the cluster performance. As we know, latency is the sum from those parts, RequestQueueTime, LocalTime, RemoteTime, ResponseSend/QueueTime, ThrottleTime, MessageConversionsTime. RemoteTime means the time spent in the purgatory.

In a fully loaded cluster, the time in RemoteTime is very significant part of produce latency.

However, the produce remoteTime is hard to troubleshoot. That is the time spent by the leader waiting for the follower to fetch the data and send back the confirmed watermark. It can be affected by 1. Network is congested, 2. Fetch processing itself is slow. 3. Something else.

However, currently we don't have the correct metrics to tell which part is contributing to the high Produce RemoteTime. That is because, the reported fetch latency didn't reflect the true fetch latency because it sometimes needs to stay in purgatory and wait for replica.fetch.wait.max.ms when data is more than fetch.min.bytes. This greatly skewed the real latency number.

The current metrics:

1. Fetcher rate

This reflects how frequently the follower sending fetch request to the leader.

Those are captured currently using fetcherStats.requestRate.mark() in processFetchRequest

However, the fetch rate becomes almost meaningless because, the fetch request might just wait in purgatory.

2. Fetch processing time.

Fetch TotalTime also include the RemoteTime, which is the time it is waiting for the requests.

For this KIP, we like to propose the fix to make the metrics to report the real latency.

Public Interfaces

Add waitTimeMs in FetchResponse, and bump the FetchResponse version.

Proposed Changes

We like to propose to track the real end to end fetch latency with those changes:

Add waitTimeMs in FetchResponse()
In Kafka API handler (in handleFetchRequest() function), when creating FetchResponse(), set the waitTimeMs as the time spent in purgatory
In Follower broker, in processFetchRequest(), it will track the latency of fetch request and minus the waitTimeMs from FetchResponse.
In FetcherStats, we will add a new histogram to track this calculated "true" fetch latency.
Create a sensor to report this metrics.

Also, additionally, at leader side, we will also add a new metric called TotalEffectiveTime, which is TotalTime minus RemoteTime.

Compatibility, Deprecation, and Migration Plan

Follow the standard protocol change update.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-736: Report the true end to end fetch latency

Status

Motivation

Public Interfaces

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives