You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Status

Current state"Under Discussion"

Discussion thread: here 

JIRAhttps://issues.apache.org/jira/browse/KAFKA-12713

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

The produce latency is an important metric to monitor for the cluster performance.  As we know, latency is the sum from those parts, RequestQueueTime, LocalTime, RemoteTime, ResponseSend/QueueTime, ThrottleTime, MessageConversionsTime.  RemoteTime means the time spent in the purgatory. 

In a fully loaded cluster, the time in RemoteTime is very significant part of produce latency. 

However, the produce remoteTime is hard to troubleshoot.  That is the time spent by the leader waiting for the follower to fetch the data and send back the confirmed watermark. It can be due to 1. Fetch end to end latency is bad  or 2. Fetch processing itself is slow. 

However, currently the reported fetch latency didn't reflect the true fetch latency because it sometimes needs to stay in purgatory and wait for replica.fetch.wait.max.ms when data is more than fetch.min.bytes.  This greatly skewed the real latency number, and it is very hard to find out whether fetch latency fluctuate. 


The current metrics:

1. Fetcher rate

Those are captured currently using   fetcherStats.requestRate.mark() in processFetchRequest

However, the fetch rate becomes almost meaningless because, it is hard to tell the slow fetch due to waiting or not. 

2. Fetch processing time.

Fetch TotalTime also include the RemoteTime, which is the time it is waiting for the requests.

For this KIP, we like to propose the fix to make the metrics to report the real latency. 

Public Interfaces


Add waitTimeMs in FetchResponse.

Proposed Changes

We like to propose to track the real end to end fetch latency with those changes:

  1. Add waitTimeMs in FetchResponse()
  2. In processResponseCallback() of handleFetchRequest,  set the waitTimeMs as the time spent in purgatory.
  3. In FetcherStats, add a new meter to track the real fetch latency, by deducting the waitTimeMs.

Also, in FetchLatency, we should also report time called TotalEffectiveTime, which is TotalTime minus RemoteTime. 

Compatibility, Deprecation, and Migration Plan

None. 

Rejected Alternatives


  • No labels