Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

However, the produce remoteTime is hard to troubleshoot.  That is the time spent by the leader waiting for the follower to fetch the data and send back the confirmed watermark. It can be due to affected by 1. Fetch end to end latency is bad  or Network is congested,  2. Fetch processing itself is slow.  3. Something else.

However, currently we don't have the correct metrics to tell which part is contributing to the high Produce RemoteTime.   That is because,  the reported fetch latency didn't reflect the true fetch latency because it sometimes needs to stay in purgatory and wait for replica.fetch.wait.max.ms when data is more than fetch.min.bytes.  This greatly skewed the real latency number, and it is very hard to find out whether fetch latency fluctuate

The current metrics:

1. Fetcher rate

This reflects how frequently the follower sending fetch request to the leader. 

Those are captured currently using   fetcherStats.requestRate.mark() in processFetchRequest

However, the fetch rate becomes almost meaningless because, it is hard to tell the slow fetch due to waiting or notfetch request might just wait in purgatory

2. Fetch processing time.

...

  1. Add waitTimeMs in FetchResponse()
  2. In Kafka API handler (in handleFetchRequest() function),  set the waitTimeMs as the time spent in purgatory for FetchResponse()
  3. In Follower broker,  in processFetchRequest(),  it will track the duration of fetch request and minus the waitTimeMs from FetchResponseTime.  In FetcherStats, we will add a new meter histogram to track this calculated fetch latency.

...