Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The produce latency is an important metric to monitor for the cluster performance.   As we know, latency is the sum from those parts, RequestQueueTime, LocalTime, RemoteTime, ResponseSend/QueueTime, ThrottleTime, MessageConversionsTime.  RemoteTime means the time spent in the purgatory. 

...

However, the produce remoteTime is hard to troubleshoot.   That is the time spent by the leader waiting for the follower to fetch the data and send back the confirmed watermark. It can be affected by 1. Network is congested,  2. Fetch processing itself is slow.  3. Something else.

Currently we don't have the correct metrics to tell which part is contributing to the high Produce RemoteTime.    That That is because,   the reported fetch latency didn't reflect the true fetch latency because it sometimes needs to stay in purgatory and wait for replica.fetch.wait.max.ms when data is more than fetch.min.bytes.   This greatly skewed the real latency number. 

...