...

However, the produce remoteTime is hard to troubleshoot. That is the time spent by the leader waiting for the follower to fetch the data and send back the confirmed watermark. It can be affected by 1. Network is congested, 2. Fetch processing itself is slow. 3. Something else.

However, currently Currently we don't have the correct metrics to tell which part is contributing to the high Produce RemoteTime. That is because, the reported fetch latency didn't reflect the true fetch latency because it sometimes needs to stay in purgatory and wait for replica.fetch.wait.max.ms when data is more than fetch.min.bytes. This greatly skewed the real latency number.

The limitation of the current metrics:

1. Fetcher rate

...

Those are captured currently using fetcherStats.requestRate.mark() in processFetchRequest

...

Add waitTimeMs in FetchResponse, and bump the FetchResponse and FetchRequest API protocol version.

Proposed Changes

...

Also, additionally, at leader side, we will also add a new metric called TotalEffectiveTimeTotalLocalTime, which is TotalLocalTime = TotalTime minus - RemoteTime.

Compatibility, Deprecation, and Migration Plan

...

Space shortcuts

Child pages

Versions Compared

Old Version 11

New Version 12

Key

Proposed Changes

Compatibility, Deprecation, and Migration Plan

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 11

New Version 12

Key

Proposed Changes

Compatibility, Deprecation, and Migration Plan