...
Latency improvement of workloads run with acks=all
Baseline latency (ms) | Optimized latency (ms) | Improvement | |
---|---|---|---|
High Partitions | |||
p99 E2E | 188 | 184 | 2.1% |
p99 Produce | 155.65 | 151.8 | 2.5% |
Low Partitions | |||
p99 E2E | 393 | 374.5 | 4.7% |
p99 Produce | 390.95 | 374.35 | 4.2% |
Latency improvement of workloads run with acks=1
Baseline latency (ms) | Optimized latency (ms) | Improvement | |
---|---|---|---|
High Partitions | |||
p99 E2E | 106.5 | 101 | 5.2% |
p99 Produce | 84.7 | 83.3 | 1.7% |
Low Partitions | |||
p99 E2E | 12.5 | 12.5 | 0% |
p99 Produce | 3.25 | 2.95 | 9.2% |
Workload Details
All tests are run on 6 m5.xlarge Apache Kafka brokers running with Kraft as the metadata quorum in 3 m5.xlarge instances. The clients are 6 m5.xlarge instances running the OpenMessagingBenchmark. The test is run for 70 minutes, during which the brokers are restarted one by one with a 10 minute interval between restarts.
...
Another idea considered was to fetch new leader on the client using the usual Metadata RPC call, once produce or fetch request fails with NOT_LEADER_OR_FOLLOWER or FENCED_LEADER_EPOCH. And save time on the client by avoiding the static retry delay(RETRY_BACKOFF_MS_CONFIG) on a failed request, instead retry immediately as soon as a possible when the new leader is available for the partition on the client. Consider the total time taken for a produce-path, when leader changes -
...