Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Latency improvement of workloads run with acks=all


Baseline latency (ms)

Optimized latency (ms)

Improvement

High Partitions

p99 E2E

188

184

2.1%

p99 Produce

155.65

151.8

2.5%

Low Partitions

p99 E2E

393

374.5

4.7%

p99 Produce

390.95

374.35

4.2%

Latency improvement of workloads run with acks=1


Baseline latency (ms)

Optimized latency (ms)

Improvement

High Partitions

p99 E2E

106.5

101

5.2%

p99 Produce

84.7

83.3

1.7%

Low Partitions

p99 E2E

12.5

12.5

0%

p99 Produce

3.25

2.95

9.2%

Workload Details

All tests are run on 6 m5.xlarge Apache Kafka brokers running with Kraft as the metadata quorum in 3 m5.xlarge instances. The clients are 6 m5.xlarge instances running the OpenMessagingBenchmark. The test is run for 70 minutes, during which the brokers are restarted one by one with a 10 minute interval between restarts.

...

Another idea considered was to fetch new leader on the client using the usual Metadata RPC call, once produce or fetch request fails with NOT_LEADER_OR_FOLLOWER or FENCED_LEADER_EPOCH. And save time on the client by avoiding the static retry delay(RETRY_BACKOFF_MS_CONFIG) on a failed request, instead retry immediately as soon as a possible when the new leader is available for the partition on the client. Consider the total time taken for a produce-path, when leader changes -

...