Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

While using "acks=-1", we found the P999 latency is very spiky. Since the produce request can only be acknowledged after all of the "insync.replicas" committed the message, the produce request latency is determined by the slowest replica inside "insync.replicas". In the production environment, we usually set "replica.lag.time.max.ms" to the order of 10 seconds (to protect frequent ISR shrink and expand), and producer P999 can jump to the value of "replica.lag.time.max.ms" even without any failures or retries.

The following is a normal one day P999 for a production Kafka cluster. Of course, the majority of the time was spent on the remote replication part (replicas=3 and acks=-1).

Image Added

We already have pretty stable and low P99 latency, it will definitely make Kafka more suitable for a much wider audience (even in the critical write path) if we can have the similar guarantees for P999.

...