Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. introduce a new config on the producer side called "quorum.required.acks". This config specifies the number of replicas a message should be committed to before the message got acknowledged. This config should be <= the number of replicas and >= 1. In this case, we have the data durability guarantee for "quorum.required.acks - 1" broker failures.
  2. select the live replica with the largest LEO as the new leader during leader election. For the most popular configs in today's production environment, we use replicas=3, ack=-1, and min.insync.replicas=2. In this case, we guarantee a committed message will NOT be lost if only 1 broker failed/died. If we use replicas=3 and quorum.required.acks=2, for any message got acknowledged, it has been committed to at least 2 brokers, if the leader died, when we chose the replica with the larger LEO from the two live replicas, it will ensure this replica contains all of the messages from the previous leader.

The following is the P999 for the replicas=4 and quorum.required.acks=2 configs, 4000QPS with the 2KB payload for 24 hours. (I deployed with the modified version supporting quorum.required.ack based on 0.10.2.1)

Image Added

Compatibility, Deprecation, and Migration Plan

...

I got some recommendations from the offline discussions. One of them is to set replicas=2 and ack=-1, this will only wait 1 follower fetch the pending message; however, from the experiment, the P999 is still very spiky.

The following is the P999 for the replicas=2 and ack=-1 configs, 1000QPS with the 2KB payload for only 2 hours.

Image Added