Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. This design aims to remove split-brain and herd effect issues in the V1 design. A partition has only one brain (on the leader) and all brokers only respond to state changes that are meant for them (as decided by the leader).
  2. The state machine in this design is completely controlled only by the leader for each partition. Each follower changes its state only based on such a request from the leader for a particular partition. Leader co-ordinated state machine allows central state machine verification and allows it to fail fast.
  3. This design introduces a global epoch, which is a non-decreasing value for a Kafka cluster. The epoch changes when the leader for a partition changes.
  4. This design handles delete partition or delete topic state changes for dead brokers by queuing up state change requests for a broker in Zookeeper.
  5. This design scales better wrt to number of ZK watches, since it registers fewer watches compared to V1. The motivation is to be able to reduce the load on ZK when the Kafka cluster grows to thousands of partitions. For example, if we have a cluster of 3 brokers hosting 1000 topics with 3 partitions each, the V1 design requires registering 15000 watches. The V2 design requires registering 3000 watches.
  6. This design ensures that leader change ZK notifications are not queued up on any other notifications and can happen instantaneously.
  7. This design allows explicit monitoring of
    1. the entire lifecycle of a state change -
      1. leader, broker id 0, requested start-replica to broker id 1, at epoch 10
      2. leader, broker id 0, requested start-replica to broker id 2, at epoch 10
      3. follower, broker id 1, received start-replica from leader 0, at epoch 10
      4. follower, broker id 2, received start-replica from leader 0, at epoch 10
      5. follower, broker id 1, completed start-replica request from leader 0, at epoch 10
      6. follower, broker id 2, completed start-replica request from leader 0, at epoch 10
    2. the backup of state change requests, on slow followers

...