Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

KIP-595 introduced KRaft topic partitions. These are partitions with replicas that can achieve consensus on the Kafka log without relying on the Controller or ZK. The KRaft Controllers in KIP-631 use one of these topic partitions (called cluster metadata topic partition) to order operations on the cluster, commit them to disk and replicate them to other controllers and brokers.

Consensus on the The cluster metadata partition is stored in the metadata.log.dir. If this directory doesn't contain a valid meta.properties file the Kafka node fails to start. The user can create a valid meta.properties file by executing the format command of the kafka-storage tool. The format only creates a meta.properties if it doesn't exists.

If the metadata.log.dir was prevously formatted, it was lost because of a disk failure and the user runs the format commmand of the kafka-storage tool then it is possible for the cluster metadata partition to establish a leader with inconsistency or data loss. To safely reformat a metadata.log.dir the user needs was achieved by the voters (Controllers). If the operator wanted to replace an existing voter because of a disk failure or general hardware failure, they would have to make sure that the new voter node has a superset of the previous voter's on-disk state. This solution to disk failure is manual and error prone.

This KIP describes a protocol for extending KIP-595 and KIP-630 so that the Kafka cluster can identify that there was a disk failure detect disk failures on the minority of the nodes in the cluster metadata topic partition and . This KIP also describe the mechanism for the operator user to programmatically replace the failed voter voters in a way that is safe and available.

...

The replica id and endpoint for all of the voters is describe described and persisted in the configuration controller.quorum.voters. An example value for this configuration is 1@localhost:9092,2@localhost:9093,3@localhost:9094 .

...

When the leader of the topic partition first discovers the UUID of all the voters through the KRaft RPCs, for example Fetch and Vote, it will write an AddVoterRecord to the cluster metadata log for each voter described in the controller.quorum.voters. All of these AddVoterRecord must be contained in one control record batch. Putting them all in  one record batch guarantees that they all get persisted, replicated and committed atomically.

The KRaft leader will not perform writes from the state machine (active controller) or client until is has written to the log an AddVoterRecord for every replica id in the controller.quorum.voters configuration and those control records have been replicated to all of the voters.

If the cluster metadata topic partition contains a an AddVoterRecords for a replica id that is not enumerated in controller.quorum.voters then the replica will fail to start and shutdown.

...

It is possible for the leader to write an AddVoterRecord to the log and replicateit replicate it to some of the voters in the new configuration. If the leader fails before this record has been replicated to the new voter it is possible that a new leader cannot be elected. This is because in KRaft voters reject vote request requests from replicas that are not in the voter set. This check will be removed and replicas will reply be able to grant their vote requests when the candidate is not in the voter set or the voting replica is not in the voter set. The candidate must still have a longer log with an end offset and epoch that is greater than or equal to the voter's log before winning the voter will grant a vote to it.

High Watermark

As describe in KIP-595, the high-watermark will be calculated using the fetch log end offset of the majority of the voters. When a replica is removed or added it is possible for the high-watermark to decrease. The leader will not allow the high-watermark to decrease and will guarantee that is is monotonically increasing for both the state machines and the remote replicas.

With this KIP it is possible for the leader to not be part of the voter set when the replica removed is the leader. In this case the leader will continue to handle Fetch and FetchSnapshot requests as normal but it will not count itself when computing the high-watermark.

Snapshots

The snapshot generation code needs to be extended to include these new KRaft specific control records for AddVoterRecord and RemoveVoterRecord. Before this KIP the snapshot didn't include any KRaft generated control records.

...