Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Disruptive server scenarios

Network Partition

When a follower becomes partitioned from the rest of the quorum, it will continuously increase its epoch to start elections until it is able to regain connection to the leader/rest of the quorum. When the server regains connectivity, it will disturb the rest of the quorum as they will be forced to participate in an unnecessary election. While this situation only results in one leader stepping down, as we start supporting larger quorums these events may occur more frequently per quorum.

...

Code Block
  * Unattached|Resigned transitions to:
  *    Unattached: After learning of a new election with a higher epoch
+- *    Voted: After granting a standard vote to a candidate
+ *    ProspectiveVoted: After expiration of the election timeoutgranting a standard vote to a candidate
- *    VotedCandidate: After grantingexpiration aof votethe toelection a candidatetimeout
-+ *    CandidateProspective: After expiration of the election timeout
  *    Follower: After discovering a leader with an equal or larger epoch
  *
  * Voted transitions to:
  *    Unattached: After learning of a new election with a higher epoch 
+- *    ProspectiveCandidate: After expiration of the election timeout
-+ *    CandidateProspective: After expiration of the election timeout
+ *
+ * Prospective transitions toFollower:
+ *After discovering a leader Unattached:with Afteran learningequal ofor alarger newepoch election(already with a higher epoch
+ *    Candidate: After receiving a majority of pre-votes
  *
  * Candidatevalid transition, just missed in original docs)  
+ *
+ * Prospective transitions to:
 + *    Unattached: After learning of a new election with a higher epoch
+ *    ProspectiveCandidate: After expirationreceiving ofa themajority electionof timeoutpre-votes
+ *    LeaderFollower: After receivingdiscovering a majorityleader ofwith standardan votes
-equal *or larger   Candidate: After expiration of the election timeout
- *    Leader: After receiving a majority of votes
  epoch 
  *
  * LeaderCandidate transitions to:
  *    Unattached: After learning of a new election with a higher epoch
- *   *
  * Follower transitions to:
   Candidate: After expiration of the election timeout
+ *    UnattachedProspective: After learningexpiration of athe new election with a higher epoch
+election timeout
- *    ProspectiveLeader: After expirationreceiving ofa themajority electionof timeoutvotes
-+ *    CandidateLeader: After expirationreceiving a majority of thestandard fetch timeoutvotes
+  *    Follower: After discovering a leader with an equal aor larger epoch

A candidate will now send a VoteRequest with the PreVote field set to true and CandidateEpoch set to its [epoch + 1] when its election timeout expires. If [majority - 1] of VoteResponse grant the vote, the candidate will then bump its epoch up and send a VoteRequest with PreVote set to false which is our standard vote that will cause state changes for servers receiving the request.

When servers receive VoteRequests with the PreVote field set to true, they will respond with VoteGranted set to

  • true if the epoch and offsets in the Pre-Vote request satisfy the same conditions as a standard vote
  • false if otherwise

When a server receives VoteResponses, it will follow it up with another VoteRequest with PreVote set to either true (send another Pre-Vote) or false (send a standard vote)

  • false (standard vote) if the server has received [majority - 1] VoteResponses with VoteGranted set to true within [election.timeout.ms + a little randomness]
  • true (another Pre-Vote) if the server receives [majority] VoteResponse with VoteGranted set to false within [election.timeout.ms + a little randomness]
  • true if the server receives less than [majority] VoteResponse with VoteGranted set to false within [election.timeout.ms + a little randomness] and the first bullet point does not apply
    • Explanation for why we don't send a standard vote at this point is explained in rejected alternatives. 

If a server happens to receive multiple VoteResponses from another server for a particular VoteRequest, it can take the first and ignore the rest. We could also choose to take the last, but taking the first is simpler. A server does not need to worry about persisting its election state for a Pre-Vote response like we currently do for VoteResponses because the direct result of the Pre-Vote phase does not elect leaders. 

How does this prevent unnecessary elections when it comes to network partitions?

When a partitioned server rejoins and forces the cluster to participate in an election, all servers reject the pre-vote request from the disruptive follower since they've recently heard from the active leader. The disruptive server continuously kicks off elections but is unable to be elected. It should rejoin the quorum when it discovers the higher epoch on the next valid election by another server (todo: check this for accuracy, there should be other ways for the server to become follower earlier)

Can this prevent necessary elections?

Yes. If a leader is unable to receive fetch responses from a majority of servers, it can impede followers that are able to communicate with it from voting in an eligible leader that can communicate with a majority of the cluster. This is the reason why an additional "Check Quorum" safeguard is needed which is what KAFKA-15489 implements. Check Quorum ensures a leader steps down if it is unable to receive fetch responses from a majority of servers.

Do we still need to reject VoteRequests received within fetch timeout if we have implemented Pre-Vote and Check Quorum?

Yes. Specifically we would be rejecting Pre-Vote requests received within fetch timeout. We need to avoid bumping epochs without a new leader being elected else the server requesting the election(s) will be unable to rejoin the quorum because its epoch is greater than everyone else's while its log continues to fall behind.

The following are two scenarios where having just Pre-Vote and Check Quorum is not enough.

  • Scenario A: A server in an old configuration (e.g. S1 in the below diagram pg. 41) starts a “pre-vote” when the leader is temporarily unavailable, and is elected because it is as up-to-date as the majority of the quorum. The Raft paper argues we can not rely on the original leader replicating fast enough to get past this scenario, however unlikely that it is. We can imagine some bug/limitation with quorum reconfiguration causes S1 to continuously try to reconnect with the quorum (i.e. start elections) when the leader is trying to remove it from the quorum.

Image Removed

  • Scenario B: We can also imagine a non-reconfiguration scenario where two servers, one of which is the leader, are simply unable to communicate with each other. Since the non-leader server is unable to find a leader, it will start an election and may get elected. Since the prior leader is now unable to find the new leader, it will start an election and may get elected. This could continue in a cycle.

Rejecting Pre-Vote requests received within fetch timeout

As mentioned in the prior section, a server should reject Pre-Vote requests received from other servers if its own fetch timeout has not expired yet. The logic now looks like the following for servers receiving VoteRequests with PreVote  set to true 

When servers receive VoteRequests with the PreVote field set to true, they will respond with VoteGranted set to

  • true if they haven't heard from a leader in fetch.timeout.ms and all conditions that normally need to be met for VoteRequests are satisfied
  • false if they have heard from a leader in fetch.timeout.ms (could help cover Scenario A & B from above) or conditions that normally need to be met for VoteRequests are not satisfied
  • (Not in scope) To address the disk loss and 'Servers in new configuration' scenario, one option would be to have servers respond false to vote requests from servers that have a new disk and haven't caught up on replication 

Compatibility, Deprecation, and Migration Plan

We can gate Pre-Vote with a new VoteRequest and VoteResponse version. , and gate rejecting Pre-Vote requests with the same metadata version bump. (todo)

Raft equiv for metadata version?

If X servers are upgraded and start ignoring vote requests and Y servers continue to reply and transition to “Unattached” could we run into a scenario where we violate raft invariants (more than one leader) or cause extended quorum unavailability?

...

We can never have more than one leader - if Y servers are enough to vote in a new leader, the resulting state change they undergo (e.g. to Voted, to Follower) will cause maybeFireLeaderChange to update the current leader’s state (e.g. to Resigned) as needed.

Proving that unavailability would remain the same or better w/ a mish-mash of upgraded servers is a bit harder.

...

What happens if the leader is part of the Y servers that will continue to respond to unnecessary vote requests?

  • If Y is a minority of the cluster, the current leader and minority of cluster may vote for this disruptive server to be the new leader but the server will never win the election. The disruptive server will keep trying with exponential backoff. At some point a majority of the cluster will not have heard from a leader for fetch.timeout.ms and will start participating in voting. A leader can be elected at this point. In the worst case this would extend the time we are leader-less by the length of the fetch timeout.

  • If Y is a majority of the cluster, we have enough servers to potentially elect the disruptive server. (This is the same as the old behavior.) If a majority of votes were not received by the server, it will keep trying with exponential backoff. At some point the rest of the cluster will not have heard from a leader for fetch.timeout.ms and will start participating in voting. In the worst case this would extend the time we are leader-less by the length of the fetch timeout.

What happens if leader is part of the X servers which will reject unnecessary vote requests?

...

If Y is a minority of the cluster, the minority of cluster may vote for this disruptive server to be the new leader but the server will never win the election. The disruptive server will keep trying with exponential backoff, increasing its epoch. If the current leader ever resigns or a majority of the cluster hasn’t heard from a leader, there’s a potential we will now vote for the server - but an election would have been necessary in this case anyways

...

 (already a valid transition, just missed in original docs)
  *
  * Leader transitions to:
  *    Unattached: After learning of a new election with a higher epoch
  *
  * Follower transitions to:
  *    Unattached: After learning of a new election with a higher epoch
- *    Candidate: After expiration of the fetch timeout
+ *    Prospective: After expiration of the election timeout
  *    Follower: After discovering a leader with a larger epoch

A candidate will now send a VoteRequest with the PreVote field set to true and CandidateEpoch set to its [epoch + 1] when its election timeout expires. If [majority - 1] of VoteResponse grant the vote, the candidate will then bump its epoch up and send a VoteRequest with PreVote set to false which is our standard vote that will cause state changes for servers receiving the request.

When servers receive VoteRequests with the PreVote field set to true, they will respond with VoteGranted set to

  • true if the epoch and offsets in the Pre-Vote request satisfy the same conditions as a standard vote
  • false if otherwise

When a server receives VoteResponses, it will follow it up with another VoteRequest with PreVote set to either true (send another Pre-Vote) or false (send a standard vote)

  • false (standard vote) if the server has received [majority - 1] VoteResponses with VoteGranted set to true within [election.timeout.ms + a little randomness]
  • true (another Pre-Vote) if the server receives [majority] VoteResponse with VoteGranted set to false within [election.timeout.ms + a little randomness]
  • true if the server receives less than [majority] VoteResponse with VoteGranted set to false within [election.timeout.ms + a little randomness] and the first bullet point does not apply
    • Explanation for why we don't send a standard vote at this point is explained in rejected alternatives. 

If a server happens to receive multiple VoteResponses from another server for a particular VoteRequest, it can take the first and ignore the rest. We could also choose to take the last, but taking the first is simpler. A server does not need to worry about persisting its election state for a Pre-Vote response like we currently do for VoteResponses because the direct result of the Pre-Vote phase does not elect leaders. 

How does this prevent unnecessary leadership loss?

We prevent servers from increasing their epoch prior to establishing they can win an election. 

Can this prevent necessary elections?

Yes. If a leader is unable to receive fetch responses from a majority of servers, it can impede followers that are able to communicate with it from voting in an eligible leader that can communicate with a majority of the cluster. This is the reason why an additional "Check Quorum" safeguard is needed which is what KAFKA-15489 implements. Check Quorum ensures a leader steps down if it is unable to receive fetch responses from a majority of servers.

Do we still need to reject VoteRequests received within fetch timeout if we have implemented Pre-Vote and Check Quorum?

Yes. Specifically we would be rejecting Pre-Vote requests received within fetch timeout. What this means in terms of state transitions is that Followers will not transition to Unattached when they learn of new elections with higher epochs. If they have heard from a leader within fetch timeout there is no need to consider electing a new leader. 

The following are two scenarios which explain why Pre-Vote and Check Quorum are not enough to prevent disruptive servers. The second scenario is out of scope and should be covered by KIP-853: KRaft Controller Membership Changes or future work.

  • Scenario A: We can image a scenario where two servers (S1 & S2) are both up-to-date on the log but unable to maintain a stable connection with each other. Let's say S1 is leader. S2 may be unable to find the leader, will start an election, and may get elected. Since S1 might be unable to find the new leader now, it will start an election and may get elected. This could continue in a cycle.
  • Scenario B: A server in an old configuration (e.g. S1 in the below diagram pg. 41) starts a “pre-vote” when the leader is temporarily unavailable, and is elected because it is as up-to-date as the majority of the quorum. The Raft paper argues we can not rely on the original leader replicating fast enough to get past this scenario, however unlikely that it is. We can imagine some bug/limitation with quorum reconfiguration causes S1 to continuously try to reconnect with the quorum (i.e. start elections) when the leader is trying to remove it from the quorum.

Image Added

Rejecting Pre-Vote requests received within fetch timeout

As mentioned in the prior section, a server should reject Pre-Vote requests received from other servers if its own fetch timeout has not expired yet. The logic now looks like the following for servers receiving VoteRequests with PreVote  set to true 

When servers receive VoteRequests with the PreVote field set to true, they will respond with VoteGranted set to

  • true if they haven't heard from a leader in fetch.timeout.ms and all conditions that normally need to be met for VoteRequests are satisfied
  • false if they have heard from a leader in fetch.timeout.ms (could help cover Scenario A & B from above) or conditions that normally need to be met for VoteRequests are not satisfied
  • (Not in scope) To address the disk loss and 'Servers in new configuration' scenario, one option would be to have servers respond false to vote requests from servers that have a new disk and haven't caught up on replication 

Compatibility

We can gate Pre-Vote with a new VoteRequest and VoteResponse version. Instead of sending a Pre-Vote, a server will transition from Prospective immediately to Candidate if it knows of other servers which do not support Pre-Vote yet. This will result in the server sending standard votes which are understood by servers on older software versions. 

Test Plan

This will be tested with unit tests, integration tests, system tests, and TLA+. (todo) 

Rejected Alternatives

Rejecting VoteRequests received within fetch timeout (w/o Pre-Vote)

...

However, this would not be a good standalone alternative to Pre-Vote because once a server starts a disruptive election (disruptive in the sense that the current leader still has majority), its epoch may increase while none of the other servers' epochs do. The most likely way for the server to rejoin the quorum now with its inflated epoch would be to win an election. Since epochs are not increased with Pre-Vote requests, it is easier for a disruptive server to rejoin the quorum once it finds any of the servers in the cluster have been elected. (todo: can it join once it finds the current leader) 

Separate RPC for Pre-Vote

...

This scenario shares similarities with adding new servers to the quorum, which KIP-853: KRaft Controller Membership Changes would handle. If a server loses its disk and fails to fully catch up to the leader prior to another server starting an election, it may vote for any server which is at least as caught up as itself (which might be less than the last leader). One way to handle this is to add logic preventing servers with new disks (determined via a unique storage id) from voting prior to sufficiently catching up on the log. Another way is to reject pre-vote requests from these servers. We leave this scenario to be covered by KIP-853 or future work because of the similarities with adding new servers.

Time

Server 1

Server 2

Server 3

T0

Leader with majority of quorum (Server 1, Server 3) caught up with its committed data

Lagging follower

Follower

T1



Disk failure

T2

Leader → Unattached state

Follower → Unattached state

Comes back up w/ new disk, triggers an election before catching up on replication




Will not be elected

T4


Election ms times out and starts an election


T5


Votes for Server 2

Votes for Server 2

T6


Elected as leader leading to data loss



Sending Standard Votes after failure to win Pre-Vote

...