node may still believe it is leader and send fetch requests to the rest of the quorum

Table of Contents

Status

Current state: Under Discussion

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

This KIP will go over scenarios where we might expect disruptive servers and discuss how Pre-Vote (as originally detailed in the Raft paper and in KIP-650) along with Rejecting VoteRequests received within fetch timeout can ensure correctness when it comes to network partitions (as well as quorum reconfiguration and failed disk scenarios).

...

Rejecting VoteRequests received within fetch timeout entails servers rejecting any vote requests received prior to their own fetch timeout expiring. The idea here is if we've recently heard from a leader, we should not attempt to elect a new one just yet.

Disruptive server scenarios

Network Partition

When a follower becomes partitioned from the rest of the quorum, it will continuously increase its epoch to start elections until it is able to regain connection to the leader/rest of the quorum. When the follower regains connectivity, it will disturb the rest of the quorum as they will be forced to participate in an unnecessary election. While this situation only results in one leader stepping down, as we start supporting larger quorums these events may occur more frequently per quorum.

...

While network partitions may be the main issue we expect to encounter/mitigate impact for, it’s possible that bugs and user error create similar effects to network partitions that we need to guard against. For instance, a bug which causes fetch requests to periodically timeout or setting controller.quorum.fetch.timeout.ms and other related configs too low.

Quorum reconfiguration

The scenarios here should be covered by KIP-853: KRaft Controller Membership Changes, which may choose to make use of Pre-Vote. While KIP-853 may take another approach to handle the following scenarios, it's worth discussing how Pre-Vote can be applied as well.

Servers in old configuration

When reconfiguring a quorum, servers in the old configuration which are not also in new configuration will stop receiving heartbeats from the leader which can lead to them starting new elections and forcing the current leader to step down. This is disruptive as the servers in the old configuration are not eligible to be elected and could cause leadership to bounce prior to their complete removal.

...

Another way is to reject election requests sent by servers in old configurations due for removal - with Pre-Vote implemented this would not result in any epoch bumps. This could increase the chance of unavailability if the old server is the only one eligible for leadership. To safeguard against this we could have servers only reject election requests received before their fetch timeout hits zero.

Servers in new configuration

What happens when a new node joins the quorum? If it were allowed to participate in elections without having fully replicated the leader’s log, could a node w/ a subset of the committed data be elected leader? Quorum reconfiguration only allows one addition/deletion at a time, and we are most vulnerable when the original configuration is small (3 node minimum). Given this, if a majority of the cluster is not caught up with the leader when we add a new node, we may lose data.

...

Another way is to reject pre-vote requests from these nodes.The candidate does not increase its epoch prior to sending the request out.

Disk Loss Scenario

This scenario shares similarities with adding new nodes to the quorum. If a node loses its disk and fails to fully catch up to the leader prior to another node starting an election, it may vote for any node which is at least as caught up as itself (which might be less than the last leader). The two solutions above can be applied here as well.

Time	Node 1	Node 2	Node 3
T0	Leader with majority of quorum (node 1, node 3) caught up with its committed data	Lagging follower	Follower
T1			Disk failure
T2	Leader → Unattached state	Follower → Unattached state	Comes back up w/ new disk, triggers an election before catching up on replication
			Will not be elected
T4		Election ms times out and starts an election
T5		Votes for Node 2	Votes for Node 2
T6		Elected as leader leading to data loss

Public Interfaces

Pre-Vote

We will add a new field PreVote to VoteRequests to signal whether the request is a PreVote or not. The candidate does not increase its epoch prior to sending the request out. The VoteResponse schema does not need any additional fields (still needs a version bump to match version bump for VoteRequest).

...

todo: Should VoteResponse need to change to indicate response was for Pre-Vote vs Vote Request? (should be possible to retrieve request object for each response, but this isn't really done currently)

Proposed Changes

Pre-Vote

Code Block

/**
 * Unattached|Resigned transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Voted: After granting a vote to a candidate
 *    Candidate: After expiration of the election timeout
 *    Follower: After discovering a leader with an equal or larger epoch
 *
 * Voted transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Candidate: After expiration of the election timeout
 *
 * Candidate transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Candidate: After expiration of the election timeout
 *    Leader: After receiving a majority of votes
 *
 * Leader transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Resigned: When shutting down gracefully
 *
 * Follower transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Candidate: After expiration of the fetch timeout
 *    Follower: After discovering a leader with a larger epoch
 *
 * Observers follow a simpler state machine. The Voted/Candidate/Leader/Resigned
 * states are not possible for observers, so the only transitions that are possible
 * are between Unattached and Follower.
 *
 * Unattached transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Follower: After discovering a leader with an equal or larger epoch
 *
 * Follower transitions to:
 *    Unattached: After learning of a new election with a higher epoch
 *    Follower: After discovering a leader with a larger epoch
 *
 */

...

Scenario B: We can also imagine a non-reconfiguration scenario where two nodes, one of which is the leader, are simply unable to communicate with each other. Since the non-leader node is unable to find a leader, it will start an election and may get elected. Since the prior leader is now unable to find the new leader, it will start an election and may get elected. This could continue in a cycle.

Rejecting Pre-Vote requests received within fetch timeout

As mentioned in the prior section, a server should reject Pre-Vote requests received from other servers if its own fetch timeout has not expired yet. The logic now looks like the following for servers receiving VoteRequests with PreVote set to true

...

true if they haven't heard from a leader in fetch.timeout.ms and all conditions that normally need to be met for VoteRequests are satisfied
false if they have heard from a leader in fetch.timeout.ms (could help cover Scenario A & B from above) or conditions that normally need to be met for VoteRequests are not satisfied
(Not in scope) To address the disk loss and 'Servers in new configuration' scenario, one option would be to have servers respond false to vote requests from servers that have a new disk and haven't caught up on replication

Compatibility, Deprecation, and Migration Plan

We can gate Pre-Vote with a new VoteRequest and VoteResponse version. , and gate rejecting Pre-Vote requests with the same metadata version bump. (todo)

...

We can never have more than one leader - if Y nodes are enough to vote in a new leader, the resulting state change they undergo (e.g. to Voted, to Follower) will cause maybeFireLeaderChange to update the current leader’s state (e.g. to Resigned) as needed.
Proving that unavailability would remain the same or better w/ a mish-mash of upgraded nodes is a bit harder.
- What happens if the leader is part of the Y nodes that will continue to respond to unnecessary vote requests?
  - If Y is a minority of the cluster, the current leader and minority of cluster may vote for this disruptive node to be the new leader but the node will never win the election. The disruptive node will keep trying with exponential backoff. At some point a majority of the cluster will not have heard from a leader for fetch.timeout.ms and will start participating in voting. A leader can be elected at this point. In the worst case this would extend the time we are leader-less by the length of the fetch timeout.
  - If Y is a majority of the cluster, we have enough nodes to potentially elect the disruptive node. (This is the same as the old behavior.) If a majority of votes were not received by the node, it will keep trying with exponential backoff. At some point the rest of the cluster will not have heard from a leader for fetch.timeout.ms and will start participating in voting. In the worst case this would extend the time we are leader-less by the length of the fetch timeout.
- What happens if leader is part of the X nodes which will reject unnecessary vote requests?
  - If Y is a minority of the cluster, the minority of cluster may vote for this disruptive node to be the new leader but the node will never win the election. The disruptive node will keep trying with exponential backoff, increasing its epoch. If the current leader ever resigns or a majority of the cluster hasn’t heard from a leader, there’s a potential we will now vote for the node - but an election would have been necessary in this case anyways
  - If Y is a majority of the cluster, we have enough nodes to potentially elect the disruptive node. (This is the same as the old behavior.) If a majority of votes were not received by the node, it will keep trying with exponential backoff. The nodes in Y will transition to unattached after receiving the first vote request and potentially start their own elections. The nodes in X will still be receiving fetch responses from the current leader and will be rejecting all vote requests from nodes in Y. Although there must be at least one node in Y that is eligible to become leader (otherwise, Y couldn’t have been a majority of the cluster), if we suffer enough node failures in Y where Y is no longer a majority of the cluster, we get stuck. X (a minority of the cluster, but now needed to form a majority) stays connected to the current leader and never votes for an eligible leader in Y. No new commits can be made to the log. Check quorum would solve this case (leader resigns since it senses it no longer has a majority).

Test Plan

This will be tested with unit tests, integration tests, system tests, and TLA+. (todo)

Rejected Alternatives

Rejecting VoteRequests received within fetch timeout (w/o Pre-Vote)

This was originally proposed in the Raft paper as a necessary safeguard to prevent Scenario A from occurring, but we can see how this could extend to cover all the other disruptive scenarios mentioned.

...

However, this would not be a good standalone alternative to Pre-Vote because once a server starts a disruptive election (disruptive in the sense that the current leader still has majority), its epoch may increase while none of the other servers' epochs do. The most likely way for the server to rejoin the quorum now with its inflated epoch would be to win an election. Since epochs are not increased with Pre-Vote requests, it is easier for a disruptive server to rejoin the quorum once it finds any of the servers in the cluster have been elected. (todo: can it join once it finds the current leader)

Separate RPC for Pre-Vote

This would be added toil with no real added benefits. Since a Pre-Vote and a standard Vote are similar in concept, it makes sense to cover both with the same RPC. We can add clear logging and metrics to easily differentiate between Pre-Vote and standard Vote requests.

...

Space shortcuts

Child pages

Versions Compared

Old Version 11

New Version 12

Key

Status

Motivation

Disruptive server scenarios

Network Partition

Quorum reconfiguration

Servers in old configuration

Servers in new configuration

Disk Loss Scenario

Public Interfaces

Pre-Vote

Proposed Changes

Pre-Vote

Rejecting Pre-Vote requests received within fetch timeout

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Rejecting VoteRequests received within fetch timeout (w/o Pre-Vote)

Separate RPC for Pre-Vote

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 11

New Version 12

Key

Status

Motivation

Disruptive server scenarios

Network Partition

Quorum reconfiguration

Servers in old configuration

Servers in new configuration

Disk Loss Scenario

Public Interfaces

Pre-Vote

Proposed Changes

Pre-Vote

Rejecting Pre-Vote requests received within fetch timeout

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Rejecting VoteRequests received within fetch timeout (w/o Pre-Vote)

Separate RPC for Pre-Vote