Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. When a follower becomes partitioned from the rest of the quorum, it will continuously increase its epoch to start elections until it is able to regain connection to the leader/rest of the quorum. When the follower regains connectivity, it will disturb the rest of the quorum as they will be forced to participate in an unnecessary election. While this situation only results in one leader stepping down, as we start supporting larger quorums these events may occur more frequently per quorum.
  2. When a leader becomes partitioned from the rest of the quorum, one of the following occurs
    1. It becomes aware that it should step down as a leader/initiates a new election.
      This is now the follower case covered above.
    2. It continues to believe it is the leader until rejoining the rest of the quorum.
      On rejoining the rest of the quorum, it will start receiving messages w/ a different epoch. If higher than its own, the node will eventually become a follower of the node with the higher epochknow to resign as leader. If lower than its own, the node may still believe it is leader and send fetch requests to the rest of the quorum. Nodes will transition to Unattached state and election(s) will ensue (todo)

A - epoch 1 (disconnected) where leader is A


B, C, E - epoch 2 where leader is B

D - epoch 0 (disconnected)

While network partitions may be the main issue we expect to encounter/mitigate impact for, it’s possible that bugs and user error create similar effects to network partitions that we need to guard against. For instance, a bug which causes fetch requests to periodically timeout or setting controller.quorum.fetch.timeout.ms  and other related configs too low.

...

Code Block
{
  "apiKey": 52,
  "type": "request",
  "listeners": ["controller"],
  "name": "VoteRequest",
  "validVersions": "0-1",
  "flexibleVersions": "0+",
  "fields": [
    { "name": "ClusterId", "type": "string", "versions": "0+",
      "nullableVersions": "0+", "default": "null"},
    { "name": "Topics", "type": "[]TopicData",
      "versions": "0+", "fields": [
      { "name": "TopicName", "type": "string", "versions": "0+", "entityType": "topicName",
        "about": "The topic name." },
      { "name": "Partitions", "type": "[]PartitionData",
        "versions": "0+", "fields": [
        { "name": "PartitionIndex", "type": "int32", "versions": "0+",
          "about": "The partition index." },
        { "name": "CandidateEpoch", "type": "int32", "versions": "0+",
          "about": "The bumped epoch of the candidate sending the request"},
        { "name": "CandidateId", "type": "int32", "versions": "0+", "entityType": "brokerId",
          "about": "The ID of the voter sending the request"},
        { "name": "LastOffsetEpoch", "type": "int32", "versions": "0+",
          "about": "The epoch of the last record written to the metadata log"},
        { "name": "LastOffset", "type": "int64", "versions": "0+",
          "about": "The offset of the last record written to the metadata log"},
        { "name": "PreVote", "type": "boolean", "versions": "1+",
          "about": "Whether the request is a PreVote request (no epoch increase) or not."}
      ]
 ...
}

Proposed Changes

Pre-Vote

todo: Should VoteResponse need to change to indicate response was for Pre-Vote vs Vote Request? (should be possible to retrieve request object for each response, but this isn't really done currently)

Proposed Changes

Pre-Vote

A candidate will now send a VoteRequest with the PreVote field set to true when its election timeout expires. If [majority - 1] of VoteResponse grant the vote, the candidate will then bump its epoch up and send a VoteRequest with PreVote set to false which behaves the same way as before.

...

We can gate Pre-Vote with a new VoteRequest /and VoteResponse version and metadata version. , and gate rejecting Pre-Vote requests with the same metadata version bump. (todo)

Raft equiv for metadata version?

If X nodes are upgraded and start ignoring vote requests and Y nodes continue to reply and transition to “Unattached” could we run into a scenario where we violate raft invariants (more than one leader) or cause extended quorum unavailability?

  • We can never have more than one leader - if Y nodes are enough to vote in a new leader, the resulting state change they undergo (e.g. to Voted, to Follower) will cause maybeFireLeaderChange to update the current leader’s state (e.g. to Resigned) as needed.

  • Proving that unavailability would remain the same or better w/ a mish-mash of upgraded nodes is a bit harder.

    • What happens if the leader is part of the Y nodes that will continue to respond to unnecessary vote requests?

      • If Y is a minority of the cluster, the current leader and minority of cluster may vote for this disruptive node to be the new leader but the node will never win the election. The disruptive node will keep trying with exponential backoff. At some point a majority of the cluster will not have heard from a leader for fetch.timeout.ms and will start participating in voting. A leader can be elected at this point. In the worst case this would extend the time we are leader-less by the length of the fetch timeout.

      • If Y is a majority of the cluster, we have enough nodes to potentially elect the disruptive node. (This is the same as the old behavior.) If a majority of votes were not received by the node, it will keep trying with exponential backoff. At some point the rest of the cluster will not have heard from a leader for fetch.timeout.ms and will start participating in voting. In the worst case this would extend the time we are leader-less by the length of the fetch timeout.

    • What happens if leader is part of the X nodes which will reject unnecessary vote requests?

      • If Y is a minority of the cluster, the minority of cluster may vote for this disruptive node to be the new leader but the node will never win the election. The disruptive node will keep trying with exponential backoff, increasing its epoch. If the current leader ever resigns or a majority of the cluster hasn’t heard from a leader, there’s a potential we will now vote for the node - but an election would have been necessary in this case anyways

      • If Y is a majority of the cluster, we have enough nodes to potentially elect the disruptive node. (This is the same as the old behavior.) If a majority of votes were not received by the node, it will keep trying with exponential backoff. The nodes in Y will transition to unattached after receiving the first vote request and potentially start their own elections. The nodes in X will still be receiving fetch responses from the current leader and will be rejecting all vote requests from nodes in Y. Although there must be at least one node in Y that is eligible to become leader (otherwise, Y couldn’t have been a majority of the cluster), if we suffer enough node failures in Y where Y is no longer a majority of the cluster, we get stuck. X (a minority of the cluster, but now needed to form a majority) stays connected to the current leader and never votes for an eligible leader in Y. No new commits can be made to the log. Check quorum would solve this case (leader resigns since it senses it no longer has a majority).


Test Plan

This will be tested with unit tests, integration tests, system tests, and TLA+. (todo)

...