...
Note that only voters receive the BeginQuorumEpoch
request: observers will discover the new leader through either the DiscoverBrokers
or Fetch
APIs. For example, the old leader API. If a non-leader receives a Fetch
request, it would return an error code in Fetch
response the response indicating that it is no longer the leader and it will also encode the current known leader id / epoch as well, then the observers can start fetching from the new leader. In case the old leader does not know who's the new leader, observers can still fallback to DiscoverBrokers
request to discover the new leaderUpon initialization, an observer will send a Fetch
request to any of the voters at random until the new leader is found.
Request Schema
Code Block |
---|
{ "apiKey": N, "type": "request", "name": "BeginQuorumEpochRequest", "validVersions": "0", "fields": [ {"name": "ClusterId", "type": "string", "versions": "0+"}, { "name": "Topics", "type": "[]BeginQuorumTopicRequest", "versions": "0+", "fields": [ { "name": "TopicName", "type": "string", "versions": "0+", "entityType": "topicName", "about": "The topic name." }, { "name": "Partitions", "type": "[]BeginQuorumPartitionRequest", "versions": "0+", "fields": [ { "name": "PartitionIndex", "type": "int32", "versions": "0+", "about": "The partition index." }, {"name": "LeaderId", "type": "int32", "versions": "0+", "about": "The ID of the newly elected leader"}, {"name": "LeaderEpoch", "type": "int32", "versions": "0+", "about": "The epoch of the newly elected leader"} ] } } ] } |
...
- If the response contains FENCED_LEADER_EPOCH error code, check the
leaderId
from the response. If it is defined, then update thequorum-state
file and become a "follower" (may or may not have voting power) of that leader. Otherwise, try to learn about the new leader viaDiscoverBrokers
requests; in retry to theFetch
request again against one of the remaining voters at random; in the mean time it may receive BeginQuorumEpoch request which would also update the new epoch / leader as well. - If the response contains OUT_OF_RANGE error code, truncate its local log to the encoded nextFetchOffset, and then resend the
FetchRecord
request.
Note that followers will have to check any records received for the presence of control records. Specifically a follower/observer must check for voter assignment messages which could change its role.
Discussion:
...
Pull v.s. Push Model
In the original Raft paper, the push model is used for replicating states, where leaders actively send requests to followers and count quorum from returned
There's one caveat of the pull-based model: say a new leader has been elected with a new epoch and everyone has learned about it except the old leader (e.g. that leader was not in the voters anymore and hence not receiving the BeginQuorumEpoch as well), then that old leader would not be notified by anyone about the new leader / epoch and become a pure "zombie leader".
To resolve this issue, we will piggy-back on the "quorum.fetch.timeout.ms
" config, such that if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would start sending Metadata
request to random nodes in the cluster to understand the latest quorum. If it couldn't connect to any known voter, the old leader shall reset the connection information and send out DiscoverBrokers
. And if the returned response includes a newer epoch leader, this zombie leader would step down and becomes an observer; and if it realized that it is still within the current quorum's voter list, it would start fetching from that leader. Note that the node will remain a leader until it finds that it has been supplanted by another voter.
Discussion: Pull v.s. Push Model
In the original Raft paper, the push model is used for replicating states, where leaders actively send requests to followers and count quorum from returned acknowledgements to decide if an entry is committed.
...
This request is always sent to the leader node. We expect AdminClient
to use DiscoverBrokers and
the Metadata
in API in order to discover the current leader. Upon receiving the request, a node will do the following:
- First check whether the node is the leader. If not, then return an error to let the client retry with
DiscoverBrokers
Metadata. If the current leader is known to the receiving node, then include theLeaderId
andLeaderEpoch
in the response. - Build the response using current assignment information and cached state about replication progress.
...
- If the response indicates that the intended node is not the current leader, then check the response to see if the
LeaderId
has been set. If so, then attempt to retry the request with the new leader. - If the current leader is not defined in the response (which could be the case if there is an election in progress), then backoff and retry with
DiscoverBrokers and Metadata
. - Otherwise the response can be returned to the application, or the request eventually times out.
...