Page History

...

Need controller failover.

Paths:

Stores Epoc path: stores the current controller infoepoc.

Code Block
/epoc --> {long} (generating a monotonically increasing number; used to identify leader generations)

Stores Controller path: stores the current controller info.
Code Block
/controller --> {brokerid} (ephemeral; created by controller)
Stores Broker path: stores the information of all live brokers.
Code Block
/brokers/ids/[broker_id] --> host:port (ephemeral; created by admin)
Stores Topic path: stores the replication assignment for all partitions in a topic. For each replica, we store the id of the broker to which the replica is assigned. The first replica is the preferred replica. Note that for a given partition, there is at most 1 replica on a broker. Therefore, the broker id can be used as the replica id
Code Block
/brokers/topics/[topic] --> {part1: [broker1, broker2], part2: [broker2, broker3] ...} (created by admin)

Stores LeaderAndISR path: stores leader and ISR of a partition

Code Block

 /brokers/topics/[topic]/[partition_id]/leaderAndISR --> {leader_epoc: epoc, leader: broker_id, ISR: {broker1, broker2}}
 
 This path is updated by the controller or the current leader. The current leader only updates the ISR part.
 Updating the path requires synchronization using conditional updates to Zookeeper.

PartitionReassignment path: This path is used when we want to This path is used when we want to reassign some partitions to a different set of brokers. For each partition to be reassigned, it stores a list of new replicas and their corresponding assigned brokers. This path is created by an administrative process and is automatically removed once the partition has been moved successfully
Code Block
/brokers/partitions_reassigned/[topic]/[partition_id] --> {broker_id …} (created by admin)  
ZKQueue: Used to communicate state change information from controller to each broker.

Code Block
/brokers/state/[broker_id]/[i] --> { state change requests ... } (created by controller)

Terminologies:

AR: assigned replicas, ISR: in-sync replicas

A. Failover during broker failure.

Controller watches child changes of /brokers/ids path. When the watcher gets triggered, it calls on_broker_change().

Terminologies:

AR: assigned replicas, ISR: in-sync replicas

A. Failover during broker failure.

Controller watches child changes of /brokers/ids path. When the watcher gets triggered, it calls on_broker_change().

Code Block


on_broker_change():
1. Get the current live brokers from BrokerPath in ZK
2. Determine set_p, a set of partitions who leader is no longer live.
3. For each partition P in set_p
3.1 Read the current ISR of P from LeaderAndISR path in ZK
3.2 Determine the new leader and the new ISR of P:
    If ISR has at least 1 broker in the live broker list, select one of those brokers as the new leader. The new ISR includes all brokers in

Code Block


on_broker_change():
The controller keeps in-memory for every partition: leader, AR
1. call change_leaders() on the current list of partitions

change_leaders():
Input: a list of partitions and their leader, AR
1. Read the current live broker list
2. Determine the set of partitions whose leader is not in the live broker list.
3. For each such partition P
3.1 Read the current ISR ofthat Pare from ZK
3.2 Determine the new leader andalive.
    Otherwise, select one of the newlive ISRbrokers ofin P:
AR as the new Ifleader ISRand hasset at leastthat 1 broker inas the livenew brokerISR list,(potential selectdata oneloss ofin those brokers as the new leader. The new ISR includes allthis case).
    Finally, if none of the brokers in theAR current ISR that are alive.
    Otherwise, select one of the live brokers in AR asis alive, set the new leader to -1.
3.3 Write the new leader, ISR and seta thatnew brokerepoc as(increase thecurrent newepoc ISR (potential data loss in this case)by 1) in /brokers/topics/[topic]/[partition_id]/leaderAndISR.
    Question A.1, what happens if none of the brokers in AR is alive?
4. For each such partition p, write This write has to be done conditionally. If the version of the LeaderAndISR path has changed btw 1.1 and 1.3, go back to 1.1.
4. Send a LeaderAndISRCommand (containers the new leader/ISR and ISRthe in /brokers/topics/[topic]/[partition_id]/leaderAndISR
5. Send a StateChangeCommand about the new leader/ISRZK version of the LeaderAndISR path) for each affected partition toin the ZKQueue ofset_p to the affected brokers.
   For efficiency, the controllerwe can writeput themultiple decision
   for all affected partitions commands in 1one pathRPC in ZKQueuerequest.
(Ideally we want to use ZK multi to do the reads and writes in step 4 and 5 conditionally in 1 transaction for better latency and correctness. We can also take advantage of ZK multi for reads in step 3.)

Question A.2. Should the broker send the StateChangeCommand to the brokers that are currently down?
Technically, we don't need to do that. On startup, the broker can get the latest state of each partition by reading /brokers/topics/[topic]/[partition_id]/leaderAndISR and
receive new StateChange commands from the controller afterwards.

Question A.3, is broker down the only failure scenario that we worry about? Do we worry about leader failure at individual partition level?

B. Broker acts on leadership change.

Each broker registers a child watcher on its ZKQueue. When the watcher gets triggered, it calls on_leader_assignment_change().

3.1 and 3.3.)

B. Broker acts on commands from the controller.

Each broker listens to commands from the controller through RPC.

Code Block


For LeaderAndISRCommand, it calls on_LeaderAndISRCommand().
on_LeaderAndISRCommand(command):
1. Read the set of partitions set_P from command.
2. For each partition P in set_p
2.1 If the command asks this broker to be the new leader for P and this broker is not already the leader for P,
2.1.1 Stop the fetcher to the current leader
2.1.2 Become the leader and remembers the ZK version of the LeaderAndISR path
2.2 If the command asks this broker to following a leader L and the broker is not already following L
2.2.1 stop the fetcher to the current leader
2.2.2 become a follower to L

3. If the command has a flag INIT, delete all local partitions not in set_p.


For StopReplicaCommand, it calls on_StopReplicaCommand().
on_StopReplicaCommand(command):
1. Read the list of partitions from command

Code Block


on_leader_assignment_change():
1. Read from this broker's ZKQueue, the list of partitions whose leader/ISR has changed.
2. For each such partition P
2.1 Ifdelete thisp brokerfrom islocal thestorage, newif leader,
2.1.1 stop the fetcher to the current leader
2.1.2 become the leader (This is critical: the leader can only update the ISR in /brokers/topics/[topic]/[partition_id]/leaderAndISR in the future if it hasn't been changed by the controller)
2.2 If this broker is following a new leader
2.2.1 stop the fetcher to the current leader
2.2.2 become a follower

C. Creating/deleting topics.

The controller watches child change of /brokers/topics. When the watcher gets triggered, it calls on_topic_change().

Code Block


on_topic_change():
The controller keeps in memory a list of existing topics.
1. If a new topic is created, read topic's replica assignment.
1.1. call init_leaders() on all newly created partitions.
2. If a topic is deleted, send the stopReplica state change to all affected brokers.

init_leaders():
Input: a list of partitions and their AR
0. Read the current live broker list
1. For each partition P
1.1 Select one of the live brokers in AR as the new leader and set all live brokers in AR as the new ISR.
2. For each such partition p, write the new leader and ISR in /brokers/topics/[topic]/[partition_id|partition_id]/leaderAndISR
3. Publish the new leader/ISR for each affected partition to the ZKQueue of the affected brokers. Again, for efficiency, the controller can write the decision
   for all affected partitions in 1 path in ZKQueue.
(Ideally we want to use ZK multi to do the writes in step 2 and 3 conditionally in 1 transaction for better latency and correctness.)

Question C1. How to deal with repeated topic deletion/creation? A broker can be down for a long time during which a topic can be deleted and recreated. When the broker comes up, the topic it has locally may not match the content of the newly created topic. There are a couple of ways of dealing with this.

Simply let the broker with the outdated topic become a follower and figure out the right offset from which it can sync up with the leader.
Keep a version ID for each topic/partition. Delete a partition on broker startup if the partition version is outdated.
Queue up the close replica state change in the ZkQueue, so the broker can simply read the ZkQueue on start and delete partitions accordingly

D. Handling controller failure.

present.

C. Creating/deleting topics.

The controller watches child change of /brokers/topics. When the watcher gets triggered, it calls on_topic_change().

Code Block


on_topic_change():
The controller keeps in memory a list of existing topics.
1. If a new topic is created, read the TopicPath in ZK to get topic's replica assignment.
1.1. call init_leaders() on all newly created partitions.
2. If a topic is deleted, send the StopReplicaCommand to all affected brokers.

init_leaders(set_p):
Input: set_p, a set of partitions
0. Read the current live broker list from the BrokerPath in ZK
1. For each partition P in set_p
1.1 Select one of the live brokers in AR as the new leader and set all live brokers in AR as the new ISR.
1.2 Write the new leader and ISR in /brokers/topics/[topic]/[partition_id|partition_id]/leaderAndISR
2. Send the LeaderAndISRCommand to the affected brokers. Again, for efficiency, the controller can send multiple commands in 1 RPC.

D. Handling controller failure.

Each broker sets an exists watcher on the ControllerPathEach broker sets an exists watch on /controller. When the watcher gets triggered, it calls on_controller_failover(). Basically, the controller needs to inform each of the brokers all decisions that it has made in the history (since it's not sure if there is any decision all brokers the current states stored in ZK (since some state change commands could be lost during the controller failover). A broker can ignore decisions that it has followed already.

Code Block

on_controller_failover():
1. create /controller -> {this broker id; new broker epoc)
2. if not successful
2.1 write all published decisions (leader/ISR, return
3. read the LeaderAndISR path from ZK for each partition
4. send a LeaderAndISR command (with a special flag INIT) for each partition) to ZKQueue to all brokers.
2.2 change_leaders()
2.3relevant brokers. Those commands can be sent in 1 RPC request.
5. call on_broker_change()
6. for the list of partitions without a leader, call init_leaders().

...

When a broker starts up, it calls on_broker_startup(). Basically, the broker needs to first read all published decisions about the current state of each partition from ZK.

Code Block

on_broker_startup():
1. read the replica assignment of all /brokers/topics/[topic]
2. read /brokers/topics/[topic]/[partition_id|partition_id]/leaderAndISRtopics from the TopicPath in ZK
2. read the leader and the ISR of each partition assigned to this broker from the LeaderAndISR path in ZK
3. for each replica assigned to this broker
3.1 start replica
3.2 if this broker is a leader of this partition, become leader. (shouldn't happen in general)
3.3 if this broker is a follower of this partition, become follower.
4, become follower.
4. Delete local partitions no longer assigned to this broker (partitions deleted while the broker is down).
5. subscribes to changes in ZKQueue for this broker.

...

Occasionally, it's possible for multiple brokers to simultaneous assume that they are the leader of a partition. For example, broker A is the initial leader of a partition and the ISR of that partition is {A,the initial leader of a partition and the ISR of that partition is {A,B,C}.. Then, broker A goes into GC and losses its ZK registration. The controller assumes that broker A is dead, assigns the leader of the partition to broker B and sets the new ISR in ZK to {B,C}. . Then, broker A goes into GC and losses its ZK registration. The controller assumes that broker A is dead, assigns the leader of the partition to broker B and sets the new ISR in ZK to {B,C}. Broker B becomes the leader and at the same time, Broker A wakes up from GC but hasn't acted on the leadership change command sent by the controller. Now, both broker A and B think they are the leader. It would be bad if we allow both broker A and B to commit new messages since the data among replicas will be out of sync. Our current design actually will prevent this from happening in this situation. Here is why. The claim is that after broker B becomes the new leader, broker A can no longer commit new messages any more. For broker A to commit a message m, it needs every replica in ISR to receive m. At the moment, broker A still thinks the ISR is {A,B,C} (its local copy; although the ISR in ZK has changed). Broker B will never receive message m. This is because by becoming the new leader, it must have first stopped fetching data from the previous leader. Therefore broker A can't commit message m without shrinking the ISR first. In order to shrink ISR, broker A has to write the new ISR in ZK. However, it can't do that because it will realize that the leaderAndISR node in ZK is not on a version that it assumes to be (since it has already been changed by the controller). At this moment, broker A will realize that it's no longer the leader any moreBroker B becomes the leader and at the same time, Broker A wakes up from GC but hasn't acted on the leadership change command sent by the controller. Now, both broker A and B think they are the leader. It would be bad if we allow both broker A and B to commit new messages since the data among replicas will be out of sync. Our current design actually will prevent this from happening in this situation. Here is why. The claim is that after broker B becomes the new leader, broker A can no longer commit new messages any more. For broker A to commit a message m, it needs every replica in ISR to receive m. At the moment, broker A still thinks the ISR is {A,B,C} (its local copy; although the ISR in ZK has changed). Broker B will never receive message m. This is because by becoming the new leader, it must have first stopped fetching data from the previous leader. Therefore broker A can't commit message m without shrinking the ISR first. In order to shrink ISR, broker A has to write the new ISR in ZK. However, it can't do that because it will realize that the leaderAndISR node in ZK is not on a version that it assumes to be (since it has already been changed by the controller). At this moment, broker A will realize that it's no longer the leader any more.Question A.3, is broker down the only failure scenario that we worry about? Do we worry about leader failure at individual partition level?Question A.3, is broker down the only failure scenario that we worry about? Do we worry about leader failure at individual partition level?
How to deal with repeated topic deletion/creation? A broker can be down for a long time during which a topic can be deleted and recreated. When the broker comes up, the topic it has locally may not match the content of the newly created topic. There are a couple of ways of dealing with this.

Space shortcuts

Child pages

Versions Compared

Old Version 8

New Version 9

Key

A. Failover during broker failure.

A. Failover during broker failure.

B. Broker acts on leadership change.

B. Broker acts on commands from the controller.

C. Creating/deleting topics.

D. Handling controller failure.

C. Creating/deleting topics.

D. Handling controller failure.