Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In a Kafka cluster, one of the brokers serves as the controller, which is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions. The following describes the states of partitions and replicas, and the kind of operations going through the controller.

PartitionStateChange: 

Valid states are:

...

  • NewPartition: After creation, the partition is in the NewPartition state. In this state, the partition should have replicas assigned to it, but no leader/isr yet.

  • OnlinePartition: Once a leader is elected for a partition, it is in the OnlinePartition state.
  • OfflinePartition: If, after successful leader election, the leader for partition dies, then the partition moves to the OfflinePartition state.

...

NewPartition,OnlinePartition -> OfflinePartition
  1. nothing other than marking partition state as Offline
OfflinePartition -> NonExistentPartition
  1. nothing other than marking the partition state as NonExistentPartition

ReplicaStateChange:

Valid states are:
  1. NewReplica: When replicas are created during topic creation or partition reassignment. In this state, a replica can only get become follower state change request. 
  2. OnlineReplica: Once a replica is started and part of the assigned replicas for its partition, it is in this state. In this state, it can get either become leader or become follower state change requests.
  3. OfflineReplica : If a replica dies, it moves to this state. This happens when the broker hosting the replica is down.
  4. NonExistentReplica: If a replica is deleted, it is moved to this state.

Valid state transitions are:

NonExistentReplica --> NewReplica
  1. send LeaderAndIsr request with current leader and isr to the new replica replica and UpdateMetadata request for the partition to every live broker
NewReplica-> OnlineReplica
  1. add the new replica to the assigned replica list if needed
OnlineReplica,OfflineReplica -> OnlineReplica
  1. send LeaderAndIsr request with current leader and isr to the new replica and UpdateMetadata request for the partition to every live broker
NewReplica,OnlineReplica -> OfflineReplica
  1. send StopReplicaRequest to the replica (w/o deletion)
  2. remove this replica from the isr and send LeaderAndIsr request (with new isr) to the leader replica and UpdateMetadata request for the partition to every live broker.
OfflineReplica -> NonExistentReplica
  1. send StopReplicaRequest to the replica (with deletion)

KafkaController Operations:

onNewTopicCreation:
  1. call onNewPartitionCreation

 

onNewPartitionCreation:
  1. new partitions -> NewPartition
  2. all replicas of new partitions -> NewReplica
  3. new partitions -> OnlinePartition
  4. all replicas of new partitions -> OnlineReplica


onBrokerFailure:
  1. partitions w/o leader -> OfflinePartition
  2. partitions in OfflinePartition and NewPartition -> OnlinePartition (with OfflinePartitionLeaderSelector)
  3. each replica on the failed broker -> OfflineReplica


onBrokerStartup:
  1. send UpdateMetadata requests for all partitions to newly started brokers
  2. replicas on the newly started broker -> OnlineReplica
  3. partitions in OfflinePartition and NewPartition -> OnlinePartition (with OfflinePartitionLeaderSelector)
  4. for partitions with replicas on newly started brokers, call onPartitionReassignment to complete any outstanding partition reassignment


onPartitionReassignment: (OAR: old assigned replicas; NARRAR: new re-assigned replicas when reassignment completes)
  1. update assigned replica list with OAR + NAR replicasRAR replicas
  2. send LeaderAndIsr request to every replica in OAR + NAR RAR (with AR as OAR + NARRAR)
  3. replicas in NAR RAR - OAR -> NewReplica
  4. wait until replicas in NAR RAR join isr
  5. replicas in NAR RAR -> OnlineReplica
  6. set AR to NAR in RAR in memory
  7. send LeaderAndIsr request with a potential new leader (if current leader not in NARRAR) and a new assigned replica list (using NARRAR) and same isr to every broker in RAR
  8. replicas in OAR - NAR RAR -> Offline (force those replicas out of isr)
  9. replicas in OAR - NAR RAR -> NonExistentReplica (force those replicas to be deleted)
  10. update assigned replica list to NAR RAR in ZK
  11. update the /admin/reassign_partitions path in ZK
  12. remove partition from re-assigned partition list in ZK
  13. to remove this partition
  14. after electing leader, the replicas and isr information changes, so resend the update metadata request end UpdateMetadata request for the partition to every broker
For example, if OAR = {1, 2, 3} and NAR RAR = {4,5,6}, the values in the assigned replica (AR) and leader/isr path in ZK may go through the following transition.
AR                  leader/isr
{1,2,3}            1/{1,2,3}           (initial state)
{1,2,3,4,5,6}   1/{1,2,3}           (step 2)
{1,2,3,4,5,6}   1/{1,2,3,4,5,6}  (step 24)
{1,2,3,4,5,6}   4/{1,2,3,4,5,6}  (step 7)

...


Note that we have to update AR in ZK with NAR last RAR last since it's the only place where we store the OAR persistently. This way, if the controller crashes before that step, we can still recover.

...

  1. replicaStateMachine.startup():
    1. initialize each replica to either OfflineReplica or OnlineReplica
    2. each replica -> OnlineReplica (force LeaderAndIsr request to be sent to every replica)
  2. partitionStateMachine.startup():
    1. initialize each partition to either NewPartition, OfflinePartition or OnlinePartition
    2. each OfflinePartition and NewPartition -> OnlinePartition (force leader election)
  3. resume partition reassignment, if any
  4. resume preferred leader election, if any


...