Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Leadership changes are now made by a controller.
  • The controller detects broker failures and elects a new leader for each affected partition.
  • Each leadership change is communicated by the controller to each affected broker.
  • The communication between the controller and the broker is done through direct RPC, instead of via Zookeeper.
Overview:

One of the brokers is elected as the controller for the whole cluster. It will be responsible for:

...

After the controller makes a decision, it publishes the decision permanently in ZK and also sends the new decisions through ZKQueue to affected brokers through direct RPC. The published decisions are the source of truth and they are used by clients ( for request routing ) and for by each broker during startup (to bring all replicas assigned to a broker to the right state)to recover its state. After the broker is started, it picks up new decisions made by the controller from ZKQueuethrough RPC.

Potential benefits:

  1. Easier debugging since leadership changes are made in a central place.
  2. ZK reads/writes needed for leadership changes can be batched (also easier to exploit ZK multi) and thus reduce end-to-end latency during failover.
  3. Fewer ZK watchers.

Potential downside:

  1. Need controller failover.

Paths:

  1. ) and thus reduce end-to-end latency during failover.
  2. Fewer ZK watchers.
  3. More efficient communication of state changes by using direct RPC, instead of via a queue implementation in Zookeeper.

Potential downside:

  1. Need controller failover.

Paths:

  1. Stores the current controller info.
    Code Block
    /epoc --> {long} (generating a monotonically increasing number; used to identify leader generations) 
  2. Stores the current controller info.
    Code Block
    /controller --> {brokerid, controller epoc} (ephemeral; created by controller) 
  3. Stores the information of all live brokers.
    Code Block
    /brokers/ids/[broker_id] --> host:port (ephemeral; created by admin) 
  4. Stores the replication assignment for all partitions in a topic. For each replica, we store the id of the broker to which the replica is assigned. The first replica is the preferred replica. Note that for a given partition, there is at most 1 replica on a broker. Therefore, the broker id can be used as the replica id
    Code Block
    /brokers/topics/[topic] --> {part1: [broker1, broker2], part2: [broker2, broker3] ...}  (created by admin) 
  5. Stores leader and ISR of a partition
    Code Block
     /brokers/topics/[topic]/[partition_id]/leaderAndISR --> {leader: broker_id, ISR: {broker1, broker2}} (
     
     This path is updated by the controller or the current leader;. The current leader only updates the ISR part) .
  6. This path is used when we want to reassign some partitions to a different set of brokers. For each partition to be reassigned, it stores a list of new replicas and their corresponding assigned brokers. This path is created by an administrative process and is automatically removed once the partition has been moved successfully
    Code Block
     /brokers/partitions_reassigned/[topic]/[partition_id] --> {broker_id …} (created by admin) 

  7. ZKQueue: Used to communicate state change information from controller to each broker.

...