Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Broker should always process the control requests sent by the active controller.
  2. Controller should always process the ControlledShutdownRequest sent by the broker.
  3. The first LeaderAndIsrRequest received by a broker after it starts up always contains all partitions hosted by the broker.
  4. By the time a broker finishes bouncing, controller should have processed the BrokerChange event.

...

  • If a broker bounces during controlled shutdown, the bounced broker may accidentally process its earlier generation’s StopReplicaRequest sent from the active controller for one of its follower replicas, leaving the replica offline while its remaining replicas may stay online
  • If controller receives old ControlledShutdownRequest (due to retries) after the broker has been bounced, controller will proactively send out control requests to the broker, which may leave the broker in a bad state.
  • If the first LeaderAndIsrRequest that a broker processes is sent by the active controller before its startup, the broker will overwrite the high watermark checkpoint file and may cause incorrect truncation (KAFKA-7235)
  • If a broker bounces very quickly, the controller may start processing the BrokerChange event after the broker already re-registers itself in zk. In this case, controller will miss the broker restart and will not send any requests to the broker for initialization. The broker will not be able to accept traffics.

...

This KIP will include the broker generation (broker_epoch) in LeaderAndIsrRequest, UpdateMetadataRequest and , StopReplicaRequest, ControlledShutdownRequest and bump up their protocol versions. Since we will evolve the schema of these control requests in this KIP, I would also normalize the schema of these requests by avoiding data redundancy for the topic strings to reduce the memory footprint in the controller side and reduce the amount of data we send across the network. Here are the new version of these requests:

...

Code Block
title StopReplicaRequest V1
StopReplica Request => controller_id controller_epoch broker_epoch delete_partitions [topic_partitions] 
  controller_id => INT32
  controller_epoch => INT32
  broker_epoch => INT64  					<-- NEW
  delete_partitions => BOOLEAN
  topic_partitions => topic [partition]  	<-- NEW
    partition => INT32


Code Block
title ControlledShutdownRequest V2
ControlledShutdown Request => broker_id broker_epoch
  broker_id => INT32
  broker_epoch => INT64


Note: Normalizing the schema is a good-to-have optimization because the memory footprint for the control requests hinders the controller from scaling up if we have many topics with large partition counts. We already did the same thing in other types of request (e.g. Produce, Fetch, ...).

...

A broker will extract the broker generation (czxid) in LeaderAndIsrRequest/UpdateMetadataRequest/StopReplicaRequest and will reject the requests with smaller broker generation than its current generation.

Controller Rejects ControlledShutdownRequest Sent from Former Generations

During controlled shutdown, the broker will include its current broker generation (czxid) in the ControlledShutdownRequest. Upon receiving ControlledShutdownRequest, controller will check the broker generation (czxid) in ControlledShutdownRequest and will reject the request if its broker generation is smaller broker generation than the broker generation cached in the controller side. This guarantees controller will not attempt to send out control requests to move partitions and stop replicas in reaction to a broker cleaned shutdown if the broker has already been restarted. 

Controller Detects Bounced Broker

In order to avoid missing broker state change when fast broker bounce happens, the logic in controller processing BrokerChange event should be:

...