Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • CLUSTER_AUTHORIZATION_FAILED
  • STALE_BROKER_EPOCH
  • NOT_CONTROLLER
  • UNKNOWN_REPLICA_EVENT_TYPE STATE (new)

Partition-level errors:

...

  • If the broker cannot send the AlterReplicaState request to the controller and the offline replica is a follower, it will eventually be removed from the ISR by the leader.
  • If the broker cannot send the AlterReplicaState request to the controller and the offline replica is a leader, it will remain the leader until the broker is restarted or the controller learns about the replica state through a LeaderAndIsr request (this is the current behavior).
  • If the controller reads the AlterReplicaState and encounters a fatal error before handling the subsequent ControllerEvent, a new controller will eventually be elected and the state of all replicas will become known to the new controller.

For the Broker-side failures, we should implement retries on the AlterReplicaState messages.

Compatibility, Deprecation, and Migration Plan

...

Another rejected approach was to add an RPC which mirrored the JSON payload used by the ZK workflow currently implemented. This was rejected in favor of a more generic RPC that could be used for other purposes in the future. It was also rejected to prevent "leaking" the notion of a log dir to the public API and to the Controller.

Future Work

This RPC is quite similar to the existing ControlledShutdown RPC. Since we intend to interpret an empty list of topics to mean all topics, the proposed AlterReplicaState RPC could subsume the ControlledShutdown RPC. After we implement this new RPC, we can consider if we want to move forward and consolidate ControlledShutdown into AlterReplicaState as a future KIP.