Status
Current state: Draft
Discussion thread: TBD
JIRA: TBD
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Following the implementation of KIP-500, we will no longer use ZooKeeper as an event bus between brokers. Currently, log dir failure notifications are sent from the broker to the controller using a ZooKeeper watch. When a broker has a log dir failure, it will write a znode under the path /log_dir_event_notification
. The controller watches this path for changes to its children. Once the watch is fired, the controller reads the data from all the children to get a list of broker IDs which had log dir errors. A LeaderAndIsr request is sent to all the brokers which were found in the notification znodes. Any broker which now has an offline replica (due to log dir failure) would respond with a storage error. This then causes the controller to mark the replica as offline and to trigger a leader election. This procedure is describe in detail in the original design KIP-112.
With this KIP, we propose to add a new RPC that allows a broker to directly communicate state changes of a replica to the controller. This will replace the ZooKeeper based notification for log dir failures and potentially could replace the existing controlled shutdown RPC. Since this RPC is somewhat generic, it could potentially be used to mark a replicas a "online" following some kind of log dir recovery procedure (out of scope for this proposal).
Public Interfaces
We will add a new RPC named ReplicaStateEvent which requires CLUSTER_ACTION permissions
ReplicaStateEventRequest => BrokerId BrokerEpoch EventType EventReason [Topic [PartitionId LeaderEpoch]] BrokerId => Int32 BrokerEpoch => Int32 EventType => Int32 EventReason => String Topic => String PartitionId => Int32 LeaderEpoch => Int32 ReplicaStateEventResponse => ErrorCode [Topic [PartitionId]] ErrorCode => Int32 Topic => String PartitionId => Int32
Possible top-level errors:
- CLUSTER_AUTHORIZATION_FAILED
- STALE_BROKER_EPOCH
- NOT_CONTROLLER
- UNKNOWN_REPLICA_EVENT_TYPE (new)
Partition-level errors:
- INVALID_REQUEST: The update was rejected due to some internal inconsistency (e.g. invalid replicas specified in the ISR)
- UNKNOWN_TOPIC_OR_PARTITION: The topic no longer exists.
Proposed Changes
Upon encountering errors relating to the log dirs, the broker will now send an RPC to the controller indicating that one or more replicas need to be marked offline. Since these types of errors are likely to occur for group of replicas at once, we will continue to use a background thread in ReplicaManager to allow these errors to accumulate before sending a message to the controller. The controller will synchronously perform basic validation of the request (permissions, do topics exists, etc) and asynchronously perform the necessary actions to process the replica state changes.
Previously, the broker would only send its ID to the controller using the ZK watch mechanism. This meant the controller had to send a LeaderAndIsr request to the brokers in order to learn which replicas were offline. Using a more direct approach as proposed here, the controller can update its view of the replica's state immediately and react accordingly.
RPC semantics
- The EventType field in the request is will only support the value 1 which will represent "offline"
- The EventReason field is an optional textual description of why the event is being sent
- If no Topic is given, it is implied that all topics on this broker are being indicated
- If a Topic and no partitions are given, it is implied that all partitions of this topic are being indicated
- LeaderEpoch is optional and is not validated for the "offline" event type
Failure Modes
Like the ZK watch before, the ReplicaStateEvent RPC from the broker to the controller is best-effort. We are not guaranteed to be able to send a message to the controller to indicate a replica state change. Also, since the processing of this RPC by the controller is asynchronous, we are not guaranteed that the subsequent handling of the state change will happen. The following failure scenarios are possible:
- If the broker cannot send the ReplicaStateEvent request to the controller and the offline replica is a follower, it will eventually be removed from the ISR by the leader.
- If the broker cannot send the ReplicaStateEvent request to the controller and the offline replica is a leader, it will remain the leader until the broker is restarted or the controller learns about the replica state through a LeaderAndIsr request (this is the current behavior).
- If the controller reads the ReplicaStateEvent and encounters a fatal error before handling the subsequent ControllerEvent, a new controller will eventually be elected and the state of all replicas will become known to the new controller.
Compatibility, Deprecation, and Migration Plan
Since this change is between the brokers, we will use the inter-broker protocol version as a flag for using this new RPC or not. Once the IBP has been increased, the brokers will not write out ZK messages for log dir failures and the controller will not set a watch on the parent znode. They will instead use the new RPC detailed here.
Rejected Alternatives
AlterIsr RPC
The AlterIsr RPC proposed in KIP-497 is similar in structure to the RPC proposed here. In the case of AlterIsr, it is used by the leader to indicate ISR changes to the controller. Since only the leader has up-to-date ISR information, a follower cannot reliably use this RPC to communicate an ISR change to the controller. It would require significant changes to the AlterIsr KIP to support this use case.
LogDir specific RPC
Another rejected approach was to add an RPC which mirrored the JSON payload used by the ZK workflow currently implemented. This was rejected in favor of a more generic RPC that could be used for other purposes in the future. It was also rejected to prevent "leaking" the notion of a log dir to the public API.