...
The behaviors in the failover:
Broker failover.
If the replica fails before it receives the GetReplicaLogInfo request, it can just send the log info along with its current broker epoch.
If the replica fails after it responds to the GetReplicaLogInfo request
If the controller receives the new broker registration, the controller can reject the response because the broker epoch in the request mismatches with the broker registration.
Otherwise, the replica may become the leader but will be fenced later when it registers.
Controller failover.
The controller does not store anything in the metadata log, every controller failover will result in a new unclean recovery.
Other
The kafka-leader-election.sh tool will be upgraded to allow manual leader election.
It can directly select a leader.
It can trigger an unclean recovery for the replica with the longest log in either Proactive or Balance mode.
- Configs to add
- unclean.recovery.strategy. Described in the above section. Balanced is the default value.
- unclean.recovery.Enabled. True for enabling the unclean recovery. False otherwise. False is the default value.
- unclean.recovery.timeout.ms. The time limits of waiting for the replicas' response during the Unclean Recovery. 5 min is the default value.
- For a better user experience, the unclean.recovery.strategy and unclean.leader.election.enable will be converted if unclean.recovery.Enabled is changed.
unclean.recovery.Enabled from false to true
unclean.leader.election.enable unclean.recovery.strategy false Balanced true Proactive unclean.recovery.Enabled from true to false
unclean.recovery.strategy unclean.leader.election.enable Proactive true Balanced false Manual false
Public Interfaces
We will deliver the KIP in phases, so the API changes are also marked coming with either ELR or Unclean Recovery.
...