...
The Unclean Recovery uses a deterministic way to elect the leader persisted the most data. On a high level, once the unclean recovery is triggered, the controller will use a new API GetReplicaLogInfo to query the log end offset and the leader epoch from each replica. The one with the highest leader epoch plus the longest log end offset will be the new leader. To help explain when and how the Unclean Recovery is performed, let's first introduce some config changes.
The current new unclean.leader.election.enable will be extended with 3 more optionsrecovery.strategy has the following 3 options.
AggressiveProactive. It represents the intent of recovering the availability as fast as possible.
Balanced. Auto recovery on potential data loss case, wait as needed for a better result.
ManualNone. Stop the partition on potential data loss.
...
If there are other ISR members, choose an ISR member.
If there are unfenced ELR members, choose an ELR member.
If there are fenced ELR members
If the unclean.leaderrecovery.election.enablestrategy=ProactiveAggressive, then an unclean recovery will happen.
Otherwise, we will wait for the fenced ELR members to be unfenced.
If there are no ELR members.
If the unclean.leaderrecovery.election.enablestrategy=ProactiveAggressive, the controller will do the unclean recovery.
- If the unclean.leaderrecovery.election.enablestrategy=Balanced, the controller will do the unclean recovery when all the LastKnownELR are unfenced. See the following section for the explanations.
Otherwise, unclean.leaderrecovery.election.enablestrategy=ManualNone, the controller will not attempt to elect a leader. Waiting for the user operations.
...
- In Balance mode, all the LastKnownELR members have replied, plus the replicas replied within the timeout. Due to this requirement, the controller will only start the recovery if the LastKnownELR members are all unfenced.
- In Proactive Aggressive mode, any replicas replied within a fixed amount of time OR the first response received after the timeout.
...
The kafka-leader-election.sh tool will be upgraded to allow manual leader election.
It can directly select a leader.
It can trigger an unclean recovery for the replica with the longest log in either Proactive Aggressive or Balance mode.
- Configs to update
- unclean.leaderrecovery.election.enablestrategy. Described in the above section. Balanced Balanced is the default value.
- unclean.recovery.manager.enabled. True for using the unclean recovery manager to perform an unclean recovery. False otherwise. False is the default value.
- unclean.recovery.timeout.ms. The time limits of waiting for the replicas' response during the Unclean Recovery. 5 min is the default value.
- For compatibility, the compatibility issue. The original unclean.leader.election.enable options True/False will be used but meaning differently once the unclean recovery manager is in use. Here is the behavior when ISR and ELR are emptymapped to unclean.recovery.strategy options.
- unclean.leader.
...
- election.
...
- enable.
...
- false
...
- -> unclean.recovery.
...
- strategy.
...
- Balanced
- unclean.leader.election.enable
...
- .true -> unclean.recovery.strategy.Aggressive
Public Interfaces
We will deliver the KIP in phases, so the API changes are also marked coming with either ELR or Unclean Recovery.
...
|
...
Unclean Recovery is guarded by the feature flag unclean.recovery.manager.enabled.
Delivery plan
- For the existing unclean.leader.election.enable
If true, unclean.recovery.strategy will be set to Aggressive.
If false, unclean.recovery.strategy will be set to Balanced.
- unclean.leader.election.enable will be marked as deprecated.
Delivery plan
The KIP is a large plan, it can be The KIP is a large plan, it can be across multiple quarters. So we have to consider how to deliver the project in phases.
...
Actually in this model, broker 2 is not likely to have the complete log, so just forcing a fixed number of responses does not improve much durability.
Using a different set of configs
We also considered deprecating the unclean.leader.election.enable and using unclean.recovery.strategy(Manual/Balanced/Proactive). It would require the config conversion when we enable using the unclean recovery manager.
...
.