Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Randomly electing a leader is definitely worth improving. As a result, we decide to replace the unclean leader random election with the Unclean Recovery.

...

The current unclean.leader.election.enable will be replaced by an intent-based config, unclean.recovery.strategy has the following 3 options.extended with 3 more options

Proactive. It represents the intent of recovering the availability as fast as possible.
Balanced. Auto recovery on potential data loss case, wait as needed for a better result.
Manual. Stop the partition on potential data loss.

...

  1. If there are other ISR members, choose an ISR member.

  2. If there are unfenced ELR members, choose an ELR member.

  3. If there are fenced ELR members

    1. If the unclean.leader.recoveryelection.strategyenable=Proactive, then an unclean recovery will happen.

    2. Otherwise, we will wait for the fenced ELR members to be unfenced.

  4. If there are no ELR members.

    1. If the unclean.leader.recoveryelection.strategyenable=Proactive, the controller will do the unclean recovery.

    2. If the unclean.leader.recoveryelection.strategyenable=Balanced, the controller will do the unclean recovery when all the LastKnownELR are unfenced. See the following section for the explanations.
    3. Otherwise, unclean.recoveryleader.election.strategyenable=Manual, the controller will not attempt to elect a leader. Waiting for the user operations.

...

    1. .

The controller will initiate the Unclean Recovery once a partition meets the above conditions. As mentioned above, the controller collects the log info from the replicas. Apart from the Log end offset and the partition leader epoch in the log, the replica also returns the following for verification:

...

  1. The kafka-leader-election.sh tool will be upgraded to allow manual leader election.

    1. It can directly select a leader.

    2. It can trigger an unclean recovery for the replica with the longest log in either Proactive or Balance mode.

  2. Configs to addupdate
    1. unclean.leader.recoveryelection.strategyenable. Described in the above section. Balanced is the default value. 
    2. unclean.recovery.Enabledmanager.enabled. True True for enabling using the unclean recovery manager to perform an unclean recovery.   False otherwise. False is the default value.
    3. unclean.recovery.timeout.ms. The time limits of waiting for the replicas' response during the Unclean Recovery. 5 min is the default value.
  3. For a better user experience, the unclean.recovery.strategy and the compatibility issue. The original unclean.leader.election.enable will be converted if unclean.recovery.Enabled is changedenable options True/False will be used but meaning differently once the unclean recovery manager is in use. Here is the behavior when ISR and ELR are empty.

unclean.recovery

...

.manager.enabled=falseunclean.recovery.manager.enabled=true
unclean.leader.election.enable=ManualWaiting for the last known leader to be unfencedWaiting for the user operations 
unclean.

...

unclean.recovery.Enabled from true to false

leader.election.enable=False/BalancedWaiting for the last known leader to be unfencedStart the unclean recovery process

...

unclean.leader.election.enable

...

=True/Proactive

...

Randomly select an unfenced replicaStart the unclean recovery process

Public Interfaces

We will deliver the KIP in phases, so the API changes are also marked coming with either ELR or Unclean Recovery.

...

  1. The High Watermark will only advance if all the messages below it have been replicated to at least min.insync.replicas ISR members.

  2. The consumer is affected when consuming the ack=1 and ack=0 messages. When there is only 1 replica(min ISR=2), the HWM advance is blocked, so the incoming ack=0/1 messages are not visible to the consumers. Users can avoid the side effect by updating the min.insync.replicas to 1 for their ack=0/1 topics.

  3. Compared to the current model, the proposed design has availability trade-offs:

    1. If the network partitioning only affects the heartbeats between a follower and the controller, the controller will kick it out of ISR. If losing this replica makes the ISR under min ISR, the HWM advancement will be blocked unnecessarily because we require the ISR to have at least min ISR members. However, it is not a regression compared to the current system at this point. But later when the network partitioning finishes, the current leader will put the follower into the pending ISR(aka "maximum ISR") and continue moving forward while in the proposed world, the leader needs to wait for the controller to ack the ISR change.

    2. Electing a leader from ELR may mean choosing a degraded broker. Degraded means the broker can have a poor performance in replication due to common reasons like networking or disk IO, but it is alive. It can also be the reason why it fails out of ISR in the first place. This is a trade-off between availability and durability.

  4. The unclean leader election will be replaced by the unclean recovery.

  5. For fsync users, the ELR can be beneficial to have more choices when the last known leader is fenced. It is worth mentioning what to expect when ISR and ELR are both empty. We assume fsync users adopt the unclean.leader.election.enable as false.
    1. If the KIP has been fully implemented. The unclean.recovery.strategy will be balanced. During the unclean recovery, the controller will elect a leader when all the LastKnownElr members have replied.
    2. If only the ELR is implemented, the LastKnownLeader is preferred when ELR and ISR are both empty.

...

Unclean Recovery is guarded by the feature flag unclean.recovery.Enabledmanager. unclean.leader.election.enable and unclean.recovery.strategyare automated converted.enabled

Delivery plan

The KIP is a large plan, it can be across multiple quarters. So we have to consider how to deliver the project in phases.

...

Actually in this model, broker 2 is not likely to have the complete log, so just forcing a fixed number of responses does not improve much durability.

Using a different set of configs

We also considered deprecating the unclean.leader.election.enable and using unclean.recovery.strategy(Manual/Balanced/Proactive). It would require the config conversion when we enable using the unclean recovery manager.

  • It may not have enough gains compared with extending the unclean.leader.election.enable. But adding extra overhead of the config conversion.
  • Implementing a multi-level config can be complicated. Users can also lose track of the new config.
  • It can be even more complicated to perform muti-level config conversion between unclean.leader.election.enable and unclean.recovery.strategy.