Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: Under Discussion

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

When the KRaft broker begins controlled shutdown, it immediately disables the metadata listener. This means that metadata changes as part of the controlled shutdown do not get sent to the respective components. For partitions that the broker is follower of, that is what we want. It prevents the follower from being able to rejoin the ISR while still shutting down. But for partitions that the broker is leading, it means the leader will remain active until controlled shutdown finishes and the socket server is stopped. That delay can be as much as 5 seconds and probably even worse. Note that in the ZK world, we have an explicit request `StopReplica` which serves the purpose of shutting down both follower and leader, but we don't have something similar a follower catches up with the leader, the leader tries to add it back to the ISR. The AlterPartition API is used by the leader to persist the new ISR in the controller. Presently, the controller validates that the new ISR contains valid replicas of the partition but without taking their state into account - a leader could for instance add a fenced or shutting down replica to the ISR. This means that we always trust that the leader will do the right thing. We believe that we should be more defensive and ensure that fenced and shutting replicas are not allowed to join the ISR in KRaft.

Proposed Changes

This KIP proposes changing the ISR expansion logic on the leader and and the ISR validation logic on the controller to avoid bringing back fenced or shutting down replicas in the ISR. The leader will consider only unfenced replicas to be eligible to join the ISR. It will rely on the metadata cache to get this information via the metadata log. As the metadata cache is eventually consistent, the leader might try to add a replica - which was just removed by the controller - back to the ISR because it does not know that the replica was fenced by the controller yet. In order to avoid this, the controller will validate the new ISR and reject any AlterPartition request containing an ineligible replica with the newly introduced INELIGIBLE_REPLICA error code. For backward compatibility, OPERATION_NOT_ATTEMPTED will be used for older versions. When the leader receives an INELIGIBLE_REPLICA error code, it is expected to revert back its state to the last committed state - assuming that the state did not change in the mean time - and to retry to expansion. When a broker is unfenced by the controller, the leader does nothing because subsequent fetch requests from the followers will try to get them back into the ISR if they are caught-up.With this change, a shutting down broker can stop its metadata listener when the controlled shutdown is terminated. This allows leaders hosted on that broker to step down while allowing followers to keep fetching until the broker shuts down.

Public Interfaces

New Error Code

INELIGIBLE_REPLICA - At least one replica is ineligible to join the ISR.

AlterPartition RPC

The version of the AlterPartition is bumped to version 2. The INELIGIBLE_REPLICA is returned in the response if any of the replicas in the new ISR contains a fenced or shutting down replica.

Compatibility, Deprecation, and Migration Plan

The change is backward compatible.

Rejected Alternatives

An alternative would be to explicitly stop replicas like we do in the ZK mode. When a broker shuts down, it could consider any partition change events after the controlled shutdown as started as an implicit `StopReplica` request. The main downside of this approach is that we don't really know from where in the metadata log it is safe to assume this. The broker might be behind and miss real events. We felt like doing it on the leader is the right approach for KRaft.None