Status
Current state: Discussion
Discussion thread:
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
From the perspective of the leader, there is no such thing as a partition reassignment. There is only the current replica set. This has several drawbacks:
- While a reassignment is in progress, the new replicas are trying to catch up. The broker considers these partitions under-replicated even if the desired replication factor is always satisfied. This is misleading and makes URP metrics difficult to alert on.
- During a reassignment, users often attempt to throttle the reassignment traffic in order to avoid negatively impacting the cluster. The throttle applies to any replica which is not in sync, but there is no way for the leader to distinguish the replicas which are being reassigned. If a replica falls out of sync, then suddenly it gets hit with the throttle. Not only does this make the replica slower to catch back up, it increases the amount of traffic which is subject to the throttle which could make the reassignment go even slower.
- As replicas are catching up, they are added to the ISR. Depending on the rate that the replicas catch up, there may be a non-trivial amount of time during which the ISR is larger than the desired replication factor. This can negatively impact end-to-end latency since replication . Perhaps even worse, it makes determining the correct throttle even trickier because all ISR traffic skips the throttle.
This KIP proposes to make the leader aware of partition assignments.
Proposed Changes
The problem at the moment is that only the controller knows about the reassignment. Partition leaders just see a single replica set. We propose to have the controller propagate the reassignment state to the leaders. We will distinguish between the current set of replicas and the impending set of replicas. The impending replica set will contain the new replica assignment while the reassignment is in progress.
Public Interfaces
Request APIs
We will modify the UpdateMetadata and the LeaderAndIsr request APIs to allow the controller to propagate the new reassignment to the leaders. The new LeaderAndIsr request schema is given below:
LeaderAndIsrRequest => ControllerId ControllerEpoch [PartitionState] [LiveLeader] ControllerId => INT32 ControllerEpoch => INT32 PartitionState => TopicName PartitionId ControllerEpoch LeaderId LeaderEpoch ISR ZkVersion ActiveReplicas ImpendingReplicas IsNew TopicName => STRING PartitionId => INT32 ControllerEpoch => INT32 LeaderId => INT32 LeaderEpoch => INT32 IsNew => BOOLEAN ZkVersion => INT32 ISR => [INT32] CurrentReplicas => [INT32] // Renamed TargetReplicas => [INT32] // New
Similar changes will be made to the UpdateMetadata request.
UpdateMetadataRequest => ControllerId ControllerEpoch [PartitionState] [LiveLeader] ControllerId => INT32 ControllerEpoch => INT32 PartitionState => TopicName PartitionId ControllerEpoch LeaderId LeaderEpoch ISR ZkVersion ActiveReplicas ImpendingReplicas TopicName => STRING PartitionId => INT32 ControllerEpoch => INT32 LeaderId => INT32 LeaderEpoch => INT32 ZkVersion => INT32 ISR => [INT32] CurrentReplicas => [INT32] // Renamed TargetReplicas => [INT32] // New
The response schemas for both APIs will match the previous version.
Metrics
We will change the semantics of the "UnderReplicated" metric to count only the partitions which are under-replicated from the perspective of the active replica set. We will add a new metric "ReassignedCount" which tracks the number of replicas which are currently being reassigned.
Compatibility, Deprecation, and Migration Plan
The main concern from a compatibility perspective is the semantic change to the "UnderReplicated" metric. Users may have to make changes if this is used to track the reassignment state. However, we believe that continued misuse of this metric (i.e. not taking reassignment into account) is a more substantial problem.
Rejected Alternatives
We considered leaving the "UnderReplicated" metric with its current semantics and adding a new metric to represent the "under-synchronized" replicas. We ultimately rejected this because we felt it was necessary to address the misuse of the URP metric due to its surprising behavior during a reassignment.