Status
Current state: Discussion
Discussion thread:
JIRA:
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
While a reassignment is in progress, the number of replicas for a partition being reassigned temporarily increases beyond the replication factor. Once all new replicas are in the ISR, the old replicas are removed and the number of replicas again matches the replication factor. Until that point, however, the partition is treated as under-replicated both from the perspective of metrics and from the topic command utility. This is misleading because the partitions may satisfy the required replication factor throughout the reassignment. Furthermore, it obscures actual replication problems while a reassignment is in progress because some number of under-replicated partitions are expected. For example, this makes it difficult to use URPs for alerting. In this KIP, we propose to distinguish the URPs caused by reassignment.
Proposed Changes
We will distinguish "UnderSynchronized" partitions as those which have an in-sync replica set that is smaller than the topic's replication factor, and "OverReplicated" partitions as those which have more replicas than the replication factor.
Public Interfaces
We will add two new metrics exposed on the broker which represent counts of the new categories mentioned above: "UnderSynchronizedCount" and "OverReplicatedCount."
The topic command utility will have similar options to display the partitions in each category: --under-synchronized-partitions and --over-replicated-partitions.
Compatibility, Deprecation, and Migration Plan
These changes are backwards compatible.
Rejected Alternatives
We considered redefining "under-replicated partition" to exclude partitions being reassigned. Ultimately we were reluctant to change its semantics for compatibility with previous versions considering its broad usage.