Status

Current state: Discussion

Discussion thread:

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

While a reassignment is in progress, new replicas are trying to catch up and are not in the ISR. The broker considers these partitions "under-replicated" even if the desired replication factor is always satisfied. This is misleading and makes URP metrics difficult to use for alerts. In KIP-455, we gave the leader a way to detect a reassignment. Specifically, the LeaderAndIsr request now has a separate field for the replicas which are being added and those that are being removed. This allows us to compute a more useful metric value.

Proposed Changes

We will change the semantics of the "UnderReplicated" metric to only take into account the active replica assignment. In other words, we will subtract the AddingReplica from both the total replicas and the current ISR when determining URP satisfaction.

Similarly, we will change the behavior of the kafka topic command so that `--under-replicated-partitions` returns consistent results. Because the adding/removing replicas are not visible from the Metadata API, we will use the new ListReassignment API.

Additionally, we are adding a new metric to track the number of partitions being reassigned.

Public Interfaces

As described above, this KIP changes the semantics of `kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`.

We will add a new gauge to track the number of partitions currently being reassigned: `kafka.server:type=ReplicaManager,name=ReassigningPartitions`. Any partition which has non-empty AddingReplicas will count toward this value.

Compatibility, Deprecation, and Migration Plan

The main concern from a compatibility perspective is the semantic change to the "UnderReplicated" metric. Users may have to make changes if this is used to track the reassignment state. However, we believe that continued misuse of this metric (i.e. not taking reassignment into account) is a more substantial problem.

Rejected Alternatives

We considered leaving the "UnderReplicated" metric with its current semantics and adding a new metric to represent the "under-synchronized" replicas. We ultimately rejected this because we felt it was necessary to address the misuse of the URP metric due to its surprising behavior during a reassignment.

Space shortcuts

Child pages

Status

Motivation

Proposed Changes

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-352: Distinguish URPs caused by reassignment

Status

Motivation

Proposed Changes

Public Interfaces

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives