You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Status

Current state: Discussion

Discussion thread:

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

While a reassignment is in progress, new replicas are trying to catch up and are not in the ISR. The broker considers these partitions "under-replicated" even if the desired replication factor is always satisfied. This is misleading and makes URP metrics difficult to use for alerts. In KIP-455, we gave the leader a way to detect a reassignment. Specifically, the LeaderAndIsr request now has a separate field for the replicas which are being added and those that are being removed. This allows us to compute a more useful metric value.

Proposed Changes

We will change the semantics of the "UnderReplicated" metric to only take into account the active replica assignment. In other words, we will subtract the adding replicas from both the total replicas and the current ISR when determining URP satisfaction. 

Additionally, we are adding a new metric to track the number of partitions being reassigned.

Public Interfaces

As described above, this KIP changes the semantics of `kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`.

We will add a new metric to track the number of partitions currently being reassigned: `kafka.server:type=ReplicaManager,name=ReassigningPartitions`. Any partition which has non-empty AddingReplicas will count toward this value.

Compatibility, Deprecation, and Migration Plan

The main concern from a compatibility perspective is the semantic change to the "UnderReplicated" metric. Users may have to make changes if this is used to track the reassignment state. However, we believe that continued misuse of this metric (i.e. not taking reassignment into account) is a more substantial problem.

Rejected Alternatives

We considered leaving the "UnderReplicated" metric with its current semantics and adding a new metric to represent the "under-synchronized" replicas. We ultimately rejected this because we felt it was necessary to address the misuse of the URP metric due to its surprising behavior during a reassignment.

  • No labels