You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

Status

Current state: Discussion

Discussion thread:

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

While a reassignment is in progress, new replicas are trying to catch up and are not in the ISR. The broker considers these partitions "under-replicated" even if the desired replication factor is always satisfied. This is misleading and makes URP metrics difficult to use for alerts. In KIP-455, we gave the leader a way to detect a reassignment. Specifically, the LeaderAndIsr request now has a separate field for the replicas which are being added and those that are being removed. This allows us to compute a more useful metric value.

Proposed Changes

We will change the semantics of the "UnderReplicated" metric to only take into account the active replica assignment. In other words, we will subtract the AddingReplica from both the total replicas and the current ISR when determining URP satisfaction. 

Similarly, we will change the behavior of the kafka topic command so that `--under-replicated-partitions` returns consistent results. Because the adding/removing replicas are not visible from the Metadata API, we will use the new ListReassignment API.

Additionally, we are adding a couple new metrics to track the progress of an active reassignment. These are described below.

Public Interfaces

As described above, this KIP changes the semantics of `kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`.

We will add a new gauge to track the number of partitions currently being reassigned: `kafka.server:type=ReplicaManager,name=ReassigningPartitions`. Any partition which has non-empty AddingReplicas will count toward this value.

We will also add a new meter to track inbound and outbound bytes for reassignment traffic: `kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSec` and `kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSec`. Fetch traffic to and from replicas in the AddingReplicas set will contribute to this metric as well as the total replication metric.

Compatibility, Deprecation, and Migration Plan

The main concern from a compatibility perspective is the semantic change to the "UnderReplicated" metric. Users may have to make changes if this is used to track the reassignment state. However, we believe that continued misuse of this metric (i.e. not taking reassignment into account) is a more substantial problem.

Rejected Alternatives

We considered leaving the "UnderReplicated" metric with its current semantics and adding a new metric to represent the "under-synchronized" replicas. We ultimately rejected this because we felt it was necessary to address the misuse of the URP metric due to its surprising behavior during a reassignment.

  • No labels