Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: DiscussionAdopted

Discussion thread: here

JIRA: 

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-8834
 
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-8835
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-9059

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

While a reassignment is in progress, the number of replicas for a partition being reassigned temporarily increases beyond the replication factor. Once all new replicas are in the ISR, the old replicas are removed and the number of replicas again matches the replication factor. Until that point, however, the partition is treated as under-replicated both from the perspective of metrics and from the topic command utility. This is misleading because the partitions may satisfy the required replication factor throughout the reassignment. This has two major drawbacks:

  1. URPs cannot easily be used for alerting because they are expected during a reassignment. This can obscure actual replication problems while a reassignment is in progress.
  2. We cannot easily isolate the reassignment load with a throttle. Kafka supports replication throttles which exclude ISR traffic, but if there is an unexpected URP during a reassignment, the formerly in-sync replica will get hit with the throttle. This not only makes it more difficult to rejoin the ISR, it takes traffic from the reassignment. 

In this KIP, we propose to distinguish the URPs caused by reassignment.

Proposed Changes

The problem at the moment is that only the controller knows about the reassignment. Partition leaders just see a single replica set. We propose to have the controller propagate the reassignment state to the leaders. We will distinguish between the current set of replicas and the impending set of replicas. The impending replica set will contain the new replica assignment while the reassignment is in progress. 

Public Interfaces

Request APIs

We will modify the UpdateMetadata and the LeaderAndIsr request APIs to allow the controller to propagate the new reassignment to the leaders. The new LeaderAndIsr request schema is given below:

Code Block
linenumberstrue
LeaderAndIsrRequest => ControllerId ControllerEpoch [PartitionState] [LiveLeader]
  ControllerId => INT32
  ControllerEpoch => INT32
  PartitionState => TopicName PartitionId ControllerEpoch LeaderId LeaderEpoch ISR ZkVersion ActiveReplicas ImpendingReplicas IsNew
    TopicName => STRING
    PartitionId => INT32
    ControllerEpoch => INT32
    LeaderId => INT32
    LeaderEpoch => INT32
    IsNew => BOOLEAN
    ZkVersion => INT32
    ISR => [INT32]
    CurrentReplicas => [INT32]     // New
    ImpendingReplicas => [INT32]  // New

Similar changes will be made to the UpdateMetadata request.

Code Block
linenumberstrue
UpdateMetadataRequest => ControllerId ControllerEpoch [PartitionState] [LiveLeader]
  ControllerId => INT32
  ControllerEpoch => INT32
  PartitionState => TopicName PartitionId ControllerEpoch LeaderId LeaderEpoch ISR ZkVersion ActiveReplicas ImpendingReplicas
    TopicName => STRING
    PartitionId => INT32
    ControllerEpoch => INT32
    LeaderId => INT32
    LeaderEpoch => INT32
    ZkVersion => INT32
    ISR => [INT32]
    CurrentReplicas => [INT32]     // New
    ImpendingReplicas => [INT32]  // New

The response schemas for both APIs will match the previous version.

Metrics

new replicas are trying to catch up and are not in the ISR. The broker considers these partitions "under-replicated" even if the desired replication factor is always satisfied. This is misleading and makes URP metrics difficult to use for alerts. In KIP-455, we gave the leader a way to detect a reassignment. Specifically, the LeaderAndIsr request now has a separate field for the replicas which are being added and those that are being removed. This allows us to compute a more useful metric value.

Proposed Changes

We will change the semantics of the "UnderReplicated" metric to taking into account the AddingReplicas. Specifically, we will use the following formula:

Code Block
isUnderReplicated == size(original assigned replicas) - size(isr) > 0

We count a partition as under-replicated if the current isr is smaller than the size of the current replica set. This allows us to count AddingReplicas which makes this metric consistent with UnderMinIsr criteria. Note that a reassignment may change the number of replicas, but URP satisfaction will not take this into account until the reassignment is complete.

Similarly, we will change the behavior of the kafka topic command so that `--under-replicated-partitions` returns results consistent with the change above. Because the adding/removing replicas are not visible from the Metadata API, we will use the new ListReassignment API.

Additionally, we are adding a couple new metrics to track the progress of an active reassignment. These are described below.

Public Interfaces

As described above, this KIP changes the semantics of `kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions`. Replicas which are being added as part of a reassignment will not count toward this value.

We will also add some additional metrics to improve monitoring for reassignments. The table below shows all of the changes from this KIP.

MetricIs NewType

Includes Current

Assigned Replicas

Includes Reassigning

Replicas

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionsNoGaugeYesNo
kafka.server:type=ReplicaManager,name=ReassigningPartitionsYesGaugeNoYes
kafka.server:type=ReplicaManager,name=ReassignmentMaxLagYesGaugeNoYes
kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesOutPerSecYesMeterNoYes
kafka.server:type=BrokerTopicMetrics,name=ReassignmentBytesInPerSecYesMeterNoYes

Note that the `ReassignmentBytesOutPerSec` and `ReassignmentBytesInPerSec` meters are broker-level metrics. We are not proposing any topic-level metrics for tracking reassignment progress.

ReassignmentMaxLag will be implemented separately as it requires some more consideration. JIRA is linked on the top of the KIPWe will change the semantics of the "UnderReplicated" metric to count only the partitions which are under-replicated from the perspective of the active replica set. We will add a new metric "ReassignedCount" which tracks the number of replicas which are currently being reassigned.

Compatibility, Deprecation, and Migration Plan

...