You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Status

Current state: Discussion

Discussion thread:

JIRA:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

While a reassignment is in progress, the number of replicas for a partition being reassigned temporarily increases beyond the replication factor. Once all new replicas are in the ISR, the old replicas are removed and the number of replicas again matches the replication factor. Until that point, however, the partition is treated as under-replicated both from the perspective of metrics and from the topic command utility. This is misleading because the partitions may satisfy the required replication factor throughout the reassignment. This has two major drawbacks:

  1. URPs cannot easily be used for alerting because they are expected during a reassignment. This can obscure actual replication problems while a reassignment is in progress.
  2. We cannot easily isolate the reassignment load with a throttle. Kafka supports replication throttles which exclude ISR traffic, but if there is an unexpected URP during a reassignment, the formerly in-sync replica will get hit with the throttle. This not only makes it more difficult to rejoin the ISR, it takes traffic from the reassignment. 

In this KIP, we propose to distinguish the URPs caused by reassignment.

Proposed Changes

The problem at the moment is that only the controller knows about the reassignment. Partition leaders just see a single replica set. We propose to have the controller propagate the reassignment state to the leaders. We will distinguish between the current set of replicas and the impending set of replicas. The impending replica set will contain the new replica assignment while the reassignment is in progress. 

Public Interfaces

Request APIs

We will modify the UpdateMetadata and the LeaderAndIsr request APIs to allow the controller to propagate the new reassignment to the leaders. The new LeaderAndIsr request schema is given below:

LeaderAndIsrRequest => ControllerId ControllerEpoch [PartitionState] [LiveLeader]
  ControllerId => INT32
  ControllerEpoch => INT32
  PartitionState => TopicName PartitionId ControllerEpoch LeaderId LeaderEpoch ISR ZkVersion ActiveReplicas ImpendingReplicas IsNew
    TopicName => STRING
    PartitionId => INT32
    ControllerEpoch => INT32
    LeaderId => INT32
    LeaderEpoch => INT32
    IsNew => BOOLEAN
    ZkVersion => INT32
    ISR => [INT32]
    CurrentReplicas => [INT32]     // New
    ImpendingReplicas => [INT32]  // New

Similar changes will be made to the UpdateMetadata request.

UpdateMetadataRequest => ControllerId ControllerEpoch [PartitionState] [LiveLeader]
  ControllerId => INT32
  ControllerEpoch => INT32
  PartitionState => TopicName PartitionId ControllerEpoch LeaderId LeaderEpoch ISR ZkVersion ActiveReplicas ImpendingReplicas
    TopicName => STRING
    PartitionId => INT32
    ControllerEpoch => INT32
    LeaderId => INT32
    LeaderEpoch => INT32
    ZkVersion => INT32
    ISR => [INT32]
    CurrentReplicas => [INT32]     // New
    ImpendingReplicas => [INT32]  // New

The response schemas for both APIs will match the previous version.

Metrics

We will change the semantics of the "UnderReplicated" metric to count only the partitions which are under-replicated from the perspective of the active replica set. We will add a new metric "ReassignedCount" which tracks the number of replicas which are currently being reassigned.

Compatibility, Deprecation, and Migration Plan

The main concern from a compatibility perspective is the semantic change to the "UnderReplicated" metric. Users may have to make changes if this is used to track the reassignment state. However, we believe that continued misuse of this metric (i.e. not taking reassignment into account) is a more substantial problem.

Rejected Alternatives

We considered leaving the "UnderReplicated" metric with its current semantics and adding a new metric to represent the "under-synchronized" replicas. We ultimately rejected this because we felt it was necessary to address the misuse of the URP metric due to its surprising behavior during a reassignment.

  • No labels