Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Change znode of choice to `topics/[topic]`, new LeaderAndIsr fields and other smaller improvements

...

Code Block
languagejs
{
  "apiKey": 45,
  "type": "request",
  "name": "AlterPartitionReassignmentsRequest",
  "validVersions": "0",
  "fields": [
    { "name": "TimeoutMs", "type": "int32", "versions": "0+", "default": "60000",
      "about": "The time in ms to wait for the request to complete." },
    { "name": "Topics", "type": "[]ReassignableTopic", "versions": "0+", "nullableVersions": "0+",
      "about": "The topics to reassign, or null to cancel all the pending reassignments for all topics.", "fields": [
      { "name": "Name", "type": "string", "versions": "0+",
        "about": "The topic name." },
      { "name": "Partitions", "type": "[]ReassignablePartition", "versions": "0+", "nullableVersions": "0+",
        "about": "The partitions to reassign, or null to cancel all pending reassignments for this topic", "fields": [
        { "name": "PartitionIndex", "type": "int32", "versions": "0+",
          "about": "The partition index." },
        { "name": "Replicas", "type": "[]int32", "versions": "0+", "nullableVersions": "0+",
          "about": "The replicas to place the partitions on, or null to cancel a pending reassignment for this partition." }
      ]}
    ]}
  ]
}

AlterPartitionReassignmentsResponse

...

  • REQUEST_TIMED_OUT: if the request timed out.
  • NOT_CONTROLLER: if the node we sent the request to was not the controller.
  • INVALID_REPLICA_ASSIGNMENT: if the specified replica assignment was not valid-- for example, if it included negative numbers, repeated numbers, or specified a broker ID that the controller was not aware of.
  • NO_REASSIGNMENT_IN_PROGRESS: If a cancellation was called on a topic/partition which was not in the middle of a reassignment
  • CLUSTER_AUTHORIZATION_FAILED

...

Code Block
languagejs
{
  "apiKey": 45,
  "type": "response",
  "name": "AlterPartitionReassignmentsResponse",
  "validVersions": "0",
  "fields": [
    { "name": "ThrottleTimeMs", "type": "int32", "versions": "0+",
      "about": "The duration in milliseconds for which the request was throttled due to a quota violation, or zero if the request did not violate any quota." },
    { "name": "ResponsesErrorCode", "type": "[]ReassignablePartitionResponseint16", "versions": "0+",
      "about": "The responses to partitions to reassignerror code.", "fields": [ },
      { "name": "ErrorCodeErrorString", "type": "int16string", "versions": "0+",
 "nullableVersions": "0+",
      "about": "The error code." }string,
 or null if there was no error." }
    { "name": "ErrorStringResponses", "type": "string[]ReassignableTopicResponse", "versions": "0+", "nullableVersions": "0+",
        "about": "The error string, or null if there was no error." }
      ]
    }
  ]
}

ListPartitionReassignmentsRequest

This RPC lists the currently active partition reassignments.  It must be sent to the controller.

It requires DESCRIBE on CLUSTER.

Code Block
languagejs
{
  "apiKey": 46,
  responses to topics to reassign.", "fields": [
      { "name": "ErrorCode", "type": "int16", "versions": "0+",
        "about": "The error code." },
      { "name": "ErrorString", "type": "requeststring",
  "nameversions": "ListPartitionReassignmentsRequest0+",
  "validVersionsnullableVersions": "0+",
  "fields": [
    { "nameabout": "TimeoutMsThe error string, or null if there was no error." }
      { "name": "Partitions", "type": "int32[]ReassignablePartitionResponse", "versions": "0+", "defaultnullableVersions": "600000+",
        "about": "The timeresponses into mspartitions to wait for the request to complete." }
reassign", "fields": [
          { "name": "TopicsPartitionIndex", "type": "[]ListPartitionReassignmentsTopicsint32", "versions": "0+", "nullableVersions": "0+",

            "about": "The topics to list partition reassignmentsindex." for},
 or null to list everything.", "fields": [
      { "name": "PartitionIdsErrorCode", "type": "[]int32int16", "versions": "0+",
            "about": "The partitionserror to list partition reassignments for, or null to list everything" }
    ]}
  ]
}

ListPartitionReassignmentsResponse

Possible errors:

  • REQUEST_TIMED_OUT: if the request timed out.
  • NOT_CONTROLLER: if the node we sent the request to is not the controller.
  • CLUSTER_AUTHORIZATION_FAILED

If the top-level error code is set, no responses will be provided.

code." },
          { "name": "ErrorString", "type": "string", "versions": "0+", "nullableVersions": "0+",
            "about": "The error string, or null if there was no error." }
        ]
      ]
    }
  ]
}

ListPartitionReassignmentsRequest

This RPC lists the currently active partition reassignments.  It must be sent to the controller.

It requires DESCRIBE on CLUSTER.

Code Block
languagejs
{
  "apiKey": 46,
  "type": "request",
 
Code Block
languagejs
{
  "apiKey": 46,
  "type": "response",
  "name": "ListPartitionReassignmentsResponse",
  "validVersions": "0",
  "fields": [
    { "name": "ThrottleTimeMs", "type": "int32", "versions": "0+",
      "about": "The duration in milliseconds for which the request was throttled due to a quota violation, or zero if the request did not violate any quota." },
    { "name": "ErrorCodeListPartitionReassignmentsRequest",
  "typevalidVersions": "int16", "versions": "0+",
      "aboutfields": "The top-level error code, or 0 on success." },[
    { "name": "ErrorMessageTimeoutMs", "type": "stringint32", "versions": "0+", "nullableVersionsdefault": "0+60000",
      "about": "The top-level error message, or null on success time in ms to wait for the request to complete." },
    { "name": "Topics", "type": "[]OngoingTopicReassignmentListPartitionReassignmentsTopics", "versions": "0+",
      "aboutnullableVersions": "The ongoing0+",
      "about": "The topics to list partition reassignments for each topic, or null to list everything.", "fields": [
        { "name": "NamePartitionIndexes", "type": "string[]int32", "versions": "0+",
 "nullableVersions": "0+",
        "about": "The partitions to list partition reassignments for, or null to list everything" }
    ]}
  ]
}

ListPartitionReassignmentsResponse

Possible errors:

  • REQUEST_TIMED_OUT: if the request timed out.
  • NOT_CONTROLLER: if the node we sent the request to is not the controller.
  • CLUSTER_AUTHORIZATION_FAILED

If the top-level error code is set, no responses will be provided.


Code Block
languagejs
{
  "apiKey": 46,
   topic name." },
        { "name": "Partitions", "type": "[]OngoingPartitionReassignmentresponse",
  "versionsname": "0+ListPartitionReassignmentsResponse",
          "aboutvalidVersions": "The ongoing reassignments for each partition.0",
  "fields": [
          { "name": "PartitionIdThrottleTimeMs", "type": "int32", "versions": "0+",
            "about": "The partitionduration ID." },
          { "name": "CurrentReplicas", "type": "[]int32in milliseconds for which the request was throttled due to a quota violation, or zero if the request did not violate any quota." },
    { "name": "ErrorCode", "type": "int16", "versions": "0+",
            "about": "The replicastop-level whicherror thecode, partitionor is0 currentlyon assigned tosuccess." },
          { "name": "TargetReplicasErrorMessage", "type": "[]int32string", "versions": "0+",
      "nullableVersions": "0+",
      "about": "The replicastop-level whicherror themessage, partitionor isnull beingon reassigned tosuccess." },
    { "name": "Topics", "type": ]}"[]OngoingTopicReassignment", "versions": "0+",
    ]}
  ]
}

Modified RPCs

LeaderAndIsrRequest

We will add a new field to LeaderAndIsrRequest named TargetReplicas.  This field will contain the replicas which will exist once the partition reassignment is complete.

Code Block
languagejs
diff --git a/clients/src/main/resources/common/message/LeaderAndIsrRequest.json b/clients/src/main/resources/common/message/LeaderAndIsrRequest.json
index b8988351c..d45727078 100644
--- a/clients/src/main/resources/common/message/LeaderAndIsrRequest.json
+++ b/clients/src/main/resources/common/message/LeaderAndIsrRequest.json
@@ -20,6 +20,8 @@
   // Version 1 adds IsNew.
   //
   // Version 2 adds broker epoch and reorganizes the partitions by topic.
+  //
+  // Version 3 adds TargetReplicas.
   "validVersions": "0-3",
   "fields": [
       "about": "The ongoing reassignments for each topic.", "fields": [
        { "name": "Name", "type": "string", "versions": "0+",
          "about": "The topic name." },
        { "name": "ControllerIdPartitions", "type": "int32[]OngoingPartitionReassignment", "versions": "0+",
@@ -48,6 +50,8 @@
           "about": "The ZooKeeper version ongoing reassignments for each partition." },
, "fields": [
          { "name": "ReplicasPartitionIndex", "type": "[]int32", "versions": "0+",
            "about": "The replica IDsindex of the partition." },
+          { "name": "TargetReplicasCurrentReplicas", "type": "[]int32", "versions": "30+",
 "nullableVersions": "0+",
+          "about": "The replicareplicas IDswhich thatthe wepartition areis reassigningcurrently theassigned partitionto." to},
 or null if the partition is not being reassigned." },
         { "name": "IsNewTargetReplicas", "type": "bool[]int32", "versions": "10+", "default": "false", "ignorable": true, 

            "about": "WhetherThe thereplicas replicawhich shouldthe havepartition existedis onbeing the broker or notreassigned to." }
      ]}
    ]}
  ]
}

...

Modified RPCs

Cancellation

We propose supporting the "cancellation" of an on-going rebalance. This is not a revert. To concretize the notion - a cancellation of a rebalance means stopping on-going partition movements. A partition reassignment which is half done will not have the new replicas that have entered the ISR be reverted, it will only stop the in-flight movements for replicas that are not in ISR.
Note that it is possible to have a partition's replication factor become higher than configured if a reassignment is stopped half-way. We do not propose a clear way to guard against this nor revert it – users are expected to run another reassignment

kafka-reassign-partitions.sh

kafka-reassign-partitions.sh is a script to create, inspect, or modify partition reassignments.

We will remove the --zookeeper option from this script.  Instead, all operations will be performed through the admin client.  This will be initialized by the --bootstrap-server flag, which already exists.

There will be a new flag, --list, which will list all of the current replica reassignments that are going on in the cluster.  They will be output in the same JSON format that is accepted by --reassignment-json-file.  If there are no ongoing replica reassignments, an empty JSON object will be printed.

Cancellation

The reassign-partitions tool will work in an incremental fashion – any new reassignment will be added on top of the current ones. It will continue to output a rollback JSON as part of any reassignment in order to let the user revert the changes.
Additionally, we will support a --cancel-all flag which will cancel all on-going partition reassignments. This providers users with an intuitive, quick and easy way to cancel multiple reassignments at once.

kafka-topics.sh

When describing partitions, kafka-topics.sh will show an additional field, "targetReplicas."  This will have information about the target replicas for the given partition.  kafka-topics.sh will get this information from listReassigningPartitions.

Implementation

ZooKeeper Changes

/brokers/topics/[topic]/partitions/[partition]/state

We will store the reassignment data for each partition in this zNode, as that is where we already store the current ISR.
This node will store a new structure named targetReplicas - an array of ints.  If this array is not present or null, there is no ongoing reassignment for this topic.  If it is non-null, it contains the ongoing reassignments for a certain partition to the target replicas for that partition.

Code Block
languagejs
titlepartition/state zNode content
{  
    "version":1,
    "leader":1,
    "leader_epoch":1,
    "controller_epoch":1,
    "isr":[1, 2, 3],
    "targetReplicas":[4, 5, 6] # <-- NEW
}

...

Algorithm

Code Block
# Illustrating the rebalance of a single partition.
# R is the current replicas, I is the ISR list and T is the targetReplicas -- what we want to reassign the partition to.
R: [1, 2, 3], I: [1, 2, 3], T: []
# Reassignment called
R: [1, 2, 3], I: [1, 2, 3], T: [4, 5, 6]
R: [1, 2, 3], I: [1, 2, 3, 4], T: [4, 5, 6]
R: [1, 2, 3, 4], I: [1, 2, 3, 4], T: [4, 5, 6]
R: [1, 2, 3, 4], I: [1, 2, 3, 4, 5], T: [4, 5, 6]
R: [1, 2, 3, 4, 5], I: [1, 2, 3, 4, 5], T: [4, 5, 6]
R: [1, 2, 3, 4, 5], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
R: [4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: null
R: [4, 5, 6], I: [4, 5, 6], T: null

...

LeaderAndIsrRequest

We will add a new field to LeaderAndIsrRequest named TargetReplicas.  This field will contain the replicas which will exist once the partition reassignment is complete.

Code Block
languagejs
diff --git a/clients/src/main/resources/common/message/LeaderAndIsrRequest.json b/clients/src/main/resources/common/message/LeaderAndIsrRequest.json
index b8988351c..d45727078 100644
--- a/clients/src/main/resources/common/message/LeaderAndIsrRequest.json
+++ b/clients/src/main/resources/common/message/LeaderAndIsrRequest.json
@@ -20,6 +20,8 @@
   // Version 1 adds IsNew.
   //
   // Version 2 adds broker epoch and reorganizes the partitions by topic.
+  //
+  // Version 3 adds AddingReplicas and RetiringReplicas.
   "validVersions": "0-3",
   "fields": [
     { "name": "ControllerId", "type": "int32", "versions": "0+",
@@ -48,6 +50,8 @@
           "about": "The ZooKeeper version." },
         { "name": "Replicas", "type": "[]int32", "versions": "0+",
           "about": "The replica IDs." },
+        { "name": "AddingReplicas", "type": "[]int32", "versions": "3+", "nullableVersions": "0+",
+          "about": "The replica IDs that we are adding this partition to, or null if the partition is not being added to any new replicas." },
+        { "name": "RetiringReplicas", "type": "[]int32", "versions": "3+", "nullableVersions": "0+",
+          "about": "The replica IDs that we are removing this partition from, or null if the partition is not being from any replicas." },
         { "name": "IsNew", "type": "bool", "versions": "1+", "default": "false", "ignorable": true, 
           "about": "Whether the replica should have existed on the broker or not." }
       ]}

Proposed Changes

Cancellation

We propose supporting the "cancellation" of an on-going rebalance in all types of granularity - at the partition level, topic level and global (stop all reassignments).
Cancellation is not the same as a revert. To concretize the notion - a cancellation of a rebalance means stopping on-going partition movements. A partition reassignment which is half done will not have the new replicas that have entered the ISR be reverted, it will only stop the in-flight movements for replicas that are not in ISR.
Note that it is possible to have a partition's replication factor become higher than configured if a reassignment is stopped half-way. We do not propose a clear way to guard against this nor revert it – users are expected to run another reassignment.

kafka-reassign-partitions.sh

kafka-reassign-partitions.sh is a script to create, inspect, or modify partition reassignments.

We will remove the --zookeeper option from this script.  Instead, all operations will be performed through the admin client.  This will be initialized by the --bootstrap-server flag, which already exists.

There will be a new flag, --list, which will list all of the current replica reassignments that are going on in the cluster.  They will be output in the same JSON format that is accepted by --reassignment-json-file.  If there are no ongoing replica reassignments, an empty JSON object will be printed.

Cancellation with the tool

The reassign-partitions tool will work in an incremental fashion – any new reassignment will be added on top of the current ones. It will continue to output a rollback JSON as part of any reassignment in order to let the user revert the changes.
Additionally, we will support a --cancel-all flag which will cancel all on-going partition reassignments. This providers users with an intuitive, quick and easy way to cancel multiple reassignments at once.
The protocol also supports cancellation at the topic-level but we will not be leveraging that in this tool because the JSON schema used is not suitable for coarser-grained cancellations. Any further improvements to the tool and the JSON format it uses can be slated for future work

kafka-topics.sh

When describing partitions, kafka-topics.sh will show an additional field, "targetReplicas."  This will have information about the target replicas for the given partition. kafka-topics.sh will get this information from listReassigningPartitions.

Implementation

ZooKeeper Changes

/brokers/topics/[topic]

We will store the reassignment data for each partition in this znode, as that is where we already store the current replicaSet for every partition.
This node will store two new maps, named retiringReplicas and addingReplicas. Each key is the partition and the value is an array of integers denoting the replicas we should retire or add that partition to, respectively. 
If both maps are empty, there is no on-going reassignment for that topic.

Code Block
languagejs
titletopics/[topic] zNode content
{
  "version": 1,
  "partitions": {"0": [1, 4, 2, 3] },
  "retiringReplicas": {"0": [1] }, # <---- NEW
  "addingReplicas": {"0": [4] } # <----- NEW
}


Because we are not making incompatible changes to the znode, we will not be bumping up the inter-broker protocol version.

Controller Changes

When the controller starts up, it will load all of the currently active reassignments from the `topics/[topic]' znodes and the `admin/reassign_partitions` znode, with the latter taking precedence. This will not impose any additional startup overhead, because the controller needs to read these znodes anyway to start up.
The reason for having the `reassign_partitions` take precedence is that we cannot know for sure whether the Controller has taken action on that znode, so we allow it to overwrite the API-assigned reassignments in the event of a failover.

The controller will handle ListPartitionReassignments by listing the current active reassignments. It will keep an in-memory cache of on-going reassignments that it loads from ZooKeeper and updates on further reassignments.

The controller will handle the AlterPartitionAssignmentsResponse RPC by modifying the /brokers/topics/[topic] znode.  Once the node is modified, the controller will send out LeaderAndIsr to start all the movements.
Once the controller is notified that a new replica has entered the ISR for a particular partition, it will remove it from the "addingReplicas" field in `/topics/[topic]`. If "addingReplicas" becomes empty, the controller will send a new wave of LeaderAndIsr requests to retire all the old replicas and, if applicable, elect new leaders.

Algorithm

We will always empty out the "addingReplicas" field before starting to act on the "retiringReplicas" field.

Code Block
# Illustrating the rebalance of a single partition.
# R is the current replicas, I is the ISR list, RR is retiringReplicas and AR is addingReplicas
R: [1, 2, 3], I: [1, 2, 3], AR: [], RR: []
# Reassignment called via API with targetReplicas=[4,3,2]

R: [1, 4, 3, 2], I: [1, 2, 3], AR: [4], RR: [1] # Controller sends LeaderAndIsr requests 
# (We ignore RR until AR is empty)
R: [1, 4, 3, 2], I: [1, 2, 3, 4], AR: [4], RR: [1] # Leader 1 adds 4 to ISR
# Controller picks up on the change and removes 4 from AR.
# (in-memory) - R: [1, 4, 3, 2], I: [1, 2, 3, 4], AR: [], RR: [1] 
# Controller realizes AR is empty and starts work on RR. Removes all of RR from R and sends LeaderAndIsr requests
R: [4, 3, 2], I: [1, 2, 3, 4], AR: [], RR: [] # at this point the reassignment is considered complete
# Leaders change, 1 retires and 4 starts leading. 4 shrinks the ISR
R: [4, 3, 2], I: [2, 3, 4], AR: [], RR: []


We iteratively shrink the AR collection. When AR is empty, we empty out RR in one step. 

If a new reassignment is issued during an on-going one, we cancel the current one by emptying out both AR and RR, constructing them from (the updated from the last-reassignment) R and TR, and starting anew.

In general this algorithm is consistent with the current Kafka behavior - other brokers still get the full replica set consisting of both the original replicas and the new ones.

Note that because of iteratively shrinking the AR collection, the server will not know how to create a "revert" of a reassignment, as it is always seeing it in motion. To keep the server logic simpler, we defer to clients for potential reverts. Here is an example of a cancellation:

Code Block
# Illustrating the rebalance of a single partition.
# R is the current replicas, I is the ISR list, RR is retiringReplicas and AR is addingReplicas
R: [1, 2, 3], I: [1, 2, 3], AR: [], RR: []
# Reassignment called via API with targetReplicas=[3, 4, 5]
R: [1, 2, 3, 4, 5], I: [1, 2, 3], AR: [4, 5], RR: []
R: [1, 2, 3, 4, 5], I: [1, 2, 3, 4], AR: [5], RR: []
# Cancellation called
R: [1, 2, 3, 4], I: [1, 2, 3, 4], AR: [], RR: []

Essentially, once a cancellation is called we subtract AR from R, empty out both AR and RR, and send LeaderAndIsr requests to cancel the replica movements that have not yet completed.

Tool Changes

The kafka-reassign-partitions.sh tool will use the new AdminClient APIs to submit the reassignment plan.  It will not use the old ZooKeeper APIs any more.  In order to contact the admin APIs, the tool will accept a --bootstrap-server argument.

When changing the throttle, we will use IncrementalAlterConfigs rather than directly writing to Zookeeper.

Compatibility, Deprecation, and Migration Plan

/admin/reassign_partitions znode

For compatibility purposes, we will continue to allow assignments to be submitted through the /admin/reassign_partitions node.  Just as with the current code, this will only be possible if there are no current assignments.  In other words, the znode has two states: empty and waiting for a write, and non-empty because there are assignments in progress. Once the znode is non-empty, further writes to it will be ignored.

Applications using the old API will not be able to cancel in-progress assignments.  They will also not be able to monitor the status of in-progress assignments, except by polling the znode to see when it becomes empty, which indicates that no assignments are ongoing.  To get these benefits, applications will have to be upgraded to the AdminClient API.

The deprecated ZooKeeper-based API will be removed in a future release.

There is one known edge-case with using both APIs which we will intentionally leave unaddressed:

Code Block
1. znode reassignment involving partitionA and partitionB starts
2. API reassignment involving partitionA starts
3. Controller failover. It will effectively ignore the API reassignment in step 2. and move partitionA to the replica given
reassignment by the znode

There are ways to handle this edge-case, but we propose ignoring it due to the fact that:

  • it is very unlikely for users to use both APIs at once
  • we will be deprecating the znode reassignment API anyway
  • it is not worth the complexity and performance hit from addressing such a rare and seemingly non-critical edge-case

Reassignments while upgrading/downgrading versions

In both scenarios, reassignments should continue to work because we still update and send the newly-updated, full replicaSet

Controller Changes

When the controller starts up, it will load all of the currently active reassignments from the `[partition]/state' zNodes and the `admin/reassign_partitions` zNode, with the latter taking precedence. This will not impose any additional startup overhead, because the controller needs to read these zNode anyway to start up.
The reason for having the `reassign_partitions` take precedence is 

...

Tool Changes

The kafka-reassign-partitions.sh tool will use the new AdminClient APIs to submit the reassignment plan.  It will not use the old ZooKeeper APIs any more.  In order to contact the admin APIs, the tool will accept a --bootstrap-server argument.

When changing the throttle, we will use IncrementalAlterConfigs rather than directly writing to Zookeeper.

Compatibility, Deprecation, and Migration Plan

/admin/reassign_partitions znode

For compatibility purposes, we will continue to allow assignments to be submitted through the /admin/reassign_partitions node.  Just as with the current code, this will only be possible if there are no current assignments.  In other words, the znode has two states: empty and waiting for a write, and non-empty because there are assignments in progress. Once the znode is non-empty, further writes to it will be ignored.

Applications using the old API will not be able to cancel in-progress assignments.  They will also not be able to monitor the status of in-progress assignments, except by polling the znode to see when it becomes empty, which indicates that no assignments are ongoing.  To get these benefits, applications will have to be upgraded to the AdminClient API.

The deprecated ZooKeeper-based API will be removed in a future release.

There is one known edge-case with using both APIs which we will intentionally leave unaddressed:

Code Block
1. zNode reassignment involving partitionA and partitionB starts
2. API reassignment involving partitionA starts
3. Controller failover. It will effectively ignore the API reassignment in step 2. and move partitionA to the replica given
reassignment by the zNode

There are ways to handle this edge-case, but we propose ignoring it due to the fact that:

  • it is very unlikely for users to use both APIs at once
  • we will be deprecating the zNode API anyway
  • it is not worth the complexity and performance hit from addressing such a rare and seemingly non-critical edge-case

Reassignments while upgrading versions

In the event of a software upgrade, we may end up with a temporarily-stuck reassignment. Imagine the following scenario:

Code Block
Brokers A, B and C
Controller=BBroker A gets upgraded to 2.4 (which contains this KIP)
Broker B gets upgraded to 2.4. Controller moves to Broker A
Reassignments happens, Broker A sends a LeaderAndIsr request with TargetReplicas to C but it is ignored by C, as it does not know about the field yet.
Broker C gets upgraded to 2.4.

At this point, part of the reassignment is "stuck". Partitions with a targetReplica of C will never complete until the controller gets restarted and a new LeaderAndIsr is sent out.

Reassignments while downgrading versions

In the event of a software downgrade, the extra field in the LeaderAndIsr request will be ignored by brokers running the old code and the reassignment will stop. As soon as the controller gets downgraded, it will send a LeaderAndIsr with no targetReplicas which ensures that all brokers will stop their reassignments.


Rejected Alternatives

Store the pending replicas in a single

...

Znode rather than in a per-partition ZNode

We could store the pending replicas in a single ZNode Znode per cluster, rather than putting them beside the ISR in /brokers/topics/[topic]/partitions/[partitionId]/state.  However, this ZNode might Znode might become very large, which could lead to us running into maximum znode size problems.  This is an existing scalability limitation which we would like to solve.

Another reason for storing the targetReplicas state in the ISR state ZNode is Znode is that this state belongs in the LeaderAndIsrRequest.  By making it clear which replicas are pending and which are existing, we pave the way for future improvements in reassignment.  For example, we will be able to distinguish between reassignment traffic and non-reassignment traffic for quota purposes, and provide better metrics.

...