Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The current kafka-reassign-partitons.sh tool imposes the limitation that only a single batch of partition reassignments can be in-flight, and it is not possible to cancel a reassignment that is in-flight cleanly, safely in a timely fashion (e.g. as reported KAFKA-6304,  the current way of reassignment cancellation requires a lot of manual steps). This has a number of consequences:

  1. Reassignments especially for large topic/partition is costly.  In some case, the performance of the Kafka cluster can be severely impacted when reassignments are kicked off.   There should be a fast, clean, safe way to cancel and rollback the pending reassignments.   e.g.  original replicas [1,2,3],  new replicas [4,5,6],   causing performance impact on Leader 1,  the reassignment should be able to get cancelled immediately and reverted back to original replicas [1,2,3],  and dropping the new replicas. 
  2. If users need to do a large-scale reassignment they end up having to do the reassignment in smaller batches, so they can abort the overall reassignment sooner, if operationally necessary  
  3. Each batch of reassignments takes as long as the slowest partition; this slowest partition prevents other reassignments from happening.   This can be happening even in the case submitting the reassignments by grouping similar size topic/partitions into each batch. How to optimally group reassignments into one batch for faster execution and less impact to the cluster is beyond the discussion in this KIP. 
  4. The ZooKeeper-imposed limit of 1MB on znode size places an upper limit on the number of reassignments that can be done at a given time.

...

  1.  Note that in real Production environment, it's better to do reassignments in batches with reasonable reassignments in each batch.  Large number reassignments tends to cause higher Producer latency. Between batches,  proper staggering, throttling is recommended.  

This change would enable 

  • Cancel all pending reassignments currently in /admin/reassign_partitions and revert them back to their original replicas.

  • Disable reassignments of the Kafka cluster when znode (e.g. /admin/reassign_cancel) is present.   This is helpful for some production cluster (e.g. the min.insync.replicas > 1)  that are sensitive to reassignments and prevent accidentally starting reassignment on the Kafka cluster. 
  • Adding more partition reassignments, while some are still in-flight.Cancelling individual partition reassignments (by reverting the reassignment to the old set of brokers)  Even though in the original design of the reassign tool, the intent was for the znode (/admin/reassign_partitions) not to be updated by the tool unless it was empty,  there are user requests to support such feature,  e.g. KAFKA-7854
  • Development of an AdminClient API which supported the above features.

To illustrate the last bullet, consider an AdminClient API for partition reassignment that returns a KafkaFuture providing access to some ReassignmentPartitionsResult which (implicitly) includes the identity of each partitions Reassignment. Further AdminClient APIs could then be added to:

  • query the status of a particular Reassignment
  • list all current Reassignments
  • change a current Reassignment
  • scope a throttle to the duration of a Reassignment

...

  • .

Public Interfaces

Strictly speaking this is not a change that would affect any public interfaces (since ZooKeeper is not considered a public interface, and it can be made in a backward compatible way), however since some users are known to operate on the /admin/reassign_partitions znode directly I felt it was worthwhile using the KIP process for this change.,  this can break in new version of Kafka  (e.g. as reported in KAFKA-7854).   

For the existing /admin/reassign_partitions znode,  adding "original_replicas" to support rollback to its original state of the topic partition assigned replicas.   How "original_replicas" gets populated will be discussed in detail later. 

Code Block
languagejs
{"version":1,
 "partitions":[{"topic": "foo1",
                "partition": 0,
		        "replicas": [1,2,4],
                "original_replicas": [1,2,3]
               },
               {"topic": "foo2",
                "partition": 1,
		        "replicas": [5,6,8],
                "original_replicas": [7,9,8]
               }]            
}

Proposed Changes

The main idea is to move away from using the single /admin/reassign_partitions znode to control reassignment.

...