...
The current kafka-reassign-partitons.sh
tool imposes the limitation that only a single batch of partition reassignments can be in-flight, and it is not possible to cancel a reassignment that is in-flight cleanly, safely in a timely fashion (e.g. as reported KAFKA-6304, the current way of reassignment cancellation requires a lot of manual steps). This has a number of consequences:
- Reassignments especially for large topic/partition is costly. In some case, the performance of the Kafka cluster can be severely impacted when reassignments are kicked off. There should be a fast, clean, safe way to cancel and rollback the pending reassignments. e.g. original replicas [1,2,3], new replicas [4,5,6], causing performance impact on Leader 1, the reassignment should be able to get cancelled immediately and reverted back to original replicas [1,2,3], and dropping the new replicas.
If users need to do a large-scale reassignment they end up having to do the reassignment in smaller batches, so they can abort the overall reassignment sooner, if operationally necessary.- Each batch of reassignments takes as long as the slowest partition; this slowest partition prevents other reassignments from happening. This can be happening even in the case submitting the reassignments by grouping similar size topic/partitions into each batch. How to optimally group reassignments into one batch for faster execution and less impact to the cluster is beyond the discussion in this KIP.
The ZooKeeper-imposed limit of 1MB on znode size places an upper limit on the number of reassignments that can be done at a given time.
...
- Note that in real Production environment, it's better to do reassignments in batches with reasonable reassignments in each batch. Large number reassignments tends to cause higher Producer latency. Between batches, proper staggering, throttling is recommended.
This change would enable
Cancel all pending reassignments currently in
/admin/reassign_partitions
and revert them back to their original replicas.- Disable reassignments of the Kafka cluster when znode (e.g.
/admin/reassign_cancel
) is present. This is helpful for some production cluster (e.g. the min.insync.replicas > 1) that are sensitive to reassignments and prevent accidentally starting reassignment on the Kafka cluster. - Adding more partition reassignments, while some are still in-flight.Cancelling individual partition reassignments (by reverting the reassignment to the old set of brokers) Even though in the original design of the reassign tool, the intent was for the znode (/admin/reassign_partitions) not to be updated by the tool unless it was empty, there are user requests to support such feature, e.g. KAFKA-7854.
- Development of an AdminClient API which supported the above features.
To illustrate the last bullet, consider an AdminClient API for partition reassignment that returns a KafkaFuture
providing access to some ReassignmentPartitionsResult
which (implicitly) includes the identity of each partitions Reassignment
. Further AdminClient APIs could then be added to:
- query the status of a particular
Reassignment
- list all current
Reassignments
- change a current
Reassignment
- scope a throttle to the duration of a
Reassignment
...
- .
Public Interfaces
Strictly speaking this is not a change that would affect any public interfaces (since ZooKeeper is not considered a public interface, and it can be made in a backward compatible way), however since some users are known to operate on the /admin/reassign_partitions
znode directly I felt it was worthwhile using the KIP process for this change., this can break in new version of Kafka (e.g. as reported in KAFKA-7854).
For the existing /admin/reassign_partitions
znode, adding "original_replicas"
to support rollback to its original state of the topic partition assigned replicas. How "original_replicas"
gets populated will be discussed in detail later.
Code Block | ||
---|---|---|
| ||
{"version":1,
"partitions":[{"topic": "foo1",
"partition": 0,
"replicas": [1,2,4],
"original_replicas": [1,2,3]
},
{"topic": "foo2",
"partition": 1,
"replicas": [5,6,8],
"original_replicas": [7,9,8]
}]
} |
Proposed Changes
The main idea is to move away from using the single /admin/reassign_partitions
znode to control reassignment.
...