...
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
The current current kafka-reassign-partitons.sh
tool imposes the limitation that only a single batch of partition reassignments can be in-flight, and it is not possible to cancel a reassignment that is in-flight. This has a number of consequences:
...
This change would enable
- Adding more partition reassignments, while some are still in-flight.
- Cancel individual partition reassignments (by reverting the reassignment to the old set of brokers)
- Development of an AdminClient API which supported the above features.
To illustrate the last bullet, consider an AdminClient API for partition reassignment that returns a Future providing access to a ReassignmentPartitionsResult which (implicitly) includes the identity of each partitions Reassignment
. Further AdminClient APIs can be added to:
- query the status of a particular reassignment
- list all current reassignments
- change a current reassignment
- scope a throttle to the duration of a reassignment
Public Interfaces
Strictly speaking this is not a change that would affect any public interfaces (since ZooKeeper is not considered a public interface, and it can be made in a backward compatible way), however since some users are known to operate on the /admin/reassign_partitions
znode directly I felt it was worthwhile using the KIP process for this change.
...
The controller will also watch for changes in the contents of these znodes, and re-initiate reassignment if the content changes. This means it is possible to "cancel" the reassignment of a single partition by changing the reassignment to the old assigned replicas.Note however that the controller itself doesn't remember the old assigned replicas – it is up to the client to record the old state prior to starting a reassignment if cancellation is necessary for that client.
When the ISR includes all the replicas given as the contents of the reassignment znode, ([10,11,15]
in the example) the controller would remove the relevant child of /admin/reassignments
(i.e. at the same time the controller currently updates/removes the /admin/reassign_partitions
znode).
In this way the child nodes of /admin/reassignments
are precisely the replicas currently being reassigned. Moreover the ZooKeeper czxid of a child node identifies a reassignment uniquely. It is then possible to determine whether that exact reassignment is still on-going, or (using the mzxid) has been changed.
Compatibility, Deprecation, and Migration Plan
In order to retain compatibility with existing software that understands /admin/reassign_partitions
the controller would be changed to create the child nodes of /admin/reassignments
when the /admin/reassign_partitions
znode was created.. The code to update and delete /admin/reassign_partitions
would also be retained, so the only difference to a client that operates on /admin/reassign_partitions
would be a slight increase in latency due to the need to create the new znodes. This compatibility behaviour could be dropped in some future version of Kafka, if that was desirable.
...