Status

Current state: Under Discussion

Discussion thread: here

JIRA:

Released:

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

If partition reassignment involves a lot of replicas, then it could put too much overhead on the brokers.

Say you have a replication factor of 4 and you trigger a reassignment which moves all replicas to new brokers. Now 8 replicas are fetching at the same time which means you need to account for 8 times the current producer load plus the catch-up replication. To make matters worse, the replicas won't all become in-sync at the same time; in the worst case, you could have 7 replicas in-sync while one is still catching up. Currently, the old replicas won't be disabled until all new replicas are in-sync. This makes configuring the throttle tricky since ISR traffic is not subject to it.

Rather than trying to bring all 4 new replicas online at the same time, a friendlier approach would be to do it incrementally: bring one replica online, bring it in-sync, then remove one of the old replicas. Repeat until all replicas have been changed. This would reduce the impact of a reassignment and make configuring the throttle easier at the cost of a slower overall reassignment.

Public Interfaces

No existing public interface will be changed.

Proposed Changes

As explained above, the goal would be to incrementally add new partitions, one at a time to avoid putting much pressure on the brokers. The new strategy therefore will create a list of steps to take where each step contains exactly one new replica and may drop more than one old replica. It will also delay the new leader election by making the change which requires leader election at last.

Algorithm

Calculating a Reassignment Step

For calculating a reassignment step, always the final target replica (FTR) set and the current replica (CR) set is used.
Calculate the replicas to be dropped (DR):
1. Calculate n = size(FTR) - size(CR)
2. Filter those replicas from CR which are not in FTR, this is the excess replica (ER) set
3. Sort the ER set in an order where the leader is the last (this will ensure that it will be selected only when needed).
4. Take the first n replicas of ER, that will be the set of dropped replicas
Calculate the new replica (NR) to be added by selecting the first replica from FTR that is not in CR
Create the target replica (TR) set: CR + NR - DR
If this is the last step, then order the replicas as specified by FTR. This means that the last step is always equals to FTR

Performing a Reassignment Step

Performing a reassignment step is somewhat similar in big picture to the currently existing algorithm.

Wait until CR is entirely in ISR. This will make sure that we're starting off with a solid base for reassignment.
Calculate the next reassignment step as described above based on the reassignment context.
Wait until all brokers in the target replicas (TR) of the reassignment step are alive. This will make sure that reassignment starts only when the target brokers can perform the actual reassignment step.
If we have new replicas in ISR from the previous step, change the states' of those to OnlineReplica
Update CR in Zookeeper with TR: with this the DR set will be drop and NR set will be added.
Send LeaderAndIsr request to all replicas in CR + NR so they would be notified of the Zookeeper events.
Start new replicas in NR by moving them to NewReplica state.
Set CR to TR in memory.
Send LeaderAndIsr request with a potential new leader (if current leader not in TR) and a new CR (using TR) and same ISR to every broker in TR
Replicas in DR -> Offline (force those replicas out of ISR)
Replicas in DR -> NonExistentReplica (force those replicas to be deleted)
Update the /admin/reassign_partitions path in ZK to remove this partition.
After electing leader, the replicas and ISR information changes, so resend the update metadata request to every broker

Example

The following code block shows how a transition happens from (0, 1, 2) into (3, 4, 5) where the initial leader is 0.

Reassignment Example

 (0, 1, 2)     // starting assignment
     |
(0, 1, 2, 3)   // +3
     |
(0, 2, 3, 4)   // -1 +4
     |
(0, 3, 4, 5)   // -2 +5
     |
 (3, 4, 5)     // -0, new leader (3) is elected, requested order is matched, reassignment finished

Let's take a closer look at the third step above:

Calculating a Reassignment Step

FTR = (3, 4, 5)
CR = (0, 1, 2, 3)
 
n = size(FTR) - size(CR)  // 1
ER = CR - FTR             // (0, 1, 2)
ER = order(ER)            // (1, 2, 0)
DR = takeFirst(ER, n)     // (1)
 
NR = first(FTR - CR)      // 4
TR = CR + NR - DR         // (0, 2, 3, 4)

Compatibility, Deprecation, and Migration Plan

In the first version with this released feature the plan would be to keep batch mode default but recommend incremental. Then in a subsequent major version change the default to incremental but keep the batch mode around.

Test Plan

The existing unit tests will be parameterized so they would run with both modes
Extra unit tests will be added to cover those cases that are not covered with the existing tests
Ducktape tests would be parameterized to run with both modes
If needed, extra ducktapes could be added to cover cases that are needed

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Space shortcuts

Child pages

Status

Motivation

Public Interfaces

Proposed Changes

Algorithm

Calculating a Reassignment Step

Performing a Reassignment Step

Example

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives

Space shortcuts

Child pages

KIP-435: Incremental Partition Reassignment

Status

Motivation

Public Interfaces

Proposed Changes

Algorithm

Calculating a Reassignment Step

Performing a Reassignment Step

Example

Compatibility, Deprecation, and Migration Plan

Test Plan

Rejected Alternatives