Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently if the Kafka cluster loses a broker, there is no mechanism to transfer replicas from the failed node to others within the cluster other than manually triggering ReassignPartitionCommand. This is especially a problem when operating clusters with replication factor = 2. Loss of a single broker means that there is effectively no more redundancy which is not desirable. Automatic  Automatic rebalancing upon failure is important to preserve redundancy within the cluster. This KIP effectively automates commands that operators typically perform upon failure. 

Public Interfaces

Proposed Changes

...

To address this case, we should check if the partitions in a task all have the desired number of replicas before starting it. i.e. if Broker X is back online and catching up, there is no point in executing any subsequent jobs that have been queued up. However if the task is already running it is simpler to let it complete. Since tasks are not very large, they should not take very long to complete. The controller should only check this for cancel tasks with "isAutoGenerated":true.

Deletion of failed_nodes znode

It is How do we GC the failed_nodes entries. They are deleted in 2 cases.:

  • If the failed broker comes back online.
  • After the reassign tasks get scheduled, the znode can be deleted since the tasks themselves are persisted in ZK.

...

There is a case where the controller fails after generating the reassignment tasks but before deleting the corresponding entry in /failed_brokers. There are a couple of ways to solve this.:

  1. The new controller can simply compare the assignments with the failed brokers. If the assignment has a reference to the failed broker, the controller knows which failed node created it. If it finds all tasks for a failed node, then it can simply assume the assignments are sufficient and delete the znode for that broker.
  2. It should also be fine to let the controller simply reschedule the assignment tasks. Since the tasks are executed sequentially the previously executed tasks complete before these. Before scheduling any task we check to see if assignments are already satisfied. This will be true and hence the task can be skipped.

Option 2 is simpler.

Quotas

Should we throttle replica bootstrap traffic? This is tangentially related but we can actually throttle inter broker replication traffic to say "50%" of available network. Regardless of the number of failed replicas we have, we can impose a hard limitation on the total amount of network traffic caused by auto-rebalancing. Given that we plan to move only a set of partitions at one time this shouldn't really be a problem. Rather a nice to have.

...

This doesn't really solve the problem of load balancing existing assignments within a cluster. It should be possible to ensure even partition distribution in the cluster whenever new nodes are added i.e. if I expand a cluster it would be nice to not move partitions onto them manually. This KIP does not attempt to solve that problem rather it makes it possible to tackle auto-expansion in the future since we could reuse the same pattern.

Configs

The following configs can be added.

...