Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

This page is meant as a template for writing a KIP. To create a KIP choose Tools->Copy on this page and modify with your content and replace the heading with the next KIP number and a description of your issue. Replace anything in italics with your own description.

Status

Current state: DraftDiscussion

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

...

When creating topics or partitions, the Kafka controller has to pick brokers to host the new partitions. The current placement logic is based on a round robin algorithm and supports rack awareness. While this works relatively well in many scenarios, in a few cases the placements it generates are not optimal because it's not aware of the state of the clusters. Many cluster administrators then rely on tools like Cruise Control to move partitions to better brokers. This process is expensive as often data has to be copied between brokers.

It would be desirable to allow custom logic for the placer to leverage the state of the cluster and minimize the number of partition reassignments necessary. It would enable administrators to build assignment goals (similar to Cruise Control goals) rules for their clusters.

Some scenarios that could benefit greatly from this feature:

  • When adding brokers to a cluster, Kafka currently does not necessarily place new partitions on new brokers
  • When removing administrators when to remove brokers from a cluster, as Kafka currently will keep there is no way to prevent Kafka from placing partitions on all existing brokersthem
  • When some brokers are near their storage/throughput limit, the assignor Kafka could avoid putting new partitions on them

...

The proposal is to expose the ReplicaPlacer API which is currently internal as public APIinterface. It will move from the org.apache.kafka.controller package in the metadata project to the org.apache.kafka.server.placer package in the clients project. Similarly the existing UsableBroker class will move from org.apache.kafka.metadata package in the metadata project to the org.apache.kafka.server.placer  placer package in the clients project.
This feature will only be
The logic assigning replicas to partition differs so much between ZooKeeper and KRaft that I propose making this feature only available in KRaft mode.

Compatibility, Deprecation, and Migration Plan

...

Rejected Alternatives

  • Computing assignments replica placement for the whole batchcreate topics/partitions request: Instead of computing assignment for each topic in the CreateTopics/CreatePartitions request one at a time, we I looked at computing assignment for all of them in a single call. We rejected this approach for the following reasons:
    • All logic (validation, policies, creation in ZK) in AdminManager current logic works on a single topic at a time. Grouping the replica assignment computation created very complicated logic
    • It's not clear if having all topics at once would significantly improve computed assignments. This is especially true for the 4 scenarios listed in the Motivation section
  • Providing more details about the cluster to the placer: Instead of only passing usable brokers, I considered passing a data structure with more details about the cluster, such as Cluster. While this could allow some additional advanced use cases, this would potentially not scale well if we expect Kafka to be able to support very large number of topics with KRaft.