Author: Gokul Ramanan Subramanian

Contributors: Stanislav Alexandre Dupriez, Tom Bentley, Colin McCabe, Ismael Juma, Boyang Chen, Stanislav Kozlovski

Status

Current state: DraftVoting

Discussion thread: Herehere

JIRA:

Jira

server	ASF JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-9590

PR: https://github.com/apache/kafka/pull/8499 (currently only prototype, slightly out of date wrt KIP, but gets the idea across)

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

We did some performance experiments to understand the effect of increasing the number of partitions. See Appendix A1 for producer performance, and A2 for topic creation and deletion times. These consistently indicate that having a large number of partitions can lead to a malfunctioning cluster.

Topic creation policy plugins specified via the create.topic.policy.class.name configuration can partially help solve this problem by rejecting requests that result in a large number of partitions. However, these policies cannot produce a replica assignment that respects the partitions limits, instead they can only either accept or reject a request. Therefore, we need a more native solution for addressing the problem of partition limit-aware replica assignment. (See rejected alternatives for more details on why the policy approach does not work.)

We propose having two configurations (a) max.brokerIn order to prevent a cluster from entering a bad state due to a large number of topic partitions, we propose having two configurations (a) max.broker.partitions to limit the number of partitions per broker, and (b) max.partitions to limit the number of partitions in the per broker, and (b) max.partitions to limit the number of partitions in the cluster overall. These can act as guard rails ensuring that the cluster is never operating with a higher number of partitions than it can handle.

...

These limits are cluster-wide. This is obviously true for max.partitions which is meant to apply at the cluster level. However, we choose this for max.broker.partitions too, instead of supporting different values for each broker. This is in alignment with the current recommendation to run homogenous Kafka clusters where all brokers have the same specifications (CPU, RAM, disk etc.).
These limits can be changed at runtime, without restarting brokers. This provides greater flexibility. See the "Rejected alternatives" section for why we did not go with read-only configuration.
These limits apply to all topics, even internal topics (i.e. __consumer_offsets and __transaction_state, which usually are not configured with too many partitions). This provides the same consistent experience across all topics and partitions.
If both limits max.partitions and max.broker.partitions are specified, then the more restrictive of the two apply. It is possible that a request is rejected because it causes the max.partitions limit to be hit without causing any broker to hit the max.broker.partitions limit. The vice versa is true as well.
These limits can be changed at runtime, without restarting brokers. This provides greater flexibility. See the "Rejected alternatives" section for why we did not go with read-only configuration.
These limits won't These limits also apply to topics created via auto topic creation (currently possible via the Metadata and FindCoordinator API via Metadata API requests) . By enforcing this, we disallow having a backdoor to bypass these limitsuntil KIP-590. With KIP-590, auto-topic creation will leverage the CreateTopics API, and will have same behavior as the creation of any other topic.
These limits do not apply to internal topics (i.e. __consumer_offsets and __transaction_state), which usually are not configured with too many partitions. This ensures that any internal Kafka behaviors do not break because of partition limits. The topic partitions corresponding to these internal topics won't also count towards the limit.
These limits do not apply when creating topics or partitions, or reassigning partitions via the ZooKeeper-based admin tools. This is unfortunate, because it does create a backdoor to bypass these limits. However, when creating topics or partitions, or reassigning partitions via the ZooKeeper-based admin tools. This is unfortunate, because it does create a backdoor to bypass these limits. However, we leave this out of scope here given that ZooKeeper will eventually be deprecated from Kafka.

...

Config name	Type	Default	Update-mode
max.broker.partitions	int32int64	int32int64's max value (2⁶³ - 1)	cluster-wide
max.partitions	int32int64	int32int64's max value (2⁶³ - 1)	cluster-wide

Kafka administrators can specify these in the server.properties file.

They can also use the following to set/modify these configurations via ZooKeeperthe kafka-config.sh admin tool.

Code Block

language	bash

./kafka-config.sh --bootstrap-zookeeperserver $ZOOKEEPER$SERVERS --alter --add-config max.broker.partitions=4000 --entity-type brokers --entity-default
./kafka-config.sh --zookeeperbootstrap-server $ZOOKEEPER$SERVERS --alter --add-config max.partitions=200000 --entity-type brokers --entity-default

...

Code Block

language	bash

./kafka-config.sh --bootstrap-zookeeperserver $ZOOKEEPER$SERVERS --alter --add-config max.broker.partitions=4000 --entity-type brokers --entity-name 1
./kafka-config.sh --bootstrap-zookeeperserver $ZOOKEEPER$SERVERS --alter --add-config max.partitions=200000 --entity-type brokers --entity-name 1

...

CreateTopics, CreatePartitions, and AlterPartitionReassignments , Metadata and FindCoordinator APIs will throw the following exception APIs will throw PolicyViolationException and correspondingly the POLICY_VIOLATION(44) error code if it is not possible to satisfy the request while respecting the max.broker.partitions or max.partitions limits. This applies to Metadata requests only in case auto-topic creation is enabled . This applies to FindCoordinator requests only in case of creating internal topics (__consumer_offsets and __transaction_state).

Code Block

language	java

public class WillExceedPartitionLimitsException extends ApiException { ... }

post KIP-590, which will modify the Metadata API to call CreateTopics. We will bump up the version of these APIs by one for new clients.

Corresponding to this exception, we will have the following API error code. The actual exception will contain the values of max.broker.partitions and max.partitions in order to make it easy for users to understand why their request got rejected.

Code Block

language	java

WILL_EXCEED_PARTITION_LIMITS(88, "Cannot satisfy request without exceeding the partition limits", WillExceedPartitionLimitsException::new);

Proposed Proposed Changes

The following table shows the list of methods that will need to change in order to support the max.broker.partitions and max.partitions configurations. (We skip a few internal methods for the sake of simplicity.)

Method name	Description of what the method does currently	Context in which used	Relevant methods which directly depend on this one	Relevant methods on which this one is directly dependent	Description of what the method does currently	Context in which used
`AdminUtils.assignReplicasToBrokers`	`AdminUtils.assignReplicasToBrokers`	`AdminZkClient.createTopic` `AdminZkClient.addPartitions` `AdminManager.createTopics` `ReassignPartitionsCommand.generateAssignment`	Encapsulates the algorithm specified in KIP-36 to assign partitions to brokers on as many racks as possible. This also handles the case when rack-awareness is disabled. This is a pure function without any state or side effects.	API ZooKeeper-based admin tools	`AdminZkClient. createTopicWithAssignment` createTopic` `AdminZkClient.createTopic`addPartitions` `AdminManager.createTopics` `ReassignPartitionsCommand.generateAssignment``ZookeeperTopicService
`AdminZkClient.createTopic`createTopicWithAssignment`	Creates the ZooKeeper znodes required for topic-specific configuration and replica assignments for the partitions of the topic.	API ZooKeeper-based admin tools	`AdminZkClient.createTopic``KafkaApis `AdminManager.createTopic`createTopics` `ZookeeperTopicService.createTopic``AdminUtils.assignReplicasToBrokers`
`AdminZkClient.createTopicWithAssignment`createTopic`	Computes replica assignment using `AdminUtils.assignReplicasToBrokers` and then reuses `AdminZkClient.createTopicWithAssignment`.	API ZooKeeper-based admin tools	`AdminZkClient.addPartitions`	`AdminManager.createPartitions``KafkaApis.createTopic` `ZookeeperTopicService.alterTopic`createTopic`	`AdminUtils.assignReplicasToBrokers` `AdminZkClient.createTopicWithAssignment`
`AdminZkClient.addPartitions`	Computes Computes replica assignment using `AdminUtils.assignReplicasToBrokers` when replica assignments are not specified. When replica assignments are specified, uses them as is. Creates the ZooKeeper znodes required for the new partitions with the corresponding replica assignments.	API ZooKeeper-based admin tools	`AdminManager. createTopics` createPartitions` `KafkaApis `ZookeeperTopicService. handleCreateTopicsRequest` alterTopic`	`AdminUtils.assignReplicasToBrokers`
`AdminZkClient`AdminManager.createTopicWithAssignment`createTopics`	Used exclusively by `KafkaApis.handleCreateTopicsRequest` to create topics. Reuses `AdminUtils.assignReplicasToBrokers` when replica assignments are not specified. When replica assignments are specified, uses them as is.	API	`AdminManager`KafkaApis.createPartitions`handleCreateTopicsRequest``KafkaApis	`AdminUtils. handleCreatePartitionsRequest` assignReplicasToBrokers` `AdminZkClient.createTopicWithAssignment`
`AdminManager.addPartitions`createPartitions`	Used exclusively by `KafkaApis.handleCreatePartitionsRequest` to create partitions on an existing topic.	API	`KafkaController`KafkaApis.onPartitionReassignment`handleCreatePartitionsRequest``KafkaApis	`AdminZkClient.handleAlterPartitionReassignmentsRequest`addPartitions`
`KafkaController.onPartitionReassignment`(not quite directly, but the stack trace in the middle is not relevant)	Handles all the modifications required on ZooKeeper znodes and sending API requests required for moving partitions from some brokers to others.	API	`KafkaApis.handleAlterPartitionReassignmentsRequest` (not quite directly, but the stack trace in the middle is not relevant)
`KafkaApis.handleCreateTopicsRequest`handleCreateTopicsRequest``AdminManager.createTopics`	Handles the CreateTopics API request sent to a broker, if that broker is the controller.	API		`AdminManager.createTopics`
`KafkaApis.handleCreatePartitionsRequest``AdminManager.createPartitions`	Handles the CreatePartitions API request sent to a broker, if that broker is the controller.	API		`AdminManager.createPartitions`
`KafkaApis.handleAlterPartitionReassignmentsRequest`	`KafkaController.onPartitionReassignment` (not quite directly, but the stack trace in the middle is not relevant)	Handles the AlterPartitionReassignments API request sent to a broker, if that broker is the controller.	API	`KafkaApis.createTopic`		`KafkaApis.handleTopicMetadataRequest` `KafkaApis.handleFindCoordinatorRequest``KafkaController.onPartitionReassignment` (not quite directly, but the stack trace in the middle is not relevant) `AdminZkClient
`KafkaApis.createTopic`	Creates internal topics for storing consumer offsets (__consumer_offsets), and transaction state (__transaction_state). Also used to auto-create topics when topic auto-creation is enabled.	API	`KafkaApis.handleTopicMetadataRequest` `KafkaApis.createTopic` (not quite directly, but the stack trace in the middle is not relevant)	`AdminZkClient.createTopic`
`KafkaApis.handleTopicMetadataRequest`	Handles the Metadata API request sent to a broker.	API `KafkaApis.handleFindCoordinatorRequest`		`KafkaApis.createTopic` (not quite directly, but the stack trace in the middle is not relevant)	Handles the FindCoordinator API request sent to a broker.	API
`ZookeeperTopicService.createTopic``AdminZkClient.createTopic` `AdminZkClient.createTopicWithAssignment`	Used by the ./kafka-topics.sh admin tool to create topics when --zookeeper is specified. Reuses `AdminZkClient.createTopic` when no replica assignments are specified. Reuses `AdminZkClient.createTopicWithAssignment` when replica assignments are specified.	ZooKeeper-based admin tools		`ZookeeperTopicService `AdminZkClient. alterTopic` createTopic` `AdminZkClient.createTopicWithAssignment`
`ZookeeperTopicService.addPartitions`alterTopic`	Used by the ./kafka-topics.sh admin tool to alter topics when --zookeeper is specified. Calls `AdminZkClient.addPartitions` if topic alteration involves a different number of partitions than what the topic currently has.	ZooKeeper-based admin tools		`AdminZkClient.addPartitions`
`ReassignPartitionsCommand.generateAssignment``AdminUtils.assignReplicasToBrokers`	Used by the ./kafka-reassign-partitions.sh admin tool to generate a replica assignment of partitions for the specified topics onto the set of specified brokers.	ZooKeeper-based admin tools		`AdminUtils.assignReplicasToBrokers`

For all the methods in the above table that are used in the context of both Kafka API request handling paths and ZooKeeper-based admin tools (`AdminUtils.assignReplicasToBrokers`, `AdminZkClient.createTopicWithAssignment`, `AdminZkClient.createTopic` and `AdminZkClient.addPartitions`), we will pass the values for maximum number of partitions per broker, maximum number of partitions overall, and the current number of partitions for each broker as arguments.

We will modify the core algorithm for replica assignment in the `AdminUtils.assignReplicasToBrokers` method. The modified algorithm will ensure that as replicas are being assigned to brokers iteratively one partition at a time, if assigning the next partition to a broker causes the broker to exceed the max.broker.partitions limit, then the broker is skipped. If all brokers are skipped successively in a row, then the algorithm will terminate and throw WillExceedPartitionLimitsExceptionthrow PolicyViolationException. The check for max.partitions is much simpler and based purely on the total number of partitions that exist across all brokers.

When the methods are invoked in the context of a Kafka API call, we will get the values for the maximum number of partitions per broker by reading the max.broker.partitions configuration from the `KafkaConfig` object (which holds the current value after applying precedence rules on configuration supplied via server.properties and those set via ZooKeeper). Similarly, we will get the maximum number of partitions overall by reading the max.partitions configuration from the `KafkaConfig` object. We will fetch the current number of partitions for each broker from either the `AdminManager` or `KafkaControllerContext` depending on the method.

When the methods are invoked in the context of ZooKeeper-based admin tools, we will set these limits equal to the maximum int32 value int64 value that Java can represent. This is basically because it is not easy (and we don't want to make it easy) to get a reference to the broker-specific `KafkaConfig` object in this context. We will also set the object representing the current number of partitions for each broker to None, since it is not relevant when the limits are not specified.

...

This change is backwards-compatible in practice because we will set the default values for max.broker.partitions and max.partitions equal to the maximum int32 value int64 value that Java can represent, which is quite large (2³¹ ⁶³ - 1). Users will anyway run into system issues far before hitting these limits.

...

Similarly, a cluster that already has more than max.partitions number of partitions at the time at which max.partitions configuration is set, will continue to function just fine. It will however, fail any further requests to create topics or partitions. Any reassignment of partitions should work fine.

These soft behaviors are also necessary because (even with this KIP), users can bypass the limit checks by using ZooKeeper-based admin tools.

...

This is in general a more flexible approach than the one described in this KIP and allows having different brokers with different resources, each have its own max.broker.partitions configuration. However, this would require sending the broker-specific configuration to the controller, which needs this while creating topics and partitions or reassigning partitions. One approach would be to put this information into the broker's ZooKeeper znode and have the controller rely on that. And the other would be to create a new API request-response that brokers can use to share this information with the controller. Both of these approaches introduce complexity for little gain. We are not aware of any clusters that are running with heterogenous configurations where having different max.broker.partitions configuration for each broker would help. Therefore, in this KIP, we do not take this approach.

...

other would be to create a new API request-response that brokers can use to share this information with the controller. Both of these approaches introduce complexity for little gain. We are not aware of any clusters that are running with heterogenous configurations where having different max.broker.partitions configuration for each broker would help. Therefore, in this KIP, we do not take this approach.

Use configurable topic policies for limiting number of partitions

Kafka allows plugging in custom topic creation policies via the create.topic.policy.class.name configuration. This allows administrators to install a policy that can limit number of partitions. There are a few downsides to using this approach

such a configuration is not available for partition increase or reassignment, which even if we can address do not fix the next problem.
partition limits are not "yet another" policy configuration. Instead, they are fundamental to partition assignment. i.e. the partition assignment algorithm needs to be aware of the partition limits. To illustrate this, imagine that you have 3 brokers (1, 2 and 3), with 10, 20 and 30 partitions each respectively, and a limit of 40 partitions on each broker enforced via the configurable policy class. This leaves extra leg room for 30, 20 and 10 partitions respectively on the 3 brokers. This adds up to a total legroom of 60 partitions. It should be possible to create a topic with 30 partitions and replication factor of 2 with this configuration. Assign the first 10 partitions to brokers 1 and 3; then assign the next 20 partitions to brokers 1 and 2. While the configurable policy class may accept a topic creation request for 30 partitions with a replication factor of 2 each (because it is satisfiable), the non-pluggable partition assignment algorithm (in AdminUtils.assignReplicasToBrokers) has to do the assignment in such a way as to not violate the partition limits.

Basically, partition limits cannot be viewed as a policy on top of topic creation. They are integral to topic creation / partition increase and reassignment.

Add configuration to limit number of partitions that a specific user can create

A large number of partitions can cause performance issues for a Kafka cluster irrespective of which user created those partitions. The focus of the KIP is to prevent the Kafka cluster from entering into a bad state when having a large number of partitions. Therefore, it does not focus on addressing the orthogonal use case of having partition quotas per user in a multi-tenant environment.

Appendix A: Performance with a large number of partitions

Our setup had had 3 m5.large EC2 broker instances on 3 different AZs within the same AWS region us-east-1, running Kafka version 2.3.1. Each broker had an EBS GP2 volume attached to it for data storage. All communication was plaintext and records were not compressed. The brokers each had 8 IO threads (num.io.threads), 2 replica fetcher threads (num.replica.fetchers) and 5 network threads (num.network.threads).

...

We did a performance test (using kafka-producer-perf-test.sh from a single m5.4xlarge EC2 instance). On the producer side, each record was 1 KB in size. The batch size (batch.size) and artificial delay (linger.ms) were left at their default values.

...

We can see that leadership resignation times are exponential in the number of partitions.

Leadership resignation time (minutes)
Number of partitions left
30000	20000	10000	5000	1000	100	10
100	43	9	3	< 1	< 1	< 1

Space shortcuts

Child pages

Versions Compared

Old Version 24

New Version Current

Key

Status

Proposed Proposed Changes

Appendix A: Performance with a large number of partitions

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 24

New Version Current

Key

Status

Proposed Proposed Changes

Appendix A: Performance with a large number of partitions