Current state: Draft

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA: here [Change the link from KAFKA-1 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently when handling topics or partitions creation requests, Kafka enforces all replicas to be created in order to fulfill the request. While all other functionalities (produce/consume) are fault tolerant and can handle some brokers down, topics and partitions creation stop working as soon as there are no enough replicas available. In small clusters, when one node is unavailable, for example when a broker is being restarted, it's possible that there are not enough alive replicas online brokers to satisfy topic/partition creation even though there are enough brokers to satisfy the "min.insync.replicas" configuration of the topic.

This is mostly impacting clusters with 3 brokers for which
For example, in a 3 node cluster, while a rolling restart is happening, users can't create topics with replication factor 3 . In a 4 node cluster, while a node is down, a rolling restart also prevent topics with replication factor 3 from being created.

Another example: in a 9 node cluster with 3 nodes in each zone, if 1 zone was to go offline, the cluster would still contain enough nodes (6) to host a topic with replication factor 3. However, in some environments it may still be preferable to only assign 2 replicas to currently alive nodes (in 2 zones) and assign the last replica to a broker in the unavailable zone that is expected to come back online later.

can't be created while a rolling restart is happening. This makes this cluster size unsuitable for environments requiring maximum availability and forces administrators to deploy at least 4 brokers. In case of a stretch cluster spanning 3 availability zones, which is now a relatively common deployment, this forces administrators to have 6 brokers.

This features will allow small clusters to stay fully available over rolling restarts and basic maintenance operationsThe same consideration exists for adding partitions to existing topics (CreatePartitions API).

Proposed Changes

We would like to allow users to create topics and partitions even when the current available number of brokers is less than the number of requested replicas, as long as the expected size of the cluster under normal conditions is large enough. As soon as brokers re-join the cluster, the missing replicas will be created.

To allow that, we propose to keep track of brokers that are part of the cluster in zookeeper even after they disconnect ("observed brokers"), and use that information to determine an assignment (or to determine if a given assignment is feasible) on handling a topic creation request or partition creation request.

When the number of available brokers is sufficient and no assignment is specified, creating an under-replicated topic/partition may be preferred: e.g. replication factor 3, in a 3-zones region with 2 brokers per region, but a zone is down.
In such case the create request could make use of all observed brokers rather than the currently online ones.

...

Public Interfaces

Network protocol and api behavior

Code Block

title	CreateTopicsRequest

CreateTopics Request (Version: 4) => [create_topic_requests] timeout validate_only 
  create_topic_requests => topic num_partitions replication_factor [replica_assignment] [config_entries] min_required_replication_factor
    topic => STRING
    num_partitions => INT32
    replication_factor => INT16
    replica_assignment => partition [replicas] 
      partition => INT32
      replicas => INT32
    config_entries => config_name config_value 
      config_name => STRING
      config_value => NULLABLE_STRING
    min_required_replication_factor => INT16  //new field
  timeout => INT32
  validate_only => BOOLEAN

Code Block
CreateTopicsResponse (Version: 4) //same schema as version 3

Code Block

title	CreatePartitionsRequest

CreatePartitions Request (Version: 2) => [topic_partitions] timeout validate_only 
  topic_partitions => topic new_partitions 
    topic => STRING
    new_partitions => count [assignment] min_required_replication_factor
      count => INT32
      assignment => ARRAY(INT32)
      min_required_replication_factor => INT16  //new field
  timeout => INT32
  validate_only => BOOLEAN

Code Block
CreatePartitionsResponse (Version: 2) //same schema as version 1

Error Codes

...

ERROR

...

CODE

...

RETRIABLE

...

DESCRIPTION

...

CREATED_UNDER_REPLICATED

...

79

...

False

...

Code Block

language	java
title	Errors

package org.apache.kafka.common.protocol;
public enum Errors {
...
    CREATED_UNDER_REPLICATED(79, "The server created a topic or partitions without the full set of replicas being active.",
        CreatedUnderReplicatedException::new);
}

AdminClient APIs

The AdminClient will be updated to expose the new fields.

Code Block

language	java
title	NewTopic

// Keep existing constructors but add a getter/setter pair 
// to specify the minimum required replication factor if different from full 
package org.apache.kafka.clients.admin;
public class NewTopic {
    ....   
    // return `this` for chaining 
    public NewTopic minRequiredReplicationFactor(short requiredReplicationFactor) {}
    public short minRequiredReplicationFactor() {}
}

...

language	java
title	NewPartition

...

replicas. It would still require at least enough brokers to satisfy the "min.insync.replicas" configuration of the topic. This guarantees producers will be able to use the topic/partition regardless of their acks setting, ie the lack of replicas will be invisible to all users. Because it's possible to create a topic with less replicas than "min.insync.replicas", the actual requirement will be min(min.insync.replicas, replicas).

When handling a CreateTopics or CreatePartitions request, in case not enough brokers are available to satisfy the replication factor, placeholder replica ids (-1, -2, -3) will be inserted for the missing replicas. In case a topic or partition is created under-replicated, the error code will still be NONE in the CreateTopics and CreatePartition responses.

When a new broker joins the cluster, the controller will check if any partitions have placeholders replicas ids and if so assign this broker as a replica (only if this broker is not already a replica).
This may lead to some topics/partitions not optimally spread across all racks, but note that this may already happen (without this KIP) when topics/partitions are created while all brokers in a rack are offline (ie: an availability zone is offline). Tracking topics/partitions not optimally spread across all racks can be tackled in a follow up KIP.

Because this feature can make partitions appear under replicated, it is disabled by default and can be enabled by a broker configuration: enable.under.replicated.topic.creation.

Public Interfaces

Broker Configuration

NAME	DESCRIPTION	TYPE	DEFAULT	VALID VALUES	IMPORTANCE	DYNAMIC UPDATE MODE
enable.under.replicated.topic.creation.policy	Behavior on Topic or Partition creation without the full set of replicas being active.	STRINGBOOLEAN	disabledfalse	true, falsedisabled, enabled, enabled_prefer_observed	Medium	cluster-wide

Administrators can disable (disabledfalse), or enable (enabledtrue) creations of topics/partitions without the full replication factor. When enabled a sub-option is to always prefer using all known brokers and their racks and have some under replication, rather than using currently available online brokers (enabled_prefer_observed)

Command line tools and arguments

New optional argument to the kafka-topics script:

...

Defaulting to false for compatibility.

...

Zookeeper changes:

New non-ephemeral node under /brokers called observed_ids.
Upon starting brokers will create/update the non ephemeral node for their current id containing the same information as the existing ephemeral brokers/ids node.

The data in these new nodes (including rack if available) will be used on Topic/Partition creation when the request number of replicas (or the requested broker ids if assignment is specified) are not currently available.

Compatibility, Deprecation, and Migration Plan

For compatibility the default behavior remains unaltered: a number of active online brokers greater or equals (or greater) to the full replication factor is required for successful creation.
No migration is necessary.
Updates to the Decommissioning brokers section in the documentation

will mention that if a broker id is never to be reused then its corresponding node in zookeeper /brokers/observed_ids will need to be removed manually

Rejected Alternatives

Async replica assignment: Instead of relying on previously known brokers, we considered creating replicas on the available brokers at topic/partition create time and saving in zookeeper some information to asynchronously create missing replicas on the next suitable brokers that would join the cluster. Such a process would involve the controller checking if missing replicas existed when a broker joins the cluster. This alternative felt too complicated. Also the next joining broker may not be on the preferred rack assignment.

Rejected Alternatives

Store "observed brokers" in Zookeeper and assign replicas to them even if they are offline. Because broker registration is implicit, this alternative required storing extra data in Zookeeper and also required administrators to delete znodes when decommisioning brokers. We cannot tell people to modify ZK directly for this.
Update CreateTopicsRequest and CreatePartitionsRequest to allow users to disable creation of under replicated topics. This option adds an extra configuration to think about to end users and it's unclear how many people would want to take advantage of such a featureInstead of specifying a minimum replication factor on creation, we could have used a boolean to allow creation at min-isr size only. This approach looked a bit less flexible but not significantly simpler.
Update schema of CreateTopicsResponse and CreatePartitionsResponse to contain the actual replication factor at creation. Like assignments, it's not data the client can act on.
Create a new error code when a topic/partition is created under replicated. As this is not an error case, it's best to keep returning NONE to avoid breaking existing logic We think the error message associated with the new error code can suffice.

Space shortcuts

Child pages

Versions Compared

Old Version 10

New Version Current

Key

Motivation

Proposed Changes

Public Interfaces

Network protocol and api behavior

Error Codes

AdminClient APIs

Public Interfaces

Broker Configuration

Command line tools and arguments

Zookeeper changes:

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 10

New Version Current

Key

Motivation

Proposed Changes

Public Interfaces

Network protocol and api behavior

Error Codes

AdminClient APIs

Public Interfaces

Broker Configuration

Command line tools and arguments

Zookeeper changes:

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Rejected Alternatives