Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We would like to allow users to create topics and partitions even when the current available number of brokers is less than the number of requested replicas, as long as the expected size of the cluster under normal conditions is large enough. As soon as brokers re-join the cluster, the missing replicas will be created. Also a topic should only be created when there are . It would still require at least enough brokers to satisfy the  "min.insync.replicas" configuration of the topic.

To allow that, we propose to keep track of brokers that are part of the cluster in Zookeeper even after they disconnect ("observed brokers"), and use that information to determine an assignment (or to determine if a given assignment is feasible) on handling a topic creation request or partition creation request.

. Placeholder replica ids (-1, -2, -3) will be inserted for the missing replicas.

When a broker joins the cluster, the controller will check if any partitions have placeholders replicas ids and if so assign this broker as a replica (obviosuly only if this broker is not already a replica).

Because this feature can make partitions appear under replicated, it is disabled by default and can be enabled by a broker configuration: enable.under.replicated.topic.creationWhen the number of available brokers is less than the replication factor and no assignment is specified, creating an under-replicated topic/partition can be enabled. For example, creating a topic with replication factor 3, in a 3-zones region with 1 broker per region, but a zone is down or a rolling restart is happening.

In case a topic or partition is created under-replicated, the error code will still be NONE in the CreateTopics and CreatePartition response
s.

...

Administrators can disable (false), or enable (true) creations of topics/partitions without the full replication factor. Defaulting to false for compatibility.

...


New non-ephemeral nodes under /brokers called observed_ids.
Upon starting, brokers will create/update the non ephemeral node for their current id containing the same information as the existing ephemeral brokers/ids node. 

The data in these new nodes (including rack if available) will be used on Topic/Partition creation when the request number of replicas (or the requested broker ids if assignment is specified) are not currently available.


Compatibility, Deprecation, and Migration Plan

  • For compatibility the default behavior remains unaltered: a number of active brokers equals (or greater) to the full replication factor is required for successful creation.

  • No migration is necessary.
  • Updates to the Decommissioning brokers section in the documentation
    • will mention that if a broker id is never to be reused then its corresponding node in zookeeper /brokers/observed_ids will need to be removed if enable.under.replicated.topic.creation is enabled.

Rejected Alternatives

Rejected Alternatives

  • Store "observed brokers" in Zookeeper and assign replicas to them even if they are offline. Because broker registration is implicit, this alternative required storing extra data in Zookeeper and also required administrator to delete znodes when decommisioning brokers. We cannot tell people to modify ZK directly for thisAsync replica assignment: Instead of relying on previously known brokers, we considered creating replicas on the available brokers at topic/partition create time and saving in zookeeper some information to asynchronously create missing replicas on the next suitable brokers that would join the cluster. Such a process would involve the controller checking if missing replicas existed when a broker joins the cluster. This alternative felt too complicated. Also the next joining broker may not be on the preferred rack assignment.
  • Update CreateTopicsRequest and CreatePartitionsRequest to allow users to disable creation of under replicated topics. This option adds an extra configuration to think abobut to end users and it's unclear how many people would want to take advantageo of such a feature.
  • Update schema of CreateTopicsResponse and CreatePartitionsResponse to contain the actual replication factor at creation. Like assignments, it's not data the client can act on.
  • Create a new error code when a topic/partition is created under replicated. As this is not an error case, it's best to keep returning NONE to avoid breaking existing logic.

...