Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Clarify ZK -> KRaft migration

...

As the broker catches up with metadata, and sees the partitions which it should be hosting, it will check the associated log directory UUID for each partition.

  • If the partition is associated with not assigned to a log directory UUID with value (refers to Uuid.ZERO)
    • If the partition already exists, the broker uses the new RPC ASSIGN_REPLICAS_TO_DIRECTORIES — to notify the controller to change the metadata assignment to the actual log directory.
    • If the partition is new and the broker only has one log directory configured, it will place the replica there
  • The broker errors, and fails to place the replica if has more than one log directory
  • If the partition doesn’t yet exist, it is created in the designated log directory.
  • If any partitions already exist, but the hosting log directories do not match the cluster metadata
  • If there is a future replica in the log directory indicated by the metadata, the broker will replace the current replica with the future replica
    • .
    • If the partition is new and the broker has multiple log directories configured, the broker selects a log directory and uses the
  • Otherwise, the broker uses a
    • new RPC — ASSIGN_REPLICAS_TO_DIRECTORIES — to notify the controller to
  • change
    • create the metadata
  • association
    • assignment to the actual log directory.
  • The broker will not create the log for the partition until the log directory indicated in the cluster metadata matches the actual log directory.

...

  • If the partition is assigned to an online log directory
    • If the partition is new it is created in the indicated log directory.
    • If the partition already exists in the indicated log directory no action is taken.
    • If the partition already exists in another log directory and is a future replica in the log directory indicated by the metadata, the broker will replace the current replica with the future replica.
    • If the partition already exists in another log directory, the broker uses the new RPC — ASSIGN_REPLICAS_TO_DIRECTORIES

...

    • to the controller

...

    • to change the

...

    • metadata assignment to the

...

    • actual log directory. The partition might have been moved to a different log directory.

...

    • whilst the broker

...

    • was offline. 
  • If the partition is assigned to an offline log directory, no action is taken — the controller is already aware of this.
  • If the partition is assigned to an unknown log directory, no action is taken — the controller is already aware of this and will reassign the replica to one of the online log directories in a future metadata update. 

When replicas are moved between directories, using the existing ALTER_REPLICA_LOG_DIRS RPC, when the future replica has caught up with the current replica, the broker will send the ASSIGN_REPLICAS_TO_DIRECTORIES RPC to the controller changing the association to the future log directory. When the cluster metadata update is seen by the broker, the broker replaces the current replica with the future replica.

If the broker is configured with multiple log directories it remains FENCED until it can verify that all partitions are assigned to the correct log directories in the cluster metadata.

Metadata caching

Replicas are considered offline if the replica references a log directory which is not in the list of online log directories for the broker ID hosting the replica.

Handling log directory failures

When one or more log directories becomes offline, the broker will communicate this change using a new RPC — LOG_DIRECTORIES_OFFLINE — indicating the UUIDs of the new offline log directories.

Controller

Replica placement

For any new partitions, the active controller will select not only the broker

Metadata caching

Replicas are considered offline if the replica references a log directory which is not in the list of online log directories for the broker ID hosting the replica.

Handling log directory failures

When one or more log directories becomes offline, the broker will communicate this change using a new RPC — LOG_DIRECTORIES_OFFLINE — indicating the UUIDs of the new offline log directories.

Controller

Replica placement

For any new partitions, the controller will select not only the broker ID for each replica, but also a log directory in those brokers, from their list of online log directories. For any of the brokers which do not have associated log directories in the metadata, the special value of Uuid.ZERO is assumed and used to indicate no log directory.

This special zero value is only used to enable the transition of cluster into this feature. Once the cluster has been upgraded to support JBOD in KRaft, in KRaft the Uuid.ZERO value is no longer used. In a future release, Uuid.ZERO should not be allowed and all registered brokers should have a list of at least one online log directory UUID.
If the controller selected a log directory for a new replica and that log directory is now offline, the broker will choose a new one and invoke one of the new RPCs — ASSIGN_REPLICAS_TO_DIRECTORIES — to correct the log directory assignment for the replica in the cluster metadata and delay creating the replica until the next cluster metadata is updated by the controllerhave a list of at least one online log directory UUID.

Handling log directory failures

...

  • If there are any missing log directories this means those have been removed from the broker’s configuration, so the controller will reassign all replicas currently assigned to the missing log directories to the remaining online log directories in the same broker.
  • If there are new offline log directories, the controller will handle them as log directory failures, updating all leadership and ISR for partition with replicas on those offline log directories, persisting new PartitionChangeRecords, in similar way to how leadership and ISR is updated when a broker becomes fenced, unregistered or shuts down.
  • same broker.
  • If multiple log directories are registered and the previous registration has no log directories the broker will remain fenced until the controller learns of all the partition to log directory placements in that broker.  The broker will indicate these using the ASSIGN_REPLICAS_TO_LOG_DIRS RPC.

  • If multiple log directories are registered and some of them are new (not present in previous registration), but the previous registration includes at least one log directory then these log directories are assumed to be empty. If they are not, the broker will use ASSIGN_REPLICAS_TO_LOG_DIRS to correct assignment.
  • If a single log directory is registered and the previous registration has no log directories — indicating the broker is configured with a single log directory — the controller can assign all replicas in the broker to that log directory. This avoids having to wait for the broker to send ASSIGN_REPLICA_TO_LOG_DIRS RPCs to create the log directory assignmentIf the previous registration lists no log online directories the registration request is rejected unless it specifies a single online log directory. The controller will then assign all replicas in that broker to the specified log directory.

In a future release, broker registration without online log directories directory UUIDs will be disallowed.

Brokers whose registration indicate that multiple log directories are configured remain FENCED until all log directory assignments for that broker are learnt by the active controller and persisted into metadata.

Compatibility, Deprecation, and Migration Plan

...

The controllers are upgraded first, and broker registration is allowed without log directories. One the controllers are upgraded, the brokers can be upgraded, while still using a single log directory. Once upgraded the brokers will register the single log directory with the controller which will assign all replicas to that log directory. Then the brokers can individually be reconfigured with additional log directories.

When first registering with log directories, a broker can should only register a single log directory. This allows the controller to assign all existing replicas in that broker to the single log directorythe single log directory.

Registrations without log directories are still allowed if the previous registration also does not reference any log directory UUID. This allows the controllers to be upgraded into this feature first. In a subsequent version following the release of this feature, we can completely disallow broker registration without log directories.

Migrating a cluster in ZK mode running with JBOD

...

Migration into KRaft mode is addressed in KIP-866. That migration is extended in the following way:

  • During the migration, the active controller is still monitoring ZooKeeper for log directory failure notifications and still using full LeaderAndIsr requests to process log directory failures for any brokers still running in ZK mode
  • During the migration, the active controller accepts broker registration requests from brokers running in KRaft mode, indicating multiple online log directories, but keeps brokers fenced until it learns of all partition to log directory assignments in that broker via the new ASSIGN_REPLICAS_TO_DIRECTORIES

...

  •  RPC.
  • During the migration, replicas are assumed and assigned to log directory Uuid.ZERO until the actual log directory is learnt by the active controller from a broker running in KRaft mode.

Replica management

Existing replicas without a log directory are assumed either:

  • Assumed to live in a broker that isn’t yet running on JBOD, because the cluster was previously running in KRaft mode without this feature, and so live in a single log directory, even if the UUID for that directory isn’t yet known by the controller. It is not possible to trigger a log directory failure from a broker that has a single log directory, as the broker would simply shut down if there are no remaining online log directories. Or
  • Assigned to a log directory as of yet unknown, in a broker that remains FENCED. As the broker remains FENCED it cannot assume leadership for any partition, and so a log directory failure would be handled by the current partition leader.

Storage formatting

The changes to storage formatting simply ensure the existence of two new types of metadata files at the log directory roots. The storage format command will need to run when new log directories are configured, but also when log directories are removed.

...