Proposed changes

Storage format command

The storage format tool must be run when new log directories are added or removed from a broker’s configurationKIP-631 introduced the requirement that storage directories on a node must be formatted. The format subcommand will be updated to ensure each log directory has an assigned UUID and it will persist two new properties in the meta.properties file:

...

When the broker starts up and initializes LogManager, for each configured log directory (in log.dirs ) it will load the UUID for each log directory (directory.id ) and the list of all log directory UUIDs (directory.ids), by reading the meta.properties file at the root of each log directory.

The broker will fail at startup if either:

...

If there are any two log directories with the same UUID,

...

the broker will fail at startup
If there any meta.properties

...

file is missing directory.id

...

a new UUID is generated, and assigned to that log directory by updating the file
If there are no offline log directories the broker will also create or amend the directory.ids field in each meta.properties file as required
If there are offline log directories, the broker will fail at startup if
- There is a mismatch between the number of entries in log.dirs and directory.ids
- There is a mismatch between any two directory.ids
- If the directory.ids field is missing in any of the meta.properties files.

If there are offline log directories, the broker cannot be sure of the the value for directory.ids should be, as it cannot confirm the directory.id for any of the offline log directories.

After loading meta.properties the broker will diff all the UUIDs in directory.id with the

...

b) the full set of log directory UUIDs in directory.ids does not match in all loaded meta.properties files; or

c) the number of log directory UUIDs in directory.ids does not match the number of system paths in the broker configuration log.dirs.

The broker will then diff all the loaded UUIDs in directory.id with the full set of all UUIDs (in directory.ids) to obtain the set of UUIDs for offline log directories. The sets of both online and offline log directory UUIDs are sent along in the broker registration request to the controller. If log directory that holds the cluster metadata topic is configured separately to a different path — using metadata.log.dir — then the respective UUID for this log directory is excluded from both online and offline sets, as the broker cannot run if this particular log directory is unavailable.

...

If the partition is not assigned to a log directory (refers to Uuid.ZERO)
- If the partition already exists, the broker uses the new RPC — AssignReplicasToDirectories — to notify the controller to change the metadata assignment to the actual log directory.
- If the partition is new and , the broker selects a log directory and uses the new RPC — AssignReplicasToDirectories — to notify the controller to create the metadata assignment to the actual log directory.
If the partition is assigned to an online log directory
- If the partition is new it is created in the indicated log directory.
- If the partition already exists in the indicated log directory no action is taken.
- If the partition already exists in another log directory and is a future replica in the log directory indicated by the metadata, the broker will replace the current replica with the future replica.
- If the partition already exists in another log directory, the broker uses the new RPC — AssignReplicasToDirectories — to the controller to change the metadata assignment to the actual log directory. The partition might have been moved to a different log directory. whilst the broker was offline.
If the partition is assigned to an offline log directory, no action is taken — the controller is already aware of this.
If the partition is assigned to an unknown log directory, no action is taken — the controller is already aware of this and will reassign the replica to one of the online log directories in a future metadata update.

When replicas are moved between directories, using the existing AlterReplicaLogDirs RPC, when the future replica has caught up with the current replica, the the receiving broker will send forward the AssignReplicasToDirectories RPC to the controller changing the association to the future , converting log directory paths into log directory . When the cluster metadata update is seen by the broker, the broker replaces the current replica with the future replica.UUIDs. The controller will then perform the reassignment of the replicas and commit new metadata records, which the broker will eventually catch up to. When the broker sees the metadata update with

If the broker is configured with multiple log directories it remains FENCED until it can verify that all partitions are assigned to the correct log directories in the cluster metadata. This excludes the log directory that hosts the cluster metadata topic, if it is configured separately to a different path — using metadata.log.dir.

...

If there are any missing log directories this means those have been removed from the broker’s configuration, so the controller will reassign all replicas currently assigned to the missing log directories to the remaining online log directories in the same broker Uuid.ZERO to delegate the choice of log directory the broker, which will then report the choice via the AssignReplicasToDirectories RPC.
If multiple log directories are registered and the previous registration has no log directories the broker will remain fenced until the controller learns of all the partition to log directory placements in that broker - i.e. no remaining replicas assigned to Uuid.ZERO . The broker will indicate these using the AssignReplicasToDirectories RPC.
If multiple log directories are registered and some of them are new (not present in previous registration) , but the previous registration includes at least one log directory then these log directories are assumed to be empty. If they are not, the broker will use the AssignReplicasToDirectories RPC to correct assignment .and choose not to become UNFENCED before the metadata is correct.

In a future release, broker registration without online log directory UUIDs will be disallowed.

...

Compatibility, Deprecation, and Migration Plan

The metadata.version will be bumped to gate changes to the RPCs and metadata records.

Migrating a cluster in KRaft without JBOD

The controllers are upgraded first, and broker registration is allowed without log directories. One the controllers are upgraded, the brokers can be upgraded. Once upgraded the brokers will register log directories with the controller and use AssignReplicasToDirectories to create the partition-logdirectory assignments in the cluster metadatacluster needs to be upgraded before configuring multiple entries in log.dirs . As the upgraded brokers come up, the existing meta.properties files in each broker are updated with a generated directory.id and directory.ids . After the upgrade, the metadata.version is feature flag needs to be upgraded using kafka-features.sh. Then the brokers can individually be reconfigured with additional log directories. As this will be the first registration indicating multiple entries in log.dirs. Upon being reconfigured with multiple log directories, registering brokers will remain fenced until the log directory assignments are committed to the cluster metadata. Registrations without log directories are still allowed if the previous registration also does not reference any log directory UUID. This allows the controllers to be upgraded into this feature first. In a subsequent version following the release of this feature, we can completely disallow broker registration without log directoriesregister them with the controller via BrokerRegistration and use AssignReplicasToDirectories to create the partition-logdirectory assignments in the cluster metadata before becoming UNFENCED.

Migrating a cluster in ZK mode running with JBOD

Migration into KRaft mode is addressed in KIP-866. That migration is extended in the following way:

As per KIP-866, a separate Controller quorum is setup first, and only then the existing brokers are reconfigured and upgraded.
During the migration, the active controller is still monitoring ZooKeeper for log directory failure notifications and still using full LeaderAndIsr requests to process log directory failures for any brokers still running in ZK modeto process log directory failures for any brokers still running in ZK mode
The brokers restarting into KRaft mode for the first time update meta.properties to generate and include directory.id and directory.ids .
During the migration, the active controller accepts broker registration requests from brokers running in restarting into KRaft mode, indicating multiple online log directories, but keeps brokers fenced until it learns of all partition to log directory assignments in that broker via the new AssignReplicasToDirectories RPC.
During the migration, replicas are assumed and assigned to log directory Uuid.ZERO until the actual log directory is learnt by the active controller from a broker running in KRaft mode.

...

Assumed to live in a broker that isn’t yet running on JBOD, because the cluster was previously running in KRaft mode without this feature, configured with multiple log directories, and so live in a single log directory, even if the UUID for that directory isn’t yet known by the controller. It is not possible to trigger a log directory failure from a broker that has a single log directory, as the broker would simply shut down if there are no remaining online log directories. Or
Assigned to a log directory as of yet unknown, in a broker that remains FENCED. As the broker remains FENCED it cannot assume leadership for any partition, and so a log directory failure would be handled by the current partition leader.

...

The changes to storage formatting simply ensure the existence of two new types of metadata files at the log directory roots. The storage format command will need to run when new log directories are configured, but also when log directories are removed.

Test plan

The system test for log directory failures will be extended to KRaft mode.

...

Keeping the scope of the log directory to the broker — while this would mean a much simpler change, as was proposed in KIP-589, if only the broker itself knows which partitions were assigned to a log directory, when a log directory fails the broker will need to send a potentially very large request enumerating all the partitions in the failed disk, so that the controller can update leadership and ISRs accordingly.
Having the controller determine the log directory , and leave that up to the broker for new replicas — this would avoid require a further RPC from the broker upon selecting a new log directory for new replicas, and reduce the time until it is safe for the broker to take leadership of the replica. However the broker is in a better position to make a choice of log directory than the broker, as it has easier access to e.g. disk usage in each log directory. The controller could also have this information if the broker were to include it the broker registration. But to keep the design simple, this optimization is best left for future work.
Changing how log directory failure is handled in ZooKeeper mode — ZooKeeper mode is going away, KIP-833 proposed its deprecation in a near future release.
Using the system path to identify each log directory, or storing the identifier somewhere else — When running Kafka with multiple log directories, each log directory is typically assigned to a different system disk or volume. The same storage device can be made accessible under a different mount, and Kafka should be able to identify the contents as the same disk. Because the log directory configuration can change on the broker, the only reliable way to identify the log directory in each broker is to add metadata to the file system under the log directory itself.
Fail on broker startup if the cluster metadata indicates replicas should be in a different log directory than the directory where the broker actually finds them — Despite not being an advertised feature, currently replicas can be moved between log directories while the broker is offline. Once the broker comes back up it accepts the new location of the replica. To continue supporting this feature, the broker will need to compare the information in the cluster metadata with the actual replica location during startup and take action on any mismatch.
Not keeping a list of all configured log directory identifiers in a metadata file in the file system — If a broker finds some log directories to be offline during startup it will need another way to identify the log directory as the metadata is in the disk itself. For this reason, the list of all log directory IDs must also be persisted in somewhere that the broker can expected to be able to access. Regardless of how many log directories are configured, one of the log directories will be configured to host the cluster metadata topic. This can either be one of the data log directories or a completely separate one — typically in the system disk. Since the broker will not startup if this log directory is not available, it is the perfect location to persist the list of all log directory identifiers.

...

Space shortcuts

Child pages

Versions Compared

Old Version 15

New Version 16

Key

Proposed changes

Storage format command

Compatibility, Deprecation, and Migration Plan

Migrating a cluster in KRaft without JBOD

Migrating a cluster in ZK mode running with JBOD

Test plan

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 15

New Version 16

Key

Proposed changes

Storage format command

Compatibility, Deprecation, and Migration Plan

Migrating a cluster in KRaft without JBOD

Migrating a cluster in ZK mode running with JBOD

Test plan