Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Simplify, make OfflineLogDirs boolean

...

When multiple log.dirs are configured, a new property — directory.id — will be expected in the meta.properties file in each log directory configured under log.dirs. The property indicates the UUID for the log directory where the file is located. If any of the meta.properties files does not contain directory.id one will be randomly generated and the file will be updated upon Broker startup. The kafka-storage.sh tool will be extended to generate this property as described in the previous section.

If log directory that holds the cluster metadata topic is configured separately to a different path — using metadata.log.dir — then this log directory is does not get a UUID assigned. 

Footnote

The broker cannot run if this particular log directory is unavailable, and when configured separately it cannot host any user partitions, so there's no point in identifying it in the Controller.


Metadata records

RegisterBrokerRecord and BrokerRegistrationChangeRecord will both have two new fields:

{ "name": "OnlineLogDirs", "type":  "[]uuid", "versions":  "3+", "taggedVersions": "3+", "tag": "0",
"about": "Log directories configured in this broker which are available." },
{ "name": "OfflineLogDirs", "type": "bool", "versions": "3+",
"about": "True ifWhether any log directories configured in this broker are not available." }

...

{ "name": "OnlineLogDirs", "type":  "[]uuid", "versions":  "2+", "taggedVersions": "2+", "tag": "0",
"about": "Log directories configured in this broker which are available." },
{ "name": "OfflineLogDirs", "type": "[]uuid"bool", "versions": "2+",
"taggedVersions": "2+", "tagabout": "1",
Whether any "about": "Log log directories configured in this broker which are not available." }

BrokerHeartbeatRequest will include the following new field:

...

The format subcommand will be updated to ensure each log directory has an assigned UUID and it will persist two new properties a new property directory.id in the meta.properties  file :

...

when multiple log.dirs are configured. The value is base64 encoded, like the cluster UUID.

...

The meta.properties  version field will stay The meta.properties  version field will stay set to 1, to allow for a downgrade after an upgrade on a non JBOD KRaft cluster.

...

Having a persisted UUID at the root of each log directory allows the broker to identify the log directory regardless of the mount path.Having a persisted list of all UUIDs for all configured log directories allows the broker to determine the UUIDs of unavailable (offline) log directories, as the meta.properties  files for the offline log directories are likely to be unavailable.

Example

Given the following server.properties:

...

#
#Thu Aug 18 15:23:07 BST 2022
node.id=8
version=1
cluster.id=41QSStLtR3qOekbX4ZlbHA
directory.id=e6umYSUsQyq7jUUzL9iXMQdirectory.ids=e6umYSUsQyq7jUUzL9iXMQ,b4d9ExdORgaQq38CyHwWTA,P2aL9r4sSqqyt7bC0uierg
/mnt//mnt/d1/meta.properties :
#
#Thu Aug 18 15:23:07 BST 2022
node.id=8
version=1
cluster.id=41QSStLtR3qOekbX4ZlbHA
directory.id=b4d9ExdORgaQq38CyHwWTA
directory.ids=e6umYSUsQyq7jUUzL9iXMQ,b4d9ExdORgaQq38CyHwWTA,P2aL9r4sSqqyt7bC0uierg
/mnt/d2/meta.properties :
#
#Thu Aug 18 15:23:07 BST 2022
node.id=8
version=1
cluster.id=41QSStLtR3qOekbX4ZlbHA
directory.id=P2aL9r4sSqqyt7bC0uiergdirectory.ids=e6umYSUsQyq7jUUzL9iXMQ,b4d9ExdORgaQq38CyHwWTA,P2aL9r4sSqqyt7bC0uierg

Each directory, including the directory that holds the cluster metadata topic — metadata.log.dir  — has a different and respective value as the directory ID. The full set of directory IDs — for all log dirs in log.dirs  but also metadata.log.dir — is persisted in all three metadata files.

In the example above, we can identify the following directory mapping:

  • /var/lib/kafka/metadata  has log directory UUID e6umYSUsQyq7jUUzL9iXMQ 
  • /mnt/d1  has log directory UUID b4d9ExdORgaQq38CyHwWTA 
  • /mnt/d2 has log directory UUID P2aL9r4sSqqyt7bC0uierg 

If some but not all log directories are unavailable, the broker is able to identify which UUIDs refer to offline log directories by diffing the set of loaded directory.id from each available log directory with the loaded value from directory.ids.

Brokers

Broker lifecycle management

Brokers

Broker lifecycle management

When the broker starts up and initializes LogManager, if multiple log.dirs are configured, When the broker starts up and initializes LogManager, for each configured log directory (in log.dirs ) it will load the UUID for each log directory (directory.id ) and the list of all log directory UUIDs (directory.ids), by  by reading the meta.properties file at the root of each log directoryof them.

  • If there are any two log directories with the same UUID, the broker will fail at startup
  • If there are any meta.properties files missing directory.id, a new UUID is generated, and assigned to that log directory by updating the file
  • If there are no offline log directories the broker will also create or amend the directory.ids field in each meta.properties file as required

If there are offline log directories, the broker might not be able to determine the UUID for each specific offline log directory, but by diffing diffing directory.ids with the loaded UUIDs from all directory.id the set of offline log directory UUIDs can still be determined.

After loading meta.properties the broker will diff all the UUIDs in directory.id  with the full set of all UUIDs (in directory.ids) to obtain the set of UUIDs for offline log directories. The sets of both online and offline log directory UUIDs are sent along in the broker registration request to the controller. If log directory that holds the cluster metadata topic is configured separately to a different path — using metadata.log.dir — then the respective UUID for this log directory is excluded from both online and offline sets, as the broker cannot run if this particular log directory is unavailable.

If a new entry is added in the log.dirs  configuration, the broker can always expand directory.ids as it can determine the "set of UUIDs for online log directories" + "set of UUIDs for offline log directories" + newly generated UUID for the log directory.

If an entry is removed from log.dirs  the broker can also automatically update directory.ids as long as no log directories are offline when the broker comes back up. The broker will need to be able to access all meta.properties to determine the new full set of UUIDs. An unresolvable mismatch might occur if some log directory was removed from log.dirs , and some other log directory is offline. It is not possible to determine which UUID belonged to each of the missing log dirs. The UUID for the removed log directory needs to be removed from directory.ids  but the UUID for the offline log directory should stay. Upon an unresolvable mismatch between the number of entries configured in log.dirs  and found in metada.properties  under directory.ids the broker will fail at startup. The set of all loaded log directory UUIDs is sent along in the broker registration request to the controller as the OnlineLogDirs field. If any configured log directories is unavailable, OfflineLogDirs is set to true.

Metadata caching

Replicas are considered offline if the replica references a log directory which is not in the list of online log directories for the broker ID hosting the replicahosting broker is offline, or if the hosting broker's registration flags offline log directories and the replica references none of the registered online log directories.

Handling log directory failures

...

Because the broker is proactive in communicating any log directory assignment changes to the controller, the metadata should be up to date and correct when the controller is notified of a failed log directory. However, the consequences of some partition assignment being incorrect – due to some error or race condition - can be quite damaging, as the controller might not know to update leadership for that partition, leaving it unavailable for an indefinite amount of time. So, as a fallback mechanism, when handling a runtime directory failure, the broker must verify the assignments for the newly failed partitions against the latest metadata, and for any incorrect assignments, the broker will use AlterReplicaLogDirs  to rectify them  to assign them to UUID.Zero so that the controller can update leadership and ISR.

Replica management

As the broker, When configured with multiple log.dirs, catches as the broker catches up with metadata, and sees the partitions which it should be hosting, it will check the associated log directory UUID for each partition.

...

For any new partitions, the active controller will use Uuid.ZERO as the initial value for log directory UUID for each replica. Each broker with multiple log.dirs hosting replicas then assigns a log directory UUID and communicates it back to the active controller using the new RPC AssignReplicasToDirs so that cluster metadata can be updated with the log directory assignment.

...

  • Persist a BrokerRegistrationChange record, with the new list of online log directories and update the offline log directories flag.
  • Update the Leader and ISR for all the replicas assigned to the failed log directories, persisting PartitionChangeRecords, in similar way to how leadership and ISR is updated when a broker becomes fenced, unregistered or shuts down.

If the any of the listed log directory UUIDs is not a registered log directory then the call fails with error 57 — LOG_DIR_NOT_FOUND.

Handling replica assignments

The controller accepts the AssignReplicasToDirs RPC and persists the assignment into metadata records. If the indicated log directory UUID is not a registered log directory then the call fails with error 57 — LOG_DIR_NOT_FOUND .

If the indicated log directory UUID is listed as offlineUUID.Zero, then the replica is considered offline and the leader and ISR is updated accordingly, same as when the BrokerHeartbeat indicates a new offline log directoryupdated accordingly, same as when the BrokerHeartbeat indicates a new offline log directory. This should only happen in the exceptional case that a Broker's metadata cache shows an incorrect assignment for some replica during the handling of a failure for the actual directory that hosts that replica.

Broker registration

Upon a broker registration request the controller will persist the broker registration as cluster metadata including the online log directory list and offline log directories flag for that broker. The controller may receive a new list of online directories and offline log directories flag — different from what was previously persisted in the cluster metadata for the requesting broker.

  • If there are no indicated online log directory UUIDs the request is invalid and the controller replies with an error — INVALID_REQUEST.
  • If the offline log directories flag is false and there are any missing log directories this means those have been removed from the broker’s configuration, so the controller will reassign all replicas currently assigned to the missing log directories to Uuid.ZERO to delegate the choice of log directory the broker, which will then report the choice via the AssignReplicasToDirs RPC.
  • If multiple log directories are registered the broker will remain fenced until the controller learns of all the partition to log directory placements in that broker - i.e. no remaining replicas assigned to Uuid.ZERO . The broker will indicate these using the AssignReplicasToDirs RPC.

    • The broker remains fenced by not wanting to unfence itself in heartbeat requests until the number of mismatching replica to log directory assignments is zero. This number is represented by the new metric NumMismatchingReplicaToLogDirAssignments.
  • If multiple log directories are registered and some of them are new (not present in previous registration) then these log directories are assumed to be empty. If they are not, the broker will use the AssignReplicasToDirs  RPC to correct assignment and choose not to become UNFENCED before the metadata is correct.

...

The cluster needs to be upgraded before configuring multiple entries in log.dirs. As the upgraded brokers come up, the existing meta.properties  files in each broker are updated with a generated directory.id  and directory.ids . After the upgrade, the metadata.version feature flag needs to be upgraded using kafka-features.sh. Then the brokers can be reconfigured with multiple entries in log.dirs.

Upon being reconfigured with multiple log directories, brokers will update and generate directory.id in meta.properties as necessary to reflect the new log directories. Brokers will then register the log directories with the controller via BrokerRegistration and use AssignReplicasToDirs to create the partition-logdirectory assignments in the cluster metadata before becoming UNFENCED.

...

  • As per KIP-866, a separate Controller quorum is setup first, and only then the existing brokers are reconfigured and upgraded.
  • When configured for the migration and while still in ZK mode, brokers will:
    • update meta.properties to generate and include directory.id  and directory.ids;
    • send BrokerRegistrationRequest including the log directory UUIDs;
    • notify the controller of log directory failures via BrokerHeartbeatRequest.
  • During the migration, the controller:
    • persists log directories indicated in broker registration requests in the cluster metadata;
    • relies on heartbeat requests to detect log directory failure instead of monitoring the ZK znode for notifications;
    • still uses full LeaderAndIsr requests to process log directory failures for any brokers still running in ZK mode.
  • The brokers restarting into KRaft mode will want to stay fenced until their log directory assignments for all hosted partitions are persisted in the cluster metadata.
  • The active controller will also ensure that any given broker stays fenced until it learns of all partition to log directory assignments in that specific broker via the new AssignReplicasToDirs RPC.
  • During the migration, replicas are assumed and assigned to log directory Uuid.ZERO until the actual log directory is learnt by the active controller from a broker running in KRaft mode.

...

  • Partition reassignment across directories and across brokers involves different API calls — AlterPartitionReassignments and AlterReplicaLogDirs. Whilst reassigning partitions across brokers into a specific log directory is already possible, it involves an intricate sequence of prior calls to AlterReplicaLogDirs and expecting errors as a successful result. Once this work is done we can consolidate these two API calls by extending AlterPartitionReassignments to allow target log directories to be specified and deprecate AlterReplicaLogDirs. This can be done as part of a future KIP.
  • The only way to know which log directory UUID corresponds to which log directory path is by reading the meta.properties  files in each broker. A future KIP should expand the DescribeLogDirs RPC response to include log directory UUIDs along with the system path for each log directory.
  • Partition initialization can be optimized, by having the controller preselect a log directory for new partitions. This would avoid having to wait for the broker to send a AssignReplicasToDirs request to indicate the chosen log directory before it is safe for the broker to assume leadership of the partition. Maybe the controller Controller could also take available storage in each log directory into account if the broker the Broker indicates the available storage space for each log directory as part of broker registration. This may be be proposed in a future KIP.

...