...
When multiple log.dirs
are configured, a new property — directory.id
— will be expected in the meta.properties
file in each log directory configured under log.dirs
. The property indicates the UUID for the log directory where the file is located. If any of the meta.properties
files does not contain directory.id
one will be randomly generated and the file will be updated upon Broker startup. The kafka-storage.sh
tool will be extended to generate this property as described in the previous section.
If log directory that holds the cluster metadata topic is configured separately to a different path — using metadata.log.dir
— then this log directory is does not get a UUID assigned.
Footnote |
---|
The broker cannot run if this particular log directory is unavailable, and when configured separately it cannot host any user partitions, so there's no point in identifying it in the Controller. |
Metadata records
RegisterBrokerRecord
and BrokerRegistrationChangeRecord
will both have two new fields:
{ "name": "OnlineLogDirs", "type": "[]uuid", "versions": "3+", "taggedVersions": "3+", "tag": "0",
"about": "Log directories configured in this broker which are available." },
{ "name": "OfflineLogDirs", "type": "bool", "versions": "3+",
"about": "True ifWhether any log directories configured in this broker are not available." }
...
{ "name": "OnlineLogDirs", "type": "[]uuid", "versions": "2+", "taggedVersions": "2+", "tag": "0",
"about": "Log directories configured in this broker which are available." },
{ "name": "OfflineLogDirs", "type": "[]uuid"bool", "versions": "2+",
"taggedVersions": "2+", "tagabout": "1",
Whether any "about": "Log log directories configured in this broker which are not available." }
BrokerHeartbeatRequest
will include the following new field:
...
The format
subcommand will be updated to ensure each log directory has an assigned UUID and it will persist two new properties a new property directory.id
in the meta.properties
file :
...
when multiple log.dirs
are configured. The value is base64 encoded, like the cluster UUID.
...
The meta.properties
version field will stay The meta.properties
version field will stay set to 1, to allow for a downgrade after an upgrade on a non JBOD KRaft cluster.
...
Having a persisted UUID at the root of each log directory allows the broker to identify the log directory regardless of the mount path.Having a persisted list of all UUIDs for all configured log directories allows the broker to determine the UUIDs of unavailable (offline) log directories, as the meta.properties
files for the offline log directories are likely to be unavailable.
Example
Given the following server.properties
:
...
#
#Thu Aug 18 15:23:07 BST 2022
node.id=8
version=1
cluster.id=41QSStLtR3qOekbX4ZlbHA
directory.id=e6umYSUsQyq7jUUzL9iXMQdirectory.ids=e6umYSUsQyq7jUUzL9iXMQ,b4d9ExdORgaQq38CyHwWTA,P2aL9r4sSqqyt7bC0uierg
/mnt//mnt/d1/meta.properties
:
#
#Thu Aug 18 15:23:07 BST 2022
node.id=8
version=1
cluster.id=41QSStLtR3qOekbX4ZlbHA
directory.id=b4d9ExdORgaQq38CyHwWTA
directory.ids=e6umYSUsQyq7jUUzL9iXMQ,b4d9ExdORgaQq38CyHwWTA,P2aL9r4sSqqyt7bC0uierg
/mnt/d2/meta.properties
:
#
#Thu Aug 18 15:23:07 BST 2022
node.id=8
version=1
cluster.id=41QSStLtR3qOekbX4ZlbHA
directory.id=P2aL9r4sSqqyt7bC0uiergdirectory.ids=e6umYSUsQyq7jUUzL9iXMQ,b4d9ExdORgaQq38CyHwWTA,P2aL9r4sSqqyt7bC0uierg
Each directory, including the directory that holds the cluster metadata topic — metadata.log.dir
— has a different and respective value as the directory ID. The full set of directory IDs — for all log dirs in log.dirs
but also metadata.log.dir
— is persisted in all three metadata files.
In the example above, we can identify the following directory mapping:
/var/lib/kafka/metadata
has log directory UUIDe6umYSUsQyq7jUUzL9iXMQ
/mnt/d1
has log directory UUIDb4d9ExdORgaQq38CyHwWTA
/mnt/d2
has log directory UUIDP2aL9r4sSqqyt7bC0uierg
If some but not all log directories are unavailable, the broker is able to identify which UUIDs refer to offline log directories by diffing the set of loaded directory.id
from each available log directory with the loaded value from directory.ids
.
Brokers
Broker lifecycle management
Brokers
Broker lifecycle management
When the broker starts up and initializes LogManager
, if multiple log.dirs
are configured, When the broker starts up and initializes LogManager
, for each configured log directory (in log.dirs
) it will load the UUID for each log directory (directory.id
) and the list of all log directory UUIDs (directory.ids
), by by reading the meta.properties
file at the root of each log directoryof them.
- If there are any two log directories with the same UUID, the broker will fail at startup
- If there are any
meta.properties
files missingdirectory.id
, a new UUID is generated, and assigned to that log directory by updating the file - If there are no offline log directories the broker will also create or amend the
directory.ids
field in eachmeta.properties
file as required
If there are offline log directories, the broker might not be able to determine the UUID for each specific offline log directory, but by diffing diffing directory.ids
with the loaded UUIDs from all directory.id
the set of offline log directory UUIDs can still be determined.
After loading meta.properties
the broker will diff all the UUIDs in directory.id
with the full set of all UUIDs (in directory.ids
) to obtain the set of UUIDs for offline log directories. The sets of both online and offline log directory UUIDs are sent along in the broker registration request to the controller. If log directory that holds the cluster metadata topic is configured separately to a different path — using metadata.log.dir
— then the respective UUID for this log directory is excluded from both online and offline sets, as the broker cannot run if this particular log directory is unavailable.
If a new entry is added in the log.dirs
configuration, the broker can always expand directory.ids
as it can determine the "set of UUIDs for online log directories" + "set of UUIDs for offline log directories" + newly generated UUID for the log directory.
If an entry is removed from log.dirs
the broker can also automatically update directory.ids
as long as no log directories are offline when the broker comes back up. The broker will need to be able to access all meta.properties to determine the new full set of UUIDs. An unresolvable mismatch might occur if some log directory was removed from log.dirs
, and some other log directory is offline. It is not possible to determine which UUID belonged to each of the missing log dirs. The UUID for the removed log directory needs to be removed from directory.ids
but the UUID for the offline log directory should stay. Upon an unresolvable mismatch between the number of entries configured in log.dirs
and found in metada.properties
under directory.ids
the broker will fail at startup. The set of all loaded log directory UUIDs is sent along in the broker registration request to the controller as the OnlineLogDirs
field. If any configured log directories is unavailable, OfflineLogDirs
is set to true.
Metadata caching
Replicas are considered offline if the replica references a log directory which is not in the list of online log directories for the broker ID hosting the replicahosting broker is offline, or if the hosting broker's registration flags offline log directories and the replica references none of the registered online log directories.
Handling log directory failures
...
Because the broker is proactive in communicating any log directory assignment changes to the controller, the metadata should be up to date and correct when the controller is notified of a failed log directory. However, the consequences of some partition assignment being incorrect – due to some error or race condition - can be quite damaging, as the controller might not know to update leadership for that partition, leaving it unavailable for an indefinite amount of time. So, as a fallback mechanism, when handling a runtime directory failure, the broker must verify the assignments for the newly failed partitions against the latest metadata, and for any incorrect assignments, the broker will use AlterReplicaLogDirs
to rectify them to assign them to UUID.Zero so that the controller can update leadership and ISR.
Replica management
As the broker, When configured with multiple log.dirs
, catches as the broker catches up with metadata, and sees the partitions which it should be hosting, it will check the associated log directory UUID for each partition.
...
For any new partitions, the active controller will use Uuid.ZERO
as the initial value for log directory UUID for each replica. Each broker with multiple log.dirs
hosting replicas then assigns a log directory UUID and communicates it back to the active controller using the new RPC AssignReplicasToDirs
so that cluster metadata can be updated with the log directory assignment.
...
- Persist a
BrokerRegistrationChange
record, with the new list of online log directories and update the offline log directories flag. - Update the Leader and ISR for all the replicas assigned to the failed log directories, persisting
PartitionChangeRecords
, in similar way to how leadership and ISR is updated when a broker becomes fenced, unregistered or shuts down.
If the any of the listed log directory UUIDs is not a registered log directory then the call fails with error 57 — LOG_DIR_NOT_FOUND
.
Handling replica assignments
The controller accepts the AssignReplicasToDirs
RPC and persists the assignment into metadata records. If the indicated log directory UUID is not a registered log directory then the call fails with error 57 — LOG_DIR_NOT_FOUND
.
If the indicated log directory UUID is listed as offlineUUID.Zero, then the replica is considered offline and the leader and ISR is updated accordingly, same as when the BrokerHeartbeat indicates a new offline log directoryupdated accordingly, same as when the BrokerHeartbeat
indicates a new offline log directory. This should only happen in the exceptional case that a Broker's metadata cache shows an incorrect assignment for some replica during the handling of a failure for the actual directory that hosts that replica.
Broker registration
Upon a broker registration request the controller will persist the broker registration as cluster metadata including the online log directory list and offline log directories flag for that broker. The controller may receive a new list of online directories and offline log directories flag — different from what was previously persisted in the cluster metadata for the requesting broker.
- If there are no indicated online log directory UUIDs the request is invalid and the controller replies with an error —
INVALID_REQUEST
. - If the offline log directories flag is false and there are any missing log directories this means those have been removed from the broker’s configuration, so the controller will reassign all replicas currently assigned to the missing log directories to
Uuid.ZERO
to delegate the choice of log directory the broker, which will then report the choice via the AssignReplicasToDirs RPC. If multiple log directories are registered the broker will remain fenced until the controller learns of all the partition to log directory placements in that broker - i.e. no remaining replicas assigned to
Uuid.ZERO
. The broker will indicate these using the AssignReplicasToDirs RPC.- The broker remains fenced by not wanting to unfence itself in heartbeat requests until the number of mismatching replica to log directory assignments is zero. This number is represented by the new metric
NumMismatchingReplicaToLogDirAssignments
.
- The broker remains fenced by not wanting to unfence itself in heartbeat requests until the number of mismatching replica to log directory assignments is zero. This number is represented by the new metric
- If multiple log directories are registered and some of them are new (not present in previous registration) then these log directories are assumed to be empty. If they are not, the broker will use the
AssignReplicasToDirs
RPC to correct assignment and choose not to become UNFENCED before the metadata is correct.
...
The cluster needs to be upgraded before configuring multiple entries in log.dirs
. As the upgraded brokers come up, the existing meta.properties
files in each broker are updated with a generated directory.id
and directory.ids
. After the upgrade, the metadata.version
feature flag needs to be upgraded using kafka-features.sh
. Then the brokers can be reconfigured with multiple entries in log.dirs
.
Upon being reconfigured with multiple log directories, brokers will update and generate directory.id
in meta.properties
as necessary to reflect the new log directories. Brokers will then register the log directories with the controller via BrokerRegistration
and use AssignReplicasToDirs
to create the partition-logdirectory assignments in the cluster metadata before becoming UNFENCED.
...
- As per KIP-866, a separate Controller quorum is setup first, and only then the existing brokers are reconfigured and upgraded.
- When configured for the migration and while still in ZK mode, brokers will:
- update meta.properties to generate and include
directory.id
anddirectory.ids;
- send
BrokerRegistrationRequest
including the log directory UUIDs; - notify the controller of log directory failures via
BrokerHeartbeatRequest.
- update meta.properties to generate and include
- During the migration, the controller:
- persists log directories indicated in broker registration requests in the cluster metadata;
- relies on heartbeat requests to detect log directory failure instead of monitoring the ZK znode for notifications;
- still uses full
LeaderAndIsr
requests to process log directory failures for any brokers still running in ZK mode.
- The brokers restarting into KRaft mode will want to stay fenced until their log directory assignments for all hosted partitions are persisted in the cluster metadata.
- The active controller will also ensure that any given broker stays fenced until it learns of all partition to log directory assignments in that specific broker via the new
AssignReplicasToDirs
RPC. - During the migration, replicas are assumed and assigned to log directory
Uuid.ZERO
until the actual log directory is learnt by the active controller from a broker running in KRaft mode.
...
- Partition reassignment across directories and across brokers involves different API calls —
AlterPartitionReassignments
andAlterReplicaLogDirs.
Whilst reassigning partitions across brokers into a specific log directory is already possible, it involves an intricate sequence of prior calls toAlterReplicaLogDirs
and expecting errors as a successful result. Once this work is done we can consolidate these two API calls by extendingAlterPartitionReassignments
to allow target log directories to be specified and deprecateAlterReplicaLogDirs
. This can be done as part of a future KIP. - The only way to know which log directory UUID corresponds to which log directory path is by reading the
meta.properties
files in each broker. A future KIP should expand theDescribeLogDirs
RPC response to include log directory UUIDs along with the system path for each log directory. - Partition initialization can be optimized, by having the controller preselect a log directory for new partitions. This would avoid having to wait for the broker to send a
AssignReplicasToDirs
request to indicate the chosen log directory before it is safe for the broker to assume leadership of the partition. Maybe the controller Controller could also take available storage in each log directory into account if the broker the Broker indicates the available storage space for each log directory as part of broker registration. This may be be proposed in a future KIP.
...