...
To ensure availability, when a partition leader fails, the controller should elect a new leader from one of the other in-sync replicas. But the controller does not check whether each leader is correctly performing its duties, instead the controller simply assumes that each broker is working correctly if it is still an active member of the cluster. In KRaft, cluster membership is based on timely heartbeat requests sent by each broker to the active controller. In ZooKeeper, cluster membership is based on an ephemeral zNode under /brokers/ids
.
In KRaft mode, when a single log directory fails, the broker will be unable to be either a leader or a follower for any partitions in that log directory, but the controller will have no signal that it needs to update leadership and ISR for the replicas in that log directory, as the broker will continue to send heartbeat requests.
Footnote |
---|
The exception is the cluster metadata log directory, which can be configured separately with |
In ZooKeeper mode when a log directory fails, the broker sends a notification to the controller which then sends a full LeaderAndIsr
request to the broker, listing all the partitions for all log directories for that broker. The controller relies on per-partition error results from the broker to update leadership and ISR for the replicas in the failed log directory. Without this notification, the partitions with leadership on that log directory will not get a new leader assigned and would remain unavailable.
Support for KRaft in JBOD, was proposed and accepted back in KIP-589 — with a new RPC from the broker to the controller indicating the affected topic partitions in a failed log directory — but the implementation was never merged and concerns were raised with possible large requests from the broker to the controller.
KIP-833 was accepted, with plans to mark KRaft as production ready and deprecate ZooKeeper mode, but JBOD is still a missing feature in KRaft. This KIP aims to provide support for JBOD in KRaft, while avoiding any RPC having to list all the partitions in a log directory.
...
If some or all of the log directories are new, then the format
command should be used. Otherwise update-directories
can be used to update the two properties: directory.id
and directory.ids
. This can ensure that each log directory configured in log.dirs
has a unique UUID assigned under the property log.directory
in its meta.properties
and that the full set of UUIDs for all configured log.dirs
is persisted in directory.ids
in all meta.properties
.
All configured log directories must be available for either format
or update-directories
to run.
meta.properties
The meta.properties
version field will be bumped from 1 to 2. Two new properties directory.id
and directory.ids
will be added to the meta.properties
file in each log directory, including the metadata.log.dir
. The first property, directory.id
indicates the UUID for the log directory where the file is located, the second property, directory.ids
lists all the UUIDs for all the configured log directories. If the meta.properties
file doesn't exist for the metadata.log.dir
the Kafka node will fail to start. If the meta.properties
file exists but it doesn't contain these two properties a new one will be generated and the meta.properties
files will be updated. The kafka-storage.sh
tool will be extended to generate and update the two properties as described in the previous section.
...
The broker will then diff all the loaded UUIDs in directory.id
with the full set of all UUIDs (in directory.ids
) to obtain the set of UUIDs for offline log directories. The sets of both online and offline log directory UUIDs are sent along in the broker registration request to the controller) to obtain the set of UUIDs for offline log directories. The sets of both online and offline log directory UUIDs are sent along in the broker registration request to the controller. If log directory that holds the cluster metadata topic is configured separately to a different path — using metadata.log.dir
— then the respective UUID for this log directory is excluded from both online and offline sets, as the broker cannot run if this particular log directory is unavailable.
Replica management
As the broker catches up with metadata, and sees the partitions which it should be hosting, it will check the associated log directory UUID for each partition.
...
If the broker is configured with multiple log directories it remains FENCED until it can verify that all partitions are assigned to the correct log directories in the cluster metadata. This excludes the log directory that hosts the cluster metadata topic, if it is configured separately to a different path — using metadata.log.dir
.
Metadata caching
Replicas are considered offline if the replica references a log directory which is not in the list of online log directories for the broker ID hosting the replica.
...
- Keeping the scope of the log directory to the broker — while this would mean a much simpler change, as was proposed in KIP-589, if only the broker itself knows which partitions were assigned to a log directory, when a log directory fails the broker will need to send a potentially very large request enumerating all the partitions in the failed disk, so that the controller can update leadership and ISRs accordingly.
- Having controller determine the log directory, and leave that up to the broker — this would avoid require a RPC from the broker upon selecting a new log directory for new replicas, and reduce the time until it is safe for the broker to take leadership of the replica. However the broker is in a better position to make a choice of log directory than the broker, as it has easier access to e.g. disk usage in each log directory. The controller could also have this information if the broker were to include it the broker registration. But to keep the design simple, this optimization is best left for future work.
- Changing how log directory failure is handled in ZooKeeper mode — ZooKeeper mode is going away, KIP-833 proposed its deprecation in a near future release.
- Using the system path to identify each log directory, or storing the identifier somewhere else — When running Kafka with multiple log directories, each log directory is typically assigned to a different system disk or volume. The same storage device can be made accessible under a different mount, and Kafka should be able to identify the contents as the same disk. Because the log directory configuration can change on the broker, the only reliable way to identify the log directory in each broker is to add metadata to the file system under the log directory itself.
- Fail on broker startup if the cluster metadata indicates replicas should be in a different log directory than the directory where the broker actually finds them — Despite not being an advertised feature, currently replicas can be moved between log directories while the broker is offline. Once the broker comes back up it accepts the new location of the replica. To continue supporting this feature, the broker will need to compare the information in the cluster metadata with the actual replica location during startup and take action on any mismatch.
- Not keeping a list of all configured log directory identifiers in a metadata file in the file system — If a broker finds some log directories to be offline during startup it will need another way to identify the log directory as the metadata is in the disk itself. For this reason, the list of all log directory IDs must also be persisted in somewhere that the broker can expected to be able to access. Regardless of how many log directories are configured, one of the log directories will be configured to host the cluster metadata topic. This can either be one of the data log directories or a completely separate one — typically in the system disk. Since the broker will not startup if this log directory is not available, it is the perfect location to persist the list of all log directory identifiers.
Footnotes
Footnotes Display |
---|