Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Then again, these improvements still require manual intervention or at the very least complex infrastructure code that automates the process.
It would be very useful if Kafka had a way to quarantine unexpected failures in certain logs such that they don't affect the cleaning of other logs. While this would not fix the issue, it would significantly slow down the process and provide users with adequate time for detection and repair.

Public Interfaces

Two New Metricsmetric:

  • `uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable per logDir

New broker config value:

  • `max.uncleanable.partitions` - the maximum amount of uncleanable partitions a single LogDir can have before it is marked as offline. Default value is set to 10

...

Compatibility, Deprecation, and Migration Plan

This KIP should have no compatibility issuesThe "time-since-last-run" metric will slightly change its behavior, since the LogCleaner will now continue to run once it encounters an error. Previous implementations that track the "time-since-last-run" metric for potential disk failures might be affected, but at least disk damage is maximally mitigated by marking the log directory as offline. If all log directories are offline, "time-since-last-run" will not be updated.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

...