Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Then again, these improvements still require manual intervention or at the very least complex infrastructure code that automates the process.
It would be very useful if Kafka had a way to quarantine unexpected failures in certain logs such that they don't affect the cleaning of other logs. While this would not fix the issue, it would significantly slow down the process and provide users with adequate time for detection and repair.

Public Interfaces

New metricmetrics:

  • `uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable per logDir
  • `uncleanable-bytes` (Long) - The current number of uncleanable bytes. This is the sum of uncleanable bytes for every uncleanable partition

New broker config value:

  • `max`log.cleaner.max.uncleanable.partitions` bytes` - the maximum amount of uncleanable partitions megabytes a single LogDir can have before it is marked as offline. Default value is set to 1010GB (value of 10000000000)

Proposed Changes

Catch any unexpected (non-IO) exceptions in `CleanerThread#cleanOrSleep()`.

...

When evaluating which logs to compact, skip the marked as uncleanable ones.

Introduce new broker cluster-level configurable value - `max`log.cleaner.max.uncleanable.partitions`bytes`. When the sum of uncleanable bytes for all marked partitions reach reaches this threshold, mark the disk they disk  are on as offline. (this most likely indicates a problem with the disk itself)

...