Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently, during log compaction, if the compaction of one log fails the whole `CleanerThread` responsible for compacting and deleting old logs exits. It is then not automatically restarted at any point. This  This results in a Kafka broker that runs seemingly fine but does not delete old log segments at all.  This makes the broker a ticking time bomb - it is only a matter of time until the broker runs out of disk space and then all sorts of fatal scenarios ensue.

The situation has been improving - we have a metric showing the time since the last run of the `CleanerThread` (`kafka.log:type=LogCleanerManager,name=time-since-last-run-ms`) and Kafka 1.1 (KIP-226) provided functionality allowing us to restart the log cleaner thread without restarting the broker.

Then again, these improvements still require manual intervention or at the very least complex infrastructure code that automates the process.
It would be very useful if Kafka had a way to quarantine these unexpected failures unexpected failures in certain logs such that they don't affect the cleaning of other logs. While this would not fix the issue, it would significantly slow down the process and provide users with adequate time for detection and repair.

...