Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • `uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable per logDir
  • `uncleanable-bytes` (Long) - The current number of uncleanable bytes per logDir. This is the sum of uncleanable bytes for every uncleanable partition in a certain log directory

Proposed Changes

Catch any unexpected (non-IO) exceptions in `CleanerThread#cleanOrSleep()`.

...

  • Restart `CleanerThread` - it will most likely inevitably hit the same problem before it is able to compact more
  • Mark disk volumes as "uncleanable" on first encountered error. While this would work, in practice it would not help as most deployments use a single volume. Also, if the error is caused by a bug in the partition itself (as shown by most JIRA issues in the Motivation paragraph), this will unnecessarily stop compaction of all other partitions.
  • Mark log directories as offline after a certain threshold of uncleanable bytes or number of uncleanable partitions. - uncleanable partitions threshold proved insufficient since previous problems that have been encountered affected a small number of partitions (__consumer_offsets topic). threshold of uncleanable bytes is hard to get right, as it should be different for each user and the default value should best be -1 (disabled)