A metric that tracks the overall uncleanable bytes seems like it would be useful. I am not sure how easy that is to implement and I wonder if that functionality (fetching log segments and determining their size) could cause additional errors
Should said log directories be marked as "offline log directories" therefore stopping replicas from fetching said partitions?
Should we mark disk partitions as offline after a certain number of `IOException`s are caught? (as they imply that something might be wrong with the disk)

Compatibility, Deprecation, and Migration Plan

...

Restart `CleanerThread` - it will most likely inevitably hit the same problem before it is able to compact moreOnly catch specific `KafkaStorageException` raised when `CleanerThread` fails to delete logs. This will not address the most common failure cases and does not future-proof the `CleanerThread`
Mark disk volumes as "uncleanable" on first encountered error. While this would work, in practice it would not help as most deployments use a single volume. Also, if the error is caused by a bug in the partition itself (as shown by most JIRA issues in the Motivation paragraph), this will unnecessarily stop compaction of all other partitions.

Space shortcuts