Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • A metric that tracks the overall uncleanable bytes seems like it would be useful. I am not sure how easy that is to implement and I wonder if that functionality (fetching log segments and determining their size) could cause additional errors
  • Should said log directories be marked as "offline log directories" therefore stopping replicas from fetching said partitions?
  • Should we mark disk partitions as offline after a certain number of `IOException`s are caught? (as they imply that something might be wrong with the disk)

Compatibility, Deprecation, and Migration Plan

...

  • Restart `CleanerThread` - it will most likely inevitably hit the same problem before it is able to compact moreOnly catch specific `KafkaStorageException` raised when `CleanerThread` fails to delete logs. This will not address the most common failure cases and does not future-proof the `CleanerThread`
  • Mark disk volumes as "uncleanable" on first encountered error. While this would work, in practice it would not help as most deployments use a single volume. Also, if the error is caused by a bug in the partition itself (as shown by most JIRA issues in the Motivation paragraph), this will unnecessarily stop compaction of all other partitions.