Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state"Under DiscussionAccepted"

Discussion thread: here

JIRAKAFKA-7215

Released: 2.1

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

  • `uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable per logDir
  • `uncleanable-bytes` (Long) - The current number of uncleanable bytes per logDir. This is the sum of uncleanable bytes for every uncleanable partition

New broker config value:

  • `log.cleaner.max.uncleanable.bytes` - the maximum amount of uncleanable megabytes a single LogDir can have before it is marked as offline. Default value is set to 10GB (value of 10000000000)in a certain log directory

Proposed Changes

Catch any unexpected (non-IO) exceptions in `CleanerThread#cleanOrSleep()`.

...

When evaluating which logs to compact, skip the marked as uncleanable ones.Introduce new cluster-level configurable value - `log.cleaner.max.uncleanable.bytes`. When the sum of uncleanable bytes for all marked partitions reaches this threshold, mark the disk  are on as offline. (this most likely indicates a problem with the disk itself)

Compatibility, Deprecation, and Migration Plan

...

  • Restart `CleanerThread` - it will most likely inevitably hit the same problem before it is able to compact more
  • Mark disk volumes as "uncleanable" on first encountered error. While this would work, in practice it would not help as most deployments use a single volume. Also, if the error is caused by a bug in the partition itself (as shown by most JIRA issues in the Motivation paragraph), this will unnecessarily stop compaction of all other partitions.
  • Mark log directories as offline after a certain threshold of uncleanable bytes or number of uncleanable partitions. - uncleanable partitions threshold proved insufficient since previous problems that have been encountered affected a small number of partitions (__consumer_offsets topic). threshold of uncleanable bytes is hard to get right, as it should be different for each user and the default value should best be -1 (disabled)