Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state"Under DiscussionAccepted"

Discussion thread: here

JIRAKAFKA-7215

Released: 2.1

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

Then again, these improvements still require manual intervention or at the very least complex infrastructure code that automates the process.
It would be very useful if Kafka had a way to quarantine unexpected failures in certain logs such that they don't affect the cleaning of other logs. While this would not fix the issue, it would significantly slow down the process and provide users with adequate time for detection and repair.

Public Interfaces

New metricmetrics:

  • `uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable per logDir

New broker config value:

  • `max.uncleanable.partitions` - the maximum amount of uncleanable partitions a single LogDir can have before it is marked as offline. Default value is set to 10`uncleanable-bytes` (Long) - The current number of uncleanable bytes per logDir. This is the sum of uncleanable bytes for every uncleanable partition in a certain log directory

Proposed Changes

Catch any unexpected (non-IO) exceptions in `CleanerThread#cleanOrSleep()`.

...

When evaluating which logs to compact, skip the marked as uncleanable ones.

Introduce new broker configurable value - `max.uncleanable.partitions`. When the marked partitions reach this threshold, mark the disk they are on as offline. (this most likely indicates a problem with the disk itself)

Compatibility, Deprecation, and Migration Plan

...

  • Restart `CleanerThread` - it will most likely inevitably hit the same problem before it is able to compact more
  • Mark disk volumes as "uncleanable" on first encountered error. While this would work, in practice it would not help as most deployments use a single volume. Also, if the error is caused by a bug in the partition itself (as shown by most JIRA issues in the Motivation paragraph), this will unnecessarily stop compaction of all other partitions.
  • Mark log directories as offline after a certain threshold of uncleanable bytes or number of uncleanable partitions. - uncleanable partitions threshold proved insufficient since previous problems that have been encountered affected a small number of partitions (__consumer_offsets topic). threshold of uncleanable bytes is hard to get right, as it should be different for each user and the default value should best be -1 (disabled)