Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state"Under DiscussionAccepted"

Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]

JIRA:here [Change the link from  KAFKA-7215

Released: 2.11 to your own ticket]

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Historically, there have been numerous issues where the log compaction has failed for some reason, most commonly bugs in code (

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-3330
 
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-6264
 
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-6834
 
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-6854
 
Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-6762
 to name some).

Currently, during log compaction, if the compaction of one log fails unexpectedly the whole `CleanerThread` responsible for compacting and deleting old logs exits. It is then not automatically restarted at any point. This  This results in a Kafka broker that runs seemingly fine but does not delete old log segments at all.  This makes the broker a ticking time bomb - it is only a matter of time until the broker runs out of disk space and then all sorts of fatal scenarios ensue.

The situation has been improving - we have a metric showing the time since the last run of the `CleanerThread` (`kafka.log:type=LogCleanerManager,name=time-since-last-run-ms`) and Kafka 1.1 (KIP-226) provided functionality allowing us to restart the log cleaner thread without restarting the broker.

Then again, these improvements still require manual intervention or at the very least complex infrastructure code that automates the process.
It would be very useful if Kafka had a way to quarantine these unexpected failures unexpected failures in certain logs such that they don't affect the cleaning of other logs. While this would not fix the issue, it would significantly slow down the process and provide users with adequate time for detection and repair.

Public Interfaces

Two New Metricsmetrics:

  • `uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable per logDir
  • `uncleanable-partitions` bytes` (StringLong)  - Comma-separated names of the partitions that are uncleanable. Example: "2,3,4"- The current number of uncleanable bytes per logDir. This is the sum of uncleanable bytes for every uncleanable partition in a certain log directory

Proposed Changes

Catch any unexpected (non-IO) exceptions in `CleanerThread#cleanOrSleep()`.

...

When evaluating which logs to compact, skip the marked as uncleanable ones.

Needs Discussion

  • A metric that tracks the overall uncleanable bytes seems like it would be useful. I am not sure how easy that is to implement and I wonder if that functionality (fetching log segments and determining their size) could cause additional errors
  • Should said log directories be marked as "offline log directories" therefore stopping replicas from fetching said partitions?
  • Should we mark disk partitions as offline after a certain number of `IOException`s are caught? (as they imply that something might be wrong with the disk)

Compatibility, Deprecation, and Migration Plan

Compatibility, Deprecation, and Migration Plan

The "time-since-last-run" metric will slightly change its behavior, since the LogCleaner will now continue to run once it encounters an error. Previous implementations that track the "time-since-last-run" metric for potential disk failures might be affected, but at least disk damage is maximally mitigated by marking the log directory as offline. If all log directories are offline, "time-since-last-run" will not be updatedThis KIP should have no compatibility issues.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

  • Restart `CleanerThread` - it will most likely inevitably hit the same problem before it is able to compact more
  • Mark disk volumes as "uncleanable" on first encountered error. While this would work, in practice it would not help as most deployments use a single volume. Also, if the error is caused by a bug in the partition itself (as shown by most JIRA issues in the Motivation paragraph), this will unnecessarily stop compaction of all other partitions.
  • Mark log directories as offline after a certain threshold of uncleanable bytes or number of uncleanable partitions. - uncleanable partitions threshold proved insufficient since previous problems that have been encountered affected a small number of partitions (__consumer_offsets topic). threshold of uncleanable bytes is hard to get right, as it should be different for each user and the default value should best be -1 (disabled)