Table of Contents

Status

Current state: "Under Discussion"

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Historically, there have been numerous issues where the log compaction has failed for some reason, most commonly bugs in code (

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-3330

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-6264

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-6834

Jira

server	ASF JIRA
serverId	5aa69414-a9e9-3523-82ec-879b028fb15b
key	KAFKA-6854

to name some).

...

It would be very useful if Kafka had a way to quarantine these unexpected failures in certain logs such that they don't affect the cleaning of other logs. While this would not fix the issue, it would significantly slow down the process and provide users with adequate time for detection and repair.

Public Interfaces

Two New Metrics:

`uncleanable-partitions-count` (Int) - Count of partitions that are uncleanable
`uncleanable-partitions` (String) - Comma-separated names of the partitions that are uncleanable. Example: "2,3,4"

Proposed Changes

Catch any unexpected exceptions in `CleanerThread#cleanOrSleep()`.

...

When evaluating which logs to compact, skip the marked ones.

Needs Discussion

A metric that tracks the overall uncleanable bytes seems like it would be useful. I am not sure how easy that is to implement and I wonder if that functionality (fetching log segments and determining their size) could cause additional errors
Should said log directories be marked as "offline log directories" therefore stopping replicas from fetching said partitions?
Should we mark disk partitions as offline after a certain number of `IOException`s are caught? (as they imply that something might be wrong with the disk)

Compatibility, Deprecation, and Migration Plan

This KIP should have no compatibility issues.

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

...

Space shortcuts

Child pages

Versions Compared

Old Version 2

New Version 3

Key

Status

Motivation

Public Interfaces

Proposed Changes

Needs Discussion

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 2

New Version 3

Key

Status

Motivation

Public Interfaces

Proposed Changes

Needs Discussion

Compatibility, Deprecation, and Migration Plan

Rejected Alternatives