Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateUnder Discussion

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

In some occasions we detected errors where replica fetcher threads or log cleaners died because of an unrecoverable error and caused more serious issues in the brokers (from lagging to offline replicas, filling up disks, etc.). It would often help if the monitoring systems attached to Kafka could detect these problems early on as it would allow a prompt response from the user and the greater possibility of capturing the root cause.

...

The motivation for the log cleaner thread count metric is very similar. Sometimes a problem with the log cleaner threads get noticed when a disk gets full - a log cleaner died earlier because of some issue that prevented cleanup and it never got restarted. A metric for the thread count would help because alerts could be triggered based on this in monitoring systems and also it would be easier to find out the exact time of this issue.

Public Interfaces

I propose to add three gauge: AliveFetcherThreadCount for the fetcher threads, log-cleaner-thread-count and log-cleaner-current-live-thread-rate for the log cleaner. All of these are broker level metrics.

AliveFetcherThreadCount: this basically exposes the size of the internal thread map in AbstractFetcherManager.
log-cleaner-thread-count: this would expose the size of the size of the ArrayBuffer that maintains the CleanerThread instances in LogCleaner. This can show the actual live count of the threads.
log-cleaner-current-live-thread-rate: this is a variation of the above. We can divide the alive thread count with the cleaner thread configuration to get a rate which would be more suitable to use as a base for alerts.

Proposed Changes

There would be no changes beside the changes listed in the previous section.

Compatibility, Deprecation, and Migration Plan

No metrics will be removed or deprecated and no migration would be required.

Test Plan

It is possible to write an automated test which sets up a few brokers and injects failures in the observed threads or just interrupts them and observes the changes in the metrics.

Rejected Alternatives

No rejected alternatives so far.