Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current stateUnder DiscussionAccepted

Discussion threadhere

JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyKAFKA-7981

Released: -AK 2.4.0

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

...

In this case the first thing users usually notice is that replicas start lagging and it takes a considerable time by they get to the conclusion that it was because of something killing the replica fetchers. The problems can span from bugs to rare log divergence issues. Having a metric for dead replica fetchers would speed up the investigation of any such issues as monitoring systems attached to Kafka could record the exact time of the issue raised. In some (but relatively small) of these cases it is a problem that application logs roll over too often, therefore the real root cause of the issue remains unknown. This metric would enable allow users to trigger alerts based on the change of this metric. 

Log Directory Fetchers

Similarly to the replica fetchers when altering log directories, some log dir fetcher might get unexpectedly interrupted and in this case it won't be finished and users might have to do some digging to figure out what could have happened. Introducing a metric named DeadLogDirFetcherThreadCount for tracking the dead fetcher threads would speed up diagnostics.

Log Cleaners

The motivation for the dead log cleaner thread count metric is very similar. Sometimes a problem with the log cleaner threads get noticed when a disk gets full - a log cleaner died earlier because of some issue that prevented cleanup and it never got restarted. A metric for the dead thread count would help because alerts could be triggered based on this in monitoring systems and also it would be easier to find out the exact time of this issue.

Public Interfaces

I propose to add three two gauge: AliveFetcherThreadCountDeadReplicaFetcherThreadCount for the fetcher threads, log-cleaner-dead-thread-count and log-cleaner-current-live-thread-rate for the log cleaner. All of these are broker level metrics.

AliveFetcherThreadCountDeadFetcherThreadCountthis basically exposes the size count of non-alive threads in the internal thread map in AbstractFetcherManager.
dead-Its clientId tag could either be Fetcher or ReplicaAlterLogDirs so the two metrics are distinguishable.
log-cleaner-dead-thread-count: this would expose the number of configured dead threads - size of inside the ArrayBuffer that maintains the CleanerThread instances in LogCleaner. 

...