Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this case the first thing users usually notice is that replicas start lagging and it takes a considerable time by they get to the conclusion that it was because of something killing the replica fetchers. The problems can span from bugs to rare log divergence issues. Having a metric for dead replica fetchers would speed up the investigation of any such issues as monitoring systems attached to Kafka could record the exact time the issue raised. In some (but relatively small) of these cases it is a problem that application logs roll over too often, therefore the real root cause of the issue remains unknown. This metric would allow users to trigger alerts based on the change of this metric. 

Log Directory Fetchers

Similarly to the replica fetchers when altering log directories, some log dir fetcher might get unexpectedly interrupted and in this case it won't be finished and users might have to do some digging to figure out what could have happened. Introducing a metric named DeadLogDirFetcherThreadCount for tracking the dead fetcher threads would speed up diagnostics.

Log Cleaners

The motivation for the dead log cleaner thread count metric is very similar. Sometimes a problem with the log cleaner threads get noticed when a disk gets full - a log cleaner died earlier because of some issue that prevented cleanup and it never got restarted. A metric for the dead thread count would help because alerts could be triggered based on this in monitoring systems and also it would be easier to find out the exact time of this issue.

...

I propose to add two gauge: DeadFetcherThreadCountDeadReplicaFetcherThreadCount for the fetcher threads, log-cleaner-dead-thread-count for the log cleaner. All of these are broker level metrics.

DeadFetcherThreadCount: this basically exposes the count of non-alive threads in the internal thread map in AbstractFetcherManager. Its clientId tag could either be Fetcher or ReplicaAlterLogDirs so the two metrics are distinguishable.
log-cleaner-dead-thread-count: this would expose the number of dead threads inside the ArrayBuffer that maintains the CleanerThread instances in LogCleaner. 

...