Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this case the first thing users usually notice is that replicas start lagging and it takes a considerable time by they get to the conclusion that it was because of something killing the replica fetchers. The problems can span from bugs to rare log divergence issues. Having a metric for dead replica fetchers would speed up the investigation of any such issues as monitoring systems attached to Kafka could record the exact time of the issue raised. In some (but relatively small) of these cases it is a problem that application logs roll over too often, therefore the real root cause of the issue remains unknown. This metric would enable allow users to trigger alerts based on the change of this metric. 

Log Cleaners

The motivation for the dead log cleaner thread count metric is very similar. Sometimes a problem with the log cleaner threads get noticed when a disk gets full - a log cleaner died earlier because of some issue that prevented cleanup and it never got restarted. A metric for the dead thread count would help because alerts could be triggered based on this in monitoring systems and also it would be easier to find out the exact time of this issue.

...

I propose to add three gauge: AliveFetcherThreadCountDeadFetcherThreadCount for the fetcher threads,  dead-log-cleaner-thread-count and log-cleaner-current-live-thread-rate for the log cleaner. All of these are broker level metrics.

AliveFetcherThreadCountDeadFetcherThreadCountthis basically exposes the size count of non-alive threads in the internal thread map in AbstractFetcherManager.
dead-log-cleaner-thread-count: this would expose the difference between the number of configured threads - and size of the ArrayBuffer that maintains the CleanerThread instances in LogCleaner. 

...