You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Flink currently exposes the following GC related metrics:

Job-/TaskManagerStatus.JVM.GarbageCollector<GarbageCollector>.CountThe total number of collections that have occurred.Gauge
<GarbageCollector>.TimeThe total time spent performing garbage collection.Gauge

Unfortunately these are not very useful for monitoring purposes as they require post processing logic that is also dependent on the current runtime environment:

  • Total time is not very relevant for long running applications, only the rate of change (millisPerSec or similar)
  • In most cases it's best to simply aggregate the time/count across the different GabrageCollectors, however the specific collectors are dependent on the current Java runtime 
  • It's impossible to detect long GC pauses that may cause heartbeat timeouts

We propose to improve the current situation by:

  • Exposing rate metrics per GarbageCollector
  • Exposing aggregated Total time/count/rate metrics
  • Expose average GC time metric in the current measurement window (last 1 minute)

These new metrics are all derived from the existing ones with minimal overhead.

Public Interfaces / Proposed Changes

New GC metrics (in addition to the existing ones)

Job-/TaskManager

Status.JVM.GarbageCollector<GarbageCollector>.TimeMsPerSecMilliseconds spent performing garbage collection per second.Meter
<GarbageCollector>.AverageTimeAverage collection time in the current metric window. Delta(Time) / Delta(Count)Gauge
Total.TimeThe total time spent performing garbage collection across all collectors.Gauge
Total.TimeMsPerSecMilliseconds spent performing garbage collection per second across all collectors.Meter
Total.AverageTimeAverage collection time in the current metric window across all collectors. Delta(Time) / Delta(Count)Gauge
Total.CountThe total number of collections that have occurred across all collectors.Gauge


Compatibility, Deprecation, and Migration Plan

These are new metrics, no user impact.

  • No labels