Discussion threadhttps://lists.apache.org/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2
Vote threadhttps://lists.apache.org/thread/4gx6xv32zxdqkb2p9fdc1vdd66vq1gqw
JIRA

Unable to render Jira issues macro, execution error.

Release1.19

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Flink currently exposes the following GC related metrics:

Job-/TaskManagerStatus.JVM.GarbageCollector<GarbageCollector>.CountThe total number of collections that have occurred.Gauge
<GarbageCollector>.TimeThe total time spent performing garbage collection.Gauge

Unfortunately these are not very useful for monitoring purposes as they require post processing logic that is also dependent on the current runtime environment:

  • Total time is not very relevant for long running applications, only the rate of change (millisPerSec or similar)
  • In most cases it's best to simply aggregate the time/count across the different GabrageCollectors, however the specific collectors are dependent on the current Java runtime 
  • It's impossible to detect long GC pauses that may cause heartbeat timeouts

We propose to improve the current situation by:

  • Exposing rate metrics per GarbageCollector
  • Exposing aggregated Total time/count/rate metrics
  • Expose average GC time metric in the current measurement window (last 1 minute)

These new metrics are all derived from the existing ones with minimal overhead.

Public Interfaces / Proposed Changes

New GC metrics (in addition to the existing ones)

Job-/TaskManager

Status.JVM.GarbageCollector<GarbageCollector>.TimeMsPerSecMilliseconds spent performing garbage collection per second.Meter
<GarbageCollector>.AverageTimeAverage collection time in the current metric window. Delta(Time) / Delta(Count)Gauge
All.TimeThe total time spent performing garbage collection across all collectors.Gauge
All.TimeMsPerSecMilliseconds spent performing garbage collection per second across all collectors.Meter
All.AverageTimeAverage collection time in the current metric window across all collectors. Delta(Time) / Delta(Count)Gauge
All.CountThe total number of collections that have occurred across all collectors.Gauge


Compatibility, Deprecation, and Migration Plan

These are new metrics, no user impact.