Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties


Discussion threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)/thread/qqqv54vyr4gbp63wm2d12q78m8h95xb2
Vote threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
JIRAhere (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)
thread/4gx6xv32zxdqkb2p9fdc1vdd66vq1gqw
JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-33120

Release1.19Release2.0


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Flink currently exposes the following GC related metrics:

Job-/TaskManagerStatus.JVM.GarbageCollector<GarbageCollector>.CountThe total number of collections that have occurred.Gauge
<GarbageCollector>.TimeThe total time spent performing garbage collection.Gauge

Unfortunately these are not very useful for monitoring purposes as they require post processing logic that is also dependent on the current runtime environment:

  • Total time is not very relevant for long running applications, only the rate of change (millisPerSec or similar)
  • In most cases it's best to simply aggregate the time/count across the different GabrageCollectors, however the specific collectors are dependent on the current Java runtime 
  • It's impossible to detect long GC pauses that may cause heartbeat timeouts

We propose to improve the current situation by:

  • Exposing rate metrics per GarbageCollector
  • Exposing aggregated Total time/count/rate metrics
  • Expose average GC time metric in the current measurement window (last 1 minute)

These new metrics are all derived from the existing ones with minimal overhead.

Public Interfaces / Proposed Changes

New metrics:

...

GC metrics (in addition to the existing ones)

Job-/TaskManager

Status.JVM.GarbageCollector<GarbageCollector>.TimeMsPerSecMilliseconds spent performing garbage collection per second.Meter
<GarbageCollector>.AverageTimeAverage collection time in the current metric window. Delta(Time) / Delta(Count)Gauge
All.TimeThe total time spent performing garbage collection across all collectors.Gauge
All.TimeMsPerSecMilliseconds spent performing garbage collection per second across all collectors.Meter
All.AverageTimeAverage collection time in the current metric window across all collectors. Delta(Time) / Delta(Count)Gauge
All.CountThe total number of collections that have occurred across all collectors.Gauge


Compatibility, Deprecation, and Migration Plan

...