You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Flink currently exposes the following GC related metrics:

Job-/TaskManagerStatus.JVM.GarbageCollector<GarbageCollector>.CountThe total number of collections that have occurred.Gauge
<GarbageCollector>.TimeThe total time spent performing garbage collection.Gauge

Unfortunately these are not very useful for monitoring purposes as they require post processing logic that is also dependent on the current runtime environment:

  • Total time is not very relevant for long running applications, only the rate of change (millisPerSec or similar)
  • In most cases it's best to simply aggregate the time/count across the different GabrageCollectors, however the specific collectors are dependent on the current Java runtime 

We propose to improve the current situation by:

  • Exposing rate metrics per GarbageCollector
  • Exposing aggregated Total time/count/rate metrics

These new metrics are all derived from the existing ones with minimal overhead.

Public Interfaces / Proposed Changes

New GC metrics (in addition to the existing ones)

Job-/TaskManager

Status.JVM.GarbageCollector<GarbageCollector>.TimeMsPerSecMilliseconds spent performing garbage collection per second.Meter
Total.TimeThe total time spent performing garbage collection across all collectors.Gauge
Total.TimeMsPerSecMilliseconds spent performing garbage collection per second across all collectors.Meter
Total.CountThe total number of collections that have occurred across all collectors.Gauge


Compatibility, Deprecation, and Migration Plan

These are new metrics, no user impact.

  • No labels