Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Flink currently exposes the following GC related metrics:
Job-/TaskManager | Status.JVM.GarbageCollector | <GarbageCollector>.Count | The total number of collections that have occurred. | Gauge |
---|---|---|---|---|
<GarbageCollector>.Time | The total time spent performing garbage collection. | Gauge |
Unfortunately these are not very useful for monitoring purposes as they require post processing logic that is also dependent on the current runtime environment:
- Total time is not very relevant for long running applications, only the rate of change (millisPerSec or similar)
- In most cases it's best to simply aggregate the time/count across the different GabrageCollectors, however the specific collectors are dependent on the current Java runtime
- It's impossible to detect long GC pauses that may cause heartbeat timeouts
We propose to improve the current situation by:
- Exposing rate metrics per GarbageCollector
- Exposing aggregated Total time/count/rate metrics
- Expose average GC time metric in the current measurement window (last 1 minute)
These new metrics are all derived from the existing ones with minimal overhead.
Public Interfaces / Proposed Changes
New GC metrics (in addition to the existing ones)
Job-/TaskManager | Status.JVM.GarbageCollector | <GarbageCollector>.TimeMsPerSec | Milliseconds spent performing garbage collection per second. | Meter |
---|---|---|---|---|
<GarbageCollector>.AverageTime | Average collection time in the current metric window. Delta(Time) / Delta(Count) | Gauge | ||
Total.Time | The total time spent performing garbage collection across all collectors. | Gauge | ||
Total.TimeMsPerSec | Milliseconds spent performing garbage collection per second across all collectors. | Meter | ||
Total.AverageTime | Average collection time in the current metric window across all collectors. Delta(Time) / Delta(Count) | Gauge | ||
Total.Count | The total number of collections that have occurred across all collectors. | Gauge |
Compatibility, Deprecation, and Migration Plan
These are new metrics, no user impact.