Status
Current state: [Under Discussion]
Discussion threads:
- FLIP-75 discussion about the initial design
- FLIP-102 discussion after splitting up FLIP-75 into sub-flips
JIRA:
Released: <Flink Version>
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
FLIP-49 has been accepted and merged in Flink 1.10, the metric in current task manager detail page could not correspond well to the design of FLIP-49.
The memory model which is exposed through the configuration parameters should be visualized in the same way in the TaskManager's details.
Proposed Changes
According to FLIP-49, we can correlate the configuration parameters and the metrics partially.
JVM Metrics
These JVM metrics are exposed and can be used through the TaskManager's metrics REST API.
JVM | Metric | Used key | Total key |
---|---|---|---|
Heap | Status.JVM.Memory.Heap | Used | Max |
Direct | Status.JVM.Memory.Direct | Used | Max |
Metaspace | Status.JVM.Memory.Metaspace
| Used | Max |
Mapped | Status.JVM.Memory.Mapped | MemoryUsed | TotalCapacity |
NonHeap | Status.JVM.Memory.NonHeap | MemoryUsed | TotalCapacity |
Memory Configuration
Flink's memory model (as described in org.apache.flink.runtime.clusterframework.TaskExecutorProcessSpec
) can be mapped to the following Flink configuration parameters. There are a few that have a correlating Flink metric.
Flink Memory Model | Flink configuration1 | Effective Configuration REST API2 | Metric3 | Used key | Total key |
---|---|---|---|---|---|
Framework Heap | taskmanager.memory.framework.heap.size | memoryConfiguration.frameworkHeap | Status.JVM.Memory.Heap | Used | Max |
Task Heap | taskmanager.memory.task.heap.size | memoryConfiguration.taskHeap | |||
Framework OffHeap | taskmanager.memory.framework.off-heap.size | memoryConfiguration.frameworkOffHeap | - | - | - |
Task OffHeap | taskmanager.memory.task.off-heap.size | memoryConfiguration.taskOffHeap | |||
Network Memory | memoryConfiguration.networkMemory | Status.Shuffle.Netty | UsedMemory | TotalMemory | |
Managed Memory | taskmanager.memory.managed.size | memoryConfiguration.managedMemory | Status.Flink.Memory.Managed | Used | Total |
JVM Metaspace | taskmanager.memory.jvm-metaspace.size | memoryConfiguration.jvmMetaspace | Status.JVM.Memory.Metaspace | Used | Max |
JVM Overhead | memoryConfiguration.jvmOverhead | - | - | - |
1 These are the configuration parameters used in the Flink configuration.
2 These are the Json paths to address the properties in the HTTP REST API response. Additionally, memoryConfiguration.totalFlinkMemory
and totalProcessMemory
are exposed through the REST API.
3 The metrics which are exposed through the TaskManager's metrics REST API.
Frontend Design (out-dated)
Redesign the task manager metric page, this would allow users to more clearly understand the relationship between these metrics.
REST API Design
- task manager's resource contains this information, show it in
url: /taskmanagers/:taskmanagerid
Implementation Proposal
Step 1: Expose effective configuration parameters of TaskExecutorn
TaskManagerResourceInfo
is introduced as a POJO containing the relevant values proposed in the REST response.- The
TaskManagerResourceInfo
is initialized when initializing theTaskExecutor
in the same way as we do it with theHardwareDescription
. It will be handed over in the same way throughTaskExecutorRegistry
→WorkerRegistration
. - The
TaskManagerResourceInfo
will be added along withHardwareDescription
inResourceManager::requestTaskManagerInfo(ResourceId, Time)
.
Step 2: Introduce new metric for memory usage of NetworkBufferPool
add shuffle memory's size metric
update
NettyShuffleMetricFactory#registerShuffleMetrics
Step 3: Introduce new metrics for Task's managed memory usage
We still have to discuss how to implement that in the right way. A brief proposal is the following one:
We would have to introduce a new metric that represents the aggregated memory usage of each TaskSlot
. The aggregation can be maintained in the TaskExecutor
.
Step 4: Add Metaspace metrics
There are no metrics present, yet, monitoring the JVM's Metaspace pool. The newly introduced metrics are going to be exposed through the /taskmanagers/metrics REST API.
Step 5: Update TaskManager's details page
The web UI has to be updated as proposed above.
Follow-Ups
- Create a separate independent endpoint for the effective memory configuration.
Test Plan
Existing tests are updated to verify feature.