Status

Discussion thread	https://lists.apache.org/thread/b5dpk4g6xgx0ysgnjphlnsvxftl17lqj
Vote thread
JIRA
Release

Motivation

It is desirable to provide better visibility into the distribution of CPU resources while executing user code. One of the most visually effective means to do that are Flame Graphs. They allow to easily answer question like:

Which methods are currently consuming CPU resources?
How consumption by one method compares to the others?
Which series of calls on the stack led to executing a particular method?

Flame Graphs are constructed by sampling stack traces a number of times. Every method call is presented by a bar, where the length of the bar is proportional to the number of times it is present in the samples.

Flink supports FLIP-165: Operator's Flame Graphs now. and it draw flame graph by the front-end libraries d3-flame-graph. My research shows that Arthas and intellij idea both use async profiler to support this functionality.

And this tool is more professional.And i have already added this feature to our company. The most importance is the Operator's Flame Graphs has some fatal flaw when the job parallesim more than 500+ it will case chrome browser hang.

And the chrome browser can not do anything.

Public Interfaces

N/A

Proposed Changes

we want to propose to provide an taskmanager level(process) flame graph by async profiler.

1) we should support taskmanager level configurable script feature like yarn. user can configure multiple scripts.

taskmanager.execution.flame-graph.dir: /opt/flink/profiler/flamegraph

taskmanager.execution.flame-script.path: /opt/flink/bin/taskmanager-flame-graph.sh // it will encapsulate async profiler

taskmanager.execution.flame-script.opts: cpu 30

and it supports user defined script like:

taskmanager.execution.xxx.path:

taskmanager.execution.xxx.opts:

2) add 2 interface

call the taskmanager to run the script

list and display the flame graph

Compatibility, Deprecation, and Migration Plan

What impact (if any) will there be on existing users?
If we are changing behavior how will we phase out the older behavior?
If we need special migration tools, describe them here.
When will we remove the existing behavior?

Test Plan

Describe in few sentences how the FLIP will be tested. We are mostly interested in system tests (since unit-tests are specific to implementation details). How will we know that the implementation works as expected? How will we know nothing broke?

Rejected Alternatives

If there are alternative ways of accomplishing the same thing, what were they? The purpose of this section is to motivate why the design is the way it is and not some other way.

Page tree

FLIP-213: TaskManager's Flame Graphs