...
Page properties | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
With the long-term running of the streaming task, some operators may experience performance bottlenecks caused by rising traffic, network or external system jitter, insufficient resources, and many other reasons, leading to backpressure or even delay. In such a scenario, we have to quickly identify performance bottlenecks through some powerful tools.
What we have...
Currently, we have the following tools at our disposal (including but not limited to):
1. Thread Thread Dump on TaskManager/Jobmanager(Figure 1): Prints the call stack of all threads on TaskManager/Jobmanager via Flink Web or jstack. Performing Thread Dump at different moments may obtain different call stacks, which confuses the tuner. But it can be used for the analysis of operator blocking in a particular code. Moreover, it can't analyze the CPU usage distribution under CPU bottleneck.
...
3. Monitoring: We can evaluate whether there is a bottleneck in the current job through metrics such as watermark, data consumption latency, operator BusyRatio, etc. But it's impossible to specifically locate the code where the bottleneck is through monitoring.
Why do we need this FLIP?
- Thread Dump can not fulfill the demand for the analysis of the distribution of CPU usage over time.
- The operator-level Flame Graph provided by FLIP-165 is incapable of drawing the system call stack, and there are several problems in saving, searching, and adjusting the sampling interval case by case.
As we know, the async-profiler is a low-overhead sampling profiler for Java (A profiling instance is shown in Figure 2), and it only have less than 3% overhead according to the report.
With the help of Async-profiler on Taskmanager, we can print TM-level CPU usage distribution for program execution, including call stacks for all subtasks on Taskmanager and even system calls. You can export as HTML, search, and specify profiling intervals. But it requires logging into the physical machine or container hosting the TM to download and install it, by performing command line operations. Apparently, it's not safe to do such operations in a production environment, and there are both permissions issues (login/export) and security risks in distribution systems.
...
Overall, the FLIP is dedicated to providing taskmanager/jobmanager-level flame graph generation capabilities based on async-profiler on the Flink Web UI, along with parameterized sampling intervals and easy-to-use download capabilities
Public Interfaces
To make profiling service available in web UI, some rest API will be added:
- API for creating a profiling instance of the given taskmanager [/
- For Taskmanager [/taskmanager/:tm-id/profiler?type=create&duration=%d&mode=%s]
- For Jobmanager [/jobmanager/profiler?type=create&duration=%d&mode=%s]
- API for listing the current profiling list of the given taskmanager [/
- For Taskmanager [/taskmanager/:tm-id/profiler?type=list]
- For Jobmanager [/jobmanager/profiler?type=list]
- API for downloading a profiling result file(Flame Graph in HTML) of the given profiling instance [
- For Taskmanager [/taskmanager/:tm-id/profiler/:file]
- For Jobmanager [/jobmanager/profiler/:file]
To make the feature parameterized and controlled, some configuration options will be added:
- rest.profiling.enabled: controls whether the feature is enabled or not, false by default
- rest.profiling.max-duration-max: control the maximum allowed sampling interval, 300s by default
- rest.profiling.history-size: control the maximum allowed number of sampling results to be saved, with rolling deletion. 10 by default.
Proposed Changes
Architecture Overview
The proposed solution in our FLIP is shown in the figure below.
On Taskmanager:
- Flink users submit profiling requests through the rest API.
- The Resource Manager forwards the request to the user-specified taskmanager.
- Task Executor invokes native methods provided by Async-profiler depending on the platform.
- After the completion of Profiling, taskmanager returns the file download path (asynchronous process, driven by the front-end continuous sampling status query).
- Jobmanager allows the user to download the results of the corresponding files on taskmanager with the blob service..
On Jobmanager, the steps are similar to those in Taskmanager, the only difference is that we complete the invocation of Async-profiler in Restful Gateway directly (As the dotted line shows in figure 3).
Figure 3. An overview of our proposal on Taskmanager & Jobmanager
Cross Platform with JNI
In the package tools.profiler:async-profiler:2.9, which packages the dynamic runtime library so-files for all platforms supported by async-profiler with a unified API, it will select the appropriate dynamic runtime library file according to the runtime environment and invoke it via JAVA Native Interfaces.
...
- Linux / x64 / x86 / arm64 / arm32 / ppc64le
- macOS / x64 / arm64
Interactive UI
Flink users can complete the profiling submission and result export in the Flink Web UI by the following simple steps:
- Select the taskmanager to be sampled in taskmanager tab (or through the link in the operator detail drawer). Note that we also provided the ability to jump to the taskmanage Page from the back-pressured node in FLINK-29996.
- Type in the appropriate sampling interval, and profiling mode(event_mode), then click the ”Create Profiling Instance“ button to complete the submission of the profiling request.
- The profiling progress will be refreshed automatically. Once the sampling is complete, the link or error message will be displayed in the corresponding profiling request record.
- We can download the interactive HTML file locally by clicking on the download link for further comparison, searching, and sharing.
Figure 4. Examples of user interactions
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users? No
- If we are changing behavior how will we phase out the older behavior? No
- If we need special migration tools, describe them here. No
- When will we remove the existing behavior? No
Test Plan
Functionality:
- Ensure that the flame graph can be generated/exported in different environments(Linux: x86/arm, macOS).
...
- Ensure that the relevant interface cannot be accessed without enabling the feature, and provide appropriate parameter prompts
- Ensure the maximum sampling time is controlled by the configuration.
- Ensure that scrolling deletion rolling deletion is controlled by the configuration.
Rejected Alternatives
N/A
Most Concerns
We take the following concern in the previous discussion of FLIP-213 into consideration:
...
If changing the configuration is not possible, you may fall back to -e itimer profiling mode.
It is similar to CPU mode but does not require perf_events support. As a drawback, there will be no kernel stack traces.
UPDATE: From the discussion email, we see users want this feature could also leverage perf_events if possible, and since async-profiler could also support allocation and wall-clock profiling, we could extend this feature to support more cases.