Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties


Discussion threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)thread/tp5vqgspsdko66dr6vm7cgtod9k2pct7
Vote threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
JIRAhere (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)
Release<Flink Version>
thread/mb2l67oqgo3mj2sjys11tj3ns4zg41sp
JIRA

Jira
serverASF JIRA
columnIdsissuekey,summary,issuetype,created,updated,duedate,assignee,reporter,priority,status,resolution
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyFLINK-33325

Release


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

With the long-term running of the streaming task, some operators may experience performance bottlenecks caused by rising traffic, network or external system jitter, insufficient resources, and many other reasons, leading to backpressure or even delay. In such a scenario, we have to quickly identify performance bottlenecks through some powerful tools.

What we have...

Currently, we have the following tools at our disposal (including but not limited to):

1.  Thread Thread Dump on TaskManager/Jobmanager(Figure 1): Prints the call stack of all threads on TaskManager/Jobmanager via Flink Web or jstack. Performing Thread Dump at different moments may obtain different call stacks, which confuses the tuner. But it can be used for the analysis of operator blocking in a particular code. Moreover, it can't analyze the CPU usage distribution under CPU bottleneck.

...

3. Monitoring: We can evaluate whether there is a bottleneck in the current job through metrics such as watermark, data consumption latency, operator BusyRatio, etc. But it's impossible to specifically locate the code where the bottleneck is through monitoring.

Why do we need this FLIP?

  • Thread Dump can not fulfill the demand for the analysis of the distribution of CPU usage over time.
  • The operator-level Flame Graph provided by FLIP-165 is incapable of drawing the system call stack, and there are several problems in saving, searching, and adjusting the sampling interval case by case.

As we know, the async-profiler is a low-overhead sampling profiler for Java (A profiling instance is shown in Figure 2), and it only have less than 3% overhead according to the report.

With the help of Async-profiler on Taskmanager, we can print TM-level CPU usage distribution for program execution, including call stacks for all subtasks on Taskmanager and even system calls. You can export as HTML, search, and specify profiling intervals. But it requires logging into the physical machine or container hosting the TM to download and install it, by performing command line operations. Apparently, it's not safe to do such operations in a production environment, and there are both permissions issues (login/export) and security risks in distribution systems.

...

Overall, the FLIP is dedicated to providing Taskmanagertaskmanager/jobmanager-level flame graph generation capabilities based on async-profiler on the Flink Web UI, along with parameterized sampling intervals and easy-to-use download capabilities

Public Interfaces

To make profiling service available in web UI, some rest API will be added:

  • API for creating a profiling instance of the given Taskmanager [
    • For Taskmanager [/taskmanager/:tm-id/profiler?type=create&duration=%d&mode=%s]
    • For Jobmanager [/jobmanager/profiler?type=create&duration=%d&mode=%s]
  • API for listing the current profiling list of the given
    • For Taskmanager [/taskmanager/:tm-id/profiler?type=list]
    • For Jobmanager [/jobmanager/profiler?type=list]
  • API for downloading a profiling result file(Flame Graph in HTML) of the given profiling instance [
    • For Taskmanager [/taskmanager/:tm-id/profiler/:file]
    • For Jobmanager [/jobmanager/profiler/:file]

To make the feature parameterized and controlled, some configuration options will be added:

  • rest.profiling.enabled: controls whether the feature is enabled or not, false by default
  • rest.profiling.max-duration-max: control the maximum allowed sampling interval, 300s by default
  • rest.profiling.history-size: control the maximum allowed number of sampling results to be saved, with rolling deletion. 10 by default.

Proposed Changes

Architecture Overview

The proposed solution in our FLIP is shown in the figure below.

On Taskmanager:

  1. Flink users submit profiling requests through the rest API.
  2. The Resource Manager forwards the request to the user-specified Taskmanagertaskmanager.
  3. Task Executor invokes native methods provided by Async-profiler depending on the platform.
  4. After the completion of Profiling, Taskmanager taskmanager returns the file download path (asynchronous process, driven by the front-end continuous sampling status query).
  5. Jobmanager allows the user to download the results of the corresponding files on Taskmanager taskmanager with the blob service..

On Jobmanager, the steps are similar to those in Taskmanager, the only difference is that we complete the invocation of Async-profiler in Restful Gateway directly (As the dotted line shows in figure 3).

Image AddedImage Removed

Figure 3. An overview of our proposal on Taskmanager & Jobmanager

Cross Platform with JNI

In the package tools.profiler:async-profiler:2.9, provided by Andrei Pangin, the major contributor to the async-profiler, which packages the dynamic runtime library so-files for all platforms supported by async-profiler with a unified API. The dependency , it will select the appropriate dynamic runtime library file according to the runtime environment and invoke it via JAVA Native Interfaces.

...

  • Linux / x64 / x86 / arm64 / arm32 / ppc64le
  • macOS / x64 / arm64

Interactive UI

Flink users can complete the profiling submission and result export in the Flink Web UI by the following simple steps:

  1. Select the taskmanager to be sampled in Taskmanager taskmanager tab (or through the link in the operator detail drawer). Note that we also provided the ability to jump to the Taskmanage taskmanage Page from the backpressured back-pressured node in FLINK-29996.
  2. Type in the appropriate sampling interval, and profiling mode(event_mode), then click the ”Create Profiling Instance“ button to complete the submission of the profiling request.
  3. The profiling progress will be refreshed automatically. Once the sampling is complete, the link or error message will be displayed in the corresponding profiling request record.
  4. We can download the interactive HTML file locally by clicking on the download link for further comparison, searching, and sharing.

Image RemovedImage Added

Figure 4. Examples of user interactions

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users? No
  • If we are changing behavior how will we phase out the older behavior? No
  • If we need special migration tools, describe them here. No
  • When will we remove the existing behavior? No

Test Plan

Functionality:

  • Ensure that the flame graph can be generated/exported in different environments(Linux: x86/arm, macOS).

...

  • Ensure that the relevant interface cannot be accessed without enabling the feature, and provide appropriate parameter prompts
  • Ensure the maximum sampling time is controlled by the configuration.
  • Ensure that scrolling deletion rolling deletion is controlled by the configuration.

Rejected Alternatives

N/A

Most Concerns

We take the following concern in the previous discussion of FLIP-213 into consideration:

...

If changing the configuration is not possible, you may fall back to -e itimer profiling mode.

It is similar to CPU mode but does not require perf_events support. As a drawback, there will be no kernel stack traces.

UPDATE: From the discussion email, we see users want this feature could also leverage perf_events if possible, and since async-profiler could also support allocation and wall-clock profiling, we could extend this feature to support more cases.