Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page properties


Discussion threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)TBD
Vote threadhere (<- link to https://lists.apache.org/list.html?dev@flink.apache.org)
JIRAhere (<- link to https://issues.apache.org/jira/browse/FLINK-XXXX)
TBD
JIRATBD
Release1.19.0Release<Flink Version>


Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

With the long-term running of the streaming task, some operators may experience performance bottlenecks caused by rising traffic, network or external system jitter, insufficient resources, and many other reasons, leading to backpressure or even delay. In such a scenario, we have to quickly identify performance bottlenecks through some powerful tools.

What we have...

Currently, we have the following tools at our disposal (including but not limited to):

...

3. Monitoring: We can evaluate whether there is a bottleneck in the current job through metrics such as watermark, data consumption latency, operator BusyRatio, etc. But it's impossible to specifically locate the code where the bottleneck is through monitoring.

Why do we need this FLIP?

  • Thread Dump can not fulfill the demand for the analysis of the distribution of CPU usage over time.
  • The operator-level Flame Graph provided by FLIP-165 is incapable of drawing the system call stack, and there are several problems in saving, searching, and adjusting the sampling interval case by case.

...

Overall, the FLIP is dedicated to providing Taskmanagertaskmanager-level flame graph generation capabilities based on async-profiler on the Flink Web UI, along with parameterized sampling intervals and easy-to-use download capabilities

Public Interfaces

To make profiling service available in web UI, some rest API will be added:

  • API for creating a profiling instance of the given Taskmanager taskmanager [/profiler?type=create]
  • API for listing the current profiling list of the given Taskmanager taskmanager [/profiler?type=list]
  • API for downloading a profiling result file(Flame Graph in HTML) of the given profiling instance [/profiler/:file]

...

  • profiling.enabled: controls whether the feature is enabled or not, false by default
  • profiling.duration-max: control the maximum allowed sampling interval, 300s by default
  • profiling.history-size: control the maximum allowed number of sampling results to be saved, with rolling deletion. 10 by default.

Proposed Changes

Architecture Overview

The proposed solution in our FLIP is shown in the figure below.

  1. Flink users submit profiling requests through the rest API.
  2. The Resource Manager forwards the request to the user-specified Taskmanagertaskmanager.
  3. Task Executor invokes native methods provided by Async-profiler depending on the platform.
  4. After the completion of Profiling, Taskmanager taskmanager returns the file download path (asynchronous process, driven by the front-end continuous sampling status query).
  5. Jobmanager allows the user to download the results of the corresponding files on Taskmanager taskmanager with the blob service.

Figure 3. An overview of our proposal

Cross Platform with JNI

In the package tools.profiler:async-profiler:2.9, provided by Andrei Pangin, the major contributor to the async-profiler, which packages the dynamic runtime library so-files for all platforms supported by async-profiler with a unified API. The dependency , it will select the appropriate dynamic runtime library file according to the runtime environment and invoke it via JAVA Native Interfaces.

...

  • Linux / x64 / x86 / arm64 / arm32 / ppc64le
  • macOS / x64 / arm64

Interactive UI

Flink users can complete the profiling submission and result export in the Flink Web UI by the following simple steps:

  1. Select the taskmanager to be sampled in Taskmanager taskmanager tab (or through the link in the operator detail drawer). Note that we also provided the ability to jump to the Taskmanage taskmanage Page from the backpressured back-pressured node in FLINK-29996.
  2. Type in the appropriate sampling interval and click the ”Create Profiling Instance“ button to complete the submission of the profiling request.
  3. The profiling progress will be refreshed automatically. Once the sampling is complete, the link or error message will be displayed in the corresponding profiling request record.
  4. We can download the interactive HTML file locally by clicking on the download link for further comparison, searching, and sharing.

...

Figure 4. Examples of user interactions

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users? No
  • If we are changing behavior how will we phase out the older behavior? No
  • If we need special migration tools, describe them here. No
  • When will we remove the existing behavior? No

Test Plan

Functionality:

  • Ensure that the flame graph can be generated/exported in different environments(Linux: x86/arm, macOS).

...

  • Ensure that the relevant interface cannot be accessed without enabling the feature, and provide appropriate parameter prompts
  • Ensure the maximum sampling time is controlled by the configuration.
  • Ensure that scrolling deletion is controlled by the configuration.

Rejected Alternatives

N/A

Most Concerns

We take the following concern in the previous discussion of FLIP-213 into consideration:

...