Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

With the long-term running of the streaming task, some operators may experience performance bottlenecks caused by rising traffic, network or external system jitter, insufficient resources, and many other reasons, leading to backpressure or even delay. In such a scenario, we have to quickly identify performance bottlenecks through some powerful tools.

What we have...

Currently, we have the following tools at our disposal (including but not limited to):

...

3. Monitoring: We can evaluate whether there is a bottleneck in the current job through metrics such as watermark, data consumption latency, operator BusyRatio, etc. But it's impossible to specifically locate the code where the bottleneck is through monitoring.

Why do we need this FLIP?

  • Thread Dump can not fulfill the demand for the analysis of the distribution of CPU usage over time.
  • The operator-level Flame Graph provided by FLIP-165 is incapable of drawing the system call stack, and there are several problems in saving, searching, and adjusting the sampling interval case by case.

As we know, the async-profiler is a low-overhead sampling profiler for Java (A profiling instance is shown in Figure 2), and it only have less than 3% overhead according to the report.

With the help of Async-profiler on Taskmanager, we can print TM-level CPU usage distribution for program execution, including call stacks for all subtasks on Taskmanager and even system calls. You can export as HTML, search, and specify profiling intervals. But it requires logging into the physical machine or container hosting the TM to download and install it, by performing command line operations. Apparently, it's not safe to do such operations in a production environment, and there are both permissions issues (login/export) and security risks in distribution systems.

...

Overall, the FLIP is dedicated to providing taskmanager/jobmanager-level flame graph generation capabilities based on async-profiler on the Flink Web UI, along with parameterized sampling intervals and easy-to-use download capabilities

Public Interfaces

To make profiling service available in web UI, some rest API will be added:

...

  • rest.profiling.enabled: controls whether the feature is enabled or not, false by default
  • rest.profiling.max-duration: control the maximum allowed sampling interval, 300s by default
  • rest.profiling.history-size: control the maximum allowed number of sampling results to be saved, with rolling deletion. 10 by default.

Proposed Changes

Architecture Overview

The proposed solution in our FLIP is shown in the figure below.

...

Figure 3. An overview of our proposal on Taskmanager & Jobmanager

Cross Platform with JNI

In the package tools.profiler:async-profiler:2.9, which packages the dynamic runtime library so-files for all platforms supported by async-profiler with a unified API, it will select the appropriate dynamic runtime library file according to the runtime environment and invoke it via JAVA Native Interfaces.

...

  • Linux / x64 / x86 / arm64 / arm32 / ppc64le
  • macOS / x64 / arm64

Interactive UI

Flink users can complete the profiling submission and result export in the Flink Web UI by the following simple steps:

...

Figure 4. Examples of user interactions

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users? No
  • If we are changing behavior how will we phase out the older behavior? No
  • If we need special migration tools, describe them here. No
  • When will we remove the existing behavior? No

Test Plan

Functionality:

  • Ensure that the flame graph can be generated/exported in different environments(Linux: x86/arm, macOS).

...

  • Ensure that the relevant interface cannot be accessed without enabling the feature, and provide appropriate parameter prompts
  • Ensure the maximum sampling time is controlled by the configuration.
  • Ensure that scrolling deletion is controlled by the configuration.

Rejected Alternatives

N/A

Most Concerns

We take the following concern in the previous discussion of FLIP-213 into consideration:

...