Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently, we have the following tools at our disposal (including but not limited to):

1.  Thread Thread Dump on TaskManager/Jobmanager(Figure 1): Prints the call stack of all threads on TaskManager/Jobmanager via Flink Web or jstack. Performing Thread Dump at different moments may obtain different call stacks, which confuses the tuner. But it can be used for the analysis of operator blocking in a particular code. Moreover, it can't analyze the CPU usage distribution under CPU bottleneck.

...

Overall, the FLIP is dedicated to providing taskmanager/jobmanager-level flame graph generation capabilities based on async-profiler on the Flink Web UI, along with parameterized sampling intervals and easy-to-use download capabilities

...

  • API for creating a profiling instance of the given taskmanager [
    • For Taskmanager [/taskmanager/:tm-id/profiler?type=create]
    • For Jobmanager [/jobmanager/profiler?type=create]
  • API for listing the current profiling list of the given taskmanager [/
    • For Taskmanager [/taskmanager/:tm-id/profiler?type=list]
    • For Jobmanager [/jobmanager/profiler?type=list]
  • API for downloading a profiling result file(Flame Graph in HTML) of the given profiling instance [/
    • For Taskmanager [/taskmanager/:tm-id/profiler/:file]
    • For Jobmanager [/jobmanager/profiler/:file]

To make the feature parameterized and controlled, some configuration options will be added:

  • rest.profiling.enabled: controls whether the feature is enabled or not, false by default
  • rest.profiling.max-duration-max: control the maximum allowed sampling interval, 300s by default
  • rest.profiling.history-size: control the maximum allowed number of sampling results to be saved, with rolling deletion. 10 by default.

...

The proposed solution in our FLIP is shown in the figure below.

On Taskmanager:

  1. Flink users submit profiling requests through the rest API.
  2. The Resource Manager forwards the request to the user-specified taskmanager.
  3. Task Executor invokes native methods provided by Async-profiler depending on the platform.
  4. After the completion of Profiling, taskmanager returns the file download path (asynchronous process, driven by the front-end continuous sampling status query).
  5. Jobmanager allows the user to download the results of the corresponding files on taskmanager with the blob service..

On Jobmanager, the steps are similar to those in Taskmanager, the only difference is that we complete the invocation of Async-profiler in Restful Gateway directly (As the dotted line shows in figure 3).

Image AddedImage Removed

Figure 3. An overview of our proposal on Taskmanager & Jobmanager

Cross Platform with JNI

In the package tools.profiler:async-profiler:2.9, which packages the dynamic runtime library so-files for all platforms supported by async-profiler with a unified API, it will select the appropriate dynamic runtime library file according to the runtime environment and invoke it via JAVA Native Interfaces.

...