Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

With the long-term running of the streaming task, some operators may experience performance bottlenecks caused by rising traffic, network or external system jitter, insufficient resources, and many other reasons, leading to backpressure or even delay. In such a scenario, we have to quickly identify performance bottlenecks through some powerful tools.

What we have...

Currently, we have the following tools at our disposal (including but not limited to):

1. Thread Dump on TaskManager/Jobmanager(Figure 1): Prints the call stack of all threads on TaskManager/Jobmanager via Flink Web or jstack. Performing Thread Dump at different moments may obtain different call stacks, which confuses the tuner. But it can be used for the analysis of operator blocking in a particular code. Moreover, it can't analyze the CPU usage distribution under CPU bottleneck.

Figure 1. Thread Dumps for Sample Job's TM in Figure 2.

(Performing Thread Dump at different moments may obtain different call stacks, which confuses the tuner.)

2. Flink Operator Flame Graph(FLIP-165): Analyzes CPU performance bottlenecks at the operator level, but lacks detailed information such as system calls. Meanwhile, the feature has the following limitations:

  • Limitations of Functionality
    • Operator-level stacks only: The popular profiling tool async-profiler provides various profiling capabilities for CPU, memory allocation, locks, etc. scenarios. More importantly, the Java thread stack and system call stack provided by it are critical in some cases. In contrast, this feature can only provide the call stack of the operator.
    • Inability to adjust the sampling interval: This feature uses a fixed sampling interval. For different tasks the identical sampling duration may not be applicable, a parameterized sampling duration is needed to capture critical CPU performance bottlenecks for different cases.
  • Inconvenient for Analysis
    • Not downloadable: the feature is designed by continuous refreshing, and when you want to settle your eyes to take a detailed look at the stack, it is likely to be refreshed in the background. Besides, since the feature draws in real time, it can't be saved as an interactive file (e.g. HTML file), which makes it difficult to analyze, compare, and share.
    • Impossible to search for specific keywords in the flame graph: searching for suspect stacks in the flame graph is a very useful feature, finding the desired information in all the stacks is an exhausting task.

3. Monitoring: We can evaluate whether there is a bottleneck in the current job through metrics such as watermark, data consumption latency, operator BusyRatio, etc. But it's impossible to specifically locate the code where the bottleneck is through monitoring.

Why do we need this FLIP?

  • Thread Dump can not fulfill the demand for the analysis of the distribution of CPU usage over time.
  • The operator-level Flame Graph provided by FLIP-165 is incapable of drawing the system call stack, and there are several problems in saving, searching, and adjusting the sampling interval case by case.

As we know, the async-profiler is a low-overhead sampling profiler for Java (A profiling instance is shown in Figure 2), and it only have less than 3% overhead according to the report.

With the help of Async-profiler on Taskmanager, we can print TM-level CPU usage distribution for program execution, including call stacks for all subtasks on Taskmanager and even system calls. You can export as HTML, search, and specify profiling intervals. But it requires logging into the physical machine or container hosting the TM to download and install it, by performing command line operations. Apparently, it's not safe to do such operations in a production environment, and there are both permissions issues (login/export) and security risks in distribution systems.

Figure 2. Sample Profiling Instance of GZip Bottleneck Job

(Different from Thread Dump, you can quickly find out that `deflater` was the most CPU-intensive stack)

The currently available steps to use async-profiler are very cumbersome, and the shell-based operation has permissions and security concerns. If we could integrate it in Flink and provide a convenient way to work with it in the Web UI, it would greatly improve the experience and efficiency of locating bottlenecks.

Overall, the FLIP is dedicated to providing taskmanager/jobmanager-level flame graph generation capabilities based on async-profiler on the Flink Web UI, along with parameterized sampling intervals and easy-to-use download capabilities

Public Interfaces

To make profiling service available in web UI, some rest API will be added:

  • API for creating a profiling instance
    • For Taskmanager [/taskmanager/:tm-id/profiler?type=create&duration=%d&mode=%s]
    • For Jobmanager [/jobmanager/profiler?type=create&duration=%d&mode=%s]
  • API for listing the current profiling list
    • For Taskmanager [/taskmanager/:tm-id/profiler?type=list]
    • For Jobmanager [/jobmanager/profiler?type=list]
  • API for downloading a profiling result file(Flame Graph in HTML)
    • For Taskmanager [/taskmanager/:tm-id/profiler/:file]
    • For Jobmanager [/jobmanager/profiler/:file]

To make the feature parameterized and controlled, some configuration options will be added:

  • rest.profiling.enabled: controls whether the feature is enabled or not, false by default
  • rest.profiling.max-duration: control the maximum allowed sampling interval, 300s by default
  • rest.profiling.history-size: control the maximum allowed number of sampling results to be saved, with rolling deletion. 10 by default.

Proposed Changes

Architecture Overview

The proposed solution in our FLIP is shown in the figure below.

On Taskmanager:

  1. Flink users submit profiling requests through the rest API.
  2. The Resource Manager forwards the request to the user-specified taskmanager.
  3. Task Executor invokes native methods provided by Async-profiler depending on the platform.
  4. After the completion of Profiling, taskmanager returns the file download path (asynchronous process, driven by the front-end continuous sampling status query).
  5. Jobmanager allows the user to download the results of the corresponding files on taskmanager with the blob service.

On Jobmanager, the steps are similar to those in Taskmanager, the only difference is that we complete the invocation of Async-profiler in Restful Gateway directly (As the dotted line shows in figure 3).

Figure 3. An overview of our proposal on Taskmanager & Jobmanager

Cross Platform with JNI

In the package tools.profiler:async-profiler:2.9, which packages the dynamic runtime library so-files for all platforms supported by async-profiler with a unified API, it will select the appropriate dynamic runtime library file according to the runtime environment and invoke it via JAVA Native Interfaces.

With this package, we can directly adapt to the platforms currently supported by async-profiler.

<dependency>
   <groupId>tools.profiler</groupId>
   <artifactId>async-profiler</artifactId>
   <version>2.9</version>
</dependency>

Note that the platforms currently supported by async-profiler are as follows (excerpted from async-profiler):

  • Linux / x64 / x86 / arm64 / arm32 / ppc64le
  • macOS / x64 / arm64

Interactive UI

Flink users can complete the profiling submission and result export in the Flink Web UI by the following simple steps:

  1. Select the taskmanager to be sampled in taskmanager tab (or through the link in the operator detail drawer). Note that we also provided the ability to jump to the taskmanage Page from the back-pressured node in FLINK-29996.
  2. Type in the appropriate sampling interval, and profiling mode(event_mode), then click the ”Create Profiling Instance“ button to complete the submission of the profiling request.
  3. The profiling progress will be refreshed automatically. Once the sampling is complete, the link or error message will be displayed in the corresponding profiling request record.
  4. We can download the interactive HTML file locally by clicking on the download link for further comparison, searching, and sharing.

Figure 4. Examples of user interactions

Compatibility, Deprecation, and Migration Plan

  • What impact (if any) will there be on existing users? No
  • If we are changing behavior how will we phase out the older behavior? No
  • If we need special migration tools, describe them here. No
  • When will we remove the existing behavior? No

Test Plan

Functionality:

  • Ensure that the flame graph can be generated/exported in different environments(Linux: x86/arm, macOS).

Parameter Testing:

  • Ensure that the relevant interface cannot be accessed without enabling the feature, and provide appropriate parameter prompts
  • Ensure the maximum sampling time is controlled by the configuration.
  • Ensure that rolling deletion is controlled by the configuration.

Rejected Alternatives

N/A

Most Concerns

We take the following concern in the previous discussion of FLIP-213 into consideration:

1. Calling shell scripts triggered via the web UI is a security concern and it needs to be evaluated carefully if it could introduce any unexpected attack vectors (depending on the implementation, passed parameters, etc.)

In our proposal, we only expose and provide three rest APIs on the web UI: submit, status query, and download. After the user submits a profiling request, we achieve the invoke of async-profiler related functions at runtime through the API provided by tools.profiler:async-profiler:2.9, by calling the methods in async-profiler via Native methods, instead of invoking shell scripts directly in JAVA.


2. My understanding is that the async-profiler implementation is system-dependent. How do you propose to handle multiple architectures? Would you like to ship each available implementation within Flink?

As illustrated in the section 'Cross Platform with JNI'.


3. Do you plan to make use of full async-profiler features including native calls sampling with perf_events? If so, the issue I see is that some environments restrict ptrace calls by default.

For now, we only considered providing a complete process framework of sampling submission, sampling status query, and sampling result download based on the Async-profiler's itimer mode to draw a flame graph for CPU usage.

The reason for using the itimer mode is based on the requirement of perf_events support in some scenarios within the container, which makes it compatible with such scenarios through the itimer. As described in the async-profiler troubleshooting section:

If changing the configuration is not possible, you may fall back to -e itimer profiling mode.

It is similar to CPU mode but does not require perf_events support. As a drawback, there will be no kernel stack traces.

UPDATE: From the discussion email, we see users want this feature could also leverage perf_events if possible, and since async-profiler could also support allocation and wall-clock profiling, we could extend this feature to support more cases.