JEP 518: JFR Cooperative Sampling

Owner	Markus Grönlund
Type	Feature
Scope	Implementation
Status	Closed / Delivered
Release	25
Component	hotspot / jfr
Discussion	hotspot dash jfr dash dev at openjdk dot org
Effort	M
Duration	M
Reviewed by	Erik Gahlin, Vladimir Kozlov
Endorsed by	Vladimir Kozlov
Created	2025/02/19 14:12
Updated	2025/06/10 16:31
Issue	8350338

Summary

Improve the stability of the JDK Flight Recorder (JFR) when it asynchronously samples Java thread stacks. Achieve this by walking call stacks only at safepoints, while minimizing safepoint bias.

Motivation

A running program consumes computational resources such as memory, CPU cycles, and elapsed time. To profile a program is to measure the consumption of such resources by specific elements of the program. A profile might indicate that, e.g., one method consumes 20% of a resource, while another consumes only 0.1%.

Profiling can help make a program more efficient, and developers more productive, by identifying which program elements to optimize. Without profiling, we might optimize a method that was consuming few resources to begin with, having little impact on the program's overall performance while wasting effort. For example, optimizing a method that takes 0.1% of the program's total execution time to run ten times faster will only reduce the program's execution time by 0.09%.

JFR, the JDK Flight Recorder, is the JDK's profiling and monitoring facility. The core of JFR is a low-overhead mechanism for recording events emitted by the HotSpot JVM or by program code. Some events, such as loading a class, are recorded whenever an action occurs. Others, such as those used for profiling, are recorded by statistically sampling the program's activity as it consumes a resource. The various JFR events can be turned on or off, allowing a more detailed, higher-overhead collection of information during development and a less detailed, lower-overhead collection of information in production.

JFR can create an execution-time profile that shows which program elements consume significant elapsed real time, i.e., wall-clock time. It does this by sampling the execution stacks of program threads at fixed intervals of, say, 20 milliseconds. Each sample produces a JFR event containing a stack trace. Tools such as jfr and JDK Mission Control can summarize a stream of such events into a textual or graphical profile.

In order to produce a stack trace for a program thread, JFR's sampler thread must suspend the target thread and parse the call frames on the stack. The HotSpot JVM maintains metadata to guide the parsing of stack frames, but that metadata is valid only when a thread is suspended at well-defined code locations known as safepoints. If we sample stacks only at safepoints, however, then we will likely suffer from the safepoint bias problem: We risk losing accuracy, since a frequently-executed span of code might not be anywhere near a safepoint. The safepoint bias problem is well known and thoroughly researched.

So as to avoid the safepoint bias problem, JFR samples the stacks of program threads asynchronously, suspending threads and parsing their stacks at code locations that are not necessarily safepoints. Since the metadata for parsing stack frames is not guaranteed to be valid at non-safepoints, JFR's sampler thread uses heuristics in order to generate a stack trace.

Unfortunately, these stack-parsing heuristics are inefficient and, worse, when their results are incorrect then they can crash the JVM. JFR attempts to prevent such crashes via platform-specific crash-protection mechanisms, but those mechanisms can fail in the presence of concurrent activity such as class unloading.

Description

We redesign JFR's sampling mechanism to avoid relying on risky stack-parsing heuristics. Instead, we parse thread stacks only at safepoints.

To avoid the safepoint bias problem, we take samples cooperatively. When it is time to take a sample, JFR's sampler thread still suspends the target thread. Rather than attempting to parse the stack, however, it just records the target's program counter and stack pointer in a sample request, which it appends to an internal thread-local queue. It then arranges for the target thread to stop at its next safepoint, and resumes the thread.

The target runs normally until its next safepoint. At that time, the safepoint handling code inspects the queue. If it finds any sample requests, then, for each one, it reconstructs a stack trace, adjusting for safepoint bias, and emits a JFR execution-time sampling event.

Aside from being safe, this approach has several other advantages:

Creating a sample request requires hardly any work, and could be done in response to a hardware event or inside a signal handler.
The code to create stack traces and emit events is simpler. For example, it can dynamically allocate memory when it runs on the target thread, which it could not do when running on the sampler thread.
The sampler thread has less work to do, since it need not run heuristics, improving scalability.

This approach works well when the target thread is running Java code, whether interpreted or compiled, but not when the target thread is running native code. In that case, we continue to use the existing approach.

Future Work

Our new approach does not entirely avoid safepoint bias. In some situations, such as when sampling inside a method for which the HotSpot JVM has an intrinsic implementation, it may be impossible to parse the stack. In these cases, the recorded stack trace will reflect the last Java stack frame, thereby introducing some bias. We intend to address this in future work.

Alternatives

The HotSpot JVM does have an existing internal but unsupported mechanism, AsyncGetCallTrace, which is used by some third-party tools. Unfortunately, this mechanism relies on the same kind of risky stack-parsing heuristics that JFR uses today, but without any crash protection, thus it is even riskier. Another drawback is that it is based on the POSIX SIGPROF signal, an equivalent of which does not exist on Windows.

Testing

This is strictly an implementation change. Existing unit, integration, and stress tests will suffice.

Dependencies

The implementation of JEP 509 (JFR CPU-Time Profiling) leverages the mechanism introduced here.