JEP 509: JFR CPU-Time Profiling (Experimental)

AuthorsJaroslav Bachorík, Johannes Bechberger & Ron Pressler
OwnerJohannes Bechberger
TypeFeature
ScopeJDK
StatusCandidate
Componenthotspot / jfr
Discussionhotspot dash jfr dash dev at openjdk dot org
EffortM
DurationS
Reviewed byMarkus Grönlund
Created2024/08/04 10:34
Updated2025/04/15 20:26
Issue8337789

Summary

Enhance the JDK Flight Recorder (JFR) to capture CPU-time profiling information on Linux. This is an experimental feature.

Motivation

A running program consumes computational resources such as memory, CPU cycles, and elapsed time. To profile a program is to measure the consumption of such resources by specific elements of the program. A profile might indicate that, e.g., one method consumes 20% of a resource, while another method consumes only 0.1%.

Profiling can help make a program more efficient, and developers more productive, by identifying which program elements to optimize. Without profiling, we might optimize a method that consumes few resources to begin with, having little impact on the program's overall performance while wasting effort. For example, optimizing a method that takes 0.1% of the program’s total execution time to run ten times faster will only reduce the program's execution time by 0.09%.

JFR, the JDK Flight Recorder, is the JDK's profiling and monitoring facility. The core of JFR is a low-overhead mechanism for recording events emitted by the JVM or by program code. Some events, such as loading a class, are recorded whenever an action occurs. Others, such as those used for profiling, are recorded by statistically sampling the program's activity as it consumes a resource. The various JFR events can be turned on or off, allowing a more detailed, higher-overhead collection of information during development and a less detailed, lower-overhead collection of information in production.

JFR can already profile a program's consumption of two resources: heap memory and execution time. A heap-memory profile shows which program elements allocate objects that consume a significant portion of the heap, and what types of objects those are. An execution-time profile shows which program elements consume significant elapsed real time, i.e., wall-clock time.

Execution time vs. CPU time

To create an execution-time profile, JFR samples the execution stacks of program threads at fixed intervals of, say, 20ms. Each sample results in a JFR event, the stream of which a suitable tool can summarize into a textual or graphical profile.

An execution-time profile shows the total amount of elapsed time spent executing each of the program's methods relative to the total time spent executing the entire program. It thus identifies methods whose execution, on average, takes a relatively long amount of time. Optimizing these methods will improve the program's latency.

Execution time does not necessarily reflect CPU time. A method that sorts an array, e.g., spends all of its time on the CPU. Its execution time corresponds to the number of CPU cycles it consumes. In contrast, a method that reads from a network socket might spend most of its time idly waiting for bytes to arrive over the wire. Of the time it consumes, only a small portion is spent on the CPU. An execution-time profile will not distinguish between these cases.

Even a program that does a lot of I/O can be constrained by the CPU. A computation-heavy method might consume little execution time compared the program's I/O operations, thus having little effect on latency — but it might consume most of the program's CPU cycles, thus affecting throughput. Identifying and optimizing such methods will reduce CPU consumption and improve the program's throughput — but in order to do so, we need to profile CPU time rather than execution time.

For example, consider a program, HttpRequests, with two threads, each performing HTTP requests. One thread runs a tenFastRequests method that makes ten requests, sequentially, to an HTTP endpoint that responds in 10ms; the other runs a oneSlowRequest method that makes a single request to an endpoint that responds in 100ms. The average latency of both methods should be about the same, and so the total time spent executing them should be about the same.

We can record a stream of execution-time profiling events like so:

$ java -XX:StartFlightRecording=filename=profile.jfr,settings=profile.jfc HttpRequests

At fixed time intervals, JFR records ExecutionSample events into the file profile.jfr. Each event captures the stack trace of a thread running Java code, thus recording all of the methods currently running on that thread. (The file profile.jfc is a JFR configuration file, included in the JDK, that configures the JFR events needed for an execution-time profile.)

We can generate a textual profile from the recorded event stream by using the jfr tool included in the JDK:

$ jfr view native-methods profile.jfr

                      Waiting or Executing Native Methods

Method                                                          Samples Percent
--------------------------------------------------------------- ------- -------
sun.nio.ch.SocketDispatcher.read0(FileDescriptor, long, int)        102  98.08%
...

This clearly shows that most of the program’s time is spent waiting for socket I/O.

We can generate a graphical profile, in the form of a flame graph, by using the JDK Mission Control tool (JMC):

Execution time flame graph

Here we see that the oneSlowRequest and tenFastRequests methods take a similar amount of execution time, as we expect.

However, we also expect tenFastRequests to take more CPU time than oneSlowRequest, since ten rounds of creating requests and processing responses requires more CPU cycles than just one round. If these methods were run concurrently on many threads then the program could become CPU-bound, yet an execution-time profile would still show most of the program's time being spent waiting for socket I/O. If we could profile CPU time then we could see that optimizing tenFastRequests, rather than oneSlowRequest, could improve the program’s throughput.

(JFR's execution time sampling offers only an approximation of the distribution of execution time: To reduce profiling overhead, at each sampling period only up to six randomly selected platform threads are sampled. Furthermore, JFR may fail to obtain a sample due to the thread being in a state that prevents a sample from being recorded, yet such failed attempts are not recorded, which can result in a skewed profile. We may, in the future, offer a more accurate execution-time profile when a higher overhead is acceptable.)

CPU-time profiling

The ability to accurately and precisely measure CPU-cycle consumption was added to the Linux kernel in version 2.6.12 via a timer that emits signals at fixed intervals of CPU time rather than fixed intervals of elapsed real time. Most profilers on Linux use this mechanism to produce CPU-time profiles.

Some popular third-party Java tools, including async-profiler, use Linux's CPU timer to produce CPU-time profiles of Java programs. However, to do so, such tools interact with the Java runtime through unsupported internal interfaces. This is inherently unsafe and can lead to process crashes.

We should enhance JFR to use the Linux kernel's CPU timer to safely produce CPU-time profiles of Java programs. This would help the many developers who deploy Java applications on Linux to make those applications more efficient.

Description

We add CPU-time profiling to JFR, on Linux systems only. This feature is experimental for now, so that we may refine it based on experience before making it permanent. We may add CPU-time profiling to JFR on other platforms in the future.

JFR will use Linux's CPU-timer mechanism to sample the stack of every thread running Java code at fixed intervals of CPU time. Each such sample is recorded in a new type of event, jdk.CPUTimeSample. This event is not enabled by default.

This event is similar to the existing jdk.ExecutionSample event for execution-time sampling. Enabling CPU-time events does not affect execution-time events in any way, so the two can be collected simultaneously.

We can enable the new event in a recording started at launch like so:

$ java -XX:StartFlightRecording=jdk.CPUTimeExecutionSample#enabled=true,filename=profile.jfr ...

With the new CPU-time sampler, in the flame graph it becomes clear that the application spends nearly all of its CPU cycles in tenFastRequests:

CPU time flame graph

A textual profile of the hot CPU methods, i.e., those that consume many CPU cycles in their own bodies rather than in calls to other methods, can be obtained like so:

$ jfr view cpu-time-hot-methods profile.jfr

However, in this particular example, the output is not as useful as the flame graph.

Event details

The jdk.CPUTimeSample event for a specific stack trace has these fields:

The rate of sampling for this event can be set through the event's throttle property, in two ways:

The default throttle setting is 500/s, while in the profiling configuration included in the JDK, profile.jfc, it is 10ms. Higher rates produce a more accurate and precise profile at the cost of increasing the program's sampling overhead.

Another new event, jdk.CPUTimeSampleLost, is emitted when samples are lost due to implementation constraints, such as internal data structures becoming full. Its lostSamples field contains the number of samples dropped in the last sampling round. This event ensures that the event stream, as a whole, correctly reflects the program's overall CPU activity. It is enabled by default, but emitted only when jdk.CPUTimeSample is enabled.

Because this feature is experimental, these JFR events are tagged as such with the @Experimental annotation on their JFR event type. Unlike other kinds of experimental features of the HotSpot JVM, these events do not need to be specially enabled by command-line options such as -XX:+UnlockExperimentalVMOptions.

Profiling in production

Continuous profiling, in which an application running in production is profiled throughout its lifetime, is gaining popularity. Choosing a throttle setting with a sufficiently low overhead can make it practical to collect CPU-time profiling information continuously in production.

If continuous CPU-time profiling in production is not possible, another option is to occasionally gather a CPU profile. First, create an appropriate JFR configuration file:

$ jfr configure --input profile.jfc --output /tmp/cpu_profile.jfc \
  jdk.CPUTimeSample#enabled=true jdk.CPUTimeSample#throttle=20ms

Then use the jcmd tool to start a recording for the running program, using that configuration:

$ jcmd <pid> JFR.start settings=/tmp/cpu_profile.jfc duration=4m

Here the recording stops after four minutes, but we can choose other durations or, alternatively, manually stop the recording after a while via jcmd <pid> JFR.stop.

As with every JFR event, jdk.CPUTimeSample events can be streamed as they are recorded, even over the wire to a remote consumer, for online analysis.

Alternatives

We could enable external tools to use the Linux CPU-time sampler by adding safe and supported native HotSpot APIs for that purpose. Doing so would, however, expose internal details of the runtime, which would make the JDK harder to evolve. This approach would also be less efficient, and thus less suitable for profiling in production, than implementing the feature directly in the JDK.