JEP 509: JFR CPU-Time Profiling (Experimental)
Authors | Jaroslav Bachorík, Johannes Bechberger & Ron Pressler |
Owner | Johannes Bechberger |
Type | Feature |
Scope | JDK |
Status | Candidate |
Component | hotspot / jfr |
Discussion | hotspot dash jfr dash dev at openjdk dot org |
Effort | M |
Duration | S |
Reviewed by | Markus Grönlund |
Created | 2024/08/04 10:34 |
Updated | 2025/04/15 20:26 |
Issue | 8337789 |
Summary
Enhance the JDK Flight Recorder (JFR) to capture CPU-time profiling information on Linux. This is an experimental feature.
Motivation
A running program consumes computational resources such as memory, CPU cycles, and elapsed time. To profile a program is to measure the consumption of such resources by specific elements of the program. A profile might indicate that, e.g., one method consumes 20% of a resource, while another method consumes only 0.1%.
Profiling can help make a program more efficient, and developers more productive, by identifying which program elements to optimize. Without profiling, we might optimize a method that consumes few resources to begin with, having little impact on the program's overall performance while wasting effort. For example, optimizing a method that takes 0.1% of the program’s total execution time to run ten times faster will only reduce the program's execution time by 0.09%.
JFR, the JDK Flight Recorder, is the JDK's profiling and monitoring facility. The core of JFR is a low-overhead mechanism for recording events emitted by the JVM or by program code. Some events, such as loading a class, are recorded whenever an action occurs. Others, such as those used for profiling, are recorded by statistically sampling the program's activity as it consumes a resource. The various JFR events can be turned on or off, allowing a more detailed, higher-overhead collection of information during development and a less detailed, lower-overhead collection of information in production.
JFR can already profile a program's consumption of two resources: heap memory and execution time. A heap-memory profile shows which program elements allocate objects that consume a significant portion of the heap, and what types of objects those are. An execution-time profile shows which program elements consume significant elapsed real time, i.e., wall-clock time.
Execution time vs. CPU time
To create an execution-time profile, JFR samples the execution stacks of program threads at fixed intervals of, say, 20ms. Each sample results in a JFR event, the stream of which a suitable tool can summarize into a textual or graphical profile.
An execution-time profile shows the total amount of elapsed time spent executing each of the program's methods relative to the total time spent executing the entire program. It thus identifies methods whose execution, on average, takes a relatively long amount of time. Optimizing these methods will improve the program's latency.
Execution time does not necessarily reflect CPU time. A method that sorts an array, e.g., spends all of its time on the CPU. Its execution time corresponds to the number of CPU cycles it consumes. In contrast, a method that reads from a network socket might spend most of its time idly waiting for bytes to arrive over the wire. Of the time it consumes, only a small portion is spent on the CPU. An execution-time profile will not distinguish between these cases.
Even a program that does a lot of I/O can be constrained by the CPU. A computation-heavy method might consume little execution time compared the program's I/O operations, thus having little effect on latency — but it might consume most of the program's CPU cycles, thus affecting throughput. Identifying and optimizing such methods will reduce CPU consumption and improve the program's throughput — but in order to do so, we need to profile CPU time rather than execution time.
For example, consider a program, HttpRequests
, with two threads, each performing HTTP requests. One thread runs a tenFastRequests
method that makes ten requests, sequentially, to an HTTP endpoint that responds in 10ms; the other runs a oneSlowRequest
method that makes a single request to an endpoint that responds in 100ms. The average latency of both methods should be about the same, and so the total time spent executing them should be about the same.
We can record a stream of execution-time profiling events like so:
$ java -XX:StartFlightRecording=filename=profile.jfr,settings=profile.jfc HttpRequests
At fixed time intervals, JFR records ExecutionSample
events into the file profile.jfr
. Each event captures the stack trace of a thread running Java code, thus recording all of the methods currently running on that thread. (The file profile.jfc
is a JFR configuration file, included in the JDK, that configures the JFR events needed for an execution-time profile.)
We can generate a textual profile from the recorded event stream by using the jfr
tool included in the JDK:
$ jfr view native-methods profile.jfr
Waiting or Executing Native Methods
Method Samples Percent
--------------------------------------------------------------- ------- -------
sun.nio.ch.SocketDispatcher.read0(FileDescriptor, long, int) 102 98.08%
...
This clearly shows that most of the program’s time is spent waiting for socket I/O.
We can generate a graphical profile, in the form of a flame graph, by using the JDK Mission Control tool (JMC):
Here we see that the oneSlowRequest
and tenFastRequests
methods take a similar amount of execution time, as we expect.
However, we also expect tenFastRequests
to take more CPU time than oneSlowRequest
, since ten rounds of creating requests and processing responses requires more CPU cycles than just one round. If these methods were run concurrently on many threads then the program could become CPU-bound, yet an execution-time profile would still show most of the program's time being spent waiting for socket I/O. If we could profile CPU time then we could see that optimizing tenFastRequests
, rather than oneSlowRequest
, could improve the program’s throughput.
(JFR's execution time sampling offers only an approximation of the distribution of execution time: To reduce profiling overhead, at each sampling period only up to six randomly selected platform threads are sampled. Furthermore, JFR may fail to obtain a sample due to the thread being in a state that prevents a sample from being recorded, yet such failed attempts are not recorded, which can result in a skewed profile. We may, in the future, offer a more accurate execution-time profile when a higher overhead is acceptable.)
CPU-time profiling
The ability to accurately and precisely measure CPU-cycle consumption was added to the Linux kernel in version 2.6.12 via a timer that emits signals at fixed intervals of CPU time rather than fixed intervals of elapsed real time. Most profilers on Linux use this mechanism to produce CPU-time profiles.
Some popular third-party Java tools, including async-profiler, use Linux's CPU timer to produce CPU-time profiles of Java programs. However, to do so, such tools interact with the Java runtime through unsupported internal interfaces. This is inherently unsafe and can lead to process crashes.
We should enhance JFR to use the Linux kernel's CPU timer to safely produce CPU-time profiles of Java programs. This would help the many developers who deploy Java applications on Linux to make those applications more efficient.
Description
We add CPU-time profiling to JFR, on Linux systems only. This feature is experimental for now, so that we may refine it based on experience before making it permanent. We may add CPU-time profiling to JFR on other platforms in the future.
JFR will use Linux's CPU-timer mechanism to sample the stack of every thread running Java code at fixed intervals of CPU time. Each such sample is recorded in a new type of event, jdk.CPUTimeSample
. This event is not enabled by default.
This event is similar to the existing jdk.ExecutionSample
event for execution-time sampling. Enabling CPU-time events does not affect execution-time events in any way, so the two can be collected simultaneously.
We can enable the new event in a recording started at launch like so:
$ java -XX:StartFlightRecording=jdk.CPUTimeExecutionSample#enabled=true,filename=profile.jfr ...
With the new CPU-time sampler, in the flame graph it becomes clear that the application spends nearly all of its CPU cycles in tenFastRequests
:
A textual profile of the hot CPU methods, i.e., those that consume many CPU cycles in their own bodies rather than in calls to other methods, can be obtained like so:
$ jfr view cpu-time-hot-methods profile.jfr
However, in this particular example, the output is not as useful as the flame graph.
Event details
The jdk.CPUTimeSample
event for a specific stack trace has these fields:
startTime
— the CPU time just before walking the stack,endTime
— the CPU time just after walking the stack,eventThread
— the identity of the sampled thread,stacktrace
— the stack trace, ornull
if the sampler failed to walk the stack, andsamplingPeriod
— the sampling period at the time the sample was obtained.
The rate of sampling for this event can be set through the event's throttle
property, in two ways:
-
As a time period:
throttle=10ms
causes a sample event to be produced for each platform thread after it consumes 10ms of CPU time. -
As an overall rate:
throttle=500/s
causes 500 sample events to be produced every second, spread evenly over all platform threads.
The default throttle setting is 500/s
, while in the profiling configuration included in the JDK, profile.jfc
, it is 10ms
. Higher rates produce a more accurate and precise profile at the cost of increasing the program's sampling overhead.
Another new event, jdk.CPUTimeSampleLost
, is emitted when samples are lost due to implementation constraints, such as internal data structures becoming full. Its lostSamples
field contains the number of samples dropped in the last sampling round. This event ensures that the event stream, as a whole, correctly reflects the program's overall CPU activity. It is enabled by default, but emitted only when jdk.CPUTimeSample
is enabled.
Because this feature is experimental, these JFR events are tagged as such with the @Experimental
annotation on their JFR event type. Unlike other kinds of experimental features of the HotSpot JVM, these events do not need to be specially enabled by command-line options such as -XX:+UnlockExperimentalVMOptions
.
Profiling in production
Continuous profiling, in which an application running in production is profiled throughout its lifetime, is gaining popularity. Choosing a throttle
setting with a sufficiently low overhead can make it practical to collect CPU-time profiling information continuously in production.
If continuous CPU-time profiling in production is not possible, another option is to occasionally gather a CPU profile. First, create an appropriate JFR configuration file:
$ jfr configure --input profile.jfc --output /tmp/cpu_profile.jfc \
jdk.CPUTimeSample#enabled=true jdk.CPUTimeSample#throttle=20ms
Then use the jcmd
tool to start a recording for the running program, using that configuration:
$ jcmd <pid> JFR.start settings=/tmp/cpu_profile.jfc duration=4m
Here the recording stops after four minutes, but we can choose other durations or, alternatively, manually stop the recording after a while via jcmd <pid> JFR.stop
.
As with every JFR event, jdk.CPUTimeSample
events can be streamed as they are recorded, even over the wire to a remote consumer, for online analysis.
Alternatives
We could enable external tools to use the Linux CPU-time sampler by adding safe and supported native HotSpot APIs for that purpose. Doing so would, however, expose internal details of the runtime, which would make the JDK harder to evolve. This approach would also be less efficient, and thus less suitable for profiling in production, than implementing the feature directly in the JDK.