JEP draft: CPU Time based profiling for JFR

AuthorJaroslav Bachorik
OwnerJohannes Bechberger
TypeFeature
ScopeImplementation
StatusDraft
Componenthotspot / jfr
Discussionhotspot dash jfr dash dev at openjdk dot org
EffortM
DurationS
Created2024/08/04 10:34
Updated2024/09/12 11:19
Issue8337789

CPU Profiler

Summary

Offer a new CPU-time sampling strategy for JFR on Linux to profile CPU usage.

Goals

Non-Goals

Motivation

Profiling a program means determining how its various units consume different kinds of computational resources such as memory, computation, or elapsed time. A program profile shows what portion of a relevant resource is consumed by the different parts of the program, i.e. method foo consumes 20% of the resource, while method bar consumes only 0.1%.

Profiling is what makes the difference between an efficient program and an inefficient one because a profile shows what the developer needs to optimize to make the program more efficient. Optimizing parts of the program that don't account for a significant portion of the profile will be a wasted effort as they will have little or no impact.

JFR, the JDK Flight Recorder, is a JDK mechanism for recording various events at a high rate and with low overhead, and it serves as the core of the JDK's profiling capabilities as well as the core of the JDK's monitoring capabilities.

JFR can profile a program with respect to two resources: allocation of heap memory and elapsed ("wall-clock") time. A heap memory profile shows which parts of the code allocate objects that consume a significant portion of the heap and what type of objects these are; it helps to make the program's memory use more efficient. An execution time profiler shows where different program threads spend their time, and it helps improve the program's latency.

To capture a profile, JFR records relevant events that sample the program's execution. For example, to obtain an allocation profile we sample which object is allocated, and by what code, after every allocation of some fixed amount of memory, say, 10 MB; to obtain an execution profile we sample the execution stack of program threads every fixed interval, say, 20ms. The respective profile is then created by analyzing the recording of the profiling events.

For example, consider a program with two threads, each performing HTTP requests: one runs the tenFastRequests method that makes ten requests, sequentially, to a URL that responds in 100ms; the other runs the oneSlowRequest method that makes a single request to URL that responds in 100ms. Both threads run after each other.

An execution profile of that program is obtained like so:

java -XX:StartFlightRecording=filename=profile.jfr,settings=profile.jfc HttpRequestsExample client

JFR obtains this profile by recording ExecutionSample and NativeMethodSample events on a subset of the threads at a regular time interval.

We can view the profile using a JFR view as follows:

> jfr view native-methods profile.jfr

                      Waiting or Executing Native Methods

Method                                                          Samples Percent
--------------------------------------------------------------- ------- -------
sun.nio.ch.SocketDispatcher.read0(FileDescriptor, long, int)        102  98.08%
jdk.jfr.internal.JVM.emitEvent(long, long, long)                      2   1.92%

If we use JMC to display the results as a flame graph, we see that oneSlowRequest and tenFastRequests take up a similar amount of execution time, signifying their similar latency:

Flamegraph for the currently implemented profiler

But clearly, we expect tenFastRequests to take up ten times more CPU time than oneSlowRequest.

Profiling CPU time rather than execution time can tell us how the CPU is utilized and is essential to improving a program's throughput.

There are other shortcomings to the current execution profile: every sampling period only up to six randomly-selected platform threads are sampled to reduce the profiling overhead at the expense of a less accurate profile, and JFR may fail to obtain a sample due to the thread being in a state that prevents a sample from being recorded, but JFR doesn't record such failed attempts, which may result in a skewed profile. These shortcomings of the execution profile may be addressed in the future.

Description

We add experimental CPU-time profiling to JFR, on Linux systems only at this time.

A CPU profile is obtained by the use of Linux's CPU-timer mechanism to sample the stack of all threads regularly, every fixed period of CPU time. In addition, it records any failure to obtain stack samples, so the overall CPU portions reported correctly represent the application's CPU activity.

The new experimental event, jdk.CPUTimeSample is not enabled by default, and the sample period can be set with the event's period property.

It comprises the following fields:

This event is similar to the execution-time sampler's jdk.ExecutionSample event, and enabling the new event does not affect the execution sample event in any way, so they can be used simultaneously.

The new event jdk.CPUTimeSampleLoss is emitted when samples are lossed due to internal implementation reasons, such as internal data structures becoming full, and contains the number of dropped samples in the last sampling round (lossedSamples). This event is enabled by default, but is only emitted when jdk.CPUTimeSample is enabled.

Usage

Enable and configure the new JFR event:

> jfr configure --input profile.jfc --output /tmp/my_profile.jfc jdk.CPUTimeSample#enabled=true jdk.CPUTimeSample#period=9ms # create new settings
> jcmd <pid> JFR.start /tmp/my_profile.jfc # use those settings to start a new profiling session for your application

Or use it directly with -XX:StartFlightRecording:

java -XX:StartFlightRecording=jdk.CPUTimeSample#enabled=true,filename=profile.jfr ...

With the new CPU-time sampler, it becomes clear that the applications spends nearly all its time in tenFastRequests:

> jfr view cpu-time-hot-methods profile.jfr

                             Java Methods that Execute the Most from CPU time sampler

Method                                                                                              Samples Percent
--------------------------------------------------------------------------------------------------- ------- -------
jdk.jfr.internal.JVM.emitEvent(long, long, long)                                                         88  77.19%
N/A                                                                                                       7   6.14%
java.lang.Object.wait0(long)                                                                              2   1.75%
java.lang.Class.forName0(String, boolean, ClassLoader, Class)                                             2   1.75%
java.lang.Thread.currentCarrierThread()                                                                   1   0.88%

Flamegraph of the CPU profiler

CPU profiling may be added to other platforms in the future.

Alternatives

Rely on external tools

This is the current situation. We are relying on external tools, like async-profiler to provide the proper CPU profiling. These external tools are based on internal, unspecified, JDK APIs.

Risk and Assumptions

The main risk is the users not understanding that the new CPU profiler and JFR event type are specific to Linux only. The assumption is that Linux is the most important platform for any production-level deployment and that we can get away with implementing Linux only.

Another risk is that users on multi-core systems will be overwhelmed by the number of emitted events. While the current JFR profiler has a hard limit for the number of sampled threads per interval, the new CPU profiler does not. This lets the number of emitted events scale linearly with the number of simultaneously running threads.