JEP draft: CPU Time based profiling for JFR
Author | Jaroslav Bachorik |
Owner | Johannes Bechberger |
Type | Feature |
Scope | Implementation |
Status | Draft |
Component | hotspot / jfr |
Discussion | hotspot dash jfr dash dev at openjdk dot org |
Effort | M |
Duration | S |
Created | 2024/08/04 10:34 |
Updated | 2024/09/12 11:19 |
Issue | 8337789 |
CPU Profiler
Summary
Offer a new CPU-time sampling strategy for JFR on Linux to profile CPU usage.
Goals
- Add a CPU-time-based sampling represented by a new JFR event on Linux systems.
- Record events even when the JVM fails to walk the stack.
Non-Goals
- Do not include non-Java frames in the event stack traces.
- Do not extend the implementation of this sampler to platforms other than Linux.
Motivation
Profiling a program means determining how its various units consume different kinds of computational resources such as memory, computation, or elapsed time. A program profile shows what portion of a relevant resource is consumed by the different parts of the program, i.e. method foo
consumes 20% of the resource, while method bar
consumes only 0.1%.
Profiling is what makes the difference between an efficient program and an inefficient one because a profile shows what the developer needs to optimize to make the program more efficient. Optimizing parts of the program that don't account for a significant portion of the profile will be a wasted effort as they will have little or no impact.
JFR, the JDK Flight Recorder, is a JDK mechanism for recording various events at a high rate and with low overhead, and it serves as the core of the JDK's profiling capabilities as well as the core of the JDK's monitoring capabilities.
JFR can profile a program with respect to two resources: allocation of heap memory and elapsed ("wall-clock") time. A heap memory profile shows which parts of the code allocate objects that consume a significant portion of the heap and what type of objects these are; it helps to make the program's memory use more efficient. An execution time profiler shows where different program threads spend their time, and it helps improve the program's latency.
To capture a profile, JFR records relevant events that sample the program's execution. For example, to obtain an allocation profile we sample which object is allocated, and by what code, after every allocation of some fixed amount of memory, say, 10 MB; to obtain an execution profile we sample the execution stack of program threads every fixed interval, say, 20ms. The respective profile is then created by analyzing the recording of the profiling events.
For example, consider a program with two threads, each performing HTTP requests: one runs the tenFastRequests
method that makes ten requests, sequentially, to a URL that responds in 100ms; the other runs the oneSlowRequest
method that makes a single request to URL that responds in 100ms. Both threads run after each other.
An execution profile of that program is obtained like so:
java -XX:StartFlightRecording=filename=profile.jfr,settings=profile.jfc HttpRequestsExample client
JFR obtains this profile by recording ExecutionSample
and NativeMethodSample
events on a subset of the threads at a regular time interval.
We can view the profile using a JFR view as follows:
> jfr view native-methods profile.jfr
Waiting or Executing Native Methods
Method Samples Percent
--------------------------------------------------------------- ------- -------
sun.nio.ch.SocketDispatcher.read0(FileDescriptor, long, int) 102 98.08%
jdk.jfr.internal.JVM.emitEvent(long, long, long) 2 1.92%
If we use JMC to display the results as a flame graph, we see that oneSlowRequest
and tenFastRequests
take up a similar amount of execution time, signifying their similar latency:
But clearly, we expect tenFastRequests
to take up ten times more CPU time than oneSlowRequest
.
Profiling CPU time rather than execution time can tell us how the CPU is utilized and is essential to improving a program's throughput.
There are other shortcomings to the current execution profile: every sampling period only up to six randomly-selected platform threads are sampled to reduce the profiling overhead at the expense of a less accurate profile, and JFR may fail to obtain a sample due to the thread being in a state that prevents a sample from being recorded, but JFR doesn't record such failed attempts, which may result in a skewed profile. These shortcomings of the execution profile may be addressed in the future.
Description
We add experimental CPU-time profiling to JFR, on Linux systems only at this time.
A CPU profile is obtained by the use of Linux's CPU-timer mechanism to sample the stack of all threads regularly, every fixed period of CPU time. In addition, it records any failure to obtain stack samples, so the overall CPU portions reported correctly represent the application's CPU activity.
The new experimental event, jdk.CPUTimeSample
is not enabled by default, and the sample period can be set with the event's period
property.
It comprises the following fields:
startTime
: time directly before walking the specific stackendTime
: time directly after walking the specific stacksampledThread
: the thread sampled to obtain the stack tracestacktrace
: the obtained stack trace ornull
if the sampler failed to walk the stack
This event is similar to the execution-time sampler's jdk.ExecutionSample
event, and enabling the new event does not affect the execution sample event in any way, so they can be used simultaneously.
The new event jdk.CPUTimeSampleLoss
is emitted when samples are lossed due to internal implementation reasons, such as internal data structures becoming full, and contains the number of dropped samples in the last sampling round (lossedSamples
). This event is enabled by default, but is only emitted when jdk.CPUTimeSample
is enabled.
Usage
Enable and configure the new JFR event:
> jfr configure --input profile.jfc --output /tmp/my_profile.jfc jdk.CPUTimeSample#enabled=true jdk.CPUTimeSample#period=9ms # create new settings
> jcmd <pid> JFR.start /tmp/my_profile.jfc # use those settings to start a new profiling session for your application
Or use it directly with -XX:StartFlightRecording
:
java -XX:StartFlightRecording=jdk.CPUTimeSample#enabled=true,filename=profile.jfr ...
With the new CPU-time sampler, it becomes clear that the applications spends nearly all its time in tenFastRequests
:
> jfr view cpu-time-hot-methods profile.jfr
Java Methods that Execute the Most from CPU time sampler
Method Samples Percent
--------------------------------------------------------------------------------------------------- ------- -------
jdk.jfr.internal.JVM.emitEvent(long, long, long) 88 77.19%
N/A 7 6.14%
java.lang.Object.wait0(long) 2 1.75%
java.lang.Class.forName0(String, boolean, ClassLoader, Class) 2 1.75%
java.lang.Thread.currentCarrierThread() 1 0.88%
CPU profiling may be added to other platforms in the future.
Alternatives
Rely on external tools
This is the current situation. We are relying on external tools, like async-profiler
to provide the proper CPU profiling. These external tools are based on internal, unspecified, JDK APIs.
Risk and Assumptions
The main risk is the users not understanding that the new CPU profiler and JFR event type are specific to Linux only. The assumption is that Linux is the most important platform for any production-level deployment and that we can get away with implementing Linux only.
Another risk is that users on multi-core systems will be overwhelmed by the number of emitted events. While the current JFR profiler has a hard limit for the number of sampled threads per interval, the new CPU profiler does not. This lets the number of emitted events scale linearly with the number of simultaneously running threads.