JEP draft: Enable post-mortem crash analysis with jcmd
Owner | Kevin Walls |
Type | Feature |
Scope | JDK |
Status | Submitted |
Component | core-svc / tools |
Discussion | serviceability dash dev at openjdk dot org |
Effort | M |
Duration | M |
Reviewed by | Alex Buckley |
Created | 2024/03/18 12:00 |
Updated | 2025/04/28 17:13 |
Issue | 8328351 |
Summary
Extend the jcmd
tool to provide diagnostics on a Java Virtual Machine that has terminated unexpectedly. Achieve this by a novel technique of process revival which provides a foundation for post-mortem analysis. Users will enjoy a consistent troubleshooting experience across live environments and post-mortem environments.
Goals
- Make the troubleshooting of a crashed JVM as familiar and productive as troubleshooting a live JVM.
- Enable post-mortem diagnostics on Linux and Windows.
- Reduce the cost of JDK maintenance by focusing the implementation efforts for serviceability on
jcmd
.
Non-Goals
- It is not a goal to support all
jcmd
diagnostics in post-mortem environments. - It is not a goal to run and debug Java code in post-mortem environments.
- It is not a goal to enable post-mortem diagnostics on all operating systems supported by OpenJDK at this time.
- It is not a goal to remove legacy serviceability tools, such as the Serviceability Agent, from the JDK at this time.
Motivation
Serviceability is the ability of a system operator to monitor, observe, debug, and troubleshoot an application. Monitoring and observability tools allow the operator to connect to a live JVM and examine the application. This includes the code of the application, such as loaded classes and just-in-time compiled methods, as well as the progress of execution, such as the stacks of Java threads and native threads. JDK tools such as jstack
and jmap
produce thread dumps and heap dumps from a live JVM, while tools such as Java Mission Control let operators browse threads and memory visually. Depending on how a tool connects to the JVM, e.g., the JMX protocol, the operator may also be able to troubleshoot the application by, e.g., activating more verbose logging by the garbage collector.
In extreme scenarios, the JVM terminates unexpectedly in a way that cannot be monitored by such tools. This can occur because of buggy native code in the application or libraries, or due to bugs in the JVM itself. At termination, the JVM emits a crash report (hs_err_pidXXX.log
) that contains information about the fault and the state of the application, such as the stack trace of the failing thread and a list of loaded libraries. The operating system also saves the memory of the JVM process to a file known as a core dump. System operators use crash reports and core dumps post-mortem to gain a deepening understanding of what went wrong and to identify steps toward resolution.
Unfortunately, the tools available for post-mortem analysis of a core dump are problematic:
-
Using a native debugger such as
gdb
is frustrating because it has no Java-level knowledge. Understanding the representation of JVM artifacts in the memory of the crashed process is up to the operator. For example, if the operator determines that a Java object starts at a particular address in memory, then finding something as basic as the class of the object means manually decoding words in the object's header which vary between JDK releases. Debugger scripts can help to automate the decoding of JVM artifacts in the core dump, but the work remains error-prone and the scripts need ongoing maintenance. -
JDK 6 introduced the Serviceability Agent (SA). This is not a Java agent but rather a tool that can open a core dump and decode JVM artifacts automatically. SA also exposes these artifacts through an unsupported API. The SA codebase is dated and requires continual maintenance as the JVM evolves, as well as major work to expose new JVM features. This causes friction for operators because the depth of information available from SA depends on which JVM features were in use.
jcmd
was introduced in JDK 7 as a lightweight tool for JVM diagnostics. It connects to a live JVM via the Attach API and can present Java-level information about the application. It offers over 50 commands for listing Java threads, detailing memory use, examining the state of the garbage collector, etc. However, jcmd
is limited to attaching to live processes. Given its flexibility and popularity, it is appealing to enable the use of jcmd
for post-mortem analysis on a core dump. This would give operators a symmetrical experience for live and post-mortem troubleshooting.
Description
We extend jcmd
so it can produce diagnostics from the core dump of a JVM process. This will simplify the troubleshooting process for system operators, and unify the serviceability experience across live and post-mortem environments.
Post-mortem analysis with jcmd
uses a "revival" technique for diagnosing a crashed process. By using data from the core dump to recreate the process's memory image at the time of the crash, and by executing code in the JVM binary, it is possible to make jcmd
's diagnostic commands work as they do in a live JVM, with no changes to the commands or their implementations.
For example, if a JVM crash resulted in the core dump core.1234
, then running:
$ jcmd core.1234 Thread.print
will produce the same kind of output as when jcmd
is connected to a live JVM:
Opening dump file 'core.1234'...
2025-04-01 14:17:18
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25-internal-LTS-2025-03-30-1738352.name... mixed mode, sharing):
...
"Thread-0" #34 [1183517] prio=5 os_prio=0 cpu=0.99ms elapsed=0.07s tid=0x00007ff8fc208cc0 ...
java.lang.Thread.State: RUNNABLE
Thread: 0x00007ff8fc208cc0 [0x120f1d] State: _at_safepoint _at_poll_safepoint 0
JavaThread state: _thread_blocked
at ThreadsMem$1.run(ThreadsMem.java:25)
- locked <0x00000000fe300c98> (a java.lang.Object)
at java.lang.Thread.runWith(java.base@25-internal/Thread.java:1460)
at java.lang.Thread.run(java.base@25-internal/Thread.java:1447)
...
jcmd
in post-mortem environments
jcmd
supports 56 commands in a live JVM. 28 of them are available in the post-mortem environment:
Compiler.CodeHeap_Analytics Compiler.codecache Compiler.codelist
Compiler.directives_print Compiler.memory Compiler.perfmap
Compiler.queue
GC.heap_dump GC.heap_info
JVMTI.data_dump
System.dump_map System.map System.native_heap_info
Thread.print
VM.class_hierarchy VM.classes VM.classloader_stats
VM.classloaders VM.command_line VM.dynlibs
VM.events VM.flags VM.metaspace
VM.native_memory VM.stringtable VM.symboltable
VM.systemdictionary VM.version
The post-mortem environment must have the same operating system and CPU architecture as the environment where the JVM crashed.
It is often difficult to access production servers where the JVM has crashed, so it is common to transport core dumps to developer workstations for analysis. Developer workstations typically run newer JDKs than production servers, so to facilitate analysis, it is not necessary for jcmd
to come from the same JDK as the JVM that crashed. jcmd
in one JDK release can revive core dumps from another JDK release as long as the JVM binary from the other release is available. The other release may be older or newer than the release where jcmd
is running, as long as both releases are at least JDK NN. When running jcmd
, the path to the JVM binary is specified via the new -L
option:
$ jcmd -L /transported_files/libjvm.so core.1234 Thread.print
In JDK NN, jcmd
can take either the name of a Java class or the filename of a core dump as an argument. Since the filename of a core dump might resemble a class name, the new -c
option indicates that the argument is, in fact, a core dump:
$ jcmd -c MyApp GC.heap_dump
Reviving a core dump
jcmd
first invokes a native helper program, which "revives" the memory of the crashed process, then executes the command specified on the command line. The helper subprocess is needed to give the crashed (and now revived) process its own address space, avoiding conflicts with the address space of the JVM running jcmd
.
The helper subprocess populates its address space from the data in the core dump. It also loads the JVM binary at the same virtual address as in the crashed process. The ability to load the JVM binary at a virtual address matching the core dump is achieved by relocating a copy of the binary to that preferred address. In turn, the relocation is achieved by copying and patching the JVM binary file.
Platform-dependent analysis of the core dump is required to identify which memory regions to revive. The revived regions include memory storing the internal data structures of the JVM, and memory storing the Java heap. The memory storing the stacks of native threads is revived so that references into them will resolve. There is no reconstruction of the native threads themselves, as they are not going to execute.
The JVM is not "live" as it was at run time. No Java code can be executed, and no garbage collection occurs. However, the JVM binary is loaded at the correct address so its code can be executed. Absolute pointers are satisfied by being memory mapped in from the core dump, as are memory references relative to the running code. A JVM helper method is called to reset JVM state from the time of the crash, such as the addresses of native libraries.
This revival technique does not require loading every native library from the crashed process. This is to enable running diagnostics when the core dump is transported to a different machine, where the same libraries are not available. These transported core dumps are traditionally tricky to set up in a debugger, often requiring native libraries to match the original machine. jcmd
needs only the JVM binary which crashed, and the core dump.
Alternatives
-
Invest in improvements to the Serviceability Agent. This is problematic for several architectural reasons.
SA is written in Java and relies on a native library (
libsaproc
) to return the contents of raw memory from either a running process or a core dump. This shields SA from knowing whether the observed JVM is alive or dead, but means that SA is responsible for turning byte arrays fromlibsaproc
into useful Java constructs such as classes that model threads, stack frames, locks, etc. Turning a low-level abstraction into a high-level abstraction is tedious and intricate, requiring SA to, e.g., know how each garbage collector organizes Java objects in the heap. It also duplicates the functionality of vast swathes of native code in the JVM which manage run-time data structures.Conversely, the new
jcmd
technique -- reviving the memory of a crashed JVM -- reuses the native code which managed the data structures in memory when the JVM was alive. This shieldsjcmd
diagnostics from knowing whether the observed JVM is alive or dead; the diagnostics call the same native code regardless. No new code is needed to. e.g., extract a heap dump or obtain a thread's stack frames. Duplicating the memory from a formerly-live process is much more efficient than duplicating the code to understand it.In a nutshell, SA embraced high implementation complexity in order to support a featureful Java API that is orthogonal to the health of the JVM process, but this approach has run out of road: The rapid pace of VM development makes the complexity too high to manage and the functionality of the API has suffered. Instead of SA's high-cost / high-feature approach,
jcmd
with process revival takes a low-cost / core-feature approach. Since it is low cost, it can be supported long term. -
Native debug information goes some way to providing low-level JVM diagnostics, and will remain an essential part of debugging. However, the technical effort needed to extract Java objects in human-readable form makes it a poor alternative to an enhanced
jcmd
.
Risks and Assumptions
-
A risk of allowing
jcmd
to be used post-mortem is that core dumps containing sensitive information may be transferred to insecure environments for analysis. However, this is no different than existing troubleshooting efforts, so there is no new security risk. -
We assume that the post-mortem environment contains the JVM binary in use at the time of the crash. This assumption is reasonable because access to exact binaries is standard in troubleshooting, for reproducibility.
-
The set of
jcmd
commands usable with a revived process is configured when the JVM is built. We assume this is acceptable because the standard build configuration includes corejcmd
commands that are widely applicable and well known. Additional commands that act on a revived process can be created for specific requirements and incorporated into the JVM via the JDK build process.
Future Work
We plan to support post-mortem troubleshooting with jcmd
on MacOS, in addition to Linux and Windows.
We expect to make further enhancements to jcmd
to aid troubleshooting in both live and post-mortem environments. Two examples are new commands for inspecting arbitrary Java objects and extracting a Java class definition ("class dumping"). We also expect to enhance some existing commands, e.g., VM.uptime
, to work in both environments.
Some existing commands are implemented in Java rather than in native code. This means they are not compatible with process revival and cannot be used in post-mortem environments. An example is Thread.dump_to_file
, which outputs a list of virtual threads in JSON format. However, these commands tend to be of greatest value in live environments, so it is not critical to make them work in post-mortem environments. In future, the developers of new commands will need to consider the possibility of post-mortem execution when choosing the implementation language.