JEP draft: Enable post-mortem crash analysis with jcmd

OwnerKevin Walls
TypeFeature
ScopeJDK
StatusSubmitted
Componentcore-svc / tools
Discussionserviceability dash dev at openjdk dot org
EffortM
DurationM
Reviewed byAlex Buckley
Created2024/03/18 12:00
Updated2025/04/28 17:13
Issue8328351

Summary

Extend the jcmd tool to provide diagnostics on a Java Virtual Machine that has terminated unexpectedly. Achieve this by a novel technique of process revival which provides a foundation for post-mortem analysis. Users will enjoy a consistent troubleshooting experience across live environments and post-mortem environments.

Goals

Non-Goals

Motivation

Serviceability is the ability of a system operator to monitor, observe, debug, and troubleshoot an application. Monitoring and observability tools allow the operator to connect to a live JVM and examine the application. This includes the code of the application, such as loaded classes and just-in-time compiled methods, as well as the progress of execution, such as the stacks of Java threads and native threads. JDK tools such as jstack and jmap produce thread dumps and heap dumps from a live JVM, while tools such as Java Mission Control let operators browse threads and memory visually. Depending on how a tool connects to the JVM, e.g., the JMX protocol, the operator may also be able to troubleshoot the application by, e.g., activating more verbose logging by the garbage collector.

In extreme scenarios, the JVM terminates unexpectedly in a way that cannot be monitored by such tools. This can occur because of buggy native code in the application or libraries, or due to bugs in the JVM itself. At termination, the JVM emits a crash report (hs_err_pidXXX.log) that contains information about the fault and the state of the application, such as the stack trace of the failing thread and a list of loaded libraries. The operating system also saves the memory of the JVM process to a file known as a core dump. System operators use crash reports and core dumps post-mortem to gain a deepening understanding of what went wrong and to identify steps toward resolution.

Unfortunately, the tools available for post-mortem analysis of a core dump are problematic:

jcmd was introduced in JDK 7 as a lightweight tool for JVM diagnostics. It connects to a live JVM via the Attach API and can present Java-level information about the application. It offers over 50 commands for listing Java threads, detailing memory use, examining the state of the garbage collector, etc. However, jcmd is limited to attaching to live processes. Given its flexibility and popularity, it is appealing to enable the use of jcmd for post-mortem analysis on a core dump. This would give operators a symmetrical experience for live and post-mortem troubleshooting.

Description

We extend jcmd so it can produce diagnostics from the core dump of a JVM process. This will simplify the troubleshooting process for system operators, and unify the serviceability experience across live and post-mortem environments.

Post-mortem analysis with jcmd uses a "revival" technique for diagnosing a crashed process. By using data from the core dump to recreate the process's memory image at the time of the crash, and by executing code in the JVM binary, it is possible to make jcmd's diagnostic commands work as they do in a live JVM, with no changes to the commands or their implementations.

For example, if a JVM crash resulted in the core dump core.1234, then running:

$ jcmd  core.1234  Thread.print

will produce the same kind of output as when jcmd is connected to a live JVM:

Opening dump file 'core.1234'...
2025-04-01 14:17:18
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25-internal-LTS-2025-03-30-1738352.name... mixed mode, sharing):
...
"Thread-0" #34 [1183517] prio=5 os_prio=0 cpu=0.99ms elapsed=0.07s tid=0x00007ff8fc208cc0 ...
   java.lang.Thread.State: RUNNABLE
Thread: 0x00007ff8fc208cc0  [0x120f1d] State: _at_safepoint _at_poll_safepoint 0
   JavaThread state: _thread_blocked
        at ThreadsMem$1.run(ThreadsMem.java:25)
        - locked <0x00000000fe300c98> (a java.lang.Object)
        at java.lang.Thread.runWith(java.base@25-internal/Thread.java:1460)
        at java.lang.Thread.run(java.base@25-internal/Thread.java:1447)
        ...

jcmd in post-mortem environments

jcmd supports 56 commands in a live JVM. 28 of them are available in the post-mortem environment:

Compiler.CodeHeap_Analytics    Compiler.codecache    Compiler.codelist
Compiler.directives_print      Compiler.memory       Compiler.perfmap
Compiler.queue

GC.heap_dump    GC.heap_info

JVMTI.data_dump

System.dump_map    System.map    System.native_heap_info

Thread.print

VM.class_hierarchy     VM.classes         VM.classloader_stats
VM.classloaders        VM.command_line    VM.dynlibs
VM.events              VM.flags           VM.metaspace
VM.native_memory       VM.stringtable      VM.symboltable
VM.systemdictionary    VM.version

The post-mortem environment must have the same operating system and CPU architecture as the environment where the JVM crashed.

It is often difficult to access production servers where the JVM has crashed, so it is common to transport core dumps to developer workstations for analysis. Developer workstations typically run newer JDKs than production servers, so to facilitate analysis, it is not necessary for jcmd to come from the same JDK as the JVM that crashed. jcmd in one JDK release can revive core dumps from another JDK release as long as the JVM binary from the other release is available. The other release may be older or newer than the release where jcmd is running, as long as both releases are at least JDK NN. When running jcmd, the path to the JVM binary is specified via the new -L option:

$  jcmd  -L /transported_files/libjvm.so  core.1234  Thread.print

In JDK NN, jcmd can take either the name of a Java class or the filename of a core dump as an argument. Since the filename of a core dump might resemble a class name, the new -c option indicates that the argument is, in fact, a core dump:

$ jcmd  -c MyApp  GC.heap_dump

Reviving a core dump

jcmd first invokes a native helper program, which "revives" the memory of the crashed process, then executes the command specified on the command line. The helper subprocess is needed to give the crashed (and now revived) process its own address space, avoiding conflicts with the address space of the JVM running jcmd.

The helper subprocess populates its address space from the data in the core dump. It also loads the JVM binary at the same virtual address as in the crashed process. The ability to load the JVM binary at a virtual address matching the core dump is achieved by relocating a copy of the binary to that preferred address. In turn, the relocation is achieved by copying and patching the JVM binary file.

Platform-dependent analysis of the core dump is required to identify which memory regions to revive. The revived regions include memory storing the internal data structures of the JVM, and memory storing the Java heap. The memory storing the stacks of native threads is revived so that references into them will resolve. There is no reconstruction of the native threads themselves, as they are not going to execute.

The JVM is not "live" as it was at run time. No Java code can be executed, and no garbage collection occurs. However, the JVM binary is loaded at the correct address so its code can be executed. Absolute pointers are satisfied by being memory mapped in from the core dump, as are memory references relative to the running code. A JVM helper method is called to reset JVM state from the time of the crash, such as the addresses of native libraries.

This revival technique does not require loading every native library from the crashed process. This is to enable running diagnostics when the core dump is transported to a different machine, where the same libraries are not available. These transported core dumps are traditionally tricky to set up in a debugger, often requiring native libraries to match the original machine. jcmd needs only the JVM binary which crashed, and the core dump.

Alternatives

Risks and Assumptions

Future Work

We plan to support post-mortem troubleshooting with jcmd on MacOS, in addition to Linux and Windows.

We expect to make further enhancements to jcmd to aid troubleshooting in both live and post-mortem environments. Two examples are new commands for inspecting arbitrary Java objects and extracting a Java class definition ("class dumping"). We also expect to enhance some existing commands, e.g., VM.uptime, to work in both environments.

Some existing commands are implemented in Java rather than in native code. This means they are not compatible with process revival and cannot be used in post-mortem environments. An example is Thread.dump_to_file, which outputs a list of virtual threads in JSON format. However, these commands tend to be of greatest value in live environments, so it is not critical to make them work in post-mortem environments. In future, the developers of new commands will need to consider the possibility of post-mortem execution when choosing the implementation language.