JEP draft: Process Reanimation for Serviceability

OwnerKevin Walls
TypeFeature
ScopeImplementation
StatusDraft
Componentcore-svc / tools
Discussionserviceability dash dev at openjdk dot org
EffortM
DurationM
Created2024/03/18 12:00
Updated2024/05/03 15:49
Issue8328351

Summary

Provide a mechanism to run existing Hotspot Serviceability tools (jcmd) on a crashed JVM. This is achieved by reviving a crashed JVM into a new process, and calling JVM code directly.

Goals

Enable existing native HotSpot JVM diagnostics to be used after a crash: enable jcmd on a crash dump (e.g. core file or minidump).

Avoid diagnostic code needing to duplicate JVM internals, to avoid maintenance costs.

Non-Goals

Do not attempt to replace all of the Serviceabilty Agent (SA) in one integration. The technique can be a platform for further tools over time.

Running Java code in the revived JVM is not a goal. There are some jcmds which are implemented in Java, and these will not be usable with this technique. Additionally, some jcmds are not relevant after a crash. A few jcmds which are useful after a crash, and are implemented in Java, will need to be the subject of further work.

Motivation

The support and maintenance of the JVM requires tools to investigate problems, on live processes and after a crash.

The jcmd tool provides live JVM inspection, with a variety of commands whose names indicate a namespace of JVM subsystems (e.g. VM.info, Compiler.codecache, GC.heap_info, Thread.print, VM.info, etc...).

Investigating JVM crashes (post-mortem analysis) requires different tools. Native debuggers expose the raw details, but have no insight into the Java context, such as the Java heap and Java code. This JVM-specific data can be decoded, but scripting this in a native debugger is laborious.

The Serviceability Agent (SA) provides Java-level insight into the JVM. The SA attaches to a live process or opens a crash dump. It decodes JVM information by having built-in knowledge: Java classes model the JVM features, and the JVM explicitly chooses to expose certain structures.

Tools such as debugger scripts or the SA require continual maintenance as the JVM evolves, and major work to support new features.

This difficult maintenance work is the cause of friction and lag. JVM changes need updates in the SA which may happen as features change, or may be done later. Some JVM changes such as a new GC, require very significant updates to the SA, and in the case of ZGC may not yet be implemented, if they ever will be fully implemented.

The result can be that the information available using the SA depends on which VM options are set. Additionally, the separation of tools between live and post-mortem use is a complication for users.

It is therefore appealing to enable execution of the same JVM code during post-mortem analysis (using a core file or minidump), as is used live by jcmd. This will remove the duplication of effort caused by having separate runtime and debug-time code, and move towards symmetrical experiences for live and post-mortem JVM analysis.

This can be achieved by recreating the process memory image using data from the core file, and code from the JVM binary, to provide a process memory image in which the diagnostic commands which jcmd invokes, can be run.

Note that some SA features are missing from the current jcmd feature set, and these omissions can be covered in separate enhancements.

Description

Given a crash dump (core or minidump) from a JVM process, enable jcmd diagnostics on that crash dump.

Adding core file/minidump support must not be ambiguous with existing main-class argument, so the jcmd usage description becomes:

jcmd [pid | main-class | -c crash-dump-name ] command... | PerfCounter.print | -f filename

e.g. jcmd -c core.1234 Thread.print

Given a named core file or minidump, the jcmd launcher will invoke a native helper to perform the operation, as the reanimated process needs its own address space (to avoid conflicts with the launcher's JVM).

The new helper process runs native code to populate its memory space from data in the core dump, and code from the native JVM binary. The JVM code must be loaded at the same virtual address as in the crashed process, so that the executing JVM code has the correct memory image: absolute pointers are satisfied by memory mapped in from the crash dump, as are references relative to the running code. Restoring memory mappings includes data local to the JVM and global data, such as the Java heap. JVM symbols resolve as they normally would as the JVM library is loaded using normal methods (e.g. dlopen on Linux, LoadLibrary on Windows).

The memory representing native thread stacks is restored, so memory references into them will resolve. There does not need to be any reconstruction of the threads as the native OS libraries knew them, as these threads are not going to execute.

This revival technique must not require loading every native library from the previous process. This is to enable running diagnostics when the core file is transported to a different machine, where the same libraries are not available. However, the JVM has some references into OS libraries, which are specific to the previous process. These must be reset, as they will be at a different address in the new process: a small helper routine built in to the JVM can reset this state. This helper can also set JVM flags if necessary to assist diagnostic code. The helper method is exposed as a public, global symbol so it can be resolved from outside the JVM, in the same process.

After invoking the helper method, the tool can then make a call into the JVM. Overall control is not going to pass to the JVM, only to diagnostic code and back to the tool. For executing diagnostics, DCmd::parse_and_execute is the JVM entry point required. That can also be exposed as a global symbol, or the helper can return information on where to find it and any other symbols or information that could be required.

There is no new security impact. Using the new feature requires access to the crash dump from a JVM process, which can contain private information, but this is no different to existing debugging efforts. The additional helper method built-in to the JVM is of no value to an attacker. It will never be intentionally called by the JVM during normal operation, and would offer no security risk if an attacker forced it to be called.

Alternatives

Simply making more effort on maintenance is always a possibility. The SA will work with all JVM features if enough time is spent on it. The SA and other alternative proposals over the years have always had duplication of effort somewhere.

Native debug information goes some way to providing low-level HotSpot diagnostics, and will remain an essential part of debugging. But extracting an Object from the Java heap in human-readable form, still requires manual effort or scripting. As soon as such a script is written, its maintenance is required.

Risks and Assumptions

At post-mortem debug time, the exact JVM library in use at the time of the crash is required. This is worth noting, but is basically the same as all such debugging.

The basic feature set of diagnostic tools usable in the reanimated process is set at build time. While this could be a limitation, the core set of tools is well established. Additional tools that act on the revived data can still be created for specific requirements.

Dependencies

The ability to load the native JVM library at a virtual address matching the core file. This is currently achieved by relocating a copy of the library to need that preferred address. The relocation can be performed by existing tools, or implemented directly. Operating system library cooperation is required, to honour the requested address.

Additional jcmd tooling will be required to offer features comparable to the Serviceability Agent. e.g. JDK-8318026.