JEP draft: AOT Class Loading & Linking with Multiple GCs

OwnerErik Österlund
TypeFeature
ScopeImplementation
StatusSubmitted
Componenthotspot / gc
Discussionhotspot dash dev at openjdk dot org
EffortM
DurationM
Reviewed byAlex Buckley, Ioi Lam, Stefan Karlsson, Vladimir Kozlov
Created2024/02/16 09:49
Updated2025/01/29 20:54
Issue8326035

Summary

Enhance the ahead-of-time cache that is used by the HotSpot Java Virtual Machine to improve startup time. In particular, enable the cache to be used regardless of which garbage collector the user chooses in training or production runs. Previously, the cache could not be used with the low-latency Z Garbage Collector (ZGC).

Goals

Motivation

Most garbage collectors, e.g., G1, pause application threads to collect garbage. This causes the application's handling of some requests to take significantly longer than usual, a phenomenon known as tail latency. Application developers can attempt to manage tail latency by tuning the GC so that, e.g., 99% of response times are below 10ms.

The Z Garbage Collector (ZGC) was introduced in JDK 15 to minimize tail latency induced by garbage collection. ZGC collects garbage concurrently so that application threads are never paused for more than a millisecond. ZGC requires minimal configuration and operates as a generational collector in JDK 23 and later.

However, garbage collection is not the only runtime feature that causes tail latency. Java applications are often "scaled out" by starting new JVM instances to handle more requests, but requests sent to a new instance take significantly longer than requests sent to a warmed-up instance. To address this source of tail latency, users can enable ahead-of-time class loading and linking. This improves application startup by caching the classes of an application in a training run so that they appear to be loaded instantaneously in production. For example, a training run of the Spring PetClinic application caches ~21,000 classes used to handle requests; in production, these classes are mapped into the JVM's heap at startup, in effect loading them without scanning the class path 21,000 times.

Unfortunately, the way that classes are cached is incompatible with ZGC. This forces users to choose between suffering GC-induced tail latency or suffering startup-induced tail latency. If they use ZGC to reduce the former, they cannot enable ahead-of-time class loading and linking to reduce the latter, and vice versa.

To help users avoid this painful choice, we propose to make the cache work with all GCs, including ZGC. Users will be able to improve tail latency during startup by using the cache, without having to select a GC other than ZGC which adds tail latency elsewhere.

Description

An ahead-of-time cache is a file that contains data about the state of an application, collected during a training run. An AOT cache primarily contains Class objects, representing classes that were loaded and linked during the training run. In subsequent runs of the application, the JVM consults the AOT cache so that cached classes are not loaded and linked from scratch. This improves startup of both the JVM and the application.

Traditionally, the JVM maps the AOT cache into the memory which stores the heap and is managed by the GC used in production. This is efficient but it means that objects in the cache, written in a training run, must be bitwise-compatible with the format of heap objects expected by the GC used in production. That is, the cache is GC-specific.

In JDK NN, the cache is GC-agnostic: it works regardless of which GC is used in production. This allows users to choose ZGC in either training or production runs. It also allows the JDK to ship a baseline AOT cache that works in all environments.

Obstacles to GC-agnostic caching

The main challenge with caching objects in a GC-agnostic manner relates to how objects refer to each other. In Java source code, the value of a field that holds a reference to an object is opaque, but at run time, each GC has its own rules for laying out objects in memory and representing references from one object to another: (the "reference format")

The multitude of reference formats makes it challenging to take objects managed by one GC, cache them, and reify them later under another GC.

GC-agnostic object caching

In JDK NN, object references are written to the AOT cache in an abstract reference format that is GC-agnostic. Raw pointers are never stored in the cache.

When an object is written to the cache at the conclusion of a training run, the cached representation mirrors the memory layout of the object in the heap except that reference fields are encoded as logical indices. For example, consider a String object with the following fields:

public class String {
    private final byte[] value;
    private final byte coder;
    private int hash;
    private boolean hashIsZero;
}

Traditionally, the cached form of a String object would have a 64-bit address in the value field:

Header: ...  |  value: 0x4002045278  |  coder: ...  |  hash: ...  |  hashIsZero: ...

Furthermore, the address would use a bit pattern that is valid across the Serial, Parallel, and G1 GCs, by means of taking every object reference down to a lowest common denominator format. For example, if the user trains with Parallel, then objects are stored in the AOT cache such that none cross the boundaries of heap regions, using a predetermined region size; this allows the user to run in production with G1. However, since ZGC does not use 64-bit addresses as object references, and does not support a global size for regions, the AOT cache cannot be used in a production run with ZGC.

In JDK NN, the cached form of a String object would look like this, where the value field stores the logical index of another object:

Header: ...  |  value: 5  |  coder: ...  |  hash: ...  |  hashIsZero: ...

Given a logical index from the cache, the GC in production will find the address of the actual object in the heap by performing a lookup in a side table, and will fix up the values of reference fields as it goes.

Object Streaming

Traditionally, the AOT cache was mapped into memory, providing the appearance of instant loading of objects. In JDK NN, interpreting the GC-agnostic cache requires some indirection, so objects are streamed rather than mapped into memory.

To stream objects, a background thread is started early in the JVM bootstrapping process. When the AOT cache is opened, the thread eagerly starts materializing objects one by one. Materializing an object involves allocating memory in the heap, initializing the object's fields according to the data in the cache, and building object references to other materialized objects via lookups in a side table. Since the materialized objects are Class objects, representing Java classes, it appears to the application that its classes are all loaded instantly at startup.

Offloading materialization to a background thread requires synchronization points before objects are accessed by the application, so that any necessary materialization can be flushed out. A natural synchronization point is when the application uses a class for the first time. For example, given class Foo below, consider the first time that the application invokes Foo.getMessage():

class Foo {
    public static String getMessage() { return "hello"; }
}

Mapping versus Streaming

Some applications run in an environment where mapping the AOT cache into memory will perform better than streaming objects from the cache -- and vice versa.

A cold start is the first JVM start in a while, such as when deploying JVMs in the cloud. The AOT cache is unlikely to be in the file system cache, and the larger the AOT cache, the larger the cost of loading it from disk becomes. Streaming, however, can hide the latency of materializing objects from the cache.

Conversely, a warm start is when the JVM starts close in time to a previous start, such as when running a Java program over and over on the same machine. Because the AOT cache stays in the file system cache between runs, mapping the AOT cache into the Java heap is instant, thus classes appear to load instantly. Streaming achieves the same illusion by performing materialization in a background thread, but it relies on the availability of an extra core during startup.

The least advantageous situation for streaming is a warm start in a constrained environment that does not have a spare core. The JVM tries to avoid this situation in production by applying a heuristic:

The JDK ships with one AOT cache created with -XX:+UseCompressedOops and another created with -XX:-UseCompressedOops. This ensures that the JVM can use streaming or mapping, as appropriate, to achieve the best startup performance. Advanced users can explicitly enable streaming when creating an AOT cache by specifying -XX:AOTStreamableObjects, even if they also specify -XX:+UseCompressedOops.

Alternatives

Building ZGC support for the AOT cache does not require a GC-agnostic solution. It would be possible to continue the GC-specific approach by creating and using a ZGC-specific cache containing ZGC-specific object references. This would optimize startup performance. However, the GC-agnostic solution (with objects materialized in the background) does not materially affect startup performance as long as an extra core is available, so the only situation when a ZGC-specific cache would perform faster than the GC-agnostic cache is when using ZGC on a single-core machine. Given the concurrent nature of ZGC, this is an unusual environment for ZGC, and does not motivate building a ZGC-specific cache. We prefer to rely on the maxim that the best way to reduce tail latency is with a systems approach, where the design of discrete components is coordinated, and this translates to the GC-agnostic approach.

Another alternative would be to modify ZGC so it can interpret both its own object references and the G1-influenced object references currently found in AOT caches. The Serial and Parallel GCs were modified in this way, but ZGC is significantly more complex. This approach would effectively couple the implementation of all GCs to each other, which is undesirable. In contrast, the GC-agnostic approach decouples the implementations, allowing GC implementations to evolve while allowing users to choose from the full range of GCs in training and again in production. Furthermore, since the bitwise layout of objects in the AOT cache is not entangled with the memory layout of objects in the heap, we expect to be able to optimize the layout in the AOT cache to shrink its static footprint without significantly affecting GC implementations.

Testing

A large amount of object archiving tests have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.