JEP draft: Ahead-of-Time Class Loading & Linking with Any GC
Owner | Erik Österlund |
Type | Feature |
Scope | JDK |
Status | Submitted |
Component | hotspot / gc |
Discussion | hotspot dash dev at openjdk dot org |
Effort | M |
Duration | M |
Reviewed by | Alex Buckley, Ioi Lam, Stefan Karlsson, Vladimir Kozlov |
Created | 2024/02/16 09:49 |
Updated | 2025/05/06 19:14 |
Issue | 8326035 |
Summary
Enhance the ahead-of-time cache, which enables the HotSpot Java Virtual Machine to improve startup and warmup time, so that it can be used with any garbage collector, including the low-latency Z Garbage Collector (ZGC).
Goals
-
Allow all garbage collectors to work smoothly with the AOT cache introduced by Project Leyden.
-
Separate the AOT cache from GC implementation details and policies.
-
Ensure that use of the AOT cache does not materially impact startup time, relative to previous releases.
Motivation
Most of the HotSpot JVM's garbage collectors pause application threads in order to reclaim memory. This causes the application to take significantly longer than usual to handle some requests, increasing its tail latency. For example, 99% of all requests may be handled within 10ms, but 1% of the requests may take 100ms or more. You can minimize the tail latency caused by garbage collection by using the Z Garbage Collector (ZGC). ZGC reclaims memory concurrently, never pausing application threads for more than a millisecond.
Garbage collection is, however, not the only cause of high tail latency.
Java applications are often scaled by starting new JVM instances to handle more requests, but requests sent to a new instance take significantly longer than requests sent to a warmed-up instance. To address this source of tail latency, you can enable ahead-of-time class loading and linking, introduced in JDK 24. This improves application startup by caching your application’s classes in a training run so that they are available immediately in production. For example, the Spring PetClinic demo application starts 41% more quickly in production because the cache enables some 21,000 classes to appear immediately loaded and linked at startup. Forthcoming features, such as ahead-of-time method profiling and code compilation, will further leverage the ahead-of-time cache to further extend these gains.
Unfortunately, the way that classes are cached is incompatible with ZGC. This forces you to choose between suffering GC-induced tail latency or suffering startup-induced tail latency. If you use ZGC to reduce the former then you cannot enable ahead-of-time class loading and linking to reduce the latter, and vice versa.
We could avoid this painful choice if AOT caches could be used with any of the HotSpot JVM's garbage collectors, including ZGC.
Description
An AOT cache contains, among other things, representations of the Class
objects for classes that were loaded and linked during a training run. It also contains objects referenced by those Class
objects, such as strings and byte arrays.
Today, AOT cache files are GC-specific: Their format is bitwise-compatible with the format of heap objects understood by the GC, so that the JVM can map them directly into the heap memory managed by the GC.
We propose to make AOT caches optionally GC-agnostic so that they work with all garbage collectors, regardless of which GC is used in training or in production. As an additional benefit, this will allow the JDK to include a baseline AOT cache that works in all environments.
Obstacles to a GC-agnostic AOT cache
The main challenge of caching objects in a GC-agnostic manner is in how to handle object references. From the perspective of Java code, the value of a field that holds a reference to an object is opaque. From the perspective of the JVM, however, each GC has its own policies for laying out objects in memory and representing references from one object to another:
-
Heap size policies (Parallel, G1) — For heaps larger than 32 GB, object references are represented as 64-bit addresses and stored directly in reference fields. For heaps smaller than 32 GB, object references are stored in reference fields as 32-bit values, using compression if necessary since addresses can have up to 35 bits. There are three compression schemes, selected heuristically at run time based on heap size and other factors.
-
Object size policies (G1, ZGC) — Objects are placed within heap regions, according to their sizes. In G1, the high-order bits of a 64-bit address identify the region, the low-order bits encode an offset into the region, and objects never cross region boundaries. An object that G1 considers large gets its own exclusive heap region, and any reference to the object must have all zero low-order bits. ZGC, on the other hand, distinguishes between small, medium, and large objects, using a different reference format for each.
-
Metadata — ZGC encodes metadata bits in object references. These bits are used to manage concurrent garbage collection. No other GC supports this reference format.
The multitude of reference formats makes it challenging to take objects managed by one GC, cache them, and reify them later for a different GC.
Object caching today
The representation of an object in an AOT cache mirrors its representation in memory. For example, consider a String
object with these fields:
public class String {
private final byte[] value;
private final byte coder;
private int hash;
private boolean hashIsZero;
}
In the cached form of a String
object, the value
field contains the 64-bit memory address of a byte array:
header: ... | value: 0x4002045278 | coder: ... | hash: ... | hashIsZero: ...
The address is in a lowest-common-denominator format that is valid across the Serial, Parallel, and G1 collectors. Objects are stored in AOT caches such that none crosses the boundaries of heap regions, using a predetermined region size. This allows you to run in production with G1 even if you trained with Serial or Parallel.
ZGC does not use 64-bit addresses as object references, however, and it does not support a global size for regions. Hence ZGC cannot be used with AOT caches.
GC-agnostic object caching
We can make a GC-agnostic AOT cache by storing object references in a format that is GC-agnostic, namely as logical indices. Then the cached form of a String
object would look like this, with the value
field containing the logical index of the byte array:
header: ... | value: 5 | coder: ... | hash: ... | hashIsZero: ...
Using a GC-agnostic cache requires converting the logical indices back into memory addresses. The JVM therefore reads objects from the cache sequentially, i.e., streams them, into memory. When the cache is opened, a background thread eagerly starts materializing objects, one by one. Materializing an object involves allocating memory in the heap, initializing the object's fields according to the data in the cache, and building object references to other materialized objects via lookups in a side table. When the application uses a class for the first time, it synchronizes with the background thread to ensure that the Class
object for the class, and any related objects, are materialized.
Choosing an AOT cache format
A GC-specific AOT cache is mapped directly into memory, while a GC-agnostic cache is streamed into memory. Both create the appearance of instantly-loaded objects, but in some scenarios mapping the cache into memory performs better than streaming the cache into memory — and vice versa.
A cold start of an application is the first start of that application on a particular machine in a while. Cold starts can happen frequently when deploying applications in a cloud. The AOT cache is unlikely to be in the filesystem cache, and the larger the cache, the larger the cost of loading it from disk. Streaming can, however, hide the latency of reading data from the disk, at the cost of requiring an additional CPU core.
Conversely, a warm start is when an application starts close in time to a previous start, such as when running over and over on the same machine. Because the AOT cache stays in the filesystem cache between runs, it can be mapped into the JVM's heap instantly.
The least advantageous situation for streaming is a warm start in a constrained environment that does not have a spare CPU core. The JVM tries to avoid this situation in production by applying a heuristic when creating an AOT cache after a training run:
-
It creates a streamable, GC-agnostic cache if, in training, either ZGC was used, or the heap was larger than 32 GB, or the
-XX:-CompressedOops
option was used. Training with ZGC, a large heap, or this particular option implies that the training environment was large, with more than a single core available. We assume that the production environment is similarly unconstrained, meaning that streaming will be most effective. -
It creates a mappable, GC-specific cache if, in training, the
-XX:+UseCompressedOops
option was used. This option indicates that the training environment had a heap smaller than 32 GB and did not use ZGC. This implies that the training environment was a constrained system without a spare core. We assume that the production environment is similar, meaning that mapping will be most effective.
You can explicitly create a streaming, GC-agnostic cache by specifying -XX:+AOTStreamableObjects
, even if you also specify -XX:+UseCompressedOops
.
The JDK includes two baseline AOT caches, one GC-agnostic and one GC-specific, which the JVM uses when the application does not provide a cache. This ensures that the JVM can use streaming or mapping, as appropriate, to achieve the best startup performance.
Alternatives
-
Enabling ZGC to support AOT caches does not require a GC-agnostic solution. We could, instead, continue the GC-specific approach by creating ZGC-specific caches containing ZGC-specific object references. This would optimize startup performance. However, the GC-agnostic solution, with objects materialized in the background, does not materially affect startup performance as long as an extra core is available, so the only situation in which a ZGC-specific cache would outperform a GC-agnostic cache is when using ZGC on a single-core machine. This is an unusual environment for the highly-concurrent ZGC, and thus does not motivate creating ZGC-specific caches. We prefer to rely on the maxim that the best way to reduce tail latency is with a systems approach, where the design of discrete components is coordinated, which leads us to the GC-agnostic approach.
-
We could modify ZGC so it can interpret both its own object references and the G1-influenced object references currently found in AOT caches. The Serial and Parallel GCs were modified in this way, but ZGC is significantly more complex. This approach would effectively couple the implementations of all the GCs to each other, which is undesirable. In contrast, the GC-agnostic approach decouples the implementations, allowing GC implementations to evolve while allowing you to choose from the full range of GCs in training and again in production. Furthermore, since the bitwise layout of objects in a GC-agnostic cache is not entangled with the memory layout of objects in the heap, we expect to be able to optimize the layout of the cache to shrink its static footprint without significantly affecting GC implementations.
Testing
Many object-archiving tests already exist. We will adapt them to test with ZGC and the new streaming, GC-agnostic approach.