JEP draft: AOT Class Loading & Linking with Multiple GCs
Owner | Erik Österlund |
Type | Feature |
Scope | Implementation |
Status | Submitted |
Component | hotspot / gc |
Discussion | hotspot dash dev at openjdk dot org |
Effort | M |
Duration | M |
Reviewed by | Alex Buckley, Ioi Lam, Stefan Karlsson, Vladimir Kozlov |
Created | 2024/02/16 09:49 |
Updated | 2025/01/29 20:54 |
Issue | 8326035 |
Summary
Enhance the ahead-of-time cache that is used by the HotSpot Java Virtual Machine to improve startup time. In particular, enable the cache to be used regardless of which garbage collector the user chooses in training or production runs. Previously, the cache could not be used with the low-latency Z Garbage Collector (ZGC).
Goals
- Allow all garbage collectors to work smoothly with the AOT cache from Project Leyden.
- Separate the AOT cache from GC implementation details and policies.
- Ensure that use of the AOT cache does not materially impact startup time, relative to previous JDKs.
Motivation
Most garbage collectors, e.g., G1, pause application threads to collect garbage. This causes the application's handling of some requests to take significantly longer than usual, a phenomenon known as tail latency. Application developers can attempt to manage tail latency by tuning the GC so that, e.g., 99% of response times are below 10ms.
The Z Garbage Collector (ZGC) was introduced in JDK 15 to minimize tail latency induced by garbage collection. ZGC collects garbage concurrently so that application threads are never paused for more than a millisecond. ZGC requires minimal configuration and operates as a generational collector in JDK 23 and later.
However, garbage collection is not the only runtime feature that causes tail latency. Java applications are often "scaled out" by starting new JVM instances to handle more requests, but requests sent to a new instance take significantly longer than requests sent to a warmed-up instance. To address this source of tail latency, users can enable ahead-of-time class loading and linking. This improves application startup by caching the classes of an application in a training run so that they appear to be loaded instantaneously in production. For example, a training run of the Spring PetClinic application caches ~21,000 classes used to handle requests; in production, these classes are mapped into the JVM's heap at startup, in effect loading them without scanning the class path 21,000 times.
Unfortunately, the way that classes are cached is incompatible with ZGC. This forces users to choose between suffering GC-induced tail latency or suffering startup-induced tail latency. If they use ZGC to reduce the former, they cannot enable ahead-of-time class loading and linking to reduce the latter, and vice versa.
To help users avoid this painful choice, we propose to make the cache work with all GCs, including ZGC. Users will be able to improve tail latency during startup by using the cache, without having to select a GC other than ZGC which adds tail latency elsewhere.
Description
An ahead-of-time cache is a file that contains data about the state of an application, collected during a training run. An AOT cache primarily contains Class
objects, representing classes that were loaded and linked during the training run. In subsequent runs of the application, the JVM consults the AOT cache so that cached classes are not loaded and linked from scratch. This improves startup of both the JVM and the application.
Traditionally, the JVM maps the AOT cache into the memory which stores the heap and is managed by the GC used in production. This is efficient but it means that objects in the cache, written in a training run, must be bitwise-compatible with the format of heap objects expected by the GC used in production. That is, the cache is GC-specific.
In JDK NN, the cache is GC-agnostic: it works regardless of which GC is used in production. This allows users to choose ZGC in either training or production runs. It also allows the JDK to ship a baseline AOT cache that works in all environments.
Obstacles to GC-agnostic caching
The main challenge with caching objects in a GC-agnostic manner relates to how objects refer to each other. In Java source code, the value of a field that holds a reference to an object is opaque, but at run time, each GC has its own rules for laying out objects in memory and representing references from one object to another: (the "reference format")
-
Heap size policies (Parallel, G1) For heaps greater than 32 GB, object references are represented as 64-bit addresses ("raw pointers") and stored directly in reference fields. For heaps less than 32 GB, where addresses can have up to 35 bits, object references are stored in reference fields as 32-bit values, using compression if necessary. There are three compression schemes, selected heuristically at runtime based on the heap size and other factors.
-
Object size policies (G1, ZGC) Objects are placed within heap regions, based on the size of the object. In G1, the high order bits of a 64-bit address identify the region, the low order bits encode an offset into the region, and an object must never cross a region boundary. An object that G1 considers large gets its own exclusive heap region, and any reference to the object must have all zero low order bits. ZGC, on the other hand, distinguishes between small, medium, and large objects, using different reference formats for each.
-
Metadata ZGC encodes metadata bits into object references. These bits are used to manage concurrent garbage collection. No other GC understands this reference format.
The multitude of reference formats makes it challenging to take objects managed by one GC, cache them, and reify them later under another GC.
GC-agnostic object caching
In JDK NN, object references are written to the AOT cache in an abstract reference format that is GC-agnostic. Raw pointers are never stored in the cache.
When an object is written to the cache at the conclusion of a training run, the cached representation mirrors the memory layout of the object in the heap except that reference fields are encoded as logical indices. For example, consider a String
object with the following fields:
public class String {
private final byte[] value;
private final byte coder;
private int hash;
private boolean hashIsZero;
}
Traditionally, the cached form of a String
object would have a 64-bit address in the value
field:
Header: ... | value: 0x4002045278 | coder: ... | hash: ... | hashIsZero: ...
Furthermore, the address would use a bit pattern that is valid across the Serial, Parallel, and G1 GCs, by means of taking every object reference down to a lowest common denominator format. For example, if the user trains with Parallel, then objects are stored in the AOT cache such that none cross the boundaries of heap regions, using a predetermined region size; this allows the user to run in production with G1. However, since ZGC does not use 64-bit addresses as object references, and does not support a global size for regions, the AOT cache cannot be used in a production run with ZGC.
In JDK NN, the cached form of a String
object would look like this, where the value
field stores the logical index of another object:
Header: ... | value: 5 | coder: ... | hash: ... | hashIsZero: ...
Given a logical index from the cache, the GC in production will find the address of the actual object in the heap by performing a lookup in a side table, and will fix up the values of reference fields as it goes.
Object Streaming
Traditionally, the AOT cache was mapped into memory, providing the appearance of instant loading of objects. In JDK NN, interpreting the GC-agnostic cache requires some indirection, so objects are streamed rather than mapped into memory.
To stream objects, a background thread is started early in the JVM bootstrapping process. When the AOT cache is opened, the thread eagerly starts materializing objects one by one. Materializing an object involves allocating memory in the heap, initializing the object's fields according to the data in the cache, and building object references to other materialized objects via lookups in a side table. Since the materialized objects are Class
objects, representing Java classes, it appears to the application that its classes are all loaded instantly at startup.
Offloading materialization to a background thread requires synchronization points before objects are accessed by the application, so that any necessary materialization can be flushed out. A natural synchronization point is when the application uses a class for the first time. For example, given class Foo
below, consider the first time that the application invokes Foo.getMessage()
:
class Foo {
public static String getMessage() { return "hello"; }
}
-
Without an AOT cache, the
Class
object for classFoo
must be created in the heap by loadingFoo
via a class loader; the"hello"
literal must be resolved fromFoo
's constant pool; and aString
object denoting"hello"
must be created in the heap. The resolvedString
object is stored inFoo
's constant pool so it can be reused the next timeFoo.getMessage()
is invoked. -
With an AOT cache, the JVM has to do much less work. The AOT cache has the
Class
object for classFoo
in its cached heap. This object includes a pre-resolved constant pool entry containing the pre-resolvedString
object denoting"hello"
. These objects are already materialized from the AOT cache by the timeFoo.getMessage()
is invoked for the first time, either due to the background thread materializing them earlier or due to a synchronization point in the application thread. There is no need to create theClass
object forFoo
from scratch, nor resolve any ofFoo
's constant pool entries.
Mapping versus Streaming
Some applications run in an environment where mapping the AOT cache into memory will perform better than streaming objects from the cache -- and vice versa.
A cold start is the first JVM start in a while, such as when deploying JVMs in the cloud. The AOT cache is unlikely to be in the file system cache, and the larger the AOT cache, the larger the cost of loading it from disk becomes. Streaming, however, can hide the latency of materializing objects from the cache.
Conversely, a warm start is when the JVM starts close in time to a previous start, such as when running a Java program over and over on the same machine. Because the AOT cache stays in the file system cache between runs, mapping the AOT cache into the Java heap is instant, thus classes appear to load instantly. Streaming achieves the same illusion by performing materialization in a background thread, but it relies on the availability of an extra core during startup.
The least advantageous situation for streaming is a warm start in a constrained environment that does not have a spare core. The JVM tries to avoid this situation in production by applying a heuristic:
-
Streaming is used if the AOT cache indicates that, in training, either ZGC was used, or the heap size was larger than 32 GB, or
-XX:-CompressedOops
was used. The implication of training with ZGC, a large heap, or this particular option is that the training environment was large, with more than a single core available. The JVM assumes the production environment is similarly unconstrained, meaning that streaming will be effective. -
Mapping is used if the AOT cache was created with
-XX:+UseCompressedOops
. This option indicates that the training environment had a heap less than 32GB and did not use ZGC. This implies that the training environment was a constrained system without a spare core. The JVM assumes the production environment is similar, meaning that mapping will be most effective.
The JDK ships with one AOT cache created with -XX:+UseCompressedOops
and another created with -XX:-UseCompressedOops
. This ensures that the JVM can use streaming or mapping, as appropriate, to achieve the best startup performance. Advanced users can explicitly enable streaming when creating an AOT cache by specifying -XX:AOTStreamableObjects
, even if they also specify -XX:+UseCompressedOops
.
Alternatives
Building ZGC support for the AOT cache does not require a GC-agnostic solution. It would be possible to continue the GC-specific approach by creating and using a ZGC-specific cache containing ZGC-specific object references. This would optimize startup performance. However, the GC-agnostic solution (with objects materialized in the background) does not materially affect startup performance as long as an extra core is available, so the only situation when a ZGC-specific cache would perform faster than the GC-agnostic cache is when using ZGC on a single-core machine. Given the concurrent nature of ZGC, this is an unusual environment for ZGC, and does not motivate building a ZGC-specific cache. We prefer to rely on the maxim that the best way to reduce tail latency is with a systems approach, where the design of discrete components is coordinated, and this translates to the GC-agnostic approach.
Another alternative would be to modify ZGC so it can interpret both its own object references and the G1-influenced object references currently found in AOT caches. The Serial and Parallel GCs were modified in this way, but ZGC is significantly more complex. This approach would effectively couple the implementation of all GCs to each other, which is undesirable. In contrast, the GC-agnostic approach decouples the implementations, allowing GC implementations to evolve while allowing users to choose from the full range of GCs in training and again in production. Furthermore, since the bitwise layout of objects in the AOT cache is not entangled with the memory layout of objects in the heap, we expect to be able to optimize the layout in the AOT cache to shrink its static footprint without significantly affecting GC implementations.
Testing
A large amount of object archiving tests have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.