JEP draft: Ahead-of-Time GC Agnostic Object Archiving

OwnerErik Österlund
TypeFeature
ScopeImplementation
StatusSubmitted
Componenthotspot / gc
Discussionhotspot dash dev at openjdk dot org
EffortM
DurationM
Reviewed byIoi Lam, Stefan Karlsson, Vladimir Kozlov
Created2024/02/16 09:49
Updated2024/10/17 19:33
Issue8326035

Summary

An Ahead-of-Time (AOT) object archiving mechanism, agnostic to which Garbage Collector (GC) is selected at deployment time.

Goals

The AOT cache delivered by JEP 483: Ahead-of-Time Class Loading & Linking embeds ahead of time computed state in an AOT cache, in order to start the JVM faster. This cache contains an object archive as well as other program state. Currently, the Z Garbage Collector (ZGC), does not support the object archiving mechanism of the AOT cache, making ZGC not fully supported. This JEP aims at addressing that. The primary goals of this JEP are:

Secondary goals:

Non-Goals

It is not a goal at this time, to:

While removing the existing GC-dependent object archiving mechanism of the AOT cache would allow detangling implementation details of other GCs from object archiving, we will not consider that at this time as there is not enough data to make such a decision yet.

Success Metrics

It should not take significantly longer for the JVM to start with the new GC-agnostic archived object loader, compared to the alternative GC-specific archive object loaders for Serial GC, Parallel GC and G1 GC.

Motivation

Traditional garbage collectors (GCs) are famous for causing “tail latency” problems in Java workloads. By pausing application threads to collect garbage, some requests take significantly longer than they usually do. Applications may have a service level agreement (SLA), requiring tail latencies to be bounded for particular percentiles. For example an SLA could say that P99 response times (the 99th percentile) must be below 10 ms, meaning that the shortest response time among the 1% worst response times should not exceed 10 ms. ZGC is a low latency GC that has been available since JDK 15 (JEP 377). It greatly improves GC-induced tail latency by performing GC work concurrently.

However, GC is not the only JVM mechanism that causes tail latency. Java workloads are often "scaled out" by starting new instances to handle more incoming requests. Requests sent to the new instance take significantly longer than requests sent to a warmed-up JVM. This also causes tail latency. JEP 483: Ahead-of-Time Class Loading & Linking improves startup/warmup induced tail latency by capturing much of the corresponding work in an AOT cache.

The AOT cache contains data about the state of an application from a training run. Some of this data is Class objects for all the loaded classes in the program. For example, a training run of the Spring Petclinic 3.2.0 program creates a 130 MB AOT cache which contains Class objects for ~21,000 loaded and linked classes. These objects are stored in the AOT cache and are loaded into the Java heap when the application is run again in order to make class loading appear instantaneous. Unfortunately, the object archiving mechanism used by the AOT cache is incompatible with ZGC. This is unfortunate as it forces latency conscious users to choose whether they would like their application to suffer from GC induced tail latency or startup/warmup induced tail latency.

To reduce tail latency, it is important to use a systems approach, where all components are designed to work together. This JEP introducing a GC agnostic object archiving mechanism for the AOT cache, allowing it to be used with ZGC as well as any other GC. This way, users that wish to improve startup/warmup induced tail latency by using the AOT cache are no longer forced to select a GC other than ZGC which adds tail latency elsewhere.

Description

An AOT cache helps starting the program 42% faster. The startup/warmup optimizations that rely on object archiving risk being undermined by the cost of loading archived objects. Therefore, efficient archived object loading is important for the AOT cache.

Offline Layout Challenges

The current object archiving system of the AOT cache directly maps memory from an archive file straight into the Java heap, which is efficient. However, in order for this approach to work well, the layout in the file has to exactly match, bit by bit, what the GC (and the rest of the JVM) expects to see at runtime. There are three different layers of layout policies that might cause bits not to match. These layout concerns are:

  1. Heap layout. The heap layout is a high level strategy for where in the heap a GC chooses to place objects of a particular size and class.
  2. Field layout. The field layout is concerned with where to store contents of fields within an object. It is not GC dependent.
  3. Object reference layout. This is the bit encoding strategy, for reference fields. It varies based on different optimization goals of different GCs.

These three layers of object layout policies can vary significantly between GC implementations and heap sizes. For each level of layout policy, there are various factors that can affect the bit pattern of how objects are represented in memory. For example:

These low level bit anomalies make it challenging to load the archived objects in a GC agnostic fashion. That is why ZGC is not supported by the AOT cache today.

Object Streaming

This JEP introduces a new object archiving mechanism that abstracts away the two layout concerns that are GC dependent: heap layout and object reference layout. The fundamental insight is that object references are not stored in the archive as physical addresses but rather as logical values. When the archive is written, one object's reference to another is written not as the actual value but rather a random number. When the archive is read, the random numbers are turned back into actual pointers based on the GC in effect.

The new mechanism archives high level object descriptors that may be used at runtime to materialize an object. The new mechanism allocates objects, initializes their payload and links objects together one by one, based on these descriptors. Loading objects in this way, is referred to as "object streaming" in this document. Archived object streaming allows any GC selected at runtime to materialize the archived objects due to using an online object layout policy.

Archived object descriptors are laid out contiguously in the archived heap of the AOT cache. Each object descriptor has an “object index” based on the order in which the object descriptor has been laid out in the archived heap. This object index is used to describe the identity of archived objects. For example, object references between objects are encoded as object indices. There is also a table that maps object indices to corresponding materialized heap objects. There is another table that maps object indices to object descriptors in the archived heap. The tables are embodied as arrays making table lookups fast.

Offloading

The AOT cache has roots into the archived heap allowing it to refer to Java objects. Loading a root object from the archived heap requires streaming of all transitively reachable objects from that root. This operation requires a graph traversal which takes some time to complete. The latency of said traversal is hidden by offloading most of the object streaming work to a background thread that starts streaming archived objects early in the JVM bootstrapping. Root loading is performed lazily while the background thread is concurrently streaming objects from the archive.

The bulk of the object streaming work is offloaded to the background thread. It traverses archived object descriptors from roots, streaming all transitively reachable object descriptors. In order to accelerate the traversal from the background thread, object descriptors are laid out in an order matching the order expected by the traversal algorithm (depth-first search). This way, a linear iteration over the archived object descriptors yields the same visiting order as a graph traversal algorithm. This makes the traversal faster. This ordering also allows defining a linear range of object descriptors currently being materialized by the background thread, for each root being transitively traversed. Archived objects can then be partitioned into three distinct partitions:

  1. Objects already processed by the background thread
  2. Objects currently being processed by the background thread
  3. Objects not yet processed by the background thread

This partitioning of archived objects allows the background thread to perform the bulk of its work, without interfering with lazily triggered object streaming from the application. When an application thread requests an archived object in the already processed range of objects, the object can simply be looked up; it is guaranteed to have been transitively streamed already. When requesting an object from the currently processed range, the application thread waits for streaming to finish, and then looks up the object. Only when requesting an object from the not yet processed range does an application thread have to perform an explicit graph traversal to ensure all transitively reachable objects are materialized. That traversal can run without expensive synchronization with the background thread, due to the boundaries where background materialization is ongoing being clearly defined.

Mapping versus Streaming

However, some applications run in an environment where mapping the archive into memory will perform better than streaming objects from the archive -- and vice versa.

Alternatives

Building ZGC support for the AOT cache does not require a GC-agnostic solution. One possible solution would be to double down on more GC-specific logic, and have a ZGC specific object archiving mechanism that lays out objects with a heap layout and pointer layout expected by ZGC. This has some notable disadvantages:

The main advantage of a ZGC specific solution would presumably be faster warm JVM starts with ZGC. However, from current experiments, it appears that the streaming object loader is efficient without needing to resort to that.

As for GC-agnostic object archiving, different approaches have been considered, involving bulk materializing all objects eagerly. Without offloading and lazyness, they did impact startup times negatively though.

Testing

A large amount of object archiving tests have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.

Risks and Assumptions