JEP draft: Compiled Code in CDS Archives

OwnerJohn Rose
TypeFeature
ScopeImplementation
StatusDraft
Componenthotspot / compiler
Created2024/06/30 04:47
Updated2024/06/30 05:13
Issue8335368

Summary

Enhance CDS to store compiled code from training runs, reducing delays during application startup and warmup.

Goals

This improvement to CDS (a HotSpot optimization mechanism) indirectly serves the larger goals of Project Leyden, which include better startup, warmup, and footprint for Java applications.

Non-Goals

Success Metrics

Motivation

Java virtual machines (or “VMs”) have brought just in time (or “JIT”) compilation into mainstream use. Early misconceptions that Java is only a slow interpreted programming language were replaced by enthusiastic adoption when HotSpot (and other virtual machines) introduced today’s optimizing Java compilers,which are competitive with offline “static” compilers, but run concurrently with the Java application itself.

Although not interpreted, Java, as run in the Java VM, is highly dynamic. During startup, the various classes in the program link themselves together dynamically, and create their initial working sets of objects dynamically by running setup code and the application’s main routine.

The use of JIT compilation enhances Java’s dynamism. Although at first the VM uses its bytecode interpreter to organize startup activities, the VM’s JIT compiler is also dynamically generating code, upgrading many thousands of methods to use native code. At first the JIT may generate less optimized general-purpose code for a given method. However, as that method executes many times, the VM also gathers profile information about its exact behavior, and eventullay recompiles it, generating optimized code for the observed behaviors.

Methods which are do not contribute significantly to performance might not be compiled or fully optimized; we say such methods are not “hot” enough to receive extra attention from the VM. But after the application has run for enough time, all of the “hot“ methods are fully optimized. (This is the “hot spot” referred to by the name of the HotSpot VM!) At that point, the application is said to have reached “peak” performance.

In any given run, the peak performance comes from JIT code tuned to exactly the current behavior of the application. Where a static compiler would conservatively support all possible behaviors, the dynamic JIT assumes that rarely taken program paths or rarely used data types are irrelevant to performance, and does not let them complicated the optimized code. Such use of program behavior is called “profile guided optimization”, and Java JITs are masters of this craft.

In a Java VM, profile guided optimizations are “speculative”, in the sense that the JIT guesses good choices for code, but is prepared to correct that code if its guesses are wrong. As yet another form of dynamism in Java VMs, JIT code which encounters an unoptimized path or data type can be “deoptimized”. As mistakes in speculation are corrected by recompilation, the application regains its peak performance. Deoptimization and recompilation cycles happen routinely in practice, and can repeat many times as the application makes previously unexpected detours into new parts of the application logic.

The result of all this dynamism is that Java programs are easy to debug, configure, build, and deploy, without any compromise to application throughput.

There is a small problem in this pleasant picture. Every Java programmer has noticed at some point, however, that the benefits of dynamism are paid for around the time the program is starting up. Applications running on HotSpot do not start up instantly, and they can run slower than their expected peak performance for some time, until the JIT does its work. When looking in detail at processing costs, one can see the JIT using many CPU seconds generating code before the application is fully warmed up. For very large applications, warmup may even require minutes or hours, due to JIT activity.

One might wish for a static compiler like that of C++, but this is a poor match for Java’s otherwise dynamic characteristics. Such a static compiler is likely to produce too much code of low quality, since it is difficult to accurately anticipate the actual dynamic behavior of the application. Static compilers sometimes support profile guided optimization, but profiles are difficult to gather and can be inaccurate; if there is a speculation error, it must often be corrected by a system rebuild, where HotSpot would simply add a cycle of deoptimization and recompilation.

Today, the CDS feature of HotSpot (CDS means Cached Data Storage) allows programmers to profile application behavior ahead of time using “training runs” of the application. During a training run, the VM records which classes are loaded, producing a file called a CDS archive. The archive contains assets which enable the VM to quickly reload those classes when the same application is started up. Recent improvements to CDS also place profile data in the CDS archive, as additional assets. This allows the VM JIT to begin generating code immediately. The JIT optimizations continue to be speculative: If the application behavior is similar to what was observed during the training run, the JIT’s code runs at peak performance. If the application behaves in some new way (not seen in the training run), the VM deoptimizes and recompiles the problematic code.

As an alternative to generating static code outside of the VM, it would clearly be helpful if the VM could generate its usual excellent JIT code during training runs, and then somehow make it available to a subsequent production run. In this workflow, we can call this code, generated during the training run, AOT code (ahead of time code). This is true even though, from the point of view of the training run, it is just the same as familiar JIT code (just in time code) which powers all Java VMs.

To make this work, VM, as it collects observations for CDS about classes and profiles, will also observe JIT activity. The VM will arrange for the CDS archive to contain assets which are simply blobs of JIT code, already known to be useful during the training run. The production run of the application will quickly adopt those precompiled blobs of code, if they are relevant, and thus skip over many cycles of interpretation and compilation. If some blob of code is not fully correct, it can be deoptimized and recompiled, in exactly the same way it would be if it were compiled without the help of CDS. The result will be the best of the static and dynamic worlds: Quick to start up and warm up, yet fully and seamlessly embracing all the benefits of Java’s dynamism.

Description

Today’s CDS is able to record decisions about loading and linking classes, initializing system tables, and profiling application activity. These decisions are speculatively recorded in a training run and replayed, when relevant, in a production run.

We add the ability to speculatively record the decisions to generate optimized code for methods in the training run. The recording in the CDS archive will include the actual code generated by the JIT. These assets of AOT code, if relevant to the production run, will be quickly installed in the VM code cache, and replace slower execution by the interpreter and/or delays from obtaining JIT code. This starts happening as soon as the application starts up, even before the application main routine is invoked.

As a small but ubiquitous improvement, certain small “code stubs” present in all VM executions will also be recorded as AOT assets in the CDS archive. These include handlers for runtime events like null pointer exceptions, array copy loops, adapters for moving between interpreted and compiled modes, and virtual dispatch trampolines. As a group, they do not require much time to build, but having them present and stabilized early during VM execution makes the installation of AOT code go even faster.

This functionality is enabled by default in every workflow that uses the already existing flag -XX:+CDSLoadedClasses, as defined in JDK-8315737. No additional command line switches are needed. There may be switches to guide or inhibit the storage of these new JIT code assets.

The JIT in the premain phase

From a programmer’s point of view, the overall effect of -XX:+CDSLoadedClasses is as if class loading is initiated by the platform, not the application, in a very early period before the application main routine starts to run. We call this early period the premain phase of execution. With the new support for AOT assets, JIT activity will appear to happen during this premain phase. It will be as if the JIT knows in advance that some methods are important to optimize, and it gets working on them immediately. This JIT compilation also appears to happen very quickly, since the AOT code assets in the CDS archive are readily adopted into VM memory. Roughly speaking, much of the work of the JIT has happened long ago, in a training run which created a CDS archive.

The loading of AOT code in the premain phase will cause even the earliest phases of application startup to run faster, since it is much faster to load precompiled code than to generate it from scratch. The warmup will also be accelerated, since both profiling and JIT activity will be skipped, in favor of adopting code assets from the CDS archive.

Of course, if the application’s behavior in the production run is significantly different from the training run, some AOT code might not be adopted, or after adoption it might be deoptimized and replaced after. This is nothing new: JIT code also gets generated only conditionally (on proof of importance) and is then subject to deoptimization and replacement. Indeed, AOT code and JIT code are fundamentally the same kind of code; their names simply reflect their differing routes into the VM’s code cache.

Consistency between training and production

As usual with CDS workflows, the training and production runs must have consistent VM and platform configurations, or else the VM may not not use the CDS archive. This is also true, at a finer granularity, with AOT code. For example, if a method has been optimized under the assumption that some other method has no overrides (so the VM can “devirtualize” calls to it), but then the production run loads an overriding class for that method, the code is broken and must not be used. Such broken code must be deoptimized if already running (whether JIT or AOT), or at least never adopted from the CDS archive. HotSpot has been doing this work correctly for decades, with JIT code, and it is easy to adapt the same dependency checks to AOT code.

As another possible mismatch, suppose an AOT code asset is compiled to use a specific level of ISA, such as Intel’s AVX-512, but the production run takes place on a machine that does not support that ISA level. In that case the AOT code asset must not be adopted. Just as with the previous case of a devirtualized method, the presence of AVX-512 is a dependency attached to the AOT asset which prevents it from being adopted into the running VM. Compare this with the parallel case with static compilers: A miscompiled method would probably lead to a crash. But with Java, there is absolutely no change to program execution as a result of the mismatch in ISA level in the CDS archive. Future improvements are possible, where the training run may generate more than one AOT code asset, for a method that is vectorized, so as to cover various possibilities of ISA level support in production.

It is often the case that JIT code benefits from observing the initialization state of classes used by the method being compiled, a benefit not available to the interpreter. For example, when calling a static method like List::of(), the Java VM Specification mandates that, if the declaring class (in this case, the interface List) is not yet initialized, the call must be delayed until the class has been initialized; this is an essential part of Java’s dynamism and flexible configurability. The initialization check, although simple, can inhibit optimization of the method call itself. In the worst case the method call does nothing but return a constant, and the optimized code should have no control flow at all. Since List::of() called on no arguments simply returns a constant object, the cost of the initialization check completely swamps the actual work of the method, which is to return that constant.

Happily, the JIT usually runs long after all such classes have been initialized, so it is sufficent for the JIT to observe that List is, in fact, initialized, and so the call to List::of() inlines into a constant load, causing further optimizations to happen downstream. In the AOT setting, this is a little harder, simply because AOT code assets can be adopted so quickly that List is still getting its act together. It would be wrong for the AOT code to return a constant that has not been created yet! Therefore, AOT code often includes extra class initialization checks not present in JIT code. The cost of such checking can be controlled, in part, by using the VM’s code dependency mechanism, which already supports method devirtualization and many other things. The loading of an AOT method is simply deferred until all its initialization dependencies are satisfied. For some methods, it is also useful to compile in the initialization checks, and then replace such a method with a better version (which can also be AOT) when the initializations are done. This tactic, of using multiple compilation versions of the same method, is already standard practice in HotSpot, which has multiple distinct JITs that run different sets of optimizations.

In all possible cases of AOT code mismatch, there may be lost performance, but there is never incorrect behavior. In all cases, any important method that cannot be compiled AOT is simply compiled JIT. Eventually, the best code wins, and peak performance is reached.

Testing

Risks and Assumptions

We continue to hold to the basic assumption of CDS, that training runs are accurate predictors of production runs.

Dependencies

This JEP is an evolution of the existing CDS implementation. Future work in Project Leyden, especially involving the premain phase, is likely to depend on it.