JEP draft: Ahead-of-Time Method Profiling
Authors | Igor Veresov, John Rose |
Owner | John Rose |
Type | Feature |
Scope | Implementation |
Status | Draft |
Component | hotspot / compiler |
Effort | M |
Duration | M |
Created | 2024/02/01 20:40 |
Updated | 2024/08/04 21:58 |
Issue | 8325147 |
Summary
Enhance the AOT cache to store method profiles from training runs, reducing profiling delays during application warmup.
Goals
-
Help applications warm up more quickly, by supplying precomputed method profiles, immediately upon VM startup.
-
Do not introduce new constraints on application execution. (Profile-based optimizations are speculative, not assertive.) Thus, even if application behavior diverges wildly from archived profile history, execution semantics must be unchanged, although in such cases application warmup may be delayed.
This improvement indirectly serves the larger goals of Project Leyden, which include better startup, warmup, and footprint for Java applications.
Supporting Detail
-
Enable users to store method profiles, as produced by training runs, in AOT caches, for all classes supported.
-
Give the JIT access to the additional archived profile information, as an additional source beyond normal method profiles generated during application execution.
-
Tune the VM’s compilation policy so that the JIT can exploit both archived and online profiles appropriately.
Non-Goals
-
It is not a goal to merge profiles from multiple training runs. Aggregating profiles is a reasonable operation, but in this setting would require new AOT cache access and creation modes, which are left for possible future work.
-
It is not a goal to profile classes loaded by user-defined class loaders; enhancing AOT coverage for such classes is left for possible future work.
-
It is not a goal to improve the existing AOT workflow, as based on
-XX:AOTMode=...
and similar options; improvements to usability are left for possible future work. -
It is not a goal to blur the distinction between training and production runs, visible to users of today’s AOT workflow. “Auto-training” workflows are left for possible future work.
-
It is not a goal to store JIT output from training runs (that is, AOT compilation); that is left for possible future work.
Success Metrics
-
Measurable warmup time improvements, due to shifting of execution profiling work.
-
Profile assets in the AOT cache are observably adopted by the JIT, even in a deployed application which has new profiling activity disabled (for testing).
-
Even if the training and production runs execute different workloads, the production run eventually reaches comparable peak performance. (This will require old profiles to be replaced by new ones, and may require a longer warmup.)
Motivation
Java’s separate compilation, dynamic linking, and just in time (JIT) compilation have enabled users to compose their applications without ceremony. Libaries can be replaced at any point with new class versions being automatically picked up, from the class path or other configuration settings, and used on new executions. Java developers don’t have to wait through long static compilation and linking cycles when developing their applications.
And yet, Java still has excellent peak performance. The Java runtime - appropriately named HotSpot for its ability to find the “hot spots” where the application spends most of its time - uses online profiling to determine which code should be compiled at runtime and seamlessly transitions from the interpreter through the initial compiled code before landing in the most optimized code.
The VM uses a mixture of profiling - recording what has already happened - and speculation - assuming what has happened will continue to happen - to pick the right code to compile. This allows the VM to respond to changes in application behaviour, and to recover from speculations that no hold true, without changing the semantics of the application. This provides good peak performance without compromising the meaning of the program!
Now, gathering a profile of application behaviour takes time, which causes delays between application startup and peak performance. But since application behaviour is fairly consistent from one run to the next, profile data gathered from an earlier application run could be used to optimize a similer later run, if only there were a way for feed that data forward to the application that needs it.
The HotSpot CDS technology, notably JEP 483 (Ahead-of-Time Class Loading & Linking), provides the necessary concept of a training run. A training run of an application runs a representative workload, gathering data about decisions made by the application code and VM. These decisions are aggregated and stored in CDS so that later production runs can reuse the stored decisions. This is a form of shifting computation, as per Project Leyden, from training runs at build time to production runs.
Shifting the work of profiling to a training run would be a win, because the JIT cannot produce fully optimized code without a profile, but profile gathering typically requires many milliseconds of CPU time, even minutes for large applications. If the JIT starts its work sooner, the application achieves peak performance more quickly; we say it warms up sooner. Even before warmup, the application may get through its startup phases more quickly as well, because less time profiling allows more time for JIT compiling. Thus, even during startup, the application can spend more time in optimized code and less time in the VM interpreter, if only it can get access to profiles from previous runs.
Why We Profile
Let’s take a closer look at how HotSpot profiles - what is in them, and how they are used.
Profiles record observations about application dynamics for use by the JIT. Profile data includes method invocation counts, branch frequencies, and object classes encountered, throughout all warm parts of the program. Seldom-executed (“cold”) methods are not allocated profile data. It is generally a bad idea to collect method profiles for all methods; the space costs are prohibitive, and the JIT doesn’t need data for cold methods. Also, already-optimized methods do not accumulate additional profile data, unless they deoptimize, since collecting “more of the same” would slow the optimized code.
The JIT uses profiles to produce code which is highly optimized, for the specific application dynamics which are recorded in the profiles. The benefit depends on whether the dynamics are stable: If future application execution is similar to past execution, the JIT code continues to perform well. If a new branch or a new class appears, the JIT may recompile in response, causing a temporary slowdown, until the better code is installed. In this way the JIT responds to differences, either small or large, in future execution dynamics.
The previous description applies fully to classic HotSpot execution. Even in one application run, the “hot spots” can move around over time, and the JIT reoptimizes as they do. The description also applies fully to the present JEP, where profiles from past VM behavior are collected in training runs and passed to the VM through the CDS archive. It is always the case that the VM and its JIT are open to new behavior that invalidates past optimizations. With profiles stored in CDS, it is to be expected that the old and the new profiles will resemble each other. If they do, then the early runs of the JIT will produce properly optimized code, even with respect to the newer profile data.
A difference may be observed here, between adaptive JIT based and non-adaptive pure AOT code shapes. The JIT can make optimistic speculations and recover from their failure, while pure AOT code must make conservative assumptions about future execution, and cannot optimize as aggressively. Even the best static analysis techniques cannot predict a full set of optimizations for application behavior, since such behavior emerges from a Turing-complete evolution of computation states. That behavior is undecidable, even by the most powerful software, without actually running the program itself, on a particular input. But partially executed programs, in practice, provide useful information about likely future execution, information which can be gathered in no other way. This is why profiles, including the stored profiles of this JEP, are important to modern computing systems.
Although the VM performs some static analysis, and Leyden may allow more expensive and comprehensive static analyses in the future, the special strength of the VM is its ability to respond flexibly and dynamically to unpredictable application behaviors. These behaviors are often statically unpredictable, and may even include dynamic loading of code not present during any static analysis. But the VM optimizes it all, because it can use the evidence of actual prior execution, rather than rely solely on a weakly predictive static model. The resulting flexibility, with full optimization, benefits many Java workloads, and is one of the reasons for Java’s success. The cost for this is time spent gathering and using the dynamic profile information, which is the issue addressed by Project Leyden.
Current CDS technology has focused on improving startup time by making class loading faster. For example, JEP 483 improves on startup by speculatively shifting more class loading and linking from training to production runs. CDS archives already store a variety of asset types, such as loaded classes, linkage information, and even Java objects. Adding profile assets into this mix is a natural move.
Description
This JEP enhances CDS to capture execution profiles, so that warmup times can improve as well. Building on CDS’s data storage capabilities, additional application decisions, as observed in the form of profiling data, are captured during training runs, for use by production runs. These profiling observations can be used to speculate on behaviour in the production runs, resulting in better JIT code which reduces startup and warmup time.
This is an enhancement to existing CDS workflows and archive file formats. A user of CDS executes a training run with a special switch (of the form Xshared:dump
) telling the VM to emit a CDS archive when the training run exits. This archive contains various assets (loaded class data, and now profiles) extracted from the training run. With this JEP, the collected profiles are stored as new assets in the CDS archive. In the production run, these assets are immediately usable as profile information, which gives an accurate preview of application dynamics. This in turn can be used by the JIT to produce better code.
This functionality will be turned on by default in newer CDS workflows, those which use JEP 483. It may be individually disabled or otherwise tuned by additional command line options.
Under the usual rules of CDS, a CDS archive may be used in a production run of the application. When possible, the VM adopts profile assets from CDS and uses them to reduce startup and warmup time. These assets may be adopted and used even before the application begins to execute.
Stored profiles do not interfere with online profiling. That is, the application will also collect its own profiles, as usual. The VM’s compilation policy will be tuned to read from both profiles, and to allow the JIT to run earlier if profile information is available from a training run.
This is a win because the gathering of profiles normally requires many milliseconds of CPU time, sometimes even minutes for large applications. This work is required before the JIT can do its own work to create code optimized for those profiles. As previously noted, shifting this work out of the application production runs can make the application warm up faster.
Method profiles will be stored in the CDS archive if they are created normally during execution of the training run. In addition, only methods which were compiled during the training run will have their method profiles (“method data objects”) stored in the CDS archive. The intention is to avoid bloating the CDS archive with useless assets.
Note that it is generally a bad idea to collect method profiles for all methods; the space costs would be prohibitive. Profile assets are placed in the CDS archive only if they are likely to be useful to the JIT in a production run.
Format of Stored Profiles
CDS provides an excellent storage medium for data which is easy for the VM to adopt directly into its runtime data structures. CDS data is organized for efficient sharing, from a training run to any number of production runs. In particular, the CDS data is mapped directly into the VM’s memory address space, and edited lightly to relocate pointers to be consistent with the base address of the mapped segments. This organization of data is closely similar to the shared libraries found on all platforms on which HotSpot runs today.
The stability of class pointers provided by JEP 483 will make execution profiles easier to adopt from CDS directly into the VM. If this JEP is made to apply to CDS classes in the unloaded states, additional barrier logic could be added to prevent the JIT from acting on class profile records which are not relevant to the application, because a class encountered during a training run is not yet loaded during the production run. Such barrier logic would add its own costs, but more crucially it could reduce the amount of useful data the JIT can rely on during warmup. Therefore, these two JEPs work best together in concert.
Compatibility Issues
This JEP introduces no new compatibility issues. If a profile adopted from CDS turns out to mispredict the actual application behavior, then there will be some wasted effort by the JIT, slowing down the application.
This phenomenon, of a possible but unlikely slowdown, is akin to the well known problem that a data compression algorithm can sometimes increase the size of its input, when that input fails to be predictable.
Alternatives
Doing nothing preserves the current profiling delays before the JIT can get to its useful work.
Making the JIT itself optional, by generating AOT code, is often a useful tactic. A pure AOT solution where there is no JIT cannot adequately optimize for actual application dynamics, as explained above (“Why We Profile”).
A partial AOT solution, where reasonable AOT code is replaced by a delayed JIT, seems to be the best solution on balance. (The delayed JIT can stay out of the application’s way, and take its time to get the final code just right, based on the latest profiling information.) That requires additional work to build out the AOT infrastructure, so it is a follow-on JEP. Even before AOT, the present JEP provides important speedups to warmup.
We could put all of our resources into a partial AOT solution, backed up by a delayed JIT phase. However, prototyping indicates that stored profiles improve the performance of that design as well, since they allow the JIT, though mostly delayed, to attack performance problems earlier during warmup, when there are small changes in application behavior that require recompilation. So this JEP has its own place, not subsumed by any other JEP.
Testing
-
We will create new unit test cases that cover specific behavior of this JEP.
-
We can use run existing CDS test cases with this option implicitly enabled. Such test cases should still pass.
Risks and Assumptions
There are no new risks, beyond those already inherent to the AOT technology as noted in JEP 483.
The base assumption of CDS is operative: A training run is assumed to be a good source of observable decisions, such that, when they are passed through a CDS archive to a production run, will benefit the performance of that production run.
Dependencies
This JEP is an evolution of the existing CDS implementation. It depends on JEP 483. Future work in Project Leyden, especially involving the premain phase, is likely to depend on it.