JEP draft: Instruction Issue Cache Hardware Accommodation

OwnerPaul Hohensee
TypeFeature
ScopeJDK
StatusDraft
Componenthotspot / runtime
EffortL
DurationL
Created2021/12/22 21:32
Updated2023/06/14 16:03
Issue8279184

Summary

This JEP discusses projects that can mitigate the negative effect of instruction issue hardware limitations (IIHL) on generated code performance. Such hardware includes, but is not limited to, on-chip memory caches, ITLBs (Instruction Translation Lookaside Buffers), and BTPs (Branch Target Predictors). Mitigations discussed include:

This JEP focuses on compaction, but includes a few other project suggestions. Co-location can occur within both segmented and non-segmented code caches. Periodic re-colocation is necessary in order to adapt to application phase changes.

In the event that there are platform dependent aspects of the implementation, initial platform targets would be linux-x64 and linux-aarch64.

Goals

Reduce the negative impact of IIHL on Java application performance.

Non-Goals

Success Metrics

Motivation

IIHL constrain the number and size of virtual address ranges that can be simultaneously mapped, so co-locating an application’s hot code together can improve performance. Co-locating hot code using profile information rather than counter overflow events should reduce both capacity and conflict misses and the consequent miss penalties. Examples are instruction fetch, ITLB refill, instruction pipeline drain, and branch target buffer entry refill stalls.

The segmented code cache, JEP-197 (https://openjdk.java.net/jeps/197), was created in part for this purpose. Its design does not, however, recognize that a method and/or loop back branch counter overflow event is, by itself, unrelated to how hot the method and/or loop actually is. Counter overflow is a heat measure only when combined with the time it takes to overflow the counter, and even then is only a single profile data point. The result is that the code heap is filled in roughly chronological order of counter overflow, which often leaves steady state hot code scattered around a code heap with no way to compact it. This is especially the case when application startup is compute intensive: most of the startup code is cold in the steady state, but ends up interleaved with the steady state hot code.

Description

Instruction issue hardware can be fragile in the sense that its performance is optimized when executing code is confined to one or a few restricted virtual address ranges. The current ReservedCodeCache 240mb default is large in order to accommodate the segmented code cache and tiered compilation system. The segmented code cache splits the cache into three code heaps, one each for non-nmethods (template interpreter, etc.), compiler generated profiled nmethods, and compiler generated non-profiled/optimized nmethods. See JEP-197 (https://openjdk.java.net/jeps/197) for a more detailed description. Per the 80/20 principle, the total size of steady state hot code is usually relatively small, but the nmethods containing it end up being scattered around the code heap address space due to tiered compilation order, which in turn encounters IIHL. Experiments using a smaller (64mb) non-segmented code cache with tiered compilation turned off have shown that its impact can be partially mitigated by grouping hot nmethods together, i.e., in a more hardware friendly way.

Another approach is to split nmethods into frequently (“code”) and non-frequently (“metadata”) accessed parts and allocate each part set together. Code includes, but is not limited to, the generated code and the constant pool. Metadata includes, but is not limited to, relocation data, dependency data, and oopmaps. Metadata can be allocated completely outside the code cache, or code could be allocated starting at the bottom of the code heap working up, and metadata at the top of the code heap working down. A JBS issue, JDK-7072317: move metadata from CodeCache (https://bugs.openjdk.java.net/browse/JDK-7072317), contemplates removing metadata from the code cache entirely.

A more complex and thorough solution would be to adapt to application behavior changes by monitoring nmethod execution and periodically compacting together hot code. It would not be strictly necessary to compact metadata, but such may be desirable to mitigate code heap fragmentation. Assuming a target code heap location has been made available, moving an nmethod’s code can be done concurrently by creating a relocated copy of the nmethod, fixing up external references to its entry points, and then following existing code invalidation and nmethod recovery protocols on the now-unreferenced nmethod.

The bulk of the work is expected to be detecting hot code, deciding when to compact it, and efficiently rearranging/compacting hot code within a code heap. Compaction must be completely or almost completely concurrent with application execution. Profiling hot code via tracing entails execution time overhead, so a sampling mechanism will likely be preferred. JFR (JDK Flight Recorder) includes a sampler that could be used for this purpose, but its overhead must be carefully controlled.

Rearranging the code heap to allow compaction can be done by simply invalidating code that is in the way and allowing it to be recompiled later. A more complex approach would move it out of the way, but might require more code heap space, since both copies of the code to be moved would occupy code heap space at the same time.

Readers with garbage collector implementation experience will observe that there are similarities and analogies between code cache management and mostly/fully concurrent collectors. The implementation of this JEP should be informed by the mechanisms and implementations of such collectors.

Possible projects:

  1. Split nmethods into frequently and infrequently accessed parts and allocate them separately as described above. See JDK-7072317 (https://bugs.openjdk.java.net/browse/JDK-7072317).
  2. Move JFR’s sample-based method profiling facility into the core JVM. Both code cache management and JFR would use the resulting common implementation. Replace existing cold nmethod tracking with the common implementation. The Watcher or Low Memory thread might be used as the sampling thread: a separate sampling thread would avoid overloading them.
  3. Invent and implement a policy to determine when compaction should be done. If there are hardware counters that record instruction issue cache misses, they might be used by the policy. If so, an asynchronous mechanism would be needed that should have its own JEP. For the purposes of this JEP, a software-only solution could be used.
  4. In order to move an nmethod, the old copy must be removed, which requires a fast way to discover inactive nmethods. Invent and implement a less time consuming way to do so. The current dependency on full marking cycles means discovery is usually slow and infrequent.
  5. First stage compaction: reserve a fixed amount of code heap space for hot nmethods and move them into that address range. A from/to survivor space approach might be used to avoid having to handle the case of fragmentation due to hot nmethods that already occupy the target address range.
  6. Second stage compaction: determine target addresses for nmethods to be compacted (perhaps the bottom of a code heap, but selection of an address range that already contains many of them might be better), deoptimize non-hot nmethods occupying the target address range, and move hot nmethods into it as space becomes available. Handle the special case of hot nmethods that already exist in the target address range.
  7. Third stage compaction: Rather than deoptimize nmethods occupying the target address range, move them elsewhere in the code cache.
  8. Compilers could recognize strongly connected regions of the program call graph and generate multiple nmethods whose code is statically compacted together. An example of a system that statically compacts hot code using profile feedback is Facebook’s Automated-Hot-Text-and-Huge-Pages-An-Easy-to-adopt-Solution-Towards-High-Performing-Services.

Performance degradation might occur due to the cpu time used by the colocation mechanism, so it should be possible to revert to the existing code cache management implementation. Multiple code cache management policies tailored to different performance goals may be desirable, just as are multiple garbage collection policies. A formal code cache management policy mechanism would be the subject of another JEP.

The code cache is a central JVM component, so other components will be affected by these changes, including but not limited to

Alternatives

The existing segmented code cache was created in order to achieve many of the same goals as this JEP. E.g., allocating server compiler generated code in its own code heap has the effect of reducing the number of address ranges covered by presumably hot code. In the segmented code cache, presumably cool/cold client compiler generated code is not interleaved with server compiler generated code. Since the facility to colocate hot code described in this JEP is dynamic, it would enable merging the client and server compiler code heaps. Doing so could be the subject of another JEP.

ITLB limitations can in theory be mitigated by using huge pages, but Linux (the primary target OS) by default synchronously coalesces huge pages from small ones when huge pages are mapped (https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html#thp-sysfs) rather than in the background. Enabling transparent huge parges by default if the defrag policy is set to “defer” is likely to eliminate pauses incurred by the default “madvise” policy.

Testing

With the addition of stress modes that repeatedly relocate and compact hot nmethods, existing tests, including performance tests (standard benchmarks, etc.) should suffice. There should be no application performance degradation.

Risks and Assumptions

Code cache implementation reliability and performance are critical to HotSpot reliability and performance. An incremental implementation approach and additional stress testing modes are necessary to have confidence in the implementation.

Splitting nmethods may increase harmful code cache fragmentation, since there would be two fragments per nmethod instead of one. If the two allocation zones grow from opposite ends of a code heap toward the middle, there will be two corresponding allocation/fragmentation zones instead of one until the two zones meet. A mitigating factor is that two nmethod fragments are each smaller than a single nmethod, so there might be a greater chance of finding suitable size free chunks.

There is concurrent overhead associated with managing the code cache more aggressively, but the steadily increasing number of hardware threads in modern CPUs is a mitigation. Where overhead is a problem, HotSpot can revert to the existing policy.

Dependencies

If JFR’s method profile sampling mechanism is used, there is a dependency on it.