JEP draft: Late G1 Barrier Expansion

AuthorRoberto Castañeda Lozano & Erik Österlund
OwnerRoberto Castaneda Lozano
TypeFeature
ScopeImplementation
StatusDraft
Componenthotspot / compiler
EffortM
DurationM
Reviewed byErik Österlund
Created2023/12/18 14:09
Updated2024/03/01 18:19
Issue8322295

Summary

Garbage collectors instrument application memory accesses with additional instructions (called barriers) that register information about the accessed objects. This JEP proposes simplifying the handling of G1 (HotSpot's default collector) barriers in C2 (HotSpot's optimizing just-in-time compiler), by hiding their internal structure from C2's main code transformations and optimizations.

Goals

Motivation

The increasing popularity of cloud-based Java deployments has led to a stronger focus on reducing overall JVM overhead. JIT compilation is an effective technique to speed up Java applications, but incurs a significant overhead in terms of processing time and memory usage. This overhead is particularly noticeable for optimizing JIT compilers such as JDK's C2. Reducing the overhead is key to making Java a better fit for the cloud.

Another core JVM component that contributes to the JVM's total overhead, in addition to JIT compilers, is the garbage collector. As a semi-concurrent, generational garbage collector (GC), G1 needs to interface with the JIT compilers to instrument application memory accesses with barrier code. In the case of C2, maintaining and evolving this interface requires deep knowledge of C2 internals, which few GC engineers possess. Furthermore, some barrier optimizations require applying low-level transformations and techniques which cannot be expressed in C2's intermediate representation. These obstacles have slowed down or directly blocked the evolution and optimization of key aspects of G1. Decoupling G1 barrier instrumentation from C2 internals would enable GC engineers to further optimize and reduce the overhead of G1, by means of algorithmic improvements and low-level micro-optimizations.

C2 represents the methods under compilation as a "sea of nodes", a type of program dependence graph that gives great freedom to the compiler to choose where to schedule program instructions in the final code. While this simplifies and increases the scope of many optimizations, it makes it hard to preserve invariants about the relative ordering of instructions. In the case of G1 barriers, which are currently treated as any other operations in the sea of nodes representation, this has resulted in complex bugs such as JDK-8242115 and JDK-8295066, and even as of today it is hard to guarantee the absence of other issues of similar nature.

Early experiments and manual inspection of the C2-generated code have revealed that the assembly instruction sequences generated by C2 to implement barriers are very similar to the handwritten assembly implementation used by the interpreter to instrument memory accesses. This suggests that the scope for C2 to optimize barrier code is limited, and code of similar quality could be obtained if the barrier implementation details were hidden from C2 and only expanded at the very end of the compilation chain.

Description

Early G1 Barrier Expansion

In the current model, C2 captures all detailed barrier operations together with the original operations of the application in its intermediate representation (IR). C2 expands the detailed barrier operations of each object memory access during bytecode parsing, the initial phase where C2 transforms bytecode into IR operations. Specific logic for G1 and C2 guides this expansion via the Heap Access API. Once the barriers are expanded, C2 transforms and optimizes all operations uniformly through the compilation chain, as depicted in the following diagram where IR<C,P> denotes an intermediate representation that is specific to collector C and platform P:

Exposing all GC barrier details early in the IR has two potential advantages. First, the same GC barrier implementation (expressed as IR operations) can be reused for all platforms. Second, C2 can optimize and transform the barrier operations in the scope of the entire compilation unit, which can potentially improve the code quality. However, there is limited practical benefit because

  1. platform-specific G1 barrier implementations have to be provided nevertheless for other execution modes such as bytecode interpretation, and
  2. G1 barrier code operations are not very amenable to optimization, due to control-flow density, memory-induced serialization, and other factors.

The early expansion model has three tangible and significant disadvantages, already discussed in the Motivation section: opacity to GC engineers, difficulty to guarantee absence of barrier ordering issues, and substantial C2 compilation overhead.

Late G1 Barrier Expansion

This JEP proposes expanding G1 barriers as late as possible in the compilation chain of C2, as outlined in the following diagram (using the same notation as above):

The key idea is to represent memory accesses in C2's IR only by their core operations, and postpone barrier instrumentation to code emission, where IR operations are translated into actual machine code. Barrier implementations are provided in (pseudo) assembly code, which is a familiar level of abstraction for all HotSpot engineers. Crucially, these assembly implementations are already available for all platforms to support bytecode interpretation, and can simply be reused.

Late barrier expansion can be implemented as follows. IR memory accesses generated during bytecode parsing are tagged with the information required to generate their barrier code at code emission. This information is not exposed to C2's analysis and optimization mechanisms. Instruction selection replaces abstract memory access operations with platform- and GC-specific instructions where barriers are still implicit. GC-specific instructions are required to accommodate late barrier expansion, for example making sure that the register allocator reserves enough temporary registers to perform the barrier operations. Finally, during code emission, each GC-specific memory access instruction is transformed into actual assembly code according to its tagged barrier information. The assembly code consists of the platform-specific memory instruction itself surrounded with barrier assembly code. The barrier code is generated using the existing interpreter implementation, augmented with routines ("assembly stubs") implementing potential calls from the barrier into the JVM runtime.

ZGC, an alternative fully-concurrent collector available in the JDK, already applies the late barrier expansion model. Many of the mechanisms developed for ZGC, such as the logic to perform runtime calls in the barrier code, can be abstracted and reused for G1.

Optimizations

Preliminary experiments show that a naive application of the late barrier expansion model (using the interpreter barrier implementation without applying any optimization) already achieves code quality very close to that of the C2-optimized code. Fully closing the performance gap, however, requires adopting some of the key optimizations that are currently applied. This JEP proposes re-evaluating these optimizations in the context of late barrier expansion, and re-implementing those which have a demonstrable performance effect at the application level. The optimizations focus on barriers for write operations (of the form x.f = y, where y is an object), as our preliminary experiments show that these constitute around 99% of all executed G1 barriers. Write barriers consist of a pre-barrier part, to support concurrent marking, and a post-barrier part, to support the partition of heap regions into generations.

The following optimizations are candidates for adoption by the late barrier expansion model:

Alternatives

The C2 compilation chain contains a few natural locations at which GC barriers may be expanded:

Each of the above locations offers a different trade-off point in terms of C2 overhead, required C2 knowledge, risk of suffering instruction scheduling issues, and required platform-specific effort. Generally, the later barriers are expanded the lower the C2 overhead is. The largest savings are observed when barrier expansion is moved from bytecode parsing to macro expansion, and then from code motion to after register allocation. All expansion locations but code emission require significant C2 knowledge, but only bytecode parsing and macro expansion require familiarity with C2's platform-independent IR. Expanding barriers at these two locations does not require platform-specific support, but on the other hand risks triggering instruction scheduling issues during C2's code motion phase. Expanding barriers at code emission (as proposed here) is the alternative with lowest overhead and the only one that does not require C2-specific knowledge. As for all other platform-dependent phases, it shares the advantage of precluding instruction scheduling issues and the disadvantage of requiring implementation effort for each platform.

The following table summarizes the advantages and disadvantages of each phase:

| phase | C2 overhead | requires C2 knowledge | control over scheduling | platform-independent | --- | --- | --- | --- | --- | | at bytecode parsing (current) | high | yes | no | yes | | at macro expansion | medium | yes | no | yes | | after code motion | medium | yes | yes | no | | after register allocation | low | yes | yes | no | | at code emission (this JEP) | low | no | yes | no |

An alternative dimension that could be explored is the granularity of the barrier implementation exposed to C2. However, experiments conducted for ZGC concluded that even exposing barriers as single operations in C2's intermediate representation risks causing instruction scheduling issues.

Another alternative to achieve lower C2 overhead is to simply choose, if allowed by the application requirements, a GC using more lightweight barriers (Serial GC or Parallel GC) and/or a late barrier expansion model (ZGC).

Testing

To mitigate the risk of introducing functional failures, we plan to combine:

To mitigate the risk of performance regressions, we plan to evaluate the JEP implementation using a set of industry-standard Java benchmarks on different platforms.

To measure and compare compilation speed and code size, we plan to use the functionality provided by the -XX:+CITime HotSpot option. To control the variation in size and scope of the compiled methods across JVM runs, we plan to measure multiple iterations of each benchmark run, and use JVM options such as -Xbatch that make each JVM run more deterministic.

Risks and Assumptions

The viability of the late C2 barrier expansion design has been already proven by ZGC, which has been based on this model since JDK 13. However, as with every change affecting multiple core HotSpot components (in this case, the interface between HotSpot's default garbage collector and optimizing JIT compiler), there is a non-negligible risk of introducing bugs in the implementation that may cause failures and performance regressions. To mitigate this risk, we plan to conduct internal code reviews and extensive testing and benchmarking beyond the usual activities conducted for regular enhancements and bug fixes, as outlined in the "Testing" section above.