JEP draft: G1: Improve Application Throughput with a More Efficient Write-Barrier
Author | Ivan Walulya & Thomas Schatzl |
Owner | Ivan Walulya |
Type | Feature |
Scope | Implementation |
Status | Submitted |
Component | hotspot / gc |
Discussion | hotspot dash gc dash dev at openjdk dot org |
Effort | M |
Duration | M |
Reviewed by | Vladimir Kozlov |
Created | 2024/09/24 16:13 |
Updated | 2025/01/27 19:02 |
Issue | 8340827 |
Summary
Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers.
Goals
- Reduce the overall synchronization overhead necessary to operate the G1 garbage collector,
- Reduce the size of the generated code for the G1 barriers,
- Keep the overall behavior of the G1 garbage collector the same, alternating between two phases where G1 only collects the young or both young and old generation memory and reclaims memory incrementally.
Non-Goals
- Match the throughput of the G1 garbage collector with any other garbage collector.
Motivation
Java Virtual Machine (JVM) overhead is always a concern, especially in cloud-based Java deployments where billing is based on actual resource usage. One component that can impact CPU usage significantly is the garbage collector (GC). The default GC in the HotSpot JVM, the Garbage-First (G1) collector is tuned to keep a balance between latency and throughput for Java applications. However, keeping this balance can sometimes lead to considerable processing overhead compared to other throughput-oriented garbage collectors such as Parallel and Serial GC.
The main reason for this difference is that G1 performs some tasks concurrently with the application to meet latency goals, tasks that other garbage collectors might handle during the collection pause. The concurrent task execution requires additional synchronization between the garbage collector and the application, leading to substantial overhead. Specifically, the G1 uses a large amount of code to track every modification to the object graph in what is referred to as a write-barrier. For every field modification in the Java application, the G1 barrier adds about 50 x64 instructions [Protopopovs23], [JEP475], containing very expensive memory barriers, compared to 3 x64 instructions for the Parallel GC write-barrier.
Depending on the Java application, this difference can result in 10-20% throughput regression compared to the Parallel collector in the current JDK (as noted in [JDK-8062128], [JDK-8253230], [JDK-8132937], and [deSmet21]).
Reducing this overhead is essential for increasing the adoption of the G1 in the cloud and minimizing the need for users to switch to a different garbage collection algorithm.
Description
G1 garbage collection pause times depend on the number of changes made in the object graph during a Java application's execution. To track these changes, G1 uses a "card table," where each entry corresponds to a small section of memory, or "card," in the Java heap. This information is essential for quickly identifying live objects during a garbage collection pause [Hölzle93].
During the garbage collection pause, G1 examines the memory contents corresponding to the marked cards. In G1, there are two sources of these marked cards:
- cards that were recently marked by the application, which G1 has not yet determined whether they contain useful information for the current garbage collection.
- cards that were already examined concurrently with the Java application, found to be relevant for collecting specific regions of the Java heap, and those regions have been selected for collection.
Examining cards on the card table can take up to 60% of the total garbage collection pause time. G1 tries to minimize the number of cards to process during the collection pause to reduce the pause time and maintain its pause time goal.
G1 controls the number of cards to process during the pause by examining recently marked cards concurrent to the Java application. Cards of interest are stored in a different data structure and the card is cleared (unmarked) to avoid processing in the pause. This task runs concurrent to the Java application at a frequency the pause time goal demands. The marking, unmarking, and examination of the cards by different threads (application and garbage collection threads) require expensive memory synchronization between these threads for correctness.
This JEP changes how marked cards are managed by G1, switching the current fine-grained memory synchronization in the write-barrier with much less frequent but more intrusive coarse-grained synchronization. This allows G1 to use a much smaller write-barrier; in our implementation for example, the write-barrier is about 20 instructions shorter on x86-64 architecture. Overall this approach results in higher throughput.
The main idea is to confine the application and the garbage collection threads to always work on different card tables so that no memory synchronization is needed for examining the marked cards concurrent with the Java application. The application marks in one (regular, existing) card table, and the garbage collection threads process the second, initially empty, refinement (card) table. When G1 determines that enough marks were accumulated in the card table, the VM atomically swaps the card and refinement table. Then the Java application resumes marking the former empty refinement table, and the garbage collection threads process the marks as described before from the former regular card table without any further synchronization. This process can be repeated as necessary.
Performance
With the proposed change, concurrent operations typically take less CPU resources, because the refinement table is very cache-friendly to access compared to the previous data structure that holds random card locations. There is also a small improvement in garbage collection pause times because previously required garbage collection phases can be simplified or completely removed.
Overall the main benefit is increased throughput of the Java application by doing less memory synchronization in the write-barrier, particularly for Java applications that heavily modify the object graph in the Java heap and so execute the write-barrier in full frequently. Depending on the CPU architecture and Java application we measured an increase in throughput by 5-15% or more due to decreased memory synchronization and concurrent examination.
Even throughput-oriented applications that do very little or no concurrent examination at all may benefit from the decreased size of the barrier code, showing an increased throughput of around 5%.
Due to G1’s garbage collection CPU-usage based heap sizing, the impact of this change often results in decreased Java heap footprint instead of increased throughput.
Native Memory Footprint
This change allocates an extra card table of the same size as the regular card table. By default, a card table corresponds to around 0.2% of Java heap memory size, which corresponds to 2MB native memory usage per 1GB of Java heap.
Removing the previous data structures to track concurrent examination of cards approximately offsets half of that native memory unless the application does almost no object graph modifications. The optimization to keep the interesting cards of always collected regions on the card table also saves a significant amount of native memory.
Overall, some applications may even use less native memory than before.
Alternatives
Several alternatives to reduce the throughput difference between G1 and other collectors due to complex write-barriers have been proposed and prototyped before:
- Use operating system-dependent synchronization mechanisms to implement the coarse-grained synchronization between application and garbage collection threads. This made the synchronization operating system and even operating system version dependent. Some platforms do not implement this functionality directly (for example OSX-aarch64, or old versions of linux-riscv) or in a too generic way (older Linux kernels), so workarounds that for example use debugging APIs would need to be used. This would lead to the performance of this functionality being very poor on some platforms, so the current technique using Thread-Local Handshakes [JEP312] would have been needed anyway as a workaround to these deficiencies.
- Keep existing data structures to store where marked cards on the card table are instead of using the refinement table. This would have the advantage that the change itself would be smaller, however, the possible reduction of the write-barrier size would be much smaller and limited to avoid the memory synchronization, saving only five instructions instead of twenty. Removing the old data structures also reduces work in the garbage collection pause, reducing garbage collection pause times. Management of the refinement table is much less complex. The prototype in this proposal removes around 1000 lines of code in the JVM sources, mostly related to this data structure, also significantly reducing code complexity.
- Use the same throughput-optimized write-barrier as Parallel GC, disable all concurrent examination of cards, and have G1 do all work in the garbage collection pauses. The end user determines to use one or the other “garbage collection mode” based on his service level preferences. A master’s thesis [Protopopovs23] showed that the pause time impact can be negligible in throughput-oriented applications, but this proposal would have two distinct “modes” of operation for G1. This adds significant complexity to the code base, increases the test surface significantly, and is inconvenient for the end user as he needs to know in advance which mode to select. Alternatively the end user may as well select Parallel GC instead, which would still be slightly faster than G1 on pure throughput loads as shown in the thesis.
- Modify the original write-barrier to batch the memory barrier execution, and remove the per-object graph memory synchronization. The cost of this approach is much more enqueuing work in some applications, decreasing throughput compared to the original G1. There are no observed performance regressions with the approach proposed here.
Overall we think that the trade-off between code complexity, maintenance overhead, and performance characteristics is most favorable for this proposal.
Risks and Assumptions
We assume that the refinement table increases the native memory footprint for G1 slightly. We also assume that a significant portion of this will be compensated by the removal of the data structures for maintaining examining marked cards and optimizations like keeping the interesting cards of always collected regions on the card table.
This is an intrusive change to critical components of G1 interaction with application threads, there is a non-negligible risk of bugs that may cause failures and introduce performance regressions. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.