JEP draft: G1: Improve Application Throughput with a More Efficient Write-Barrier

AuthorIvan Walulya & Thomas Schatzl
OwnerIvan Walulya
TypeFeature
ScopeImplementation
StatusSubmitted
Componenthotspot / gc
Discussionhotspot dash gc dash dev at openjdk dot org
EffortM
DurationM
Reviewed byVladimir Kozlov
Created2024/09/24 16:13
Updated2025/01/27 19:02
Issue8340827

Summary

Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers.

Goals

Non-Goals

Motivation

Java Virtual Machine (JVM) overhead is always a concern, especially in cloud-based Java deployments where billing is based on actual resource usage. One component that can impact CPU usage significantly is the garbage collector (GC). The default GC in the HotSpot JVM, the Garbage-First (G1) collector is tuned to keep a balance between latency and throughput for Java applications. However, keeping this balance can sometimes lead to considerable processing overhead compared to other throughput-oriented garbage collectors such as Parallel and Serial GC.

The main reason for this difference is that G1 performs some tasks concurrently with the application to meet latency goals, tasks that other garbage collectors might handle during the collection pause. The concurrent task execution requires additional synchronization between the garbage collector and the application, leading to substantial overhead. Specifically, the G1 uses a large amount of code to track every modification to the object graph in what is referred to as a write-barrier. For every field modification in the Java application, the G1 barrier adds about 50 x64 instructions [Protopopovs23], [JEP475], containing very expensive memory barriers, compared to 3 x64 instructions for the Parallel GC write-barrier.

Depending on the Java application, this difference can result in 10-20% throughput regression compared to the Parallel collector in the current JDK (as noted in [JDK-8062128], [JDK-8253230], [JDK-8132937], and [deSmet21]).

Reducing this overhead is essential for increasing the adoption of the G1 in the cloud and minimizing the need for users to switch to a different garbage collection algorithm.

Description

G1 garbage collection pause times depend on the number of changes made in the object graph during a Java application's execution. To track these changes, G1 uses a "card table," where each entry corresponds to a small section of memory, or "card," in the Java heap. This information is essential for quickly identifying live objects during a garbage collection pause [Hölzle93].

During the garbage collection pause, G1 examines the memory contents corresponding to the marked cards. In G1, there are two sources of these marked cards:

Examining cards on the card table can take up to 60% of the total garbage collection pause time. G1 tries to minimize the number of cards to process during the collection pause to reduce the pause time and maintain its pause time goal.

G1 controls the number of cards to process during the pause by examining recently marked cards concurrent to the Java application. Cards of interest are stored in a different data structure and the card is cleared (unmarked) to avoid processing in the pause. This task runs concurrent to the Java application at a frequency the pause time goal demands. The marking, unmarking, and examination of the cards by different threads (application and garbage collection threads) require expensive memory synchronization between these threads for correctness.

This JEP changes how marked cards are managed by G1, switching the current fine-grained memory synchronization in the write-barrier with much less frequent but more intrusive coarse-grained synchronization. This allows G1 to use a much smaller write-barrier; in our implementation for example, the write-barrier is about 20 instructions shorter on x86-64 architecture. Overall this approach results in higher throughput.

The main idea is to confine the application and the garbage collection threads to always work on different card tables so that no memory synchronization is needed for examining the marked cards concurrent with the Java application. The application marks in one (regular, existing) card table, and the garbage collection threads process the second, initially empty, refinement (card) table. When G1 determines that enough marks were accumulated in the card table, the VM atomically swaps the card and refinement table. Then the Java application resumes marking the former empty refinement table, and the garbage collection threads process the marks as described before from the former regular card table without any further synchronization. This process can be repeated as necessary.

Performance

With the proposed change, concurrent operations typically take less CPU resources, because the refinement table is very cache-friendly to access compared to the previous data structure that holds random card locations. There is also a small improvement in garbage collection pause times because previously required garbage collection phases can be simplified or completely removed.

Overall the main benefit is increased throughput of the Java application by doing less memory synchronization in the write-barrier, particularly for Java applications that heavily modify the object graph in the Java heap and so execute the write-barrier in full frequently. Depending on the CPU architecture and Java application we measured an increase in throughput by 5-15% or more due to decreased memory synchronization and concurrent examination.

Even throughput-oriented applications that do very little or no concurrent examination at all may benefit from the decreased size of the barrier code, showing an increased throughput of around 5%.

Due to G1’s garbage collection CPU-usage based heap sizing, the impact of this change often results in decreased Java heap footprint instead of increased throughput.

Native Memory Footprint

This change allocates an extra card table of the same size as the regular card table. By default, a card table corresponds to around 0.2% of Java heap memory size, which corresponds to 2MB native memory usage per 1GB of Java heap.

Removing the previous data structures to track concurrent examination of cards approximately offsets half of that native memory unless the application does almost no object graph modifications. The optimization to keep the interesting cards of always collected regions on the card table also saves a significant amount of native memory.

Overall, some applications may even use less native memory than before.

Alternatives

Several alternatives to reduce the throughput difference between G1 and other collectors due to complex write-barriers have been proposed and prototyped before:

Overall we think that the trade-off between code complexity, maintenance overhead, and performance characteristics is most favorable for this proposal.

Risks and Assumptions

We assume that the refinement table increases the native memory footprint for G1 slightly. We also assume that a significant portion of this will be compensated by the removal of the data structures for maintaining examining marked cards and optimizations like keeping the interesting cards of always collected regions on the card table.

This is an intrusive change to critical components of G1 interaction with application threads, there is a non-negligible risk of bugs that may cause failures and introduce performance regressions. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.