JEP draft: G1: Improve Application Throughput with a More Efficient Write-Barrier

AuthorIvan Walulya & Thomas Schatzl
OwnerIvan Walulya
TypeFeature
ScopeImplementation
StatusSubmitted
Componenthotspot / gc
Discussionhotspot dash gc dash dev at openjdk dot org
EffortM
DurationM
Reviewed byVladimir Kozlov
Created2024/09/24 16:13
Updated2025/02/07 13:56
Issue8340827

Summary

Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers.

Goals

Non-Goals

Motivation

Java Virtual Machine (JVM) overhead is always a concern, especially in cloud-based Java deployments where billing is based on actual resource usage. One component that can impact CPU usage significantly is the garbage collector (GC). The default GC in the HotSpot JVM, the Garbage-First (G1) collector is tuned to keep a balance between latency and throughput for Java applications. However, keeping this balance can sometimes lead to considerable processing overhead compared to other throughput-oriented garbage collectors such as Parallel and Serial GC.

The main reason for this difference is that G1 performs some tasks concurrently with the application to meet latency goals. Other garbage collectors might handle these tasks during the collection pause, however concurrent task execution requires additional synchronization between the garbage collector and the application, leading to substantial overhead. Specifically, the G1 uses a relatively large amount of code, called write-barrier, to track every modification to the object graph in the Java heap. So for every write of a reference in the Java application, the G1 write-barrier adds about 50 x64 instructions [Protopopovs23], [JEP475], containing very expensive cross-CPU memory synchronization instructions, compared to 3 x64 instructions for the Parallel GC write-barrier.

Depending on the Java application, this difference can result in 10-20% throughput regression compared to the Parallel collector in the current JDK (as noted in [JDK-8062128], [JDK-8253230], [JDK-8132937], and [deSmet21]).

Reducing this overhead is essential for increasing the adoption of the G1 in the cloud and minimizing the need for users to switch to a different garbage collection algorithm.

Description

G1 is an incremental garbage collector. It selectively evacuates parts of the Java heap in every garbage collection. G1 must track references into the selected parts of the heap from areas not selected, as inspecting the entire heap to find such references during a garbage collection pause would take too long. G1 keeps track of locations that may contain such references of interest to solve this problem. Tracking references directly would require significant memory overhead, instead, G1 employs a technique called Card Marking [Hölzle93].

Card Marking divides the heap into fixed-size chunks called cards. Every card is represented as a byte in a separate fixed-size array called a card table. Each entry corresponds to a card in the Java heap. A card may either be marked or unmarked, a mark indicating the potential presence of interesting references in the Java heap corresponding to the card.

When the application modifies an object reference, the G1 write barrier intercepts the modification and marks the corresponding card in the card table.

During garbage collection pauses, G1 examines Java heap contents corresponding to the marked cards. Therefore, G1 garbage collection pause times depend on the number of changes made in the object graph during a Java application's execution. Examining cards on the card table can take up a majority of the total garbage collection pause time due to the large number of cards to examine.

To reduce pause times and keep the pause time goal, G1 tries to minimize the number of cards to examine during a garbage collection pause by re-examining, unmarking, and classifying cards according to what areas they reference concurrent to the application.

These measures reduce the cards to inspect to cards referring to areas selected for garbage collection in addition to cards not yet examined concurrently.

Currently, the G1 write barrier implements the following tasks:

This explains the large size of the G1 write barrier.

This JEP changes G1's management of marked cards, switching from the current fine-grained memory synchronization in the write-barrier to a much less frequent but more intrusive coarse-grained approach. G1 also adopts a different data structure for storing to-be-examined card marks. This enables G1 to use a much smaller write-barrier; in this change we only keep the mentioned pre-filters that may avoid card marks completely to lessen concurrent re-examination work, reducing the write-barrier by about 20 instructions on the x86-64 architecture.

Overall this approach results in significantly higher throughput due to the smaller write barrier allowing better compiled code quality, smaller code size, and increased concurrent re-examination performance due to reduced synchronization.

The main idea is to segregate the application and the garbage collection threads' activities onto distinct card tables. This eliminates the need for memory synchronization between the application marking cards and the concurrent garbage collection threads examining them. The application marks cards on one (regular, existing) card table, and the garbage collection threads inspect a second, initially empty, refinement (card) table. When G1 heuristically detects that processing the card marks on the card table in the garbage collection pause would start to exceed the pause time goal, the Hotspot VM atomically swaps the card and refinement tables. The Java application resumes marking on the former empty refinement table, and the garbage collection threads process marks from the previous regular card table without any synchronization. G1 repeats this process as necessary to keep the overall amount of cards to examine during garbage collection to a certain threshold.

Performance

With the proposed change, concurrent operations typically take less CPU resources, because the refinement table is very cache-friendly to access compared to the previous data structure that holds card locations that would require random memory access. There is also a small improvement in garbage collection pause times because previously required garbage collection phases can be simplified or completely removed.

However the main benefit is increased throughput of Java applications by doing less memory synchronization in the write-barrier, particularly for Java applications that heavily modify the object graph in the Java heap and so execute the write-barrier in full frequently. Depending on the CPU architecture and Java application we measured an increase in throughput by 5-15% or more due to decreased memory synchronization and concurrent examination.

Even throughput-oriented applications that do very little or no concurrent examination at all may benefit from the decreased size of the barrier code improving compiled code performance, showing an increased throughput of around 5%.

Due to G1’s garbage collection CPU-usage based heap sizing, the impact of this change often results in decreased Java heap footprint instead of increased throughput.

Native Memory Footprint

This change allocates an extra card table of the same size as the regular card table. By default, a card table corresponds to around 0.2% of Java heap memory size, which corresponds to 2MB native memory usage per 1GB of Java heap.

Removing the previous data structures to track concurrent examination of cards approximately offsets half of that native memory unless the application does almost no object graph modifications. The optimization to keep the interesting cards of always collected regions on the card table also saves a significant amount of native memory.

Overall, some applications may even use less native memory than before.

Alternatives

Several alternatives to reduce the throughput difference between G1 and other collectors due to complex write-barriers have been proposed and prototyped before:

Overall we think that the trade-off between code complexity, maintenance overhead, and performance characteristics is most favorable for this proposal.

Risks and Assumptions

We assume that the refinement table increases the native memory footprint for G1 slightly. We also assume that a significant portion of this will be compensated by the removal of the data structures for maintaining examining marked cards and optimizations like keeping the interesting cards of always collected regions on the card table.

This is an intrusive change to critical components of G1 interaction with application threads, there is a non-negligible risk of bugs that may cause failures and introduce performance regressions. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.