JEP draft: G1: Improve Application Throughput with a More Efficient Write-Barrier

Author	Ivan Walulya & Thomas Schatzl
Owner	Ivan Walulya
Type	Feature
Scope	Implementation
Status	Submitted
Component	hotspot / gc
Discussion	hotspot dash gc dash dev at openjdk dot org
Effort	M
Duration	M
Reviewed by	Alex Buckley, Vladimir Kozlov
Created	2024/09/24 16:13
Updated	2025/05/30 16:36
Issue	8340827

Summary

Increase the application throughput when the G1 garbage collector is used by reducing the impact of its barriers.

Goals

Reduce the overall synchronization overhead necessary to operate the G1 garbage collector,
Reduce the size of the injected code for the G1 write-barriers,
Maintain the overall architecture of the G1 garbage collector with no changes to user interaction.

Non-Goals

Match the throughput of the G1 garbage collector with any other garbage collector.

Motivation

The default garbage collector in the HotSpot JVM, the Garbage-First (G1) collector, is designed to balance latency and throughput for Java applications. However, achieving this balance can sometimes impact the application's performance compared to more throughput-oriented garbage collectors such as Parallel and Serial GC.

Relative to Parallel, G1 performs more of its work concurrently with the application, reducing the duration of GC pauses and thus improving the latency of the application. Unavoidably, this means that application threads need to share the CPU with special-purpose GC threads, and coordinate with them. This synchronization of the application threads with the GC threads lowers application throughput and adds latency.

In this JEP, we substantially reduce the total synchronization overhead without any fundamental change in the architecture of G1. Application threads and GC threads still coordinate as the Java heap evolves. Instead of synchronizing with the GC threads at every object reference modification, the application threads can run freely most of the time. The JVM enforces consistency by forcing them to synchronize with the GC threads when necessary. The end result is lower total synchronization overhead which results in higher application throughput and improved latency.

Description

G1 reclaims memory by copying Java objects in the heap into new memory areas and making the memory previously occupied by those objects available for the application to allocate new objects. This means that references to the original objects, stored in the fields of other objects, must be updated to point to new locations.

To find the references that need to be updated, G1 does not scan the entire heap; rather, G1 frontloads the work by arranging for every field assignment to update a data structure called a card table. Code called a write barrier is injected into the application to perform the update. Later, during GC pauses, G1 only needs to scan the card table to find where there are there are objects whose fields need updating in the Java heap.

Even though scanning the card table is efficient, a dynamic Java application might frequently allocate new objects and assign to their fields, causing the size of the card table to increase beyond the threshold that is unsustainable to examine within G1’s pause time goal. To avoid these lengthy GC pauses, G1 optimizes the card table in the background. However, this means that optimizer threads must synchronize with application threads to avoid racy updates to the card table. Accordingly, the injected write barrier code in application threads is complex, as well as the auxiliary mechanisms introduced to optimize the card table.

We propose to introduce a second card table that will let optimizer and application threads run freely. The write barriers in application threads update the first card table without any synchronization or auxiliary overhead, making them smaller and simpler. Meanwhile, optimizer threads work on the second, initially empty, card table.

When G1 detects that processing the first card table during a GC pause would likely exceed the pause time goal, the JVM atomically swaps the card tables. Application threads resume updating on the empty, formerly second table, while optimizer threads work on the full, formerly first table without any synchronization with the application. G1 repeats this process as necessary to keep the size of the card table examined during GC pauses under a certain threshold.

Performance

The proposed change reduces garbage collection overhead both while the application is running and during garbage collection pauses.

The main benefit originates from less synchronization between the optimizer and the application. In applications that heavily modify object references we observe throughput gains in the range of 5-15%. The observed throughput gain is up to 5% for other applications. They only benefit from better code generation due to the smaller and simpler write barriers. For example, on x86, the number of injected instructions is reduced from around 50 to 12 per write barrier.

The second card table poses less management overhead than the previous auxiliary data structure that kept track of modified references, so garbage collection pause length also decreases slightly.

Native Memory Footprint

The second card table has the same capacity as the first and uses an equivalent amount of additional native memory. Each card table requires 0.2% of Java heap capacity, corresponding to an additional 2MB of native memory usage per 1GB of Java heap capacity. This second card table replaces the original dynamically sized data structures used to track modified references, that offsets this expense. In some cases these savings are larger than the cost of the second card table.

However, in cases where the new card table results in higher native memory usage, we argue that this should not be a concern. In JDK 20 (JDK-8210708) and JDK 21 (JDK-8225409), we removed large data structures that in total used more than eight times the size of the second card table.

Alternatives

In the past, several attempts to decrease the throughput difference between G1 and the other collectors have been proposed and prototyped:

Use operating system dependent synchronization mechanisms to atomically swap the card table in the hope of better performance. However this required operating system or even operating system version specific code, increasing code complexity extraordinarily without providing the expected benefit. The current implementation uses generic Thread-Local Handshakes [JEP312] instead that avoid this problem.
Keep existing auxiliary data structures to store reference update locations to decrease the amount of changes. This approach limited the potential reduction of the write-barrier complexity, limiting the performance improvement of this method. The current use of the second card table requires no particular synchronization between optimizer and application.
Have two separate G1 garbage collector modes, the active one chosen at JVM startup time. One mode that works as before, and another one that uses the same write barrier as the Parallel garbage collector, disabling background card table optimization as suggested by JDK-8230187. Having two distinct modes of operation for G1 adds significant complexity to the code base, increases maintenance overhead and is inconvenient for the end user as he needs to select the mode in advance. Adoption could be limited because in this case end-users might as well use the Parallel garbage collector instead, as it would still be slightly faster than G1 on pure throughput workloads as shown in a master’s thesis [Protopopovs23].
There were other attempts to decrease the synchronization between the application and the optimizer with changes to the write barrier. These approaches exhibited severe throughput regressions in some situations with only very modest gains otherwise.

Overall we think that the trade-off between code complexity, maintenance overhead, and performance characteristics is most favorable for this proposal.

Risks and Assumptions

We assume there is no need to provide additional end user control for the optimization threads beyond what is already provided by the VM. The end user may continue to opt to completely disable the optimization threads via -XX:-G1UseConcRefinement or limit their number via -XX:G1ConcRefinementThreads=<number>. These two options are assumed to continue to cover all necessary use cases not otherwise better handled by the internal heuristics.

Further we believe that the throughput gains offered by this change, combined with native memory savings in earlier JDK releases described in the Native Memory Footprint section justify the additional memory usage of the second card table.

This is an intrusive change to critical components of G1 interaction with application threads, there is a non-negligible risk of bugs that may cause failures and introduce performance regressions when adapting existing heuristics to the new auxiliary data structure. To mitigate this risk, we will conduct careful code reviews and perform extensive testing.