JEP draft: Throughput post-write barrier for G1

Owner	Man Cao
Type	Feature
Scope	JDK
Status	Closed / Withdrawn
Component	hotspot / gc
Created	2019/08/26 21:14
Updated	2025/04/23 22:50
Issue	8230187

Summary

Have the G1 garbage collector use a throughput-optimized barrier when the user disables concurrent refinement to achieve better throughput at the cost of latency on certain workloads that are not so latency sensitive.

Non-Goals

Let the VM determine, either at startup or at runtime, when to disable concurrent refinement and optimize its barriers.
Changes to the general garbage collection cycle, i.e. G1 will stay generational, and alternate between a young-only phase and a space reclamation phase that incrementally reclaims space in the old generation.
Changes to G1’s throughput or features when concurrent refinement is enabled.
Limit the availability of other G1 features like string deduplication, AppCDS, and eager reclamation of humongous objects when concurrent refinement is disabled.
Additional ergonomics changes to improve throughput or better meet pause time goals when concurrent refinement is disabled.
Match the performance of G1 to Parallel GC when concurrent refinement is disabled.

Motivation

The G1 garbage collector has a more complicated post-reference write barrier (write barrier in short in the following) than the write barriers for more traditional collectors such as the Parallel or the Concurrent Mark-Sweep collector. This complexity is largely due to support for concurrent refinement which moves some scanning work in the collection pause to work done concurrently to the application.

The current mechanism is as follows: the write barrier adds newly dirtied cards to a local per-thread dirty card queue. If a local per-thread dirty card queue is full, the write barrier either adds this (full) queue into a global dirty card queue set, and either receives a new empty dirty card queue to fill, or is told to process the entries in this dirty card queue. Concurrent refinement threads also pick up dirty card queues from the global dirty card queue set and process them. In either case, this processing determines whether a given dirty card needs to be scanned in the next collection.

As a result, this refinement mechanism incurs noticeable overhead during execution in several places: the G1 write barrier is significantly more complicated and larger than others, taking more execution resources, and has a larger code cache footprint. The larger write barrier may also negatively affect compiler decisions for e.g. inlining during code generation. Additionally the concurrent dirty card processing, either inline or using additional threads, takes additional CPU cycles.

An observation we have made in the past is that concurrent refinement offers limited benefit for certain types of workloads. Examples include throughput-oriented workloads where latency is not the primary concern or workloads that are tuned to minimize old-generation collections. For these cases, G1 could perform better if concurrent refinement could be disabled to allow the use of a simpler write barrier.

Currently, concurrent refinement cannot be disabled completely. G1 creates concurrent GC worker threads to do the refinement work by default. The user could specify e.g. -XX:G1ConcRefinementThreads=0 to disable these worker threads, but the processing of dirty card queues by mutator threads directly can not be disabled, and so the write barrier can not be simplified.

Description

We propose a new JVM flag -XX:-G1UseConcRefinement to turn off concurrent refinement and allow G1 to use a new "throughput post-write barrier". By default, G1UseConcRefinement is enabled. If the user specifies -XX:-G1UseConcRefinement, the compilers and interpreter will issue a simplified post-barrier for a given reference write p.f = q:

if (p and q in same region) -> exit
if (q is NULL) -> exit
if (*card(p) == DIRTY) -> exit
*card(p) = DIRTY

To ensure correctness under -XX:-G1UseConcRefinement, now G1 scans all dirty cards mapped to regions not in the collection set in addition to remembered sets for regions in the collection set during a collection pause.

-XX:-G1UseConcRefinement will improve G1's throughput and reduce overall CPU usage. The simplified write barrier is much shorter in length, thus also improves instruction cache hit rate. This mode reduces the total amount of work for handling a dirty card and compilation work for JIT compilers. In addition, it reduces memory footprint by shrinking remembered sets and not using per-thread dirty card queues.

Alternatives

Performance testing of alternative throughput post-write barriers have been conducted, ranging from using the same barrier as Parallel GC to more complicated variants.

We found that the first two lines (if (p and q in same region) -> exit and if (q is NULL) -> exit) in the proposed write barrier above effectively filter out unnecessary cards during execution that thus do not need to be processed in the GC pause, reducing GC pause times without impacting throughput. The filter in the third line (if (*card(p) == DIRTY) -> exit) has been kept because:

it does not have noticeable impact on throughput;
it corresponds to conditional card marking other collectors already optionally do (via -XX:+UseCondCardMark) to reduce coherency traffic on larger machines;
by keeping this filter the proposed barrier can be a complete prefix of the default write barrier. This simplifies or makes it possible to implement further enhancements such as dynamically switching between default and throughput write barrier in the future.

Some of the overhead impact of the concurrent refinement may be removed by better handling in the compiler, and improved scheduling of the concurrent refinement threads. We expect that the gains from these changes would have a significantly smaller impact on throughput compared to simplifying the barrier as proposed here: there will also always remain some refinement work that decreases throughput. Such an effort would be orthogonal to this change.

Testing

to provide correctness of the new write barrier, existing test cases must pass with -XX:-G1UseConcRefinement to cover the case where concurrent refinement is disabled.
regarding performance, we intend to compare benchmark scores between -XX:-G1UseConcRefinement and -XX:+G1UseConcRefinement for several well-known benchmarks.

Risks and Assumptions

For certain workloads, it will be harder to meet small pause time goals with concurrent refinement disabled. Examples include workloads with large heaps and a considerable proportion of long-lived objects. We suggest that on such a workload the user should keep concurrent refinement enabled and use the default write barrier.