JEP draft: ZGC: Automatic Heap Sizing

OwnerErik Österlund
TypeFeature
ScopeImplementation
StatusSubmitted
Componenthotspot / gc
EffortM
DurationM
Reviewed byAxel Boldt-Christmas, Vladimir Kozlov
Created2024/04/05 09:47
Updated2024/09/25 16:26
Issue8329758

Summary

Automatically size the heap appropriately by default, adapting dynamically to the workload and environment, when using the Z Garbage Collector (ZGC).

Goals

While using ZGC, then:

Non-Goals

It is not a goal of this effort to:

Success Metrics

The new heap sizing policy should yield a better balance between CPU and memory overhead, compared to the current policy. One way of quantifying this is to combine the CPU and memory overhead percentages. This metric should be lower with the new policy compared to the current policy, for a majority of workloads.

Motivation

One of the most important goals of the ZGC project has always been to reduce the need for user configuration. Significant effort has been spent on reducing the need for most JVM flags. The most notorious configuration requirement still expected from users, is heap sizing. Unfortunately, heap sizing can be rather difficult to reason about, and selecting a good heap size tends to require a deep understanding of the deployed application and often the JVM as well. Following is a discussion about various difficulties with selecting a heap size.

CPU Versus Memory Tradeoff

The larger the heap is, the less CPU is spent on garbage collection (GC). The practice of increasing the heap size to reduce CPU utilization is usually referred to as adding heap memory headroom. However, the relationship is not linear. The amount of CPU overhead reduced per byte of extra heap memory headroom, decreases for every byte of heap memory headroom added. A heap that is too small leads to excessive garbage collection which is expensive in CPU use. However, if the heap is configured to minimize CPU use of garbage collection, then the heap might be several orders of magnitude larger than it needs to be, leading to excessive memory bloat. Finding a good balance can be rather tedious.

Allocation Stalls

To deal with the CPU versus memory tradeoff issue, sometimes users specify a relatively low value for the maximum heap size with the (-Xmx) command-line option, then double that value until problems seem to stop. However the garbage collector (GC) will treat this value as a hard limit. If the GC cannot keep up and collect garbage fast enough, the result is an allocation stall. This happens when an application thread blocks because it requested heap memory that is not immediately available. Consequently, the thread has to wait for the GC to finish and reclaim memory before it can continue running. This is a latency disaster. In many situations, it would have been preferrable to just use a bit more memory, instead of hitting an allocation stall, when there is more memory available on the machine. A similar situation might occur when the user does not specify a value for -Xmx and the JVM selects an arbitrary value. The JVM treats this value as a hard limit that might cause allocation stalls, even though there is plenty of more memory available.

Neighboring Processes

When a user runs a JVM in a system with potentially many programs running concurrently, the JVM should not use an excessive amount of memory, causing other programs to fail. Even if other programs can run, memory bloating can also cause more unexpected consequences such as disturbing file system caches, which might be important for performance reasons. It's also important that the system does not select a scape goat JVM that gets punished and pays an excessive burden for other oblivious JVMs.

Conversely, in containers running a single JVM process, it might prove useful to be able to use a majority of the available RAM of the container, as long as it can improve performance, without causing native out-of-memory issues. Selecting a good heap size for such deployments typically involves taken the available RAM of the computer, and subtracting the amount of memory used by other things than the heap. This involves computing the amount of memory used by other subsystems (e.g. code cache, metaspace, application or library direct mapped memory, GC implementation specific heap metadata, predicted fragmentation levels). This can be tricky to get right, without being an expert.

Memory Usage Spikes

The memory demands of some programs can suddenly surge. When this happens, a user might want to increase the heap size to reduce the pressure on the GC. With the -XX:SoftMaxHeapSize option, a user can set a soft limit on how large the Java heap can grow. ZGC will strive to not grow beyond this limit, but is still allowed to grow beyond this limit up to the maximum heap size. However, it can be difficult to determine what this soft limit should be. For instance, multiphased programs with fluctuating levels of activity that might depend on input data might add to the difficulty of finding the ideal soft limit.

GC Barrier Storms

Developers might find it difficult to account for GC barrier storms when it comes to heap sizing. When ZGC accesses object references in the Java heap, it adds additional instructions to perform GC-related bookkeeping, which are called GC barriers. These typically do not cost much, but when the GC switches phases, "storms" of barrier activity occur due to the additional accounting. Therefore, as GC frequency increases, the performance impact of GC barrier storms might become more noticeable. This is a problem that diminishes by increasing the heap size.

Relation to Concurrent GC Threads

When the GC struggles to keep up with the user allocation rate, a user can ensure that the GC finishes in a timely fashion before an allocation occurs by performing one of the following:

  1. Use more concurrent GC threads (specified with the -XX:ConcGCThreads option)
  2. Use a larger heap

However, when too many concurrent GC threads are used, there is a risk for noticeable latency impacts, as the GC threads compete with the application for CPU resources. As a result, the OS might preempt the application to let the GC run. Consequently, it is preferable to minimize the number of concurrent GC threads, which is an important aspect of ZGC heuristics.

Operating System Interactions

Some modern operating systems (e.g. MacOS and Windows) apply page level memory compression when the memory pressure increases on the computer, in an attempt to free up memory. Compression allows trading memory for CPU; by spending CPU cycles on compressing used memory, a portion of that memory may be freed up to the system. However, when running a JVM with a generous heap size, it is typically the case that the CPU runs hot compressing and decompressing garbage. The GC eventually forces all compressed memory to become decompressed again, while traversing the heap. Therefore, a more profitable use of CPU resources, is to perform more garbage collection, which can also free up memory on the computer. Removing excessive garbage from the system is typically more CPU efficient than compressing the garbage.

With manual heap sizing, it can be challenging to find the sweet spot where selecting a lower heap with higher GC pressure, will waste less CPU, compared to letting the OS compress and decompress garbage. The problem is not simplified by the fact that the answer depends on what the rest of the system is doing, which changes over time.

Summary

In general, performance and latency improve as the heap size increases, but with diminishing returns. In single application deployments, it can already be difficult to configure the heap well, as it tends to depend on estimates of how much memory is used by other things than the heap, which is hard to know without deep knowledge of both the application and JVM. When there are multiple processes running on the same computer, it gets increasingly difficult to find a good balance between not bloating the Java heap, good performance and low latency. This JEP proposes a mechanism to improve the default strategy for finding this balance without user input.

Description

This JEP proposes an adaptive heap sizing policy when using ZGC. This new policy selects heap sizes within the minimum and maximum size boundaries, which users can set as usual with the -Xms and -Xmx command-line options. However, because the importance of setting these flags should be much lower with this adaptive heap sizing policy, the default maximum and minimum heap sizes will be changed when using ZGC. This should give the algorithm better flexibility. The changes will be as follows:

The default aggressiveness of the GC, which will affect the heap size, can be controlled by a new JVM flag called -XX:ZGCPressure. Lower values will trigger GC less aggressively, and higher values will trigger GC more aggressively. The default value tries to strike a reasonable balance between memory and CPU usage for the GC, within the boundaries of the minimum and maximum heap sizes. The flag is manageable, meaning that it may be updated at runtime, if desired.

Rapid Expansion

Avoiding allocation stalls is the most important goal for a low latency GC. When the JVM boots with an initial heap size of 16 MB on a large computer with hundreds of cores, for example an enterprise-level application, will quickly find itself in a situation where the default minimum heap size of 16 MB is not enough. This application might require a heap size of, for example, 160 GB. In this scenario, the initial heap size is about 4 orders of magnitude too small. The first problem automatic heap sizing will face is recognizing the need for rapid heap expansion.

Expansion during GC

Garbage collection will likely trigger early on, and during that GC, an enterprise-level application will likely need more memory before the garbage collection has time to complete concurrently. Because the heuristic target heap size is a soft limit, the heap will simply expand during garbage collection to accommodate its memory requirements, instead of triggering an allocation stall. If garbage collections are running back-to-back, then at least the default value of -XX:ConcGCThreads is currently 25% of the available cores. Although an enterprise-level application would generate maximum memory pressure on a GC through early rapid heap expansion, the application should nevertheless run efficiently.

Allocation Rate Analysis

Another factor used to determine how quickly to expand the heap is an analysis of the allocation rate. The goal is to limit the frequency of garbage collection heuristically. Allocation rate analysis can predict when the heuristic target heap size is going to be too small. With this information, an early conservative lower bound on the heuristic target heap size can be specified. This further improves the ability to recognize the need for expansion early on, and rapidly increasing several orders of magnitude.

Concurrent Heap Committing

Committing and paging in memory can cause latency problems if it is peformed by application threads. Currently, when the user sets -Xms and -Xmx to the same value, ZGC commits the memory upfront. Moreover, when a user specifies the -XX:+AlwaysPreTouch option, the heap memory is paged in before running main. There is a tradeoff between startup and warmup performance involved here. The AlwaysPreTouch is disabled by default, which favors startup but reduces warmup performance. With the proposed defaults, users won’t benefit from committing memory upfront or paging in heap memory.

However, with this JEP, users can benefit from both the startup benefits of lazy committing and paging as well as the warmup benefits of committing memory upfront and paging in memory. The GC monitors the heap usage and concurrently commits and pre-touches heap pages before they are needed by the application. When the heuristic target size changes based on allocation rate analysis, concurrent committing and uncommitting is triggered at a reasonable pace.

Incremental Tuning

After finding an initial lower bound heap size, the system continuously monitors the behavior of the GC and application, and applies incremental tuning at the end of every GC cycle.

GC CPU Overhead

Without knowing anything about the nature of an application, it can be difficult to guess a reasonable heap size. Such a guess can easily be several orders of magnitude incorrect. For example, heart beat application, which only occasionally checks a service is running, can start using a large fraction of the computer's memory even though it only requires a small amount. This JEP proposes to guess how much extra CPU overhead garbage collection activity can add to the rest of the application’s CPU usage, then to adjust the the heap size over time to accommodate the target GC CPU overhead.

This is based on work from Tavakolisomeh et. al. (see https://dl.acm.org/doi/abs/10.1145/3617651.3622988), which proposed using a target GC CPU overhead to automatically size the Java heap when using ZGC. This JEP intentionally does not define the relationship between ZGCPressure and CPU overhead of GC. The primary reason for not defining what different levels of GC pressure mean, is that it makes it difficult to evolve and improve these policies over time. Instead, the guiding principle is that higher values for ZGCPressure results in higher CPU pressure but lower memory pressure, and vice versa. The default GC pressure results in a reasonably balanced distribution of pressure across the CPU and memory resources of the computer.

Minor vs Major Collections

Minor collections in ZGC collect only the young generation where the youngest objects live. They don’t collect the entire Java heap. Compared to major collections, it is common for minor collections to collect the most garbage over time. Major collections are less frequent, but they collect the entire Java heap, including the old generation where the oldest objects live.

Between each major collection, potentially many minor collections could have run. Therefore, a good way of accurately calculating the amount of CPU overhead is to estimate the accumulated GC CPU time between two major collections, including the minor collections that have run between them. As a result, how much CPU the generational GC system is imposing is known and hence if the heap should expand or shrink to satisfy the CPU overhead target.

If a sequence of running minor collections alone contribute more CPU overhead than the CPU overhead target, then it is already clear that the heap should grow. Waiting for the eventual major collection is only going to reinforce the observation that the heap should increase because the total CPU overhead is higher than the overhead of minor collections alone. Therefore, even though a sequence of minor collections have incomplete information to determine if the Java heap should shrink, the heap can still be expanded when it is too small.

Social Distancing

Looking at the CPU overhead alone, is not sufficient for incrementally tuning the heap size. When the GC becomes too frequent, the application can suffer from other impacts such as GC barrier storms and cache invalidation, which affects performance. Therefore, even if the CPU overhead of doing garbage collection is very low, one might want to "socially distance" GC cycles to be nicer to the application. The proposed heuristics take this into account and expand the heap to avoid such impacts. The amount of social distancing also depends on the CPU load induced by the application running on the computer. When the load is higher, the need for social distancing increases as processes are running out of CPU resources, and global synchronization becomes more expensive.

Smooth Deltas

The concerns described above can be turned into error signals for expanding or shrinking the heap, like a control system. Sometimes, such error signals can take the form of extreme values. However, extreme increments or decrements of the heap size should be avoided, as the reason for the extreme values can be temporary and misleading. We want the heap size to be rather stable. Error signals are therefore fed through a sigmoid function, which is defined as y = 1/(1 + e^-x) + 0.5. This yields values between 0.5 and 1.5, and is almost linear in the middle. The function looks like the following:

The x axis denotes the magnitude of error signals, and the y axis denotes the resulting heap sizing factor.

Memory Pressure

With the tactics described thus far, a JVM process may automatically find an appropriate heap size, given a default GC pressure. However, it might be that multiple processes are running concurrent to each other, and that if we let the JVM use as much memory as it desires, the computer will not have enough memory available to satisfy everyone.

This JEP proposes a mechanism for reacting to running low on memory on the computer. A small portion of the computer's memory is treated as a reserve that we prefer not to use. The GC pressure of the automatic heap sizing heuristics, are scaled by how much of said memory reserve is consumed. The memory usage of the computer is monitored continuously, and as the computer runs low on memory, GC heuristics will work harder to shrink the heap. Before the heap runs out of memory, the GC will work very hard. As the memory reserve gets consumed, the memory pressure increases first linearly, and then exponentially.

Importantly, this mechanism gives all JVMs a unified view of how critical memory pressure is, which allows processes under the control of these heuristics to reach an equilibrium of GC pressure, as opposed to randomly reacting to reactions of other JVMs without a shared goal and strategy.

Even in single application deployments, having a reserve of memory that we prefer not to use, is typically a good idea, particularly with a concurrent GC. It allows for example file caches to be populated, typically improving the performance of the system. At the same time, this acts as a safety buffer that can be used to avoid allocation stalls. Allocation stalls can be disasterous for latencies. It is preferrable to let go of some file system caches, if doing so can prevent an allocation stall.

When using an MacOS or Windows with memory compression enabled, then the ratio of used memory being compressed vs not compressed is continuously monitored. The perceived size of the memory reserve, gets scaled according to said compression ratio. As a consequence, when the OS starts compressing more memory, the GC will work harder to reclaim garbage and give memory back to the OS, relieving its compression pressure.

The max heap size is dynamically adapted to be at most the memory available on the computer, plus a small critical reserve of memory. Exceeding said threshold will result in allocation stalls and OutOfMemoryError if the situation can not be resolved in time.

Generation Sizing

When updating the heap size, the distribution of memory between the young and old generations needs to be reconsidered. In Generational ZGC, there are no hard boundaries between the two generations, and the distribution of memory between the generations is a function of how frequently major collections are triggered, compared to minor collections. A major collection reclaims garbage in both the young generation and the old generation, while a minor collection only reclaims memory in the young generation. By triggering major collections more frequently, the garbage reclaimed from the old generation may be redistributed to satisfy allocations in the young generation.

There are rules to automatically find a good balance between young and old generation residency. The rule for triggering major collections finds the break even point where the cost of an old generation collection is smaller than the cost due to triggering more frequent minor collections as a result of garbage in the old generation occupying memory that could have been used to satisfy allocations in the young generation.

Since all work is done concurrently in ZGC, there should not be a noticeable latency impact of choosing the memory distribution between the generations that yields the lowest net CPU cost.

Alternatives

Some other managed language runtimes enable users to tune the memory versus CPU tradeoff of GC by specifying a target residency by tuning the tradeoff between live data size versus heap size. However, for the same heap residency, the CPU overhead of garbage collecting can vary quite drastically, and aggregating this number also becomes challenging for generational GCs. Aggregating the total CPU overhead of garbage collecting with multiple generations, is more straight forward. Users tend to monitor and care about CPU overhead the most. However, since there are multiple parameters, and the rules might grow increasingly complex going forward, the target CPU overhead is not exposed to users as a flag; it risks becoming misleading.

Testing

This enhancement primarily affects performance metrics. Therefore, it will be thoroughly performance tested with a wide variety of workloads. The defined success metrics will be tested on said workloads.

Risks and Assumptions

By changing the default maximum heap size from 25% of the available memory to most of the available memory, there is a risk that the new heuristics use more memory than the current implementation would, and other process run out of memory. However, with a 25% heap size policy and few heuristics to try to limit the heap size, there is already a risk of that happening when several JVMs with the default heap size, run on the same computer. Moreover, the dynamically updated max heap size is very likely to be able to throw OOM before exceeding the computer memory limits.