JEP draft: ZGC: Automatic Heap Sizing
Owner | Erik Österlund |
Type | Feature |
Scope | Implementation |
Status | Submitted |
Component | hotspot / gc |
Effort | M |
Duration | M |
Reviewed by | Axel Boldt-Christmas, Vladimir Kozlov |
Created | 2024/04/05 09:47 |
Updated | 2024/12/05 16:24 |
Issue | 8329758 |
Summary
Automatically size the heap appropriately by default, adapting dynamically to the workload and environment, when using the Z Garbage Collector (ZGC).
Goals
While using ZGC, then:
- Make it likely that the default heap size is appropriate.
- Dynamically adapt the heap size to unpredictable circumstances.
- Performance should not change noticeably compared to a manually tuned heap sizes.
Non-Goals
It is not a goal of this JEP to:
- Find the optimal heap size.
- Remove the existing configurability of static heap bounds using existing heap sizing JVM options.
Motivation
One of the most important goals of the ZGC project has always been to reduce the need for user configuration. Significant effort has been spent on reducing the need for using most GC related JVM flags. The most noticeable configuration requirement still expected from users, is heap sizing. Unfortunately, selecting a good heap size is notoriously difficult, yet it has a large impact on the service level of applications. The range of performant heap sizes for an application depends on many technical details across the entire technology stack, such as:
- The user: The level of work a user exposes an application to.
- The application: The application code might cause different object liveness distributions or allocation rates.
- The libraries: When libraries use direct mapped byte buffers or spin up threads, the amount of memory needed for other things than the heap increases.
- The JVM: The JVM needs to use memory resources for other things than the heap, such as GC metadata, metaspace, code cache, JIT threads, etc.
- The other processes: Memory demands of other processes need to be met.
- The OS: Policies around when to start memory compression or swapping.
- The hardware: Memory availability has a big impact. The time to perform GC depends on the number of CPU cores, memory bandwidth, caching policies, atomic instruction implementation, etc.
Finding the performant heap size range requires conducting experiments in an environment resembling the production environment as closely as possible, measuring throughput, latency and memory usage to find a good balance. Experiments starting with a small heap size and then incrementally selecting larger heap sizes tends to result in the following range of service levels being observed:
- Graceful Failure: The application throws
OutOfMemoryError
and is unable to successfully execute. - Stalling: The application is capable of running bytecodes, but spends most of the time waiting for GC to finish in "allocation stalls". This yields poor throughput and latency.
- Concurrent: GC is running concurrently back-to-back. Excessive GC activity yields high CPU pressure and as a result poor latencies and compromised throughput.
- Performant: The GC runs concurrently with a low CPU utilization, resulting in good latency and application throughput.
- Bloated: Even larger heap sizes have diminishing returns for reducing CPU utilization. Bloating the heap unnecessarily may instead cause OS file caches to be sacrificed, potentially compromising performance.
- Unresponsive: The OS starts swapping or compressing memory. Now performance deteriorates rapidly to a crippling level.
- Ungraceful Failure: The required memory is greater than what is available. The process is killed without any
OutOfMemoryError
being thrown.
Conducting this type of experiments requires closely monitoring application latency, throughput and hardware resource utilization. This is rather tedious work, especially given that it is only the application end user that is positioned to do it. The outcome of the experiments is a heap size selection. However, this number is static. A fundamental problem with this practice is that there can be many dynamic circumstances that greatly impact the range for performant heap sizes, such as:
- Unpredictable CPU usage due to bursts in the workload during peak hours.
- Unpredictable allocation patterns due to phase shifts or one-off events in the application.
- Unpredictable application profiles after upgrading software (application, libraries, JVM, OS, etc.).
- Unpredictable memory usage due to direct mapped byte buffers.
- Unpredictable memory usage due to JVM implementation details (GC metadata, metaspace, code cache, thread stacks).
- Unpredictable memory usage due to other processes.
- Unpredictable proactive OS memory compression policies compressing garbage instead of collecting it.
- Unpredictable memory usage due to fragmentation of JVM internal memory slowly increasing over a long period of time.
Finding a static heap size for all possible dynamic circumstances is not only unreasonably tedious for end users. Sometimes, it is not even possible to find a performant static heap size configuration due to the fundamentally dynamic nature of the system.
Description
This JEP proposes an automatic heap sizing policy when using ZGC. This new policy automatically finds a performant heap size when it is possible, dynamically adapting to changing circumstances in the system. It selects heap sizes within the minimum and maximum size boundaries, which users can still set as usual with the -Xms
and -Xmx
command-line options. However, the default maximum and minimum heap sizes will be changed when using ZGC to give the automatic heap sizing as much flexibility as possible by default. The changes will be as follows:
- Default static minimum and initial heap sizes (
-Xms
) are changed to 16MB. - Default static maximum heap size (
-Xmx
) is changed to 100% of the available RAM of the computer. - A new dynamic maximum heap size dynamically adapts to changes in memory availability of the computer.
The default aggressiveness of the GC, which will affect the heap size, can be controlled by a new JVM flag called -XX:ZGCPressure
. Lower values will trigger GC less aggressively, and higher values will trigger GC more aggressively. The default value tries to strike a reasonable balance between memory and CPU usage for the GC, within the boundaries of the minimum and maximum heap sizes. The flag is manageable, meaning that it may be updated at runtime, if desired.
Rapid Expansion
Avoiding allocation stalls is the most important goal for a low latency GC. When the JVM boots with an initial heap size of 16 MB on a large computer with hundreds of cores, for example an enterprise-level application, will quickly find itself in a situation where the default minimum heap size of 16 MB is not enough. This application might require a heap size of, for example, 160 GB. In this scenario, the initial heap size is about 4 orders of magnitude too small. The first problem automatic heap sizing will face is recognizing the need for rapid heap expansion.
Expansion during GC
Garbage collection will likely trigger early on, and during that GC, an enterprise-level application will likely need more memory before the garbage collection has time to complete concurrently. Because the heuristic target heap size is a soft limit, the heap will simply expand during garbage collection to accommodate its memory requirements, instead of triggering an allocation stall. If garbage collections are running back-to-back, then at least the default value of -XX:ConcGCThreads
is currently 25% of the available cores. Although an enterprise-level application would generate maximum memory pressure on a GC through early rapid heap expansion, the application should nevertheless run efficiently.
Allocation Rate Analysis
Another factor used to determine how quickly to expand the heap is an analysis of the allocation rate. The goal is to limit the frequency of garbage collection heuristically. Allocation rate analysis can predict when the heuristic target heap size is going to be too small. With this information, an early conservative lower bound on the heuristic target heap size can be specified. This further improves the ability to recognize the need for expansion early on, and rapidly increasing several orders of magnitude.
Concurrent Heap Committing
Committing and paging in memory can cause latency problems if it is peformed by application threads. Currently, when the user sets -Xms
and -Xmx
to the same value, ZGC commits the memory upfront. Moreover, when a user specifies the -XX:+AlwaysPreTouch
option, the heap memory is paged in before running main. There is a tradeoff between startup and warmup performance involved here. The AlwaysPreTouch
is disabled by default, which favors startup but reduces warmup performance. With the proposed defaults, users won’t benefit from committing memory upfront or paging in heap memory.
However, with this JEP, users can benefit from both the startup benefits of lazy committing and paging as well as the warmup benefits of committing memory upfront and paging in memory. The GC monitors the heap usage and concurrently commits and pre-touches heap pages before they are needed by the application. When the heuristic target size changes based on allocation rate analysis, concurrent committing and uncommitting is triggered at a reasonable pace.
Incremental Tuning
After finding an initial lower bound heap size, the system continuously monitors the behavior of the GC and application, and applies incremental tuning at the end of every GC cycle.
GC CPU Overhead
Without knowing anything about the nature of an application, it can be difficult to guess a reasonable heap size. Such a guess can easily be several orders of magnitude incorrect. For example, heart beat application, which only occasionally checks a service is running, can start using a large fraction of the computer's memory even though it only requires a small amount. This JEP proposes to guess how much extra CPU overhead garbage collection activity can add to the rest of the application’s CPU usage, then to adjust the the heap size over time to accommodate the target GC CPU overhead.
This is based on work from Tavakolisomeh et. al. (see https://dl.acm.org/doi/abs/10.1145/3617651.3622988), which proposed using a target GC CPU overhead to automatically size the Java heap when using ZGC. This JEP intentionally does not define the relationship between ZGCPressure
and CPU overhead of GC. The primary reason for not defining what different levels of GC pressure mean, is that it makes it difficult to evolve and improve these policies over time. Instead, the guiding principle is that higher values for ZGCPressure
results in higher CPU pressure but lower memory pressure, and vice versa. The default GC pressure results in a reasonably balanced distribution of pressure across the CPU and memory resources of the computer.
Minor vs Major Collections
Minor collections in ZGC collect only the young generation where the youngest objects live. They don’t collect the entire Java heap. Compared to major collections, it is common for minor collections to collect the most garbage over time. Major collections are less frequent, but they collect the entire Java heap, including the old generation where the oldest objects live.
Between each major collection, potentially many minor collections could have run. Therefore, a good way of accurately calculating the amount of CPU overhead is to estimate the accumulated GC CPU time between two major collections, including the minor collections that have run between them. As a result, how much CPU the generational GC system is imposing is known and hence if the heap should expand or shrink to satisfy the CPU overhead target.
If a sequence of running minor collections alone contribute more CPU overhead than the CPU overhead target, then it is already clear that the heap should grow. Waiting for the eventual major collection is only going to reinforce the observation that the heap should increase because the total CPU overhead is higher than the overhead of minor collections alone. Therefore, even though a sequence of minor collections have incomplete information to determine if the Java heap should shrink, the heap can still be expanded when it is too small.
Social Distancing
Looking at the CPU overhead alone, is not sufficient for incrementally tuning the heap size. When the GC becomes too frequent, the application can suffer from other impacts such as GC barrier storms and cache invalidation, which affects performance. Therefore, even if the CPU overhead of doing garbage collection is very low, one might want to "socially distance" GC cycles to be nicer to the application. The proposed heuristics take this into account and expand the heap to avoid such impacts. The amount of social distancing also depends on the CPU load induced by the application running on the computer. When the load is higher, the need for social distancing increases as processes are running out of CPU resources, and global synchronization becomes more expensive.
Smooth Deltas
The concerns described above can be turned into error signals for expanding or shrinking the heap, like a control system. Sometimes, such error signals can take the form of extreme values. However, extreme increments or decrements of the heap size should be avoided, as the reason for the extreme values can be temporary and misleading. We want the heap size to be rather stable. Error signals are therefore fed through a sigmoid function, which is defined as y = 1/(1 + e^-x) + 0.5. This yields values between 0.5 and 1.5, and is almost linear in the middle. The function looks like the following:
The x axis denotes the magnitude of error signals, and the y axis denotes the resulting heap sizing factor.
Memory Pressure
With the tactics described thus far, a JVM process may automatically find an appropriate heap size, given a default GC pressure. However, it might be that multiple processes are running concurrent to each other, and that if we let the JVM use as much memory as it desires, the computer will not have enough memory available to satisfy everyone.
This JEP proposes a mechanism for reacting to running low on memory on the computer. A small portion of the computer's memory is treated as a reserve that we prefer not to use. The GC pressure of the automatic heap sizing heuristics, are scaled by how much of said memory reserve is consumed. The memory usage of the computer is monitored continuously, and as the computer runs low on memory, GC heuristics will work harder to shrink the heap. Before the heap runs out of memory, the GC will work very hard. As the memory reserve gets consumed, the memory pressure increases first linearly, and then exponentially.
Importantly, this mechanism gives all JVMs a unified view of how critical memory pressure is, which allows processes under the control of these heuristics to reach an equilibrium of GC pressure, as opposed to randomly reacting to reactions of other JVMs without a shared goal and strategy.
Even in single application deployments, having a reserve of memory that we prefer not to use, is typically a good idea, particularly with a concurrent GC. It allows for example file caches to be populated, typically improving the performance of the system. At the same time, this acts as a safety buffer that can be used to avoid allocation stalls. Allocation stalls can be disasterous for latencies. It is preferrable to let go of some file system caches, if doing so can prevent an allocation stall.
When using an MacOS or Windows with memory compression enabled, then the ratio of used memory being compressed vs not compressed is continuously monitored. The perceived size of the memory reserve, gets scaled according to said compression ratio. As a consequence, when the OS starts compressing more memory, the GC will work harder to reclaim garbage and give memory back to the OS, relieving its compression pressure.
The max heap size is dynamically adapted to be at most the memory available on the computer, plus a small critical reserve of memory. Exceeding said threshold will result in allocation stalls and OutOfMemoryError if the situation can not be resolved in time.
Generation Sizing
When updating the heap size, the distribution of memory between the young and old generations needs to be reconsidered. In Generational ZGC, there are no hard boundaries between the two generations, and the distribution of memory between the generations is a function of how frequently major collections are triggered, compared to minor collections. A major collection reclaims garbage in both the young generation and the old generation, while a minor collection only reclaims memory in the young generation. By triggering major collections more frequently, the garbage reclaimed from the old generation may be redistributed to satisfy allocations in the young generation.
There are rules to automatically find a good balance between young and old generation residency. The rule for triggering major collections finds the break even point where the cost of an old generation collection is smaller than the cost due to triggering more frequent minor collections as a result of garbage in the old generation occupying memory that could have been used to satisfy allocations in the young generation.
Since all work is done concurrently in ZGC, there should not be a noticeable latency impact of choosing the memory distribution between the generations that yields the lowest net CPU cost.
Alternatives
Some other managed language runtimes enable users to tune the memory versus CPU tradeoff of GC by specifying a target residency by tuning the tradeoff between live data size versus heap size. However, for the same heap residency, the CPU overhead of garbage collecting can vary quite drastically, and aggregating this number also becomes challenging for generational GCs. Aggregating the total CPU overhead of garbage collecting with multiple generations, is more straight forward. Users tend to monitor and care about CPU overhead the most. However, since there are multiple parameters, and the rules might grow increasingly complex going forward, the target CPU overhead is not exposed to users as a flag; it risks becoming misleading.
Testing
This enhancement primarily affects performance metrics. Therefore, it will be thoroughly performance tested with a wide variety of workloads. The defined success metrics will be tested on said workloads.
Risks and Assumptions
By changing the default maximum heap size from 25% of the available memory to most of the available memory, there is a risk that the new heuristics use more memory than the current implementation would, and other process run out of memory. However, with a 25% heap size policy and few heuristics to try to limit the heap size, there is already a risk of that happening when several JVMs with the default heap size, run on the same computer. Moreover, the dynamically updated max heap size is very likely to be able to throw OOM before exceeding the computer memory limits.