JEP 318: Epsilon: A No-Op Garbage Collector (Experimental)

Owner	Aleksey Shipilev
Type	Feature
Scope	Implementation
Status	Closed / Delivered
Release	11
Component	hotspot / gc
Discussion	hotspot dash gc dash dev at openjdk dot java dot net
Effort	S
Duration	S
Relates to	JEP 304: Garbage Collector Interface
Reviewed by	Andrew Haley, Roman Kennke
Endorsed by	Mikael Vidstedt
Created	2017/02/14 08:23
Updated	2018/09/24 15:53
Issue	8174901

Summary

Develop a GC that handles memory allocation but does not implement any actual memory reclamation mechanism. Once the available Java heap is exhausted, the JVM will shut down.

Goals

Provide a completely passive GC implementation with a bounded allocation limit and the lowest latency overhead possible, at the expense of memory footprint and memory throughput. A successful implementation is an isolated code change, does not touch other GCs, and makes minimal changes in the rest of JVM.

Non-Goals

It is not a goal to introduce manual memory management features to Java language and/or JVM. It is not a goal to introduce new APIs to manage Java heap. It is not a goal to change or cleanup internal JVM interfaces to fit this GC.

Motivation

Java implementations are well known for a broad choice of highly configurable GC implementations. The variety of available collectors caters for different needs in the end, even if their configurability make their functionality intersect. It is sometimes easier to maintain a separate implementation, rather than piling on another configuration option on the existing GC implementation.

There are a few use cases where a trivial no-op GC proves useful:

Performance testing. Having a GC that does almost nothing is a useful tool to do differential performance analysis for other, real GCs. Having a no-op GC can help to filter out GC-induced performance artifacts, like GC workers scheduling, GC barriers costs, GC cycles triggered at unfortunate times, locality changes, etc. Moreover, there are latency artifacts that are not GC-induced (e.g. scheduling hiccups, compiler transition hiccups, etc), and removing the GC-induced artifacts help to contrast those. For example, having the no-op GC allows to estimate the natural "background" latency baseline for low-latency GC work.
Memory pressure testing. For Java code testing, a way to establish a threshold for allocated memory is useful to assert memory pressure invariants. Today, we have to pick up the allocation data from MXBeans, or even resort to parsing GC logs. Having a GC that accepts only the bounded number of allocations, and fails on heap exhaustion, simplifies testing. For example, knowing that test should allocate no more than 1 GB of memory, we can configure no-op GC with -Xmx1g, and let it crash with a heap dump if that constraint is violated.
VM interface testing. For VM development purposes, having a simple GC helps to understand the absolute minimum required from the VM-GC interface to have a functional allocator. For no-op GC, the interface should not have anything implemented, and good interface means Epsilon's BarrierSet would just use no-op barrier implementations from the default implementation. This serves as proof that the VM-GC interface is sane, which is important in lieu of JEP 304 ("Garbage Collector Interface").
Extremely short lived jobs. A short-lived job might rely on exiting quickly to free the resources (e.g. heap memory). In this case, accepting the GC cycle to futilely clean up the heap is a waste of time, because the heap would be freed on exit anyway. Note that the GC cycle might take a while, because it would depend on the amount of live data in the heap, which can be a lot.
Last-drop latency improvements. For ultra-latency-sensitive applications, where developers are conscious about memory allocations and know the application memory footprint exactly, or even have (almost) completely garbage-free applications, accepting the GC cycle might be a design issue. There are also cases when restarting the JVM -- letting load balancers figure out failover -- is sometimes a better recovery strategy than accepting a GC cycle. In those applications, long GC cycle may be considered the wrong thing to do, because that prolongs the detection of the failure, and ultimately delays recovery.
Last-drop throughput improvements. Even for non-allocating workloads, the choice of GC means choosing the set of GC barriers that the workload has to use, even if no GC cycle actually happens. All OpenJDK GCs are generational (with the notable exceptions of non-mainline Shenandoah and ZGC), and they emit at least one reference write barrier. Avoiding this barrier can bring the last bit of throughput improvement. There are locality caveats to this, see below.

Description

Epsilon GC looks and feels like any other OpenJDK GC, enabled with -XX:+UseEpsilonGC.

Epsilon GC works by implementing linear allocation in a single contiguous chunk of allocated memory. This allows for trivial lock-free TLAB (thread-local allocation buffer) issuance code in the GC, which can then reuse the lock-free within-TLAB allocation handled by existing VM code. Issuing TLABs also helps to keep the resident memory taken by a process bounded by what had been actually allocated. Humongous/out-of-TLAB allocations are handled by the same code, because there is little difference between allocating a TLAB and allocating large objects in this scheme.

The barrier set used by Epsilon is completely empty/no-op, because the GC does not do any GC cycles, and therefore does not care about the object graph, object marking, object copying, etc. Introducing a new barrier-set implementation is likely to be the most disruptive JVM change in this implementation.

Since the only important part of the runtime interface for Epsilon is that for issuing TLABs, its latency largely depends on the TLAB sizes issued. With arbitrarily large TLABs and arbitrarily large heap, the latency overhead can be described by an arbitrarily low positive value, hence the name. (Alternative origin story: "epsilon" frequently means "empty symbol", which is aligned with the no-op nature of this GC).

Once the Java heap is exhausted, no allocation is possible, no memory reclamation is possible, and therefore we have to fail. There are several options at that point; most are in line with what existing GCs do:

Throw an OutOfMemoryError with a descriptive message.
Perform a heap dump (enabled, as usual, with -XX:+HeapDumpOnOutOfMemoryError)
Fail the JVM hard and optionally perform an external action (through the usual -XX:OnOutOfMemoryError=...), e.g., starting a debugger or notifying an external monitoring system about the failure.

There is nothing to be done on System.gc() call, because no memory reclamation code is implemented. The implementation may warn users that the attempt to force the GC was futile.

The prototype runs prove the concept by surviving small workloads and failing predictably on larger ones. The prototype implementation and some tests can be found in the sandbox repository:

$ hg clone http://hg.openjdk.java.net/jdk/sandbox sandbox 
$ hg up -r epsilon-gc-branch
$ sh ./configure 
$ make images

One can see the difference between the baseline and the patched runtime by using:

$ hg diff -r default:epsilon-gc-branch

Automatically generated webrev: https://builds.shipilev.net/patch-openjdk-epsilon-jdk/

Sample binary builds: https://builds.shipilev.net/openjdk-epsilon-jdk/

Or in Docker:

$ docker pull shipilev/openjdk-epsilon
$ docker run -it --rm shipilev/openjdk-epsilon java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xlog:gc -version
[0.006s][info][gc] Initialized with 2009M heap, resizeable to up to 30718M heap with 128M steps
[0.006s][info][gc] Using TLAB allocation; min: 2K, max: 4096K
[0.006s][info][gc] Using Epsilon GC
openjdk version "11-internal" 2018-03-20
OpenJDK Runtime Environment (build 11-internal+0-nightly-sobornost-builds.shipilev.net-epsilon-jdkX-b32)
OpenJDK 64-Bit Server VM (build 11-internal+0-nightly-sobornost-builds.shipilev.net-epsilon-jdkX-b32, mixed mode)
[0.071s][info][gc] Total allocated: 899 KB
[0.071s][info][gc] Average allocation rate: 12600 KB/sec

Alternatives

Configure existing GCs to never do the cycle. For example, using Serial or Parallel GCs should fit the same latency profile, assuming we can configure their respective heuristics to never do GC cycles before they face complete heap exhaustion (i.e., by pre-sizing the heap, setting a very large young-generation size, disabling adaptive heuristics, etc.). This is hard to reliably guarantee with the multitude of GC options they provide, and on-going improvements to GCs that would force us to think twice about no-op paths.

Amend existing GCs to never do the cycle. We can make the special options in those GCs to make it more reliable, but that might be against those GCs design goals. For example, protecting most of the code-paths for those GCs with DoNotGC does not look significantly better than providing a separate standalone implementation.

Gut the existing GC implementation. The alternative would be to no-op out the existing GC implemenation to get the baseline implementation for testing. The problem with this is inconvenience: the developers would need to make sure such the implementation is still correct, that it provides enough performance to be a good baseline, that it is hooked up into the other runtime facilities (heap dumping, thread stack walking, MXBeans) to amend the differential analysis. The implementations for other platforms would require much more work. Having the ready-to-go no-op implementation in the mainline solves this inconvenience.

Gut the existing GC barrier set. There are no existing alternatives that disable all GC barriers, but we can stub out the barrier set for the existing GC. Unfortunately, it raises the same problems as above, and it is also compounded with the dire need to disable the GC after this gutting, because the basic invariants GC expects via barriers would not hold.

Further improvements in the Parallel, G1, and Shenandoah GCs may eventually achieve overheads sufficiently low that a no-op GC is no longer needed. If and when that happens, Epsilon would still be useful for memory pressure and performance testing.

Testing

Common GC tests would not be suitable for Epsilon GC, because most tests assume they can allocate an arbitrary amount of garbage. New tests would need to be developed to test that the GC indeed works well on low-allocating workloads, and that it fails on heap exhaustion in a predictable manner. New jtreg tests under hotspot/gc/epsilon would be enough to assert correctness.

One-off performance testing during the development would be enough to ensure the desired performance characteristics when running with interpreter, C1, and C2 compilers. On-going performance testing is not required since the implementation is intended never to change after the initial implementation, and its performance-sensitive paths are implicitly tested by other GCs.

Risks and Assumptions

Usefulness vs. maintenance costs. It can be argued that such an implementation is useless to have in the product, because no one needs it. Experience, however, tells that many players in the Java ecosystem already did this exercise by expunging GC from their custom-built JVMs. That means, having a standard no-op GC option would help that part of the ecosystem. Coupled with the low maintenance costs if the implementation proves trivial, this risk is minimal. We also think this risk is minimal if the feature remains available in non-product builds only, under a "develop" flag. Users and downstream distributions may change it to "product" or "experimental" to expose Epsilon to their applications.

Public expectations. Providing a garbage collector that does not in fact do garbage collection may be seen as the dangerous practice. Accidentally enabling Epsilon GC in production may lead to surprise JVM failures when the heap is exhausted. We think this risk is minimal if the feature remains unavailable by default in product builds, under either a "develop" or "experimental" option.

Locality considerations. Non-compacting GC implicitly means it maintains the object graph in its allocation order. This has impact on spatial locality, and regular applications may experience the throughput hit if allocations are random or generate lots of sparse garbage. While this may entail some throughput overhead, this is outside of GC control, and would affect most non-moving GCs. Locality-aware application coding would be required to mitigate this drawback, if locality proves to be a problem.

Implementation complexity. It might be the case that the implementation would need more changes in the shared code than anticipated, for example in compilers and platform-specific backends. Our prototype indicates these changes are isolated enough to be benign. If that proves to be a risk, it should be mitigated by JEP 304 ("Garbage Collector Interface").

Dependencies

This work may depend on JEP 304 ("Garbage Collector Interface") to minimize shared code changes. It is might not require that interface, however, if the shared code changes are minimal.