JEP 157: G1 GC: NUMA-Aware Allocation

AuthorY. Srinivas Ramakrishna
OwnerJesper Wilhelmsson
TypeFeature
ScopeImplementation
StatusClosed / Withdrawn
Componenthotspot / gc
Discussionhotspot dash gc dash dev at openjdk dot java dot net
EffortS
DurationM
DuplicatesJEP 345: NUMA-Aware Memory Allocation for G1
Reviewed byIgor Veresov, Jesper Wilhelmsson, Jon Masamitsu, Paul Hohensee, Tony Printezis
Endorsed byMikael Vidstedt
Created2011/07/28 20:00
Updated2018/09/18 19:31
Issue8046147

Summary

Enhance G1 to improve allocation performance on NUMA memory systems.

Non-Goals

Extend NUMA-awareness to work on any OS other than Linux and Solaris, which provide appropriate NUMA interfaces.

Motivation

Modern multi-socket machines are increasingly NUMA, with not all memory equidistant from each socket or core. The more traditional SMPs using conventional dance-hall architectures are increasingly rare, except perhaps at the very high end, perhaps because of the cost and difficulty of scaling up such architectures and the resulting latency and bandwidth limitations of their interconnects. Most modern OSes, starting with Solaris about a decade ago, now offer interfaces through which the memory topology of the platform can be queried and physical memory preferentially mapped from a specific locality group. HotSpot's ParallelScavengeHeap has been NUMA-aware for many years now, and this has helped scale the performance of configurations that run a single JVM over multiple sockets, presenting a NUMA platform to the JVM. Certain other of the Hotspot collectors, most notably the concurrent ones, have not had the benefit of this feature and have not been able to take advantage of such vertical multi-socket NUMA scaling. Especially as large enterprise applications run in large heap configurations and need the power of multiple sockets, yet want the manageability advantage of running within a single JVM, we'll see customers using our concurrent collectors increasingly run up against this scaling bottleneck.

This JEP aims to extend NUMA-awareness to the heap managed by the G1 garbage collector.

Description

G1's heap is organized as a collection of fixed-size regions from what currently happens to be a convex interval of the virtual address space. Generations, or individual logical spaces (such as Eden, Survivor, and Old), are then formed as dynamic disjoint subsets of this collection of regions. A region is typically a set of physical pages, although when using very large pages (say 256M superpages on SPARC), several regions may make up a single physical page.

To make G1's allocation NUMA-aware we shall initially focus on the so-called Eden regions. Survivor regions may be considered in a second enhancement phase, but are not within the scope of this JEP. At a very high level, we want to fix the Eden regions to come from a set of physical pages that are allocated at specific locality groups (henceforth, "lgrps"). The idea is analogous to the NUMA spaces used by ParallelScavengeHeap. Let's call these "per-lgrp region pools", for lack of a better phrase.

We envisage the lifetime of an Eden region to be roughly as follows:

ParallelScavengeHeap allocates pages from a survivor space in round-robin fashion. As mentioned above, NUMA-biasing of survivor regions is not a goal of this JEP.

When using large pages, where multiple regions map to the same physical page, things get a bit complicated. For now, we will finesse this by disabling NUMA optimizations as soon as the page size exceeds some small multiple of region size (say 4), and deal with the more general case in a separate later phase. When the page size is below this threshold, we shall allocate and bias contiguous sets of regions into the per-lgrp Eden pools. This author is not sufficiently familiar with current region allocation policy, but believes that this will likely require some small changes to existing region allocation policy in G1 to allow allocating a set of regions at a time.

The -XX:+UseNUMA command line switch should enable the feature for G1 if -XX:+UseG1GC is also used. If the option is found to perform well for a large class of programs, we may enable it by default on NUMA platforms (as I think is the case for ParallelScavenge today). Other options related to NUMA adaptation and features should be supported in the same manner as for ParallelScavenge heap. We should avoid any collector-specific options for NUMA to the extent possible.

Testing

Normal testing (with -XX:+UseNUMA as appropriate) should flush out any correctness issues. This JEP assumes the use of NUMA hardware for testing. Targeted performance testing will be done, using a variety of benchmarks and applications on a variety of NUMA and non-NUMA platforms.

Risks and Assumptions

As in the case of the ParallelScavenge collector, an assumption of the implementation here is that most short-lived objects are such that they are accessed most often by the thread that allocated them. This is certainly true of the majority of short-lived objects in most object-oriented programs, as experience with ParallelScavenge has already shown us. There is, however, some small class of programs where this assumption does not quite hold. The benefits also depend on the interplay of the extent of NUMA-ness of the underlying system and the overheads associated with migrating pages on such systems, especially in the face of frequent thread migrations when load is high. Finally, there may be platforms platforms for which the appropriate lgrp interfaces are either not publicly accessible or available, or have not been implemented for other reasons.

There is some risk that the assignment of regions to specific lgrp pools will reduce some flexibility in terms of moving regions between various logical spaces, but we do not consider this a serious impediment.

Somewhat more seriously, the assignment of regions to lgrp pools will cause some internal fragmentation within these pools, which is not dissimilar to the case of ParallelScavengeHeap. This is a known issue and, to the extent that the unit of lgrp-allocation in ParallelScavengeHeap is a page and that of G1 is a region which may be several (smaller) pages, we will typically not expect the G1 implementation to perform any better than the ParallelScavengeHeap one.