JEP draft: Parallelize the Full GC Phase in CMS

Authors	Jungwoo Ha, Wessam Hassanein, Hiroshi Yamauchi
Owner	Jungwoo Ha
Type	Feature
Scope	JDK
Status	Closed / Withdrawn
Component	hotspot / gc
Discussion	hotspot dash gc dash dev
Effort	XL
Duration	XL
Created	2015/06/30 20:19
Updated	2018/03/27 14:36
Issue	8130200

Summary

Parallelize the CMS full GC phase, which is currently serial.

Motivation

CMS GC performs a whole heap collection (full GC) in various cases, including concurrent mode failures, promotion failures, and explicit GC requests. The current implementation performs a serial mark-and-sweep GC, which only uses a single thread to execute a whole heap collection. This causes most of the worst case GC pause times when using CMS. It is well-known that mark-and-sweep can be performed in parallel, and it is an obvious win, especially nowadays, as multi-core architectures prevail in most computing environments.

Description

The focus of this JEP is not the implementation itself, which has been completed and which is running in productive environments since years, but rather the integration of the code into the OpenJDK master repositories. Following is a short description of the current implementation which is available from http://cr.openjdk.java.net/~jwha/8086706/webrev.jdk9.00

The original full GC code consists of the following four phases:

Phase 1. Marking phase: mark every live object.

Phase 2. Forwarding phase: Compute where the live objects should moved to. This phase computes the new location of objects by sliding objects to the lower end of the space, and installing forwarding pointers into object headers. It does not move objects, which is done at phase 4. Note that objects can be moved to different spaces, e.g. the eden to the old gen space, more details later. This roughly corresponds to the Space::prepare_for_compaction() function.

Phase 3. Adjust pointer phase: updating pointers/oops to new locations of objects. Note that right after this phase, pointers are pointing to the new locations of objects, but the objects themselves are not yet in their new locations. This roughly corresponds to the Space::adjust_pointers() function.

Phase 4. Compaction phase: moving/copying objects to their new locations. This phase moves the objects to their new locations, which are computed in phase 2. Note that there is an ordering/dependency constraint between objects because we can't overwrite an object before it's copied/evacuated. This roughly corresponds to the Space::compact() function.

The heap has two generations and four spaces: The young generation, which has an eden space and two survivor spaces, which are of type ContiguousSpace. The old generation, which consists of an old space, which is of type CompactibleFreeListSpace. The parallelism is implemented on a per-phase, per-space basis, except for phase 1.

The compaction is performed as follows: First, live objects in the old gen space are slid to the lower address end of the same space. Second, objects in the survivor space are moved right after the last live objects in the old gen space. Third, the objects in the eden space are moved right after the last live objects which were in the survivor space.

In other words, the objects in the young gen are automatically promoted at this full GC. Note that when a space is full, the next space is used. For example, if the old gen/space is full in the middle of the compaction of the eden space. The remaining objects in the eden space are moved into the survivor space. If the survivor space is full, the eden space is used.

Parallel Full GC does not change the base algorithm. It only parallelizes each phase independently. The original GC uses bits in the object header to mark objects. Parallel Full GC uses the mark bitmap for parallelism. The mark bitmap is better for quickly dividing work (live objects in a region) without accessing the objects in the heap for parallel processing, just like the ParallelCompact (UseParallelOldGC).

A mark bitmap adds some memory overhead, compared to the original GC. In order to minimize this, the mark bitmap shares the underlying memory with the existing mark bitmap of the concurrent part of the CMS collector. Since the mark bitmap of the concurrent part covers only the old gen, the new overhead is only for the young gen, which is smaller than the whole heap. There is new code in CMSBitMap that implements the sharing of the underlying memory of the mark bitmap. PMSMarkBitMap is the mark bitmap used for the parallel marking. It shares the memory with CMSBitMap.

Parallel Full GC introduces 'regions' (class PMSRegion) as a unit of parallelism. For example, during the forwarding phase, each worker thread performs the task one region at a time. There is an array of regions for each space (PMSRegionArray) and there is one array per space (for a total of four) for the entire heap (PMSRegionArraySet). Most of the support code, including the PMSRegion, is defined in markSweep.hpp/cpp. Note that some of the support code is supplied as a preprocessor macro for performance reasons (to remove virtual calls for getting object size (obj_size), just like the existing SCAN_AND_FORWARD/_ADJUST/_COMPACT macros do.)

The four phase are extended as follows:

Phase 1. Marking: The marking is done by multiple worker threads, instead of the single VMThread. The marking code records amount of live objects for each region in the regions during marking. This is to avoid having to scan the entire mark bitmap in phase 2, which uses this data.

Phase 2. Forwarding: Based on the per-region live size info from phase 1, this phase computes the destination of the live objects in each region on a per-region basis, in parallel. This is done in two stages:

per-region destinations are computed sequentially, because a region's destination cannot be computed before the destinations of the regions that are at lower addresses than the region are computed.
the forwarding pointer of each object in each region is computed and installed in parallel. As in the original GC, this phase performs the packing of spaces and potentially inter-space moves of objects.

Phase 3. Adjust pointer: This is a fairly straightforward parallelization of pointers updates in GC roots and each heap space.

Phase 4. Compaction: To keep the ordering and dependency constraints of copying, the worker threads scan the region array and acquire regions to work on. The threads then perform the copying of objects on a per-region basis. If the destination region of the live objects in a source region hasn't been evacuated yet, the worker thread working on the source region waits until the destination region is evacuated by another worker thread. There is no deadlock because the region array is scanned in ascending order, and there must be some other thread which must have 'won' the right to work on the destination region.

Several fields were added to NamedThread (thread.hpp/cpp) in order to point to per-gc-worker-thread data structures (marking stacks/oop task queues, etc.) for the convenience of parallel marking. In the original GC, a single set of global/static fields for these data structures (markSweep.hpp) was sufficient because there was only one thread (VMThread) performing the marking.

Testing

This implementation has been deployed in large scale data-center workloads and stress tested for many years without any downsides reported by our users. We have also stress-tested on public benchmark suites.

Risks and Assumptions

Though thoroughly tested, this feature may not be bug-free. However, it is wrapped with a flag (CMSParallelFullGC) and turned off by default, and users can enable it at their own risk.