JEP 192: String Deduplication in G1

OwnerPer Liden
TypeFeature
ScopeImplementation
StatusClosed / Delivered
Release8u20
Componenthotspot / gc
Discussionhotspot dash gc dash dev at openjdk dot java dot net
EffortM
DurationL
Relates toJEP 254: Compact Strings
Reviewed byBengt Rutisson, John Coomes, Jon Masamitsu
Endorsed byMikael Vidstedt
Created2013/11/22 20:00
Updated2017/06/07 22:25
Issue8046182

Summary

Reduce the Java heap live-data set by enhancing the G1 garbage collector so that duplicate instances of String are automatically and continuously deduplicated.

Non-Goals

It not a goal to implement this feature for garbage collectors other than G1.

Motivation

Many large-scale Java applications are currently bottlenecked on memory. Measurements have shown that roughly 25% of the Java heap live data set in these types of applications is consumed by String objects. Further, roughly half of those String objects are duplicates, where duplicates means string1.equals(string2) is true. Having duplicate String objects on the heap is, essentially, just a waste of memory. This project will implement automatic and continuous String deduplication in the G1 garbage collector to avoid wasting memory and reduce the memory footprint.

Description

String Deduplication

The String class has two fields:

private final char[] value
private int hash

The value field is implementation-specific and not observable from outside of the String class itself. The String class does not modify the contents of the char[] array, nor does it synchronize on the array object itself. This means that it can safely and transparently be used by multiple instances of String at the same time.

Deduplicating a String object is conceptually just an re-assignment of the value field, i.e., aString.value = anotherString.value. The actual re-assignment is however done by the VM, which in turn means that the final property of the value field is not a problem in practice.

We are not actually deduplicating the String objects, only their backing character arrays. Deduplicating the actual String object cannot be done safely, since such a change would be observable from the Java application and could cause problems if, for example, the application used that object for synchronization or in some other way relied on the object's identity.

String deduplication will not require any changes to the JDK class library or to any other existing Java code.

Expected Benefit

Measurements done on a large number of Java applications (big and small) have shown the following:

Given that we are only deduplicating character arrays we will still carry the overhead of the String objects (object header, fields, and padding). This overhead is platform/configuration dependent and varies between 24 and 32 bytes. However, given an average String length of 45 characters (90 bytes + array header) there is still a significant win to be had.

Taking the above into account, the actual expected benefit ends up at around 10% heap reduction. Note that this number is a calculated average based on a wide range of applications. The heap reduction for a specific application could vary significantly both up and down.

Implementation

Overview

When garbage collection is performed, live objects on the heap are visited. For each object we visit a check is applied to see if the object is a candidate for string deduplication. If the check indicates that this is a candidate then a reference to the object is inserted into a queue for later processing. A deduplication thread runs in the background and processes the queue. Processing a queue entry means removing it from the queue and attempting to deduplicate the String object it references. A hashtable is used to keep track of all unique character arrays used by String objects. When deduplicating, a lookup is made in this table to see if there is already an identical character array somewhere on the heap. If so, the String object is adjusted to point to that character array, releasing the reference to the original array allowing it to eventually be garbage collected. If the lookup fails the character array is instead inserted into the hashtable so that this array can be shared at some point in the future.

Candidate Selection

Candidate selection is done during young/mixed and full collections. This is a performance sensitive operation since it is applied to all visited objects. An object is considered a deduplication candidate if all of the following statements are true:

Once a String object has been promoted to an old region, or its age is higher than the deduplication age threshold, it will never become a candidate again. This approach avoids making the same object a candidate more than once.

Interned strings are a bit special. These are explicitly deduplicated before being inserted into the StringTable (see below for details on why). These can later also become deduplication candidates if they reach the deduplication age threshold or are evacuated to an old heap region. The second attempt to deduplicate such strings will be in vain, but we have no fast way of filtering them out. This has been shown to not be a problem, as the number of interned strings is usually dwarfed by the number of normal (non-interned) strings.

Deduplication Age Threshold

It is assumed that String objects either live for a very short time or live for a long time. Deduplicating objects that will die soon is just a waste of CPU and memory resources. To avoid deduplicating strings too early the deduplication age theshold dictates how old a String object must be before it will be considered a candidate for deduplication. This threshold will have a reasonable default, but will also be configurable using a VM option.

Deduplication Queue

The deduplication queue actually consists of several queues, one queue per GC worker thread. This allows lock-free and cache-friendly enqueue operations by the GC workers. This is important since these operations are done during a stop-the-world phase.

Deduplication Hashtable

The deduplication hashtable is used to keep track of all unique character arrays (which are attached to String objects) found on the heap. When a deduplication candidate is processed, a lookup is made in the hashtable to see if an identical character array already exists. If the lookup is successful the String object's value field is updated to point to the character array found in the hashtable, allowing the garbage collector to eventually collect the original array. If the lookup fails the character array is instead added to the hashtable so that this array can be shared at some point in the future. A character array is removed from the hashtable when it is garbage collected, i.e., when all String objects referring to it have become unreachable.

The hashtable is dynamically resized to accommodate the current number of table entries. The table has hash buckets with chains for hash collision. If the average chain length goes above or below given thresholds the table grows or shrinks accordingly.

The hashtable is also dynamically rehashed (using a new hash seed) if it becomes severely unbalanced, i.e., a hash chain is significantly longer than average. This is similar to how StringTable handles an unbalanced hashtable.

For workloads that produce a large number of unique strings, where there is little opportunity for deduplication, the hashtable could consume more memory than deduplication frees. In those cases string deduplication should not be enabled. The deduplication statistics printed to the GC log will give guidance in making such decisions.

Deduplication Thread

The deduplication thread is a VM internal thread which runs concurrently with the Java application. This thread is where the actual deduplication work is done. It waits for String object references to appear on the deduplication queue and starts to dequeue them one by one. For each String it dequeues, it computes the string hash code (if needed), looks it up in the deduplication hashtable and possibly deduplicates the string. The deduplication thread maintains deduplication statistics (number of candidates inspected, number of strings deduplicated, etc) which it can print to the GC log.

Interned Strings

When a String is interned (String.intern() is invoked) it will be deduplicated before it is inserted in the StringTable. This ensures that once a String has been interned it will never be deduplicated again. Deduplicating a String after it has been interned has shown to be a bad idea since it will counteract compiler optimizations done for string literals. Some optimizations assume (and rightly so) that the String.value field is never changed to point to a different array. With this knowledge the compiler can emit code with the address of the character array as an immediate value. This optimization allows, for example, String.equals() to do a simple pointer comparison in a fast path. If the array is moved by the GC the address will be adjusted accordingly in such code blocks. However, if String.value is outside of the GC the optimization will silently fail and fall back to the normal (slower) character by character comparison.

Impact on GC Pause Times

The following items can/will affect GC pause times:

The assumption is that a high enough deduplication success rate will balance out most or all of this impact, because deduplication can reduce the amount of work needed in other phases of a GC pause (like reduced amount of objects to evacuate) as well as reduce the GC frequency (due to reduced pressure on the heap).

Command-line Options

The following new command-line options will be made available:

Alternatives

There are numerous other ways of deduplicating String objects.

Testing

jtreg tests will be added to make sure the deduplication works as expected. System tests and performance tests are needed to assess the Java heap live data set reduction and performance regressions/improvements.

Risks and Assumptions

Impact