JEP 250: Store Interned Strings in CDS Archives

Author	Jiangli Zhou
Owner	Ioi Lam
Type	Feature
Scope	JDK
Status	Closed / Delivered
Release	9
Component	hotspot / runtime
Discussion	hotspot dash dev at openjdk dot java dot net
Relates to	JEP 254: Compact Strings
Reviewed by	Karen Kinnear, Mikael Vidstedt
Endorsed by	Mikael Vidstedt
Created	2014/09/24 23:49
Updated	2022/10/03 17:25
Issue	8059092

Summary

Store interned strings in class-data sharing (CDS) archives.

Goals

Reduce memory consumption by sharing the String objects and underlying char array objects amongst different JVM processes.
Only support shared strings for the G1 GC. Shared strings require a pinned region, and G1 is the only HotSpot GC that supports pinning.
Only support 64-bit platforms with compressed object and class pointers.
No significant degradation (< 2-3%) on startup time, string-lookup time, GC pause time, or runtime performance using the usual benchmarks.

Non-Goals

Reducing startup time is not a goal.
Other types of GCs (besides G1) will not be supported.
32-bit platforms will not be supported.

Motivation

Currently, when CDS stores classes into the archive, the CONSTANT_String items in the constant pools are represented by UTF-8 strings. When the class is loaded, the UTF-8 strings are converted into java.lang.String objects on demand. This potentially wastes memory, since each character in each interned string takes up three bytes or more (two bytes in the String, 1-3 bytes in the UTF-8).

Also, because the strings are created dynamically, they cannot easily be shared across JVM processes.

Description

At dump time, a designated string space is allocated within the Java heap during heap initialization. Pointers to the interned String objects and their underlying char-array objects are modified, as if those objects are from the designated space, when writing out the interned string table and the String objects.

The string table is compressed and then stored in the archive at dump time. The compression technique for the string table is the same as for the shared symbol table (see JDK-8059510). The regular narrow oop encoding and decoding is used to access the shared String objects from the compressed-string table.

On 64-bit platforms with compressed oop pointers, the narrow oops are encoded using offsets (with or without scaling) from the narrow oop base. Currently there are four different encoding modes: 32-bit unscaled, zero based, disjoint heap based, and heap based. Depending on the heap size and the heap minimum base, an appropriate encoding mode is selected. The narrow-oop encoding mode (including the encoding shift) must be the same at both dump time and run time, so that the oop pointers within the shared string space remain valid at run time. The shared-string space can be considered relocatable, with restrictions, at runtime. It is not required to be mapped at the same address as at dump time, but it should be at the same offset from the narrow oop base at dump time and run time. The heap size is not required to be the same at dump time and run time, as long as the same encoding mode is used. The offset of the string space and the oop-encoding mode (and shift) should be stored in the archive for run-time validation. If the encoding mode changes, it will invalidate the encoding of the oop pointer to the char array from each shared String. In such cases the shared-string data is ignored while the rest of the shared data can still be used by the VM. A warning indicating that shared strings are not used due to incompatible GC configuration will be reported by the VM.

At run time, the string space is mapped as part of the Java heap at the same offset from the oop encoding base as at dump time. The mapping starts at the lowest page-aligned address of the string space saved in the archive. The mapped string space contains the shared String and char-array objects. All G1 regions which overlap this mapped space will be marked as pinned; these G1 regions are unavailable for run-time allocation. There may be unused space wasted in a region that partially overlaps, but there should be at most one such region, at the end of the mapping. No patching is required for the oop pointers within the string space since the same narrow oop encoding is used. The shared-string space is writable, but the GC should not write to the oops in the space in order to preserve shareability across different processes. An application that attempts to lock one of these shared strings, and thus writes to the shared space, will get a private copy of the page, and therefore lose the benefit of sharing that particular page. Such cases are rare.

The shared-string table is distinct from the regular string table at runtime. Both tables are searched when looking up interned strings. The shared-string table is a read-only table at run time; no entries can be added or removed from it.

The G1 string-deduplication table is a separate hash table containing the char arrays for deduplication at runtime. When a string is interned and added to the StringTable, the string is deduplicated and the underlying char array is added to the deduplication table if it is not there already. The deduplication table is not stored into the archive. The deduplication table is populated during VM startup using the shared-string data. As an optimization, the work is done in the G1StringDedupThread (in G1StringDedupThread::run(), after initialize_in_thread()) to reduce startup time. The shared strings' hash values are precomputed and stored in the strings at dump time to avoid the deduplication code writing the hash values at runtime.

Testing

Testing for this feature will cover the following areas:

Basic operation of this feature;
Modes that are incompatible with this feature, such as non-G1 GC and uncompressed object/class pointers;
Variation of ordinary-object-pointer encoding between dump time and run time;
Invalid string-file format;
Selected string operations when using this feature, such as interning and string comparison; and
Ensure that this feature does not cause heap corruption using GC diagnostic modes.

Dependences

The serviceability agent needs to be updated to add support for the shared-string table (see JDK-8079830).

With the change proposed by JDK-8054307, the underlying char array will be changed to be a byte array. The code that copies interned strings to the string space and perform deduplication will need to reflect that if and when JDK-8054307 is integrated. The impact should be minimal.