JEP 250: Store Interned Strings in CDS Archives
Author | Jiangli Zhou |
Owner | Ioi Lam |
Type | Feature |
Scope | JDK |
Status | Closed / Delivered |
Release | 9 |
Component | hotspot / runtime |
Discussion | hotspot dash dev at openjdk dot java dot net |
Relates to | JEP 254: Compact Strings |
Reviewed by | Karen Kinnear, Mikael Vidstedt |
Endorsed by | Mikael Vidstedt |
Created | 2014/09/24 23:49 |
Updated | 2022/10/03 17:25 |
Issue | 8059092 |
Summary
Store interned strings in class-data sharing (CDS) archives.
Goals
-
Reduce memory consumption by sharing the
String
objects and underlyingchar
array objects amongst different JVM processes. -
Only support shared strings for the G1 GC. Shared strings require a pinned region, and G1 is the only HotSpot GC that supports pinning.
-
Only support 64-bit platforms with compressed object and class pointers.
-
No significant degradation (< 2-3%) on startup time, string-lookup time, GC pause time, or runtime performance using the usual benchmarks.
Non-Goals
-
Reducing startup time is not a goal.
-
Other types of GCs (besides G1) will not be supported.
-
32-bit platforms will not be supported.
Motivation
Currently, when CDS stores classes into the archive, the
CONSTANT_String
items in the constant pools are represented by UTF-8
strings. When the class is loaded, the UTF-8 strings are converted into
java.lang.String
objects on demand. This potentially wastes memory,
since each character in each interned string takes up three bytes or more
(two bytes in the String
, 1-3 bytes in the UTF-8).
Also, because the strings are created dynamically, they cannot easily be shared across JVM processes.
Description
At dump time, a designated string space is allocated within the Java heap
during heap initialization. Pointers to the interned String
objects
and their underlying char
-array objects are modified, as if those objects
are from the designated space, when writing out the interned string table
and the String
objects.
The string table is compressed and then stored in the archive at dump
time. The compression technique for the string table is the same as for
the shared symbol table (see JDK-8059510). The regular narrow
oop encoding and decoding is used to access the shared String
objects
from the compressed-string table.
On 64-bit platforms with compressed oop pointers, the narrow oops are
encoded using offsets (with or without scaling) from the narrow oop base.
Currently there are four different encoding modes: 32-bit unscaled, zero
based, disjoint heap based, and heap based. Depending on the heap size
and the heap minimum base, an appropriate encoding mode is selected. The
narrow-oop encoding mode (including the encoding shift) must be the same
at both dump time and run time, so that the oop pointers within the
shared string space remain valid at run time. The shared-string space
can be considered relocatable, with restrictions, at runtime. It is not
required to be mapped at the same address as at dump time, but it should
be at the same offset from the narrow oop base at dump time and run time.
The heap size is not required to be the same at dump time and run time,
as long as the same encoding mode is used. The offset of the string
space and the oop-encoding mode (and shift) should be stored in the
archive for run-time validation. If the encoding mode changes, it will
invalidate the encoding of the oop pointer to the char
array from each
shared String
. In such cases the shared-string data is ignored while
the rest of the shared data can still be used by the VM. A warning
indicating that shared strings are not used due to incompatible GC
configuration will be reported by the VM.
At run time, the string space is mapped as part of the Java heap at the
same offset from the oop encoding base as at dump time. The mapping
starts at the lowest page-aligned address of the string space saved in
the archive. The mapped string space contains the shared String
and
char
-array objects. All G1 regions which overlap this mapped space
will be marked as pinned; these G1 regions are unavailable for run-time
allocation. There may be unused space wasted in a region that partially
overlaps, but there should be at most one such region, at the end of the
mapping. No patching is required for the oop pointers within the string
space since the same narrow oop encoding is used. The shared-string
space is writable, but the GC should not write to the oops in the space
in order to preserve shareability across different processes. An
application that attempts to lock one of these shared strings, and thus
writes to the shared space, will get a private copy of the page, and
therefore lose the benefit of sharing that particular page. Such cases
are rare.
The shared-string table is distinct from the regular string table at runtime. Both tables are searched when looking up interned strings. The shared-string table is a read-only table at run time; no entries can be added or removed from it.
The G1 string-deduplication table is a separate hash table containing the
char
arrays for deduplication at runtime. When a string is interned
and added to the StringTable
, the string is deduplicated and the
underlying char
array is added to the deduplication table if it is not
there already. The deduplication table is not stored into the archive.
The deduplication table is populated during VM startup using the
shared-string data. As an optimization, the work is done in the
G1StringDedupThread
(in G1StringDedupThread::run()
, after
initialize_in_thread()
) to reduce startup time. The shared strings'
hash values are precomputed and stored in the strings at dump time to
avoid the deduplication code writing the hash values at runtime.
Testing
Testing for this feature will cover the following areas:
-
Basic operation of this feature;
-
Modes that are incompatible with this feature, such as non-G1 GC and uncompressed object/class pointers;
-
Variation of ordinary-object-pointer encoding between dump time and run time;
-
Invalid string-file format;
-
Selected string operations when using this feature, such as interning and string comparison; and
-
Ensure that this feature does not cause heap corruption using GC diagnostic modes.
Dependences
The serviceability agent needs to be updated to add support for the shared-string table (see JDK-8079830).
With the change proposed by JDK-8054307, the underlying char
array will be changed to be a byte
array. The code that copies
interned strings to the string space and perform deduplication will need
to reflect that if and when JDK-8054307 is integrated. The impact should
be minimal.