JEP draft: Deprecate the UTF-16-only String Representation

OwnerStuart Marks
TypeFeature
ScopeJDK
StatusDraft
Componentcore-libs / java.lang
Created2025/11/05 23:10
Updated2025/11/12 19:22
Issue8371379

Summary

Deprecate, for removal, the ability to disable Compact Strings. Compact Strings were introduced in JDK 9 to save memory and are enabled by default. In future, the alternative to Compact Strings -- UTF-16-only representation of strings -- will be removed.

Motivation

In JDK 1.0, String objects had a single internal representation: an array of char values. Each char value occupied two bytes (16 bits) and the string data stored in the char array was encoded in UTF-16. Even if string data could be encoded in one byte, it was encoded in UTF-16 and occupied two bytes per character. This mode of operation is referred to as UTF-16-only mode.

The introduction of Compact Strings in JDK 9 changed the internal representation of String objects. (The change was strictly to the implementation; there were no public API changes.) The new internal representation allowed two forms: one byte per character (with characters encoded in ISO Latin 1) or two bytes per character (encoded in UTF-16). Many strings can be encoded in ISO Latin 1, using only one byte per character and saving considerable space compared to UTF-16-only mode, so Compact Strings were enabled by default.

Accordingly, in JDK 9 and later, there are three possible internal representations for String and three possible code paths for every String operation:

  1. Compact Strings with the ISO Latin 1 encoding;
  2. Compact Strings with the UTF-16 encoding; and
  3. UTF-16-only.

Maintaining three code paths is a maintenance burden. Since Compact Strings are enabled by default, many optimizations have been applied to its code paths, and they are tested thoroughly. By contrast, the UTF-16-only code paths have received less optimization and have not been tested as thoroughly as the Compact Strings code paths. Indeed, several bugs have occurred only in UTF-16-only mode.

Removing the UTF-16-only representation and supporting code will simplify the implementation of String and reduce the cost of maintaining the JDK.

Description

We propose to deprecate the UTF-16-only mode for removal in a future release. When the mode is removed in future, it will change only the internal representation of String; there will be no change to any public API, and no compatibility impact on existing source code or binaries.

The only effect of deprecating the UTF-16-only mode is on users of -XX:-CompactStrings. This command line option was introduced in JDK 9 to disable Compact Strings and run the system in UTF-16-only mode, primarily in case an application's workload and data mix caused a performance regression with Compact Strings enabled. In JDK NN, it will continue to work as in JDK 9 but will issue a warning that the ability to run in UTF-16-only mode will be removed in the future.

Application developers should examine their Java startup scripts and configurations to determine whether any of them use the -XX:-CompactStrings option. If this option is used, developers should remove the option (or equivalently, change it to -XX:+CompactStrings) and test their system to assess any impact that might arise from running with Compact Strings enabled.

Risks

Unlike other JDK ports, the ARM32 port uses UTF-16-only mode. The ARM32 port has not been tested in Compact Strings mode. It is therefore likely that the ARM32 port will need some work to bring its Compact Strings implementation up to production quality. Alternatively, support for ARM32 could be dropped entirely from the JDK.

Asian languages such as Chinese, Japanese, and Korean (CJK) use many characters that cannot be encoded in a single byte. UTF-16-only mode works reasonably well for string data in these languages. With Compact Strings enabled, the system must perform additional work to check every string to see whether it can be encoded in ISO Latin 1; if it cannot, the string data will be encoded and stored in UTF-16. An application that processes CJK text and running with Compact Strings enabled may therefore consume extra CPU time checking for the possibility of encoding string data in ISO Latin 1, only to determine that the string data must be encoded in UTF-16 anyway. This may result in an increase in CPU time consumed without any net space savings.

A key design assumption underlying the Compact Strings feature is that, even in applications that process mainly CJK string data, there will be many other strings (such as class names, request headers, URLs, etc.) that can be encoded in ISO Latin 1, and this will result in a net space savings and reduced GC overhead. However, there may be some applications for which this assumption isn't true. Installations may choose to run such applications in production with Compact Strings disabled. The Compact Strings implementation has been optimized since its introduction, so these regressions may have been mitigated. However, when UTF-16-only mode is eventually removed, these applications will need to migrate and will possibly suffer a performance regression, or they can stay on older JDK releases indefinitely.

Under UTF-16-only mode, the JNI function GetStringCritical returns a direct pointer to the internal String char array. With Compact Strings, this function makes a copy of the array. Applications that make heavy use of GetStringCritical will see a regression if they switch from UTF-16-only mode to using Compact Strings. There is no obvious mitigation for this problem.