JEP 355: Text Blocks (Preview)

Owner	Jim Laskey
Type	Feature
Scope	SE
Status	Closed / Delivered
Release	13
Component	specification / language
Discussion	amber dash dev at openjdk dot java dot net
Effort	M
Duration	M
Relates to	JEP 326: Raw String Literals (Preview)
	JEP 368: Text Blocks (Second Preview)
	JEP 378: Text Blocks
Reviewed by	Alex Buckley, Brian Goetz
Endorsed by	Brian Goetz
Created	2019/04/16 13:50
Updated	2021/08/28 00:17
Issue	8222530

Summary

Add text blocks to the Java language. A text block is a multi-line string literal that avoids the need for most escape sequences, automatically formats the string in a predictable way, and gives the developer control over format when desired. This is a preview language feature in JDK 13.

History

This is a follow-on effort to explorations begun in JEP 326 (Raw String Literals), which was withdrawn and did not appear in JDK 12.

Goals

Simplify the task of writing Java programs by making it easy to express strings that span several lines of source code, while avoiding escape sequences in common cases.
Enhance the readability of strings in Java programs that denote code written in non-Java languages.
Support migration from string literals by stipulating that any new construct can express the same set of strings as a string literal, and interpret the same escape sequences, and be manipulated like a string literal.

Non-Goals

It is not a goal to define a new reference type, distinct from java.lang.String, for the strings expressed by any new construct.
It is not a goal to define new operators, distinct from +, that take String operands.
Text blocks do not directly support string interpolation. Interpolation may be considered in a future JEP.
Text blocks do not support raw strings, that is, strings which intend no processing of their characters whatsoever.

Motivation

In Java, embedding a snippet of HTML, XML, SQL, or JSON in a string literal "..." usually requires significant editing with escapes and concatenation before the code containing the snippet will compile. The snippet is often difficult to read and arduous to maintain.

More generally, the need to denote short, medium, and long blocks of text in a Java program is near universal, whether the text is code from other programming languages, structured text representing golden files, or messages in natural languages. On the one hand, the Java language recognizes this need by allowing strings of unbounded size and content, but on the other hand, it embodies a design default that strings should be small enough to denote on a single line of a source file, and simple enough to escape easily. The Java language can be regularized by accepting that strings may be large enough to span multiple lines of a source file, and by envisaging that escapes in their content may represent formatting and layout operations as well as individual characters.

Accordingly, it would improve both the readability and the writeability of a broad class of Java programs to have a linguistic mechanism for denoting strings more literally than a string literal -- across multiple lines and without the visual clutter of escapes. In essence, a two-dimensional block of text, rather than a one-dimensional sequence of characters.

HTML example

Using "one-dimensional" string literals

String html = "<html>\n" +
              "    <body>\n" +
              "        <p>Hello, world</p>\n" +
              "    </body>\n" +
              "</html>\n";

Using a "two-dimensional" block of text

String html = """
              <html>
                  <body>
                      <p>Hello, world</p>
                  </body>
              </html>
              """;

SQL example

Using "one-dimensional" string literals

String query = "SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`\n" +
               "WHERE `CITY` = 'INDIANAPOLIS'\n" +
               "ORDER BY `EMP_ID`, `LAST_NAME`;\n";

Using a "two-dimensional" block of text

String query = """
               SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`
               WHERE `CITY` = 'INDIANAPOLIS'
               ORDER BY `EMP_ID`, `LAST_NAME`;
               """;

Polyglot language example

Using "one-dimensional" string literals

ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval("function hello() {\n" +
                         "    print('\"Hello, world\"');\n" +
                         "}\n" +
                         "\n" +
                         "hello();\n");

Using a "two-dimensional" block of text

ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval("""
                         function hello() {
                             print('"Hello, world"');
                         }
                         
                         hello();
                         """);

Description

A text block is a new kind of literal in the Java language. It may be used to denote a string anywhere that a string literal could appear, but offers greater expressiveness and less accidental complexity.

A text block consists of zero or more content characters, enclosed by opening and closing delimiters.

The opening delimiter is a sequence of three double quote characters (""") followed by zero or more white spaces followed by a line terminator. The content begins at the first character after the line terminator of the opening delimiter.

The closing delimiter is a sequence of three double quote characters. The content ends at the last character before the first double quote of the closing delimiter.

The content may include double quote characters directly, unlike the characters in a string literal. The use of \" in a text block is permitted, but not necessary or recommended. Fat delimiters (""") were chosen so that " characters could appear unescaped, and also to visually distinguish a text block from a string literal.

The content may include line terminators directly, unlike the characters in a string literal. The use of \n in a text block is permitted, but not necessary or recommended. For example, the text block:

"""
line 1
line 2
line 3
"""

is equivalent to the string literal:

"line 1\nline 2\nline 3\n"

or a concatenation of string literals:

"line 1\n" +
"line 2\n" +
"line 3\n"

If a line terminator is not required at the end of the string, then the closing delimiter can be placed on the last line of content. For example, the text block:

"""
line 1
line 2
line 3"""

is equivalent to the string literal:

"line 1\nline 2\nline 3"

A text block can denote the empty string, although this is not recommended because it needs two lines of source code:

String empty = """
""";

Here are some examples of ill-formed text blocks:

String a = """""";   // no line terminator after opening delimiter
String b = """ """;  // no line terminator after opening delimiter
String c = """
           ";        // no closing delimiter (text block continues to EOF)
String d = """
           abc \ def
           """;      // unescaped backslash (see below for escape processing)

Compile-time processing

A text block is a constant expression of type String, just like a string literal. However, unlike a string literal, the content of a text block is processed by the Java compiler in three distinct steps:

Line terminators in the content are translated to LF (\u000A). The purpose of this translation is to follow the principle of least surprise when moving Java source code across platforms.
Incidental white space surrounding the content, introduced to match the indentation of Java source code, is removed.
Escape sequences in the content are interpreted. Performing interpretation as the final step means developers can write escape sequences such as \n without them being modified or deleted by earlier steps.

The processed content is recorded in the class file as a CONSTANT_String_info entry in the constant pool, just like the characters of a string literal. The class file does not record whether a CONSTANT_String_info entry was derived from a text block or a string literal.

At run time, a text block is evaluated to an instance of String, just like a string literal. Instances of String that are derived from text blocks are indistinguishable from instances derived from string literals. Two text blocks with the same processed content will refer to the same instance of String due to interning, just like for string literals.

The following sections discuss the compile-time processing in more detail.

1. Line terminators

Line terminators in the content are normalized from CR (\u000D) and CRLF (\u000D\u000A) to LF (\u000A) by the Java compiler. This ensures that the string derived from the content is equivalent across platforms, even if the source code has been translated to a platform encoding (see javac -encoding).

For example, if Java source code that was created on a Unix platform (where the line terminator is LF) is edited on a Windows platform (where the line terminator is CRLF), then without normalization, the content would become one character longer for each line. Any algorithm that relied on LF being the line terminator might fail, and any test that needed to verify string equality with String::equals would fail.

The escape sequences \n (LF), \f (FF), and \r (CR) are not interpreted during normalization; escape processing happens later.

2. Incidental white space

The text blocks in the Motivation were easier to read than their concatenated string literal counterparts, but the obvious interpretation for the content of a text block would include the spaces added to indent the embedded string so that it lines up neatly with the opening delimiter. Here is the HTML example using dots to visualize the spaces that the developer added for indentation:

String html = """
..............<html>
..............    <body>
..............        <p>Hello, world</p>
..............    </body>
..............</html>
..............""";

Since the opening delimiter is generally positioned to appear on the same line as the statement or expression which consumes the text block, there is no real significance to the fact that 14 visualized spaces start each line. Including those spaces in the content would mean the text block denotes a string different from the one denoted by the concatenated string literals. This would hurt migration, and be a recurring source of surprise: it is overwhelmingly likely that the developer does not want those spaces in the string. Also, the closing delimiter is generally positioned to align with the content, which further suggests that the 14 visualized spaces are insignificant.

Spaces may also appear at the end of each line, especially when a text block is populated by copy-pasting snippets from other files (which may themselves have been formed by copy-pasting from yet more files). Here is the HTML example reimagined with some trailing white space, again using dots to visualize spaces:

String html = """
..............<html>...
..............    <body>
..............        <p>Hello, world</p>....
..............    </body>.
..............</html>...
..............""";

Trailing white space is most often unintentional, idiosyncractic, and insignificant. It is overwhelmingly likely that the developer does not care about it. Trailing white space characters are similar to line terminators, in that both are invisible artifacts of the source code editing environment. With no visual guide to the presence of trailing white space characters, including them in the content would be a recurring source of surprise, as it would affect the length, hash code, etc, of the string.

Accordingly, an appropriate interpretation for the content of a text block is to differentiate incidental white space at the start and end of each line, from essential white space. The Java compiler processes the content by removing incidental white space to yield what the developer intended. String::indent can then be used to manage indentation if desired. Using | to visualize margins:

|<html>|
|    <body>|
|        <p>Hello, world</p>|
|    </body>|
|</html>|

The re-indentation algorithm takes the content of a text block whose line terminators have been normalized to LF. It removes the same amount of white space from each line of content until at least one of the lines has a non-white space character in the leftmost position. The position of the opening """ characters has no effect on the algorithm, but the position of the closing """ characters does have an effect if placed on its own line. The algorithm is as follows:

Split the content of the text block at every LF, producing a list of individual lines. Note that any line in the content which was just an LF will become an empty line in the list of individual lines.
Add all non-blank lines from the list of individual lines into a set of determining lines. (Blank lines -- lines that are empty or are composed wholly of white space -- have no visible influence on the indentation. Excluding blank lines from the set of determining lines avoids throwing off step 4 of the algorithm.)
If the last line in the list of individual lines (i.e., the line with the closing delimiter) is blank, then add it to the set of determining lines. (The indentation of the closing delimiter should influence the indentation of the content as a whole -- a "significant trailing line" policy.)
Compute the common white space prefix of the set of determining lines, by counting the number of leading white space characters on each line and taking the minimum count.
Remove the common white space prefix from each non-blank line in the list of individual lines.
Remove all trailing white space from all lines in the modified list of individual lines from step 5. This step collapses wholly-whitespace lines in the modified list so that they are empty, but does not discard them.
Construct the result string by joining all the lines in the modified list of individual lines from step 6, using LF as the separator between lines. If the final line in the list from step 6 is empty, then the joining LF from the previous line will be the last character in the result string.

The escape sequences \b (backspace) and \t (tab) are not interpreted by the algorithm; escape processing happens later.

The re-indentation algorithm will be normative in The Java Language Specification. Developers will have access to it via String::stripIndent, a new instance method.

Significant trailing line policy

Normally, one would format a text block in two ways: first, position the left edge of the content to appear under the first " of the opening delimiter, and second, place the closing delimiter on its own line to appear exactly under the opening delimiter. The resulting string will have no white space at the start of any line, and will not include the trailing blank line of the closing delimiter.

However, because the trailing blank line is considered a determining line, moving it to the left has the effect of reducing the common white space prefix, and therefore reducing the the amount of white space that is stripped from the start of every line. In the extreme case, where the closing delimiter is moved all the way to the left, that reduces the common white space prefix to zero, effectively opting out of white space stripping.

For example, with the closing delimiter moved all the way to the left, there is no incidental white space to visualize with dots:

String html = """
              <html>
                  <body>
                      <p>Hello, world</p>
                  </body>
              </html>
""";

Including the trailing blank line with the closing delimiter, the common white space prefix is zero, so zero white space is removed from the start of each line. The algorithm thus produces: (using | to visualize the left margin)

|              <html>
|                  <body>
|                      <p>Hello, world</p>
|                  </body>
|              </html>

Alternatively, suppose the closing delimiter is not moved all the way to the left, but rather under the t of html so it is eight spaces deeper than the variable declaration:

String html = """
              <html>
                  <body>
                      <p>Hello, world</p>
                  </body>
              </html>
        """;

The spaces visualized with dots are considered to be incidental:

String html = """
........      <html>
........          <body>
........              <p>Hello, world</p>
........          </body>
........      </html>
........""";

Including the trailing blank line with the closing delimiter, the common white space prefix is eight, so eight white spaces are removed from the start of each line. The algorithm thus preserves the essential indentation of the content relative to the closing delimiter:

|      <html>
|          <body>
|              <p>Hello, world</p>
|          </body>
|      </html>

Finally, suppose the closing delimiter is moved slightly to the right of the content:

String html = """
              <html>
                  <body>
                      <p>Hello, world</p>
                  </body>
              </html>
                  """;

The spaces visualized with dots are considered to be incidental:

String html = """
..............<html>
..............    <body>
..............        <p>Hello, world</p>
..............    </body>
..............</html>
..............    """;

The common white space prefix is 14, so 14 white spaces are removed from the start of each line. The trailing blank line is stripped to leave an empty line, which being the last line is then discarded. In other words, moving the closing delimiter to the right of the content has no effect, and the algorithm again preserves the essential indentation of the content:

|<html>
|    <body>
|        <p>Hello, world</p>
|    </body>
|</html>

3. Escape sequences

After the content is re-indented, any escape sequences in the content are interpreted. A text block supports the same escape sequences as a string literal, such as \n, \t, \', \", and \\. See section 3.10.6 of the The Java Language Specification for the full list. Developers will have access to escape processing via String::translateEscapes, a new instance method.

Interpreting escapes as the final step allows developers to use \n, \f, and \r for vertical formatting of a string without it affecting the translation of line terminators in step 1, and to use \b and \t for horizontal formatting of a string without it affecting the removal of incidental white space in step 2. For example, consider this text block that contains the \r escape sequence (CR):

String html = """
              <html>\r
                  <body>\r
                      <p>Hello, world</p>\r
                  </body>\r
              </html>\r
              """;

The CR escapes are not processed until after the line terminators have been normalized to LF. Using Unicode escapes to visualize LF (\u000A) and CR (\u000D), the result is:

|<html>\u000D\u000A
|    <body>\u000D\u000A
|        <p>Hello, world</p>\u000D\u000A
|    </body>\u000D\u000A
|</html>\u000D\u000A

Note that it is legal to use " freely inside a text block, even next to the opening or closing delimiter. For example:

String story = """
    "When I use a word," Humpty Dumpty said,
    in rather a scornful tone, "it means just what I
    choose it to mean - neither more nor less."
    "The question is," said Alice, "whether you
    can make words mean so many different things."
    "The question is," said Humpty Dumpty,
    "which is to be master - that's all."
    """;

However, sequences of three " characters require the escaping of at least one " to avoid mimicking the closing delimiter:

String code = 
    """
    String text = \"""
        A text block inside a text block
    \""";
    """;

Concatenation of text blocks

Text blocks can be used anywhere a string literal can be used. For example, text blocks and string literals may be concatenated interchangeably:

String code = "public void print(Object o) {" +
              """
                  System.out.println(Objects.toString(o));
              }
              """;

However, concatenation involving a text block can become rather clunky. Take this text block as a starting point:

String code = """
              public void print(Object o) {
                  System.out.println(Objects.toString(o));
              }
              """;

Suppose it needs to be changed so that the type of o comes from a variable. Using concatenation, the text block that contains the trailing code will need to start on a new line. Unfortunately, the straightforward insertion of a newline in the program, as below, will cause a long span of white space between the type and the text beginning o :

String code = """
              public void print(""" + type + """
                                                 o) {
                  System.out.println(Objects.toString(o));
              }
              """;

The white space can be removed manually, but this hurts readability of the quoted code:

String code = """
              public void print(""" + type + """
               o) {
                  System.out.println(Objects.toString(o));
              }
              """;

A cleaner alternative is to use String::replace or String::format, as follows:

String code = """
              public void print($type o) {
                  System.out.println(Objects.toString(o));
              }
              """.replace("$type", type);

String code = String.format("""
              public void print(%s o) {
                  System.out.println(Objects.toString(o));
              }
              """, type);

Another alternative involves the introduction of a new instance method, String::formatted, which could be used as follows:

String source = """
                public void print(%s object) {
                    System.out.println(Objects.toString(object));
                }
                """.formatted(type);

Additional Methods

The following methods will be added to support text blocks;

String::stripIndent(): used to strip away incidental white space from the text block content
String::translateEscapes(): used to translate escape sequences
String::formatted(Object... args): simplify value substitution in the text block

Alternatives

Do nothing

Java has prospered for over 20 years with string literals that required newlines to be escaped. IDEs ease the maintainence burden by supporting automatic formatting and concatenation of strings that span several lines of source code. The String class has also evolved to include methods that simplify the processing and formatting of long strings, such as a method that presents a string as a stream of lines. However, strings are such a fundamental part of the Java language that the shortcomings of string literals are apparent to vast numbers of developers. Other JVM languages have also made advances in how long and complex strings are denoted. Unsurprisingly, then, multi-line string literals have consistently been one of the most requested features for Java. Introducing a multi-line construct of low to moderate complexity would have a high payoff.

Allow a string literal to span multiple lines

Multi-line string literals could be introduced in Java simply by allowing line terminators in existing string literals. However, this would do nothing about the pain of escaping " characters. \" is the most frequently occuring escape sequence after \n, because of frequency of code snippets. The only way to avoid escaping " in a string literal would be to provide an alternate delimiter scheme for string literals. Delimiters were much discussed for JEP 326 (Raw String Literals), and the lessons learned were used to inform the design of text blocks, so it would be misguided to upset the stability of string literals.

Adopting another language's multi-string literal

According to Brian Goetz:

Many people have suggested that Java should adopt multi-line string literals from Swift or Rust. However, the approach of “just do what language X does” is intrinsically irresponsible; nearly every feature of every language is conditioned by other features of that language. Instead, the game is to learn from how other languages do things, assess the tradeoffs they’ve chosen (explicitly and implicitly), and ask what can be applied to the constraints of the language we have and user expectations within the community we have.

For JEP 326 (Raw String Literals), we surveyed many modern programming languages and their support for multi-line string literals. The results of these surveys influenced the current proposal, such as the choice of three " characters for delimiters (although there were other reasons for this choice too) and the recognition of the need for automatic indentation management.

Removal of incidental white space

If Java introduced multi-line string literals without support for automatically removing incidental white space, then many developers would write a method to remove it themselves, or lobby for the String class to include a removal method. However, that implies a potentially expensive computation every time the string is instantiated at run time, which would reduce the benefit of string interning. Having the Java language mandate the removal of incidental white space, both in leading and trailing positions, seems the most appropriate solution. Developers can opt out of leading white space removal by careful placement of the closing delimiter.

It is not possible to opt out of trailing white space removal, so in rare contexts where trailing white space is significant (such as two spaces for a <br /> tag in Markdown), developers must resort to unusual measures to force its inclusion, such as the octal escape sequence \040 (ASCII character 32, white space):

"""
The quick brown fox\040\040
jumps over the lazy dog
"""

"""
The quick brown fox""" + "  \n" + """
jumps over the lazy dog
"""

"""
The quick brown fox  |
jumps over the lazy dog
""".replace("|", "");

Raw string literals

For JEP 326 (Raw String Literals), we took a different approach to the problem of denoting strings without escaping newlines and quotes, focusing on the raw-ness of strings. We now believe that this focus was wrong, because while raw string literals could easily span multiple lines of source code, the cost of supporting unescaped delimiters in their content was extreme. This limited the effectiveness of the feature in the multi-line use case, which is a critical one because of the frequency of embedding multi-line (but not truly raw) code snippets in Java programs. A good outcome of the pivot from raw-ness to multi-line-ness was a renewed focus on having a consistent escape language between string literals, text blocks, and related features that may be added in future.

Testing

Tests that use string literals for the creation, interning, and manipulation of instances of String should be duplicated to use text blocks too. Negative tests should be added for corner cases involving line terminators and EOF.

Tests should be added to ensure that text blocks can embed Java-in-Java, Markdown-in-Java, SQL-in-Java, and at least one JVM-language-in-Java.