JEP 368: Text Blocks (Second Preview)
Owner | Jim Laskey |
Type | Feature |
Scope | SE |
Status | Closed / Delivered |
Release | 14 |
Component | specification / language |
Discussion | amber dash dev at openjdk dot java dot net |
Effort | M |
Duration | M |
Relates to | JEP 355: Text Blocks (Preview) |
JEP 378: Text Blocks | |
Reviewed by | Alex Buckley |
Created | 2019/09/30 14:10 |
Updated | 2021/08/28 00:19 |
Issue | 8231623 |
Summary
Add text blocks to the Java language. A text block is a multi-line string literal that avoids the need for most escape sequences, automatically formats the string in a predictable way, and gives the developer control over the format when desired. This is a preview language feature in JDK 14.
History
Text blocks were proposed by JEP 355 in early 2019 as a follow-on to explorations begun in JEP 326 (Raw String Literals), which was withdrawn and did not appear in JDK 12. JEP 355 was targeted to JDK 13 in mid 2019 as a preview feature. Feedback on JDK 13 suggested that text blocks should be previewed again in JDK 14, with the addition of two new escape sequences.
Goals
-
Simplify the task of writing Java programs by making it easy to express strings that span several lines of source code, while avoiding escape sequences in common cases.
-
Enhance the readability of strings in Java programs that denote code written in non-Java languages.
-
Support migration from string literals by stipulating that any new construct can express the same set of strings as a string literal, interpret the same escape sequences, and be manipulated in the same ways as a string literal.
-
Add escape sequences for managing explicit white space and newline control.
Non-Goals
-
It is not a goal to define a new reference type, distinct from
java.lang.String
, for the strings expressed by any new construct. -
It is not a goal to define new operators, distinct from
+
, that takeString
operands. -
Text blocks do not directly support string interpolation. Interpolation may be considered in a future JEP.
-
Text blocks do not support raw strings, that is, strings whose characters are not processed in any way.
Motivation
In Java, embedding a snippet of HTML, XML, SQL, or JSON in a string literal "..."
usually requires significant editing with escapes and concatenation before the code containing the snippet will compile. The snippet is often difficult to read and arduous to maintain.
More generally, the need to denote short, medium, and long blocks of text in a Java program is near universal, whether the text is code from other programming languages, structured text representing golden files, or messages in natural languages. On the one hand, the Java language recognizes this need by allowing strings of unbounded size and content; on the other hand, it embodies a design default that strings should be small enough to denote on a single line of a source file (surrounded by " characters), and simple enough to escape easily. This design default is at odds with the large number of Java programs where strings are too long to fit comfortably on a single line.
Accordingly, it would improve both the readability and the writability of a broad class of Java programs to have a linguistic mechanism for denoting strings more literally than a string literal -- across multiple lines and without the visual clutter of escapes. In essence, a two-dimensional block of text, rather than a one-dimensional sequence of characters.
Still, it is impossible to predict the role of every string in Java programs. Just because a string spans multiple lines of source code does not mean that newline characters are desirable in the string. One part of a program may be more readable when strings are laid out over multiple lines, but the embedded newline characters may change the behavior of another part of the program. Accordingly, it would be helpful if the developer had precise control over where newlines appear, and, as a related matter, how much white space appears to the left and right of the "block" of text.
HTML example
Using "one-dimensional" string literals
String html = "<html>\n" +
" <body>\n" +
" <p>Hello, world</p>\n" +
" </body>\n" +
"</html>\n";
Using a "two-dimensional" block of text
String html = """
<html>
<body>
<p>Hello, world</p>
</body>
</html>
""";
SQL example
Using "one-dimensional" string literals
String query = "SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`\n" +
"WHERE `CITY` = 'INDIANAPOLIS'\n" +
"ORDER BY `EMP_ID`, `LAST_NAME`;\n";
Using a "two-dimensional" block of text
String query = """
SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`
WHERE `CITY` = 'INDIANAPOLIS'
ORDER BY `EMP_ID`, `LAST_NAME`;
""";
Polyglot language example
Using "one-dimensional" string literals
ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval("function hello() {\n" +
" print('\"Hello, world\"');\n" +
"}\n" +
"\n" +
"hello();\n");
Using a "two-dimensional" block of text
ScriptEngine engine = new ScriptEngineManager().getEngineByName("js");
Object obj = engine.eval("""
function hello() {
print('"Hello, world"');
}
hello();
""");
Description
This section is identical to the same section in this JEP's predecessor, JEP 355, except for the addition of the subsection on new escape sequences.
A text block is a new kind of literal in the Java language. It may be used to denote a string anywhere that a string literal could appear, but offers greater expressiveness and less accidental complexity.
A text block consists of zero or more content characters, enclosed by opening and closing delimiters.
The opening delimiter is a sequence of three double quote characters ("""
) followed by zero or more white spaces followed by a line terminator. The content begins at the first character after the line terminator of the opening delimiter.
The closing delimiter is a sequence of three double quote characters. The content ends at the last character before the first double quote of the closing delimiter.
The content may include double quote characters directly, unlike the characters in a string literal. The use of \"
in a text block is permitted, but not necessary or recommended. Fat delimiters ("""
) were chosen so that "
characters could appear unescaped, and also to visually distinguish a text block from a string literal.
The content may include line terminators directly, unlike the characters in a string literal. The use of \n
in a text block is permitted, but not necessary or recommended. For example, the text block:
"""
line 1
line 2
line 3
"""
is equivalent to the string literal:
"line 1\nline 2\nline 3\n"
or a concatenation of string literals:
"line 1\n" +
"line 2\n" +
"line 3\n"
If a line terminator is not required at the end of the string, then the closing delimiter can be placed on the last line of content. For example, the text block:
"""
line 1
line 2
line 3"""
is equivalent to the string literal:
"line 1\nline 2\nline 3"
A text block can denote the empty string, although this is not recommended because it needs two lines of source code:
String empty = """
""";
Here are some examples of ill-formed text blocks:
String a = """"""; // no line terminator after opening delimiter
String b = """ """; // no line terminator after opening delimiter
String c = """
"; // no closing delimiter (text block continues to EOF)
String d = """
abc \ def
"""; // unescaped backslash (see below for escape processing)
Compile-time processing
A text block is a constant expression of type String
, just like a string literal. However, unlike a string literal, the content of a text block is processed by the Java compiler in three distinct steps:
-
Line terminators in the content are translated to LF (
\u000A
). The purpose of this translation is to follow the principle of least surprise when moving Java source code across platforms. -
Incidental white space surrounding the content, introduced to match the indentation of Java source code, is removed.
-
Escape sequences in the content are interpreted. Performing interpretation as the final step means developers can write escape sequences such as
\n
without them being modified or deleted by earlier steps.
The processed content is recorded in the class
file as a CONSTANT_String_info
entry in the constant pool, just like the characters of a string literal. The class
file does not record whether a CONSTANT_String_info
entry was derived from a text block or a string literal.
At run time, a text block is evaluated to an instance of String
, just like a string literal. Instances of String
that are derived from text blocks are indistinguishable from instances derived from string literals. Two text blocks with the same processed content will refer to the same instance of String
due to interning, just like for string literals.
The following sections discuss compile-time processing in more detail.
1. Line terminators
Line terminators in the content are normalized from CR (\u000D
) and CRLF (\u000D\u000A
) to LF (\u000A
) by the Java compiler. This ensures that the string derived from the content is equivalent across platforms, even if the source code has been translated to a platform encoding (see javac -encoding
).
For example, if Java source code that was created on a Unix platform (where the line terminator is LF) is edited on a Windows platform (where the line terminator is CRLF), then without normalization, the content would become one character longer for each line. Any algorithm that relied on LF being the line terminator might fail, and any test that needed to verify string equality with String::equals
would fail.
The escape sequences \n
(LF), \f
(FF), and \r
(CR) are not interpreted during normalization; escape processing happens later.
2. Incidental white space
The text blocks in shown above were easier to read than their concatenated string literal counterparts, but the obvious interpretation for the content of a text block would include the spaces added to indent the embedded string so that it lines up neatly with the opening delimiter. Here is the HTML example using dots to visualize the spaces that the developer added for indentation:
String html = """
..............<html>
.............. <body>
.............. <p>Hello, world</p>
.............. </body>
..............</html>
..............""";
Since the opening delimiter is generally positioned to appear on the same line as the statement or expression which consumes the text block, there is no real significance to the fact that 14 visualized spaces start each line. Including those spaces in the content would mean the text block denotes a string different from the one denoted by the concatenated string literals. This would hurt migration, and be a recurring source of surprise: it is overwhelmingly likely that the developer does not want those spaces in the string. Also, the closing delimiter is generally positioned to align with the content, which further suggests that the 14 visualized spaces are insignificant.
Spaces may also appear at the end of each line, especially when a text block is populated by copy-pasting snippets from other files (which may themselves have been formed by copy-pasting from yet more files). Here is the HTML example reimagined with some trailing white space, again using dots to visualize spaces:
String html = """
..............<html>...
.............. <body>
.............. <p>Hello, world</p>....
.............. </body>.
..............</html>...
..............""";
Trailing white space is most often unintentional, idiosyncratic, and insignificant. It is overwhelmingly likely that the developer does not care about it. Trailing white space characters are similar to line terminators, in that both are invisible artifacts of the source code editing environment. With no visual guide to the presence of trailing white space characters, including them in the content would be a recurring source of surprise, as it would affect the length, hash code, etc, of the string.
Accordingly, an appropriate interpretation for the content of a text block is to differentiate incidental white space at the start and end of each line, from essential white space. The Java compiler processes the content by removing incidental white space to yield what the developer intended. String::indent
can then be used to manage indentation if desired. Using |
to visualize margins:
|<html>|
| <body>|
| <p>Hello, world</p>|
| </body>|
|</html>|
The re-indentation algorithm takes the content of a text block whose line terminators have been normalized to LF. It removes the same amount of white space from each line of content until at least one of the lines has a non-white space character in the leftmost position. The position of the opening """
characters has no effect on the algorithm, but the position of the closing """
characters does have an effect if placed on its own line. The algorithm is as follows:
-
Split the content of the text block at every LF, producing a list of individual lines. Note that any line in the content which was just an LF will become an empty line in the list of individual lines.
-
Add all non-blank lines from the list of individual lines into a set of determining lines. (Blank lines -- lines that are empty or are composed wholly of white space -- have no visible influence on the indentation. Excluding blank lines from the set of determining lines avoids throwing off step 4 of the algorithm.)
-
If the last line in the list of individual lines (i.e., the line with the closing delimiter) is blank, then add it to the set of determining lines. (The indentation of the closing delimiter should influence the indentation of the content as a whole -- a significant trailing line policy.)
-
Compute the common white space prefix of the set of determining lines, by counting the number of leading white space characters on each line and taking the minimum count.
-
Remove the common white space prefix from each non-blank line in the list of individual lines.
-
Remove all trailing white space from all lines in the modified list of individual lines from step 5. This step collapses wholly-white-space lines in the modified list so that they are empty, but does not discard them.
-
Construct the result string by joining all the lines in the modified list of individual lines from step 6, using LF as the separator between lines. If the final line in the list from step 6 is empty, then the joining LF from the previous line will be the last character in the result string.
The escape sequences \b
(backspace) and \t
(tab) are not interpreted by the algorithm; escape processing happens later.
The re-indentation algorithm will be normative in The Java Language Specification. Developers will have access to it via String::stripIndent
, a new instance method.
Significant trailing line policy
Normally, one would format a text block in two ways: first, position the left edge of the content to appear under the first "
of the opening delimiter, and second, place the closing delimiter on its own line to appear exactly under the opening delimiter. The resulting string will have no white space at the start of any line, and will not include the trailing blank line of the closing delimiter.
However, because the trailing blank line is considered a determining line, moving it to the left has the effect of reducing the common white space prefix, and therefore reducing the the amount of white space that is stripped from the start of every line. In the extreme case, where the closing delimiter is moved all the way to the left, that reduces the common white space prefix to zero, effectively opting out of white space stripping.
For example, with the closing delimiter moved all the way to the left, there is no incidental white space to visualize with dots:
String html = """
<html>
<body>
<p>Hello, world</p>
</body>
</html>
""";
Including the trailing blank line with the closing delimiter, the common white space prefix is zero, so zero white space is removed from the start of each line. The algorithm thus produces: (using |
to visualize the left margin)
| <html>
| <body>
| <p>Hello, world</p>
| </body>
| </html>
Alternatively, suppose the closing delimiter is not moved all the way to the left, but rather under the t
of html
so it is eight spaces deeper than the variable declaration:
String html = """
<html>
<body>
<p>Hello, world</p>
</body>
</html>
""";
The spaces visualized with dots are considered to be incidental:
String html = """
........ <html>
........ <body>
........ <p>Hello, world</p>
........ </body>
........ </html>
........""";
Including the trailing blank line with the closing delimiter, the common white space prefix is eight, so eight white spaces are removed from the start of each line. The algorithm thus preserves the essential indentation of the content relative to the closing delimiter:
| <html>
| <body>
| <p>Hello, world</p>
| </body>
| </html>
Finally, suppose the closing delimiter is moved slightly to the right of the content:
String html = """
<html>
<body>
<p>Hello, world</p>
</body>
</html>
""";
The spaces visualized with dots are considered to be incidental:
String html = """
..............<html>
.............. <body>
.............. <p>Hello, world</p>
.............. </body>
..............</html>
.............. """;
The common white space prefix is 14, so 14 white spaces are removed from the start of each line. The trailing blank line is stripped to leave an empty line, which being the last line is then discarded. In other words, moving the closing delimiter to the right of the content has no effect, and the algorithm again preserves the essential indentation of the content:
|<html>
| <body>
| <p>Hello, world</p>
| </body>
|</html>
3. Escape sequences
After the content is re-indented, any escape sequences in the content are interpreted. Text blocks support all of the escape sequences supported in string literals, including \n
, \t
, \'
, \"
, and \\
. See section 3.10.6 of the The Java Language Specification for the full list. Developers will have access to escape processing via String::translateEscapes
, a new instance method.
Interpreting escapes as the final step allows developers to use \n
, \f
, and \r
for vertical formatting of a string without it affecting the translation of line terminators in step 1, and to use \b
and \t
for horizontal formatting of a string without it affecting the removal of incidental white space in step 2. For example, consider this text block that contains the \r
escape sequence (CR):
String html = """
<html>\r
<body>\r
<p>Hello, world</p>\r
</body>\r
</html>\r
""";
The CR escapes are not processed until after the line terminators have been normalized to LF. Using Unicode escapes to visualize LF (\u000A
) and CR (\u000D
), the result is:
|<html>\u000D\u000A
| <body>\u000D\u000A
| <p>Hello, world</p>\u000D\u000A
| </body>\u000D\u000A
|</html>\u000D\u000A
Note that it is legal to use "
freely inside a text block, even next to the opening or closing delimiter. For example:
String story = """
"When I use a word," Humpty Dumpty said,
in rather a scornful tone, "it means just what I
choose it to mean - neither more nor less."
"The question is," said Alice, "whether you
can make words mean so many different things."
"The question is," said Humpty Dumpty,
"which is to be master - that's all."
""";
However, sequences of three "
characters require the escaping of at least one "
to avoid mimicking the closing delimiter:
String code =
"""
String text = \"""
A text block inside a text block
\""";
""";
New escape sequences
To allow finer control of the processing of newlines and white space, we introduce two new escape sequences.
First, the \<line-terminator>
escape sequence explicitly suppresses the insertion of a newline character.
For example, it is common practice to split very long string literals into concatenations of smaller substrings, and then hard wrap the resulting string expression onto multiple lines:
String literal = "Lorem ipsum dolor sit amet, consectetur adipiscing " +
"elit, sed do eiusmod tempor incididunt ut labore " +
"et dolore magna aliqua.";
With the \<line-terminator>
escape sequence this could be expressed as:
String text = """
Lorem ipsum dolor sit amet, consectetur adipiscing \
elit, sed do eiusmod tempor incididunt ut labore \
et dolore magna aliqua.\
""";
For the simple reason that character literals and traditional string literals don't allow embedded newlines, the \<line-terminator>
escape sequence is only applicable to text blocks.
Second, the new \s
escape sequence simply translates to a single space (\u0020
).
Escape sequences aren't translated until after incident space stripping, so \s
can act as fence to prevent the stripping of trailing white space. Using \s
at the end of each line in this example guarantees that each line is exactly six characters long:
String colors = """
red \s
green\s
blue \s
""";
The \s
escape sequence can be used in both text blocks and traditional string literals.
Concatenation of text blocks
Text blocks can be used anywhere a string literal can be used. For example, text blocks and string literals may be concatenated interchangeably:
String code = "public void print(Object o) {" +
"""
System.out.println(Objects.toString(o));
}
""";
However, concatenation involving a text block can become rather clunky. Take this text block as a starting point:
String code = """
public void print(Object o) {
System.out.println(Objects.toString(o));
}
""";
Suppose it needs to be changed so that the type of o
comes from a variable. Using concatenation, the text block that contains the trailing code will need to start on a new line. Unfortunately, the straightforward insertion of a newline in the program, as below, will cause a long span of white space between the type and the text beginning o
:
String code = """
public void print(""" + type + """
o) {
System.out.println(Objects.toString(o));
}
""";
The white space can be removed manually, but this hurts readability of the quoted code:
String code = """
public void print(""" + type + """
o) {
System.out.println(Objects.toString(o));
}
""";
A cleaner alternative is to use String::replace
or String::format
, as follows:
String code = """
public void print($type o) {
System.out.println(Objects.toString(o));
}
""".replace("$type", type);
String code = String.format("""
public void print(%s o) {
System.out.println(Objects.toString(o));
}
""", type);
Another alternative involves the introduction of a new instance method, String::formatted
, which could be used as follows:
String source = """
public void print(%s object) {
System.out.println(Objects.toString(object));
}
""".formatted(type);
Additional Methods
The following methods will be added to support text blocks;
String::stripIndent()
: used to strip away incidental white space from the text block contentString::translateEscapes()
: used to translate escape sequencesString::formatted(Object... args)
: simplify value substitution in the text block
Alternatives
Do nothing
Java has prospered for over 20 years with string literals that required newlines to be escaped. IDEs ease the maintenance burden by supporting automatic formatting and concatenation of strings that span several lines of source code. The String
class has also evolved to include methods that simplify the processing and formatting of long strings, such as a method that presents a string as a stream of lines. However, strings are such a fundamental part of the Java language that the shortcomings of string literals are apparent to vast numbers of developers. Other JVM languages have also made advances in how long and complex strings are denoted. Unsurprisingly, then, multi-line string literals have consistently been one of the most requested features for Java. Introducing a multi-line construct of low to moderate complexity would have a high payoff.
Allow a string literal to span multiple lines
Multi-line string literals could be introduced in Java simply by allowing line terminators in existing string literals. However, this would do nothing about the pain of escaping "
characters. \"
is the most frequently occurring escape sequence after \n
, because of frequency of code snippets. The only way to avoid escaping "
in a string literal would be to provide an alternate delimiter scheme for string literals. Delimiters were much discussed for JEP 326 (Raw String Literals), and the lessons learned were used to inform the design of text blocks, so it would be misguided to upset the stability of string literals.
Adopt another language's multi-string literal
According to Brian Goetz:
Many people have suggested that Java should adopt multi-line string literals from Swift or Rust. However, the approach of “just do what language X does” is intrinsically irresponsible; nearly every feature of every language is conditioned by other features of that language. Instead, the game is to learn from how other languages do things, assess the tradeoffs they’ve chosen (explicitly and implicitly), and ask what can be applied to the constraints of the language we have and user expectations within the community we have.
For JEP 326 (Raw String Literals), we surveyed many modern programming languages and their support for multi-line string literals. The results of these surveys influenced the current proposal, such as the choice of three "
characters for delimiters (although there were other reasons for this choice too) and the recognition of the need for automatic indentation management.
Do not remove incidental white space
If Java introduced multi-line string literals without support for automatically removing incidental white space, then many developers would write a method to remove it themselves, or lobby for the String
class to include a removal method. However, that implies a potentially expensive computation every time the string is instantiated at run time, which would reduce the benefit of string interning. Having the Java language mandate the removal of incidental white space, both in leading and trailing positions, seems the most appropriate solution. Developers can opt out of leading white space removal by careful placement of the closing delimiter.
Raw string literals
For JEP 326 (Raw String Literals), we took a different approach to the problem of denoting strings without escaping newlines and quotes, focusing on the raw-ness of strings. We now believe that this focus was wrong, because while raw string literals could easily span multiple lines of source code, the cost of supporting unescaped delimiters in their content was extreme. This limited the effectiveness of the feature in the multi-line use case, which is a critical one because of the frequency of embedding multi-line (but not truly raw) code snippets in Java programs. A good outcome of the pivot from raw-ness to multi-line-ness was a renewed focus on having a consistent escape language between string literals, text blocks, and related features that may be added in future.
Testing
Tests that use string literals for the creation, interning, and manipulation of instances of String
should be duplicated to use text blocks too. Negative tests should be added for corner cases involving line terminators and EOF.
Tests should be added to ensure that text blocks can embed Java-in-Java, Markdown-in-Java, SQL-in-Java, and at least one JVM-language-in-Java.