JEP 326: Raw String Literals (Preview)
Owner | Jim Laskey |
Type | Feature |
Scope | SE |
Status | Closed / Withdrawn |
Component | specification / language |
Discussion | amber dash dev at openjdk dot java dot net |
Effort | M |
Duration | M |
Relates to | JEP 355: Text Blocks (Preview) |
Reviewed by | Alex Buckley |
Endorsed by | Brian Goetz |
Created | 2018/01/23 15:40 |
Updated | 2020/05/01 19:04 |
Issue | 8196004 |
Summary
Add raw string literals to the Java programming language. A raw string
literal can span multiple lines of source code and does not interpret
escape sequences, such as \n
, or Unicode escapes, of the form \uXXXX
.
Please note: This was intended to be a preview language feature in JDK 12, but it was withdrawn and did not appear in JDK 12. It was superseded by Text Blocks (JEP 355) in JDK 13.
Goals
- Make it easier for developers to
- express sequences of characters in a readable form, free of Java indicators,
- supply strings targeted for grammars other than Java, and
- supply strings that span several lines of source without supplying special indicators for new lines.
- Raw string literals should be able to express the same strings as traditional string literals, except for platform-specific line terminators.
- Include library support to replicate the current
javac
string-literal interpretation of escapes and manage left-margin trimming.
Non-Goals
- Do not introduce any new String operators.
- Raw string literals do not directly support string interpolation. Interpolation may be considered in a future JEP.
- No change in the interpretation of traditional string literals in any
way, including:
- multi-line capability,
- customization of delimiters with repeating open and close double-quotes, and
- handling of escape sequences.
Motivation
Escape sequences have been defined in many programming languages, including
Java, to represent characters that can not be easily represented
directly. As an example, the escape sequence \n
represents the ASCII
newline control character. To print "hello" and "world" on separate
lines the string "hello\nworld\n"
can be used;
System.out.print("hello\nworld\n");
Output:
hello
world
Besides suffering from readability issues, this example fixedly targets
Unix based systems, where other OSes use alternate new line
representations, such as \r\n
(Windows). In Java, we use a higher
level method such as println
to provide the platform appropriate
newline sequence:
System.out.println("hello");
System.out.println("world");
If "hello" and "world" are being displayed using a GUI library, control characters may not have any significance at all.
The escape sequence indicator, backslash, is represented in Java string
literals as \\
. This doubling up of backslashes leads to the
Leaning Toothpick Syndrome,
where strings become difficult to interpret because of excessive
backslashes. Java developers are familiar with examples such as:
Path path = Paths.get("C:\\Program Files\\foo");
Escape sequences, such as \"
to represent the double-quote
character, also lead to interpretation issues when used in non-Java
grammars. For example, searching for a double-quote within a string
requires:
Pattern pattern = Pattern.compile("\\\"");
The reality of escape sequences is they are often the exception and not the rule in everyday Java development. We use control characters less, and escape presence adversely affects the readability and maintainability of our code. Once we come to this realization, the notion of a non-interpreted string literal becomes a well reasoned result.
Real-world Java code, which frequently embeds fragments of other programs (SQL, JSON, XML, regex, etc) in Java programs, needs a mechanism for capturing literal strings as-is, without special handling of Unicode escaping, backslash, or new lines.
This JEP proposes a new kind of literal, a raw string literal, which sets aside both Java escapes and Java line terminator specifications, to provide character sequences that under many circumstances are more readable and maintainable than the existing traditional string literal.
File Paths Example
Traditional String Literals | Raw String Literals |
---|---|
Runtime.getRuntime().exec("\"C:\\Program Files\\foo\" bar"); |
Runtime.getRuntime().exec(`"C:\Program Files\foo" bar`); |
Multi-line Example
Traditional String Literals | Raw String Literals |
---|---|
String html = "<html>\n" + " <body>\n" + " <p>Hello World.</p>\n" + " </body>\n" + "</html>\n"; |
String html = `<html> |
Regular Expression Example
Traditional String Literals | Raw String Literals |
---|---|
System.out.println("this".matches("\\w\\w\\w\\w")); |
System.out.println("this".matches(`\w\w\w\w`)); |
Output:
true
Polyglot Example
Traditional String Literals | Raw String Literals |
---|---|
String script = "function hello() {\n" + " print(\'\"Hello World\"\');\n" + "}\n" + "\n" + "hello();\n"; ScriptEngine engine = new ScriptEngineManager().getEngineByName("js"); Object obj = engine.eval(script); |
String script = `function hello() { |
Output:
"Hello World"
Database Example
Traditional String Literals | Raw String Literals |
---|---|
String query = "SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB`\n" + "WHERE `CITY` = ‘INDIANAPOLIS'\n" + "ORDER BY `EMP_ID`, `LAST_NAME`;\n"; |
String query = ``SELECT `EMP_ID`, `LAST_NAME` FROM `EMPLOYEE_TB` |
Description
A raw string literal is a new form of literal.
Literal:
IntegerLiteral
FloatingPointLiteral
BooleanLiteral
CharacterLiteral
StringLiteral
RawStringLiteral
NullLiteral
RawStringLiteral:
RawStringDelimiter RawInputCharacter {RawInputCharacter} RawStringDelimiter
RawStringDelimiter:
` {`}
A raw string literal consists of one or more characters enclosed in
sequences of backticks `
(\u0060
) (backquote, accent grave).
A raw string literal opens with a sequence of one or more
backticks. The raw string literal closes when a backtick sequence
is encountered of equal length as opened the raw string literal.
Any other sequence of backticks is treated as part of the string body.
Embedding backticks in a raw string literal can be accomplished by increasing or decreasing the number of backticks in the open/close sequences to mismatch any embedded sequences. However, this does not help when a backtick is desired as the first or last character in the raw string literal, since that character will be treated as part of the open/close sequence. In this case, it is necessary to use a workaround, such as padding the body of the raw string literal and then stripping the padding.
Characters in a raw string literal are never interpreted, with the
exception of CR and CRLF, which are platform-specific line terminators.
CR (\u000D
) and CRLF (\u000D\u000A
) sequences are always translated
to LF (\u000A
). This translation provides the least surprising behavior across platforms.
Traditional string literals support two kinds of escapes:
Unicode escapes of the form \uxxxx
,
and escape sequences such as \n
.
Neither kind of escape is processed in raw string literals;
the individual characters that make up the escape are used as-is.
This implies that processing of Unicode escapes is disabled
when the lexer encounters an opening backtick and reenabled when
encountering a closing backtick. For consistency, the Unicode escape
\u0060
may not be used as a substitute for the opening backtick.
The following are examples of raw string literals:
`"` // a string containing " alone
``can`t`` // a string containing 'c', 'a', 'n', '`' and 't'
`This is a string` // a string containing 16 characters
`\n` // a string containing '\' and 'n'
`\u2022` // a string containing '\', 'u', '2', '0', '2' and '2'
`This is a
two-line string` // a single string constant
It is a compile-time error to have an open backtick sequence and no corresponding close backtick sequence before the end of the compilation unit.
In a class
file, a string constant
does not record whether it was derived from a raw string literal or a traditional string literal.
At run time, a raw string literal is evaluated to an instance of String
,
like a traditional string literal. Instances of String
that are derived
from raw string literals are treated in the same manner as those
derived from traditional string literals.
Escapes
It is highly probable that a developer may want a string that is
multi-line but has interpreted escape sequences. To facilitate this
requirement, instance methods will be added to the String
class to
support the run-time interpretation of escape sequences. Primarily,
public String unescape()
will translate each character sequence beginning with \
that has
the same spelling as an escape defined in the JLS
(either a Unicode escape
or an escape sequence)
to the character represented by that sequence.
Examples (b0 thru b3 are true):
boolean b0 = `\n`.equals("\\n");
boolean b1 = `\n`.unescape().equals("\n");
boolean b2 = `\n`.length == 2;
boolean b3 = `\n`.unescape().length == 1;
Other methods will provide finer control over which escapes are translated.
There will also be a provision for tools to invert escapes. The following
method will also be added to the String
class:
public String escape()
which will convert all characters less than ' '
into Unicode or
character escape sequences, characters above '~'
to Unicode escape
sequences, and the characters "
, '
, \
to escape sequences.
Examples (b0 thru b3 are true):
boolean b0 = "\n".escape().equals(`\n`);
boolean b1 = `•`.escape().equals(`\u2022`);
boolean b2 = "•".escape().equals(`\u2022`);
boolean b3 = !"•".escape().equals("\u2022");
Source Encoding
If a source file contains non-ASCII characters, ensure use of the correct encoding on the javac command line (see javac -encoding). Alternatively, supply the appropriate Unicode escapes in the raw string and then use one of the provided library routines described above to translate Unicode escapes to the desired non-ASCII characters.
Margin Management
One of the issues with multi-line strings is whether to format the string against the left margin (as in heredoc) or, ideally, blend with the indentation used by surrounding code. The question then becomes, how to manage this incidental indentation.
For example, some developers may choose to code as
String s = `
this is my
embedded string
`;
while other developers may not like the outdenting style and choose to embed relative to the indentation of the code
String html = `
this is my
embedded string
`;
In the latter case, the developer probably intends that this
should be
left-justified while embedded
should be relatively indented by four
spaces, and we surely want to support this, but we are reluctant to try
and read the developer's mind and assume that this white space is
incidental.
To allow for contrasting coding styles, while providing a flexible and enduring solution, raw string literals are scanned with the incidental indentation intact; i.e., raw. The consequence of this design is that if the developer chooses the above former case, they need no further processing. Otherwise, the developer will have access to easy-to-use library support for a variety of alternate coding styles. This will permit coding style change without affecting the JLS.
We believe the most common case will be the latter case above. For that
reason, we will provide the following String
instance method:
public String align()
which after removing all leading and trailing blank lines, left justifies each line without loss of relative indentation. Thus, stripping away all incidental indentation and line spacing.
Example:
String html = `
<html>
<body>
<p>Hello World.</p>
</body>
</html>
`.align();
System.out.print(html);
Output:
<html>
<body>
<p>Hello World.&</p>
</body>
</html>
Further, generalized control of indentation will be provided with the
following String
instance method:
public String indent(int n)
where n
specifies the number of white spaces to add or remove from
each line of the string; a positive n
adds n spaces (U+0020) and
negative n
removes n white spaces.
Example:
String html = `
<html>
<body>
<p>Hello World.</p>
</body>
</html>
`.align().indent(4);
System.out.print(html);
Output:
<html>
<body>
<p>Hello World.&</p>
</body>
</html>
In the cases where align() is not what the developer wants, we expect the
preponderance of cases to be align().indent(n). Therefore, an additional
variation of align
will be provided:
public String align(int n)
where n
is the indentation applied to the string after alignment.
Example:
String html = `
<html>
<body>
<p>Hello World.</p>
</body>
</html>
`.align(4);
System.out.print(html);
Output:
<html>
<body>
<p>Hello World.&</p>
</body>
</html>
Customizable margin management will be provided by the string instance method:
<R> R transform(Function<String,R> f)
where the supplied function f is called with this
string as the argument.
Example:
public class MyClass {
private static final String MARKER= "| ";
public String stripMargin(String string) {
return lines().map(String::strip)
.map(s -> s.startsWith(MARKER) ? s.substring(MARKER.length()) : s)
.collect(Collectors.joining("\n", "", "\n"));
}
String stripped = `
| The content of
| the string
`.transform(MyClass::stripMargin);
System.out.print(stripped);
Output:
The content of
the string
It should be noted that concern for class file size and runtime impact by this design is addressed by the constant folding features of JEP 303.
Alternatives
Choice of Delimiters
A traditional string literal and a raw string literal both enclose their character sequence with delimiters. A traditional string literal uses the double-quote character as both the opening and closing delimiter. This symmetry makes the literal easy to read and parse. A raw string literal will also adopt symmetric delimiters, but it must use a different delimiter because the double-quote character may appear unescaped in the character sequence. The choice of delimiters for a raw string literal is informed by the following considerations:
-
Delimiters should have a low profile for small character sequences, margin management, and general readability.
-
The opening delimiter should be a clear indication that what follows is the body of a raw string literal.
-
The closing delimiter should have a low probability of occurring in the string body. If the closing delimiter needs to occur in the body of the string then the rules for embedding the closing delimiter should be clean and simple. Embedding must be accomplished without the use of escapes.
We assume that the string-literal delimiter choice includes only the three Latin1 quote characters: single-quote, double-quote, and backtick. Any other choice would affect clarity and be inconsistent with traditional string literals.
Still, it is necessary to differentiate a raw string literal from a
traditional string literal. For example, double-quote could be combined
with other characters or custom phrases to form a kind of compound
delimiter for raw string literals. For example, $"xyz"$
or
abcd"xyz"abcd
. These compound delimiters meet the basic requirements,
but lack a clean and simple embedding of the closing delimiter. Also,
there is a temptation in the custom phrases case to assign semantic
meaning to the phrase, heralding another industry similar to Java
annotations.
There is the possibility to use quote repetition: """xyz"""
. Here we have
to be cautious to avoid ambiguity. Example: "" + x + ""
can be
parsed as the concatenation of a traditional string literal with a
variable and another traditional string literal, or as a raw string
literal for the seven-character string " + x + "
.
The advantage of the backtick is that it does not require repurposing. We can also avoid the ambiguity created by quote repetition and the empty string. It is a new delimiter in terms of the Java Language Specification. It meets all the delimiter requirements, including a simple embedding rule.
Another consideration for choice of delimiters is the potential for future technologies. With raw and traditional string literals both using simple delimiters, any future technology could be applied symmetrically.
This JEP proposes to use backtick character. It is distinct from existing quotes in the language but conveys similar purpose.
Multi-line Traditional String Literals
Even though this option has been set aside as a raw string literal solution, it may still be reasonable to allow multi-line traditional string literals in addition to raw string literals. Enabling such a feature would affect tools and tests that assume multi-line traditional string literals as an error.
Other Languages
Java remains one of a small group of contemporary programming languages that do not provide language-level support for raw strings.
The following programming languages support raw string literals and were surveyed for their delimiters and use of raw and multi-line strings; C, C++, C#, Dart, Go, Groovy, Haskell, JavaScript, Kotlin, Perl, PHP, Python, R, Ruby, Scala and Swift. The Unix tools bash, grep and sed were also examined for string representations.
A multi-line literal solution could have been simply achieved by changing the Java specification to allow CR and LF in the body of a double-quote traditional string literal. However, the use of double quote implies that escapes must be interpreted.
A different delimiter was required to signify different interpretation behavior. Other languages chose a variety of delimiters:
Delimiters |
Language/Tool |
---|---|
|
Groovy, Kotlin, Python, Scala, Swift |
|
Go, JavaScript |
|
C# |
|
Groovy (old style) |
|
C/C++ |
|
Ruby |
|
Perl |
Python, Kotlin, Groovy and Swift have opted to use triple double quotes to indicate raw strings. This choice reflects the connection with existing string literals.
Go and JavaScript use the backtick. This choice uses a character that is not commonly used in strings. This is not ideal for use in Markdown documents, but addresses a majority of cases.
A unique meta-tag such as @"..."
used in C# provides similar
functionality to the backticks proposed here. However, @
suggests
annotations in Java. The use of another meta-tag limits the use of that
meta-tag for future purposes.
Heredoc
An alternative to quoting for raw strings is using "here" documents or heredocs. Heredocs were first used in Unix shells and have found their way into programming languages such as Perl. A heredoc has a placeholder and an end marker. The placeholder indicates where the string is to be inserted in the code as well as providing the description of end marker. The end marker comes after the body of the string. For example,
System.out.println(<<HTML);
<html>
<body>
<p>Hello World.</p>
</body>
</html>
HTML
Heredocs provide a solution for raw strings, but are thought by many to be an anachronism. They are also obtrusive and complicate margin management.
Testing
String test suites should be extended to duplicate existing tests replacing traditional string literals with raw string literals.
Negative tests should be added to test corner cases for line terminators and end of compilation unit.
Tests should be added to test escape and margin management methods.
Tests should be added to ensure we can embed Java-in-Java and Markdown-in-Java.