JEP draft: Keyword Management for the Java Language

Owner	Alex Buckley
Type	Informational
Scope	SE
Status	Draft
Component	specification / language
Discussion	jdk dash dev at openjdk dot java dot net
Effort	M
Duration	M
Created	2019/04/26 00:26
Updated	2020/01/23 22:55
Issue	8223002

Summary

Evolving the Java language often means new keywords for new features, but new keywords risk breaking old programs. To balance compatibility and readability, a new kind of keyword may be used: a hyphenated keyword that is a compound of pre-existing keywords and identifiers, such as non-final, break-with, and short-circuit.

Note: All examples in this JEP are intended solely to illustrate a syntactic form under discussion. They are not intended to suggest that any particular language feature is being considered for inclusion in Java now or in the future.

Goals

Explore the syntactic options open to Java language designers for denoting new features.
Solve the perpetual problem of keyword tokens being so scarce and expensive that language designers have to constrain or corrupt the Java programming model to fit the keywords available.
Advise language designers on the style of keyword suited to different kinds of features.

Non-Goals

In any proposal for new elements of Java syntax, it is important to avoid being influenced by the (often strawman) syntax of language features presently in development.
It is not a goal to optimize new elements of Java syntax for ease of implementation by compiler developers.

Motivation

A keyword is a sequence of ASCII letters that cannot be used as an identifier in Java programs. Java uses a small set of keywords to denote the most fundamental features of the language:

Primitive types: boolean, byte, char, double, float, int, long, short
Reference types and their members: package, class, interface, extends, implements, throws, enum, abstract, final, native, private, protected, public, static, strictfp, synchronized, transient, void, volatile
Statements: assert, break, case, catch, continue, default, do, else, for, finally, if, import, return, switch, throw, try, while
Expressions: instanceof, new, super, this

Over time, Java language designers face a challenge: the keywords conceived for the features of Java 1.0 are rarely suitable for denoting new features. There are several obvious techniques for addressing this problem:

Eminent domain: Reclassify an identifier as a keyword, such as assert in Java 1.4 and enum in Java 1.5. A similar but more conservative move is to reclassify some unusual set of identifiers as keywords, such as identifiers that begin with two underscores (e.g., __nonnull), a style often seen in feature prototypes and inspired by reserved identifiers in C.
Overload: Reuse an existing keyword for a new feature. For example, reuse the default keyword from a switch statement to declare an annotation element and a default method. As another example, reuse the break keyword from a switch statement to yield a value as the result of a switch expression (break <value>; which unfortunately looks like break <label>;).
Distort: Find a syntax that doesn't require a new keyword, such as @interface to declare an annotation type.
Smoke and mirrors: Create the illusion of new keywords in new contexts through various linguistic heroics, such as treating the identifier var as a type name but only in local variable declarations, or reclassifying the identifier module as a keyword but only in module declarations.

For most new features, all of these techniques are on the table -- but most of the time, none are very good. Given that all of these techniques are problematic, and there is not even a least-problematic technique that works in all situations, it is desirable to try to expand the set of syntactic forms that serve as keywords. Otherwise, the lack of reasonable techniques for extending the syntax of the language will become a significant impediment to language evolution.

In addition, modifiers like static and final make up a quarter of all keywords, but the set of modifiers is not complete; there is no way to say "not static" or "not final". Consequently, it is not possible to create features where variables or classes are final by default, or members are static by default, because there is no way to denote the opt out of "not static" or "not final". Leaving a feature out of Java for reasons of simplicity is fine; leaving it out because there is no way to denote the obvious semantics is not. This is a constant problem in evolving the language, and an ongoing tax paid by every Java developer.

Description

Syntax in feature design

The best syntax for a new feature -- whether declaration, statement, or expression -- is inherently feature-specific.

Some features are denoted best with tokens other than keywords: the operator -> for a lambda expression, the separator :: for a method reference expression, and the separator ... for a varargs parameter declaration. Meanwhile, features that support built-in types tend to find their own syntactic ground independent of keywords: the literals true, false, and null, the prefix 0b for binary literals, the suffixes L, F, D, etc for numeric literals, and the delimiter """ for text blocks.

Most features, though, are denoted best with keywords whose length, alphabet, and tone align with pre-existing keywords. That means 2-20 ASCII letters which spell out a simple noun, verb, or adjective of U.S. English. Traditionally, there were two kinds of keyword that met these constraints:

Classic keyword: A sequence of Java letters that is always tokenized as a keyword, never as an identifier.
Contextual keyword: A sequence of Java letters that is tokenized as a keyword in certain contexts but as an identifier in all other contexts (e.g. module, a restricted keyword in module declarations, since Java 9). Alternatively, a sequence of Java letters that is always tokenized as an identifier but for which special provision is made in certain contexts (e.g., var, a restricted identifier which has special meaning in local variable declarations and cannot be used as a type name, since Java 10).

Each classic and contextual keyword is a unitary keyword, that is, made up of one token. This JEP opens up new syntactic ground by allowing multiple tokens to be joined together by the token -, with no white space between any of the tokens. The result is a hyphenated keyword. The token - which is usually an operator is pressed into service as a hyphen; this has implications for compiler engineers and for language designers, discussed later.

There are two kinds of hyphenated keyword, depending on the tokens being joining together:

Hyphenated classic keyword: A keyword that is formed by using a hyphen to join, in any order, a classic keyword with another token. The other token is either an identifier, a literal, a classic keyword, or a contextual keyword.
Hyphenated contextual keyword: A keyword that is formed by using a hyphen to join, in any order, a contextual keyword with another token. The other token is either an identifier, a literal, or a contextual keyword.

Hyphenated keywords

Hyphenation admits a rich array of phrases relevant to current and potential constructs of the Java language:

Hyphenated classic keywords (at least one classic keyword is present)

non-final (if the default for method parameters was to be made final)
break-with (to yield a value from a switch expression)
package-private (the default accessibility for class members)
public-read (to denote "publicly readable, privately writable")
enum-class and annotation-interface (versus enum and @interface)
value-class and record-class (versus value class and record)
default-value (for elements of an annotation type)
this-class (to denote the class literal for the current class)
this-return (to mark a setter or builder method as returning its receiver)
short-circuit (perhaps useful for fibers)

Hyphenated contextual keywords (no classic keywords are present)

non-null
read-only
lazy-var (to declare a lazy final field)
eventually-true (perhaps useful for lazy final fields)

A hyphenated keyword is a terminal symbol of the syntactic grammar of the Java language. This presents a challenge for the lower-level lexical grammar of the Java language, where input characters and line terminators are tokenized into keywords, identifiers, literals, operators, and separators. For example, if a hyphenated keyword starts with a classic keyword (e.g. short-circuit), then after consuming the Java letters that make up the classic keyword, the lexer must not tokenize them as a keyword and proceed to tokenize the - character and further Java letters as an operator and an identifier respectively. Instead, the lexer must look ahead and respect the hyphenation, that is, tokenize the entire sequence of Java letters, -, and more Java letters as one (hyphenated) keyword. White space cannot appear "inside" a hyphenated keyword, so if the lexer consumes any white space in its look ahead, then it can give up on detecting hyphenation and proceed with tokenization as usual.

A future version of this JEP may suggest hyphenation of more than two tokens (e.g. never-short-circuit, non-null-except-local), as well as notions of compounding other than hyphenation (e.g. keywords joined with + or : tokens).

Keyword management

The following policy is commended to Java language designers:

Use a hyphenated classic keyword when you want to introduce a keyword in the middle of code, at a place where an identifier may occur.
Use a hyphenated {classic, contextual} keyword when you want a keyword at a declaration site (class, field, method).
Use a unitary {classic, contextual} keyword only in the most extreme cases where no hyphenated keyword is suitable.

The following subsections provide rationale for this policy.

Avoid classic keywords

While it may be legal for language designers to define i as a keyword in a future version of Java, it would likely break every program in the world, since i is used so commonly as an identifier. (When the assert keyword was added in 1.4, it broke every testing framework.) The cost of remediating the effect of such an incompatible change varies as well: invalidating a name choice for a local variable has a local fix, but invalidating the name of a public type or an interface method might well be fatal.

Additionally, the keywords that language designers are likely to want to reclaim are often those that are popular as identifiers (e.g., value, var, method), making such fatal collisions more likely. In some cases, if the keyword candidate in question is rarely used as an identifier, designers might opt to take the source-compatibility hit -- but candidates that are unlikely to collide (e.g., usually_but_not_always_final) are probably not the keywords anyone is hoping for.

Realistically, the space of identifiers is unlikely to be a well that language designers can draw from very often to find keywords, and the bar must be very high.

As a historical note, const and goto have been keywords since Java 1.0, even though they are not used by any language feature. They were defined as keywords not because a future version of Java was expected to use them, but because it supported a broader goal: migration from the then-preeminent C++ to the then-fledgling Java. Per the Java Language Specification in 1996, it allowed "a Java compiler to produce better error messages if these C++ keywords incorrectly appear in programs". (Namely, if const had been an identifier, then const int x = ... would have been flagged by a Java compiler as "Error, 'const' found where a keyword was expected", which is incongruent to a C++ developer who thinks const is a keyword; by making const a keyword in Java, Java compilers were forced to recognize it and flag "Error, 'const' keyword not allowed here", which is more comprehensible to a C++ developer.) Given the vast amount of code now written in Java, and the source incompatibility of a new classic keyword, there would be no justification for eagerly defining a classic keyword to support migration from another language. For example, it would be unacceptable to reclassify function from an identifier to a keyword in order to improve error messages for code copy-pasted from ECMAScript.

Cautiously consider contextual keywords

At first glance, unitary contextual keywords (and their friends, reserved type names) appear to be a magic wand: they let language designers create the illusion of new keywords without breaking existing programs. However, the positive track record of unitary contextual keywords hides a great deal of complexity and distortion.

The process of introducing a unitary contextual keyword is not a simple matter of choosing a word and adding it to the grammar; each one requires an analysis of potential current and future interactions. Each grammar position is its own story: contextual keywords that might be used as modifiers (e.g., readonly) have different ambiguity considerations than those that might be used in code (e.g., a match expression). While a small number of special situations can be managed in a specification or a compiler, the more heavily that unitary contextual keywords are used, the more likely there would be more significant maintenance costs and longer bug tails.

Beyond specifications and compilers, unitary contextual keywords distort the language for IDEs. IDEs often have to guess whether an identifier is meant to be an identifier or a unitary contextual keyword, and it may not have enough information to make a good guess until it has seen more input. While this is easy to dismiss as “not my problem”, in reality, it results in worse code highlighting, auto-completion, and refactoring abilities for everybody. (IDEs have the same trouble with hyphenated contextual keywords too.)

Finally, each identifier that is a candidate for dual-purposing as a unitary contextual keyword may have its own special considerations. For example, the use of var as a restricted identifier is justified only because the naming conventions for type names are so broadly adhered to. Using a hyphenated contextual keyword rather than a unitary contextual keyword can sidestep these considerations, since the hyphenated phrase has never been used as an identifier, though the ambiguity issue remains.

In summary, unitary contextual keywords are a tool in the language design toolbox, but they should be used with care.

Prefer hyphenated keywords

Hyphenated {classic, contextual} keywords create less trouble than unitary contextual keywords because the lexer can tell with fixed lookahead whether A-B should become three tokens (keyword/identifier, operator, keyword/identifier) or one (hyphenated keyword), whereas arbitrary lookahead may be required to tokenize an identifier as a unitary contextual keyword. There is less trouble for parsing as well; for example, non-null cannot be confused for a subtraction expression. In sum, this gives a lot more room for creating new, less-conflicting keywords. Happily, these new keywords are likely to be good names, as many of the missing concepts that might be added to Java can fundamentally be described by their relationship to pre-existing concepts (e.g., non-null).

There is a technical constraint on the space of hyphenated keywords, because some terms of the form A-B already have semantic meaning as expressions or statements:

Expressions that use a classic keyword as their first token and may appear on the RHS of a subtraction. For example, the notional hyphenated keyword lazy-int would clash with pre-existing code that uses the expression int.class in a subtraction, as in int lazy = ...; int x = lazy-int.class.hashCode();. Similarly, the notional hyphenated keyword object-new would clash with pre-existing code that uses a new expression in a subtraction, as in int object = ...; int x = object-new Foo().f;
Statements that take an expression as an operand. For example, the notional hyphenated keyword return-never would clash with pre-existing code that returns the negation of the numeric variable never.

These examples show type-correct expressions and statements, but there are also type-incorrect expressions and statements that would clash with hyphenated keywords. That is, some terms of the form A-B are not semantically meaningful, but they are syntactically valid, and overloading them as hyphenated keywords would make lexing and parsing very difficult. In particular, the terms are:

Expressions that use a classic keyword as their last token. For example, consider the reference-typed expressions Foo.class, Foo.this, and Foo::new -- the subtractions Foo.class-day, Foo.this-day, and Foo::new-day are valid Java syntax when day is a numeric variable, but they are not semantically meaningful because subtraction does not accept a reference-typed expression as its left operand. Overloading the syntax by introducing a notional hyphenated keyword class-day, this-day, or new-day would be an unreasonable burden on compiler and IDE vendors.
Statements that take an expression as an operand. For example, the statement throw-quickly is valid Java syntax when quickly is a variable in scope, but it is not semantically meaningful (-quickly is not a Throwable regardless of the type of quickly). Overloading the syntax by introducing a notional hyphenated keyword throw-quickly would also roil compiler and IDE vendors.

Formally, the hyphenated classic keyword A-B would be problematic if A is {assert, case, class, new, return, this, throw}, or if B is {boolean, byte, char, double, float, int, long, new, short, super, switch, this, void}.

Alternatives

A strategy to mitigate the cost of a new classic keyword would be to have a mechanism that allows the keyword to still be used as an identifier. This would have allowed developers in the Java 1.4 era to fix up their variables called assert so that their programs still compiled. However, any such mechanism would bring its own complexity and interactions with other features, and the idea of asking developers to revisit code in this way is undesirable. As a matter of interest, Kotlin allows a keyword to be used as an identifier by enclosing the keyword in backticks, but the goal is specifically to allow Kotlin code to use Java declarations whose names are identifiers in Java but keywords in Kotlin, such as is and when. General-purpose expansion of the Kotlin keyword space is accomplished with soft keywords, which map to unitary contextual keywords in this JEP.

Reusing the same classic keyword for different features has ample precedent in Java. For example, final is (ab)used to mean "not mutable" and "not overridable" and "not extensible". Using a pre-existing keyword in a new feature is sometimes natural and sensible, but usually it is not the first choice. Over time, as the range of demands placed on the keyword space expands, this may descend into the ridiculous; no one wants to use null final as a way of negating finality. (While one might think such things are too ridiculous to consider, there were serious-seeming suggestions during JEP 325 to use new switch to describe a switch with different semantics, presumably to be followed by new new switch in ten years.)

One way to live without making new keywords is to stop evolving Java entirely. While there are some who think this is a fine idea, doing so because of the lack of available tokens would be a silly reason. Java has a long life ahead, and Java developers are excited about new features that enable to them to write more expressive and reliable code.

Risks and Assumptions

Some Java developers will have a negative reaction to the idea of hyphenated keywords, while others may accept the idea but dislike the hyphenated suggestions that emerge over time for particular language features. However, this risk is likely to diminish over time, because many such reactions are possibly-transient responses to unfamiliarity.

Java has a long tradition of declarations having default properties (e.g., package accessibility for classes, mutability for fields, and concreteness for methods) and then using keywords to modify the properties of a given declaration (e.g., public, final, abstract). Hyphenated keywords could subvert this tradition by "merging" a modifier and the declaration into a single term, such as value-class D {..} rather than value class D {..}. Similarly, a hyphenated keyword could simulate a modifier on a modifier (public-read, non-final, semi-abstract) when it may be better to find a unitary term that describes the desired concept and introduce it as a contextual or even classic keyword.