JEP draft: Type operator expressions in the JVM

Owner	John Rose
Type	Feature
Scope	JDK
Status	Draft
Component	hotspot / runtime
Created	2018/06/13 06:48
Updated	2024/09/25 17:27
Issue	8204937

DRAFT DRAFT DRAFT

Summary

Extend the space of JVM type descriptors to include type operators, which are symbolic references to factory-made types. This is a separable component of template classes.

Goals

Allow JVM type descriptors (for methods, fields, and constants) to make new distinctions between types not already present in the system of classes, primitives, and arrays. Support future translation strategies which must make distinctions between different usages of the same basic JVM type, or which must provide a way to specify factory input to a class factory or template species factory.

Non-Goals

This work is a low-level VM hook, like invokedynamic, not a language feature like lambdas. As such, it will not propose any specific mechanism for representing parameterized types; it will only provide a necessary "hook" to name such types. It will not provide a new way to define classes; it will only provide a way to associate such classes with a public symbolic descriptor. It will not define any language features, nor translation strategies. It will not attempt to extend, conflict, or rationalize the current syntax for static generic signatures (JVMS 4.7.9.1).

Success Metrics

Experimental translation strategies can be created which distinguish List<Integer> from List<String> in classfiles. Experimental class templating mechanisms will be able to create species that are denotable from JVM type descriptors. Designers of language features and translation strategies will be able to vary the encodings of new source-level types by changing a bootstrap method, rather than changing the JVM's core logic. Security proofs will be easier to construct, given the black-box nature of type operators, decoupled from the complex details of templates and other advanced language features. Experimental migration strategies can be tested without fully instantiating new language features, since new place-holder types easily be posited by simple changes in javac.

Motivation

Descriptors which can denote complex type instances, such as List<int> or List<ComplexDouble> are a necessary component of "reified generics", which in turn are a goal of Project Valhalla. If a value type is to "code like a class, work like an int", then it seems necessary to be able to denote container types which are customized to that value type, rather than being erased to Object like a reference type.

Description

We will extend the JVM's fundamental syntax for field descriptors, once for all future type schemes (we hope!). The syntax will allow any single type descriptor to be modified by an optional suffix, which has the effect of constraining the original type descriptor in an ad hoc, programmable manner. The combination of the original type descriptor and the suffix is called a type operator expression.

The resolvable semantic elements of this expression are:

carrier type: the original type descriptor (before the suffix)
type operator name: a resolved class name and/or simple identifier
type arguments: one or more type descriptors and/or other constants

All of the above semantic elements are optional; any may be omitted. If the type operator name is omitted, it will be derived from the carrier type, as in the case of a template class whose top type is the unspecialized class itself. If the carrier type is omitted, it is defined to be Object, the customary carrier for untyped values in the JVM.

For example, here are some potential use cases for type operator expressions:

reified generics: The carrier type is Map, the type operator name is omitted, and the arguments are int and String. The whole expression denotes Map<int,String>.
wildcards: The carrier type is List, the type operator name is omitted, and the argument is the symbol (not type) ?. The whole expression denotes List<?>, as distinct from raw List. Given that wildcards are a special case of a concept called "existential types", it is notable that type operator expressions provide a way to wrap any bounded type (a carrier type) inside a symbolically labeled existential type.
non-nullable references: The carrier type is String and the type operator is ! (or java/lang/NotNull) with no arguments. The whole expression denotes String!, a non-nullable string reference.
nullable values: The omitted carrier type defaults to Object and the type operator is ? (or java/lang/Nullable) with one argument int. The whole expression denotes int?, a nullable integer.
reified intersections: The carrier type is some interface I and the type operator is & with a type argument J. The whole expression denotes the intersection type I&J.
reified unions: The omitted carrier type defaults to Object and the type operator name is | with two or more type arguments I, J. The whole expression denotes the intersection type I|J.
fixed-sized arrays: The carrier type is an array type double[] and the type operator name is Array.length with one argument 5. The whole expression denotes double[5], an length-constrained array.
range constraints: The carrier type is a primitive type int and the type operator name is Integer.interval with arguments ge and 0. The whole expression denotes int constrained to non-negative values.
null and notreached type tokens: The omitted carrier type defaults to Object and the type operator is java/lang/Null or java/lang/NotReached. The whole expression denotes a reference constrained to be null, or a reference that is never delivered to its consumer (i.e., the constraint always fails).

The concrete grammar for such descriptors, including new productions, will be something like the following:

MethodType: '(' (FieldType)* ')' (FieldType | 'V')
FieldType: PrimitiveType | ArrayType | ObjectType | *TypeExpr
PrimitiveType: 'B' | 'C' | 'D' | 'F' | 'I' | 'J' | 'S' | 'Z'
ArrayType: '[' (PrimitiveType | ArrayType | ObjectType)
ObjectType: 'L' ClassName ';'
*TypeExpr: TypeCarrier '/' (TypeOpName)? (';' | '[' (TypeArg)+ ']' )
*TypeCarrier: FieldType | `L`
*TypeOpName: '$' Identifier | ('L' ClassName) (';' '$' Identifier)?
*TypeArg: FieldType | MethodType | NameArg | NumberArg
*NumberArg: ('-')? DigitNotZero (Digit)* ';' | '0' ';'
*NameArg: '$' Identifier ';'
Identifier: (any character except '.' ';' '[' '/' '<' '>' ':')*

This grammar is built on slightly edited form of the one in JVMS 4.3. The new productions which support type operators are TypeExpr, TypeCarrier, TypeOpName, TypeArg, NumberArg, and NameArg. (They are starred.) The production for Identifier is taken from JVMS 4.7.9.1.

A TypeExpr denotes a fresh type which is treated by the JVM as distinct from any other type with a different descriptor string, including primitives, arrays, classes, and other TypeExprs.

The syntactic components of a TypeExpr are a TypeCarrier, a TypeOpName, and a sequence of zero or more TypeArgs. These denote the resolvable semantic components of a resolved type operator expression, which are respectively the carrier type, the type oeprator name, and the type arguments.

Two TypeExprs with exactly the same spelling denote the same type. Any FieldType which is a proper prefix of another FieldType is a proper supertype of the longer FieldType. Other than those relations, the JVM does not recognize any equivalences or relations between types with differently spelled TypeExprs.

In particular, the verifier treats every distinct type operator expression as a generic "black box" type, which starts with the carrier type and constrains it in some way, unknowable to the verifier.

Thus, the verifier will allow values of the type operator type to implicitly convert to its carrier type, or any supertypes of its carrier type, but it will not allow such values to be converted to any other type. Also, the verifier will not convert implicitly from a carrier type to a type operator type built on top of that carrier type; such conversions must be performed by explicit bytecode execution.

Here are some syntax examples of descriptors containing type operator expressions (along with some hypothetical meanings):

Ljava/util/Map;/[ID] (the type species Map<int,double>)
Ljava/util/List;/[I] (the type species List<int>)
Ljava/util/List;/[[I] (the type species List<int[]>)
Ljava/util/List;/[$wild;] (the wildcard species of List)
Ljava/util/List;/$Wild; (wildcard alternate spelling)
Ljava/util/List;/$Wild[Ljava/lang/Object;] (wildcard alternate spelling)
[D/$length[5;] (fixed-sized array double[5], not array of length-5 double)
I/$interval[$ge;0;] (int whose value is non-negative)
L/Ljava/util/TupleTemplate[Ljava/lang/String;I] (a pair of String and int)
L/Ljava/util/TupleTemplate[FFF] (a triple of floats)
Ljava/lang/String;/$N; (the N variant of String)
(Ljava/lang/String;)Ljava/lang/String;/$N; (method wrapping an N-String)
(Ljava/lang/String;/$N;)Ljava/lang/String; (method unwrapping an N-String)
L/; (shortest possible expression, a trivially constrained Object)
LFoo;/; (carrier type only, with trivial modification)
L/$N; (type operator only, N-Object)
L/[$Arg;] (lone argument with no type operator: no hypothetical meaning)
L/LFoo[LBar;/$N;] (the type species Foo<N-Bar>)
L/LFoo[LBar;]/$N; (the type species N-Foo<Bar>)
[D/$length[5;]/$N; (N variant of fixed-sized array double[5])
[D/$N/$length[5;]; (fixed-sized variant of N-variant of double[])

The last four examples show that type operator expressions can nest. For example, L/LFoo[LBar;/$N;] denotes a type which is derived first from Bar by modifying it with N, then passing the modified type to the parameterized type constructor Foo. (The carrier type of the result is Object, not Foo.) The last two examples show that type expressions can nest by piling up several TypeOp suffixes. The order of these suffixes is significant purely because the descriptor strings are different: I/$J;/$K; is a different verifier type from I/$K;/$J; even if the computational effects of the J and K type modifiers happen to commute.

The JVM will accept type operator expressions, structured as TypeExpr strings, in the following contexts:

class field types -- allocated as "black box" references
class method types -- treated as "black box" arguments and returns
CONSTANT_NameAndType types -- resolvable black box types (field or method)
CONSTANT_Class names -- resolvable type expressions with programmed resolution
instanceof operands (via CONSTANT_Class) -- resolvable types with programmed behavior
checkcast, anewarray operands -- similar to instanceof
invokevirtual receivers (via CONSTANT_Methodref) -- programmably resolved methods
getstatic, putfield, etc. -- similar to invokevirtual

Normally, descriptor syntaxes are disjoint from the syntax of class names that appear with CONSTANT_Class constants. For example, the descriptor I is very different from the class name I. However, in some cases the syntaxes can overlap; the class name of an array is the same as its descriptor, including the trailing semicolon. We use this trick with type operator expressions also, so that the same type operator expression can be inserted directly into a descriptor, and also used as a class name.

A class name string can be unambiguously distinguished as a type operator expression in three steps:

check if the last character is ] or ';' (otherwise, fail)
if the string begins with [, parse the array type name and look for a following /
otherwise, scan the string to see if the character ; or [ occurs

If the first step and any of the remaining steps pass, then the class name string is proven not to be a plain class name or an array class name, and may be assumed to be a type operator expression (or else an erroneous input). Otherwise it can be assumed to be a plain class name (or array class name). Another simpler technique (though perhaps a slower one) is simply to parse the class name string as a simple class or array name, and see if the end is reached, or else the next remaining character is slash / introducing a type operator suffix; in that case the second step must be executed first.

The second and third steps are expensive but necessary, but can be deferred until after the first step, which is cheap. Note that the JVMS specifies that a class name may not contain an open bracket [ unless it is an array type name, and in that case the bracket will not follow a package separator /. Therefore the class name grammar is not ambiguous, even after type operator expressions are added.

Some operations on a type expression require access to the inside of the black box. These include loading a reflective constant for a type expression, making a type test (checkcast), making an array type whose component is the type expression, calling a method on an instance whose verified type is a type expression, etc.

The built-in resolution mechanism for type operator expressions will perform the following jobs:

Derive a bootstrap meethod ("BSM") from the TypeCarrier and TypeOpName.
Call the BSM on the TypeArgs, suitably parsed and reified.
(Also pass relevant context, such as the current class, the carrier, and the operator name.)
Receive in reply from the BSM a resolved type descriptor for the type.
Permanently and atomically record that descriptor for that exact type expression.
Use the descriptor to derive the various behaviors required for that type.

The details of these steps and the associated APIs are defined elsewhere, and may be extended over time. See below for a sketch of resolved type descriptors and their behavior. Type operators are named by an optional class and optional identifier. If the class is present, it will help determine the bootstrap method; for example, if it is a template, the template will be specialized to the given arguments. If the identifier only is present, the BSM will be a centralized one which assigns fixed standard meanings to a small number of names.

When value types become available, type operator expressions will also be allowed to interoperate with value types. A given type operator expression will always be unambiguously assigned a kind, as a value or a reference. If other kinds are invented, type operator expressions will be "kinded" in the same way. For example, the '$' could be followed by a kind character, or additional characters besides '$' could be assigned to introduce type operator expressions of distinct various kinds.

The descriptor will not be a Class but will have its own reflective type and API. The descriptor will report a concrete carrier Class which is compatible with all values described by the original type operator expression. The BSM for a type operator may return a resolved type descriptor which reports only Object as its carrier class, or it may spin and load a new anonymous class, and use that. In either case, the JVM will be able to use the carrier class as a safe supertype for the type operator expression. The JVM will not freely convert from the carrier class to the type operator type, except via a checkcast bytecode, whose behavior is under the control of the resolved type descriptor selected by the BSM.

Note that the type operator expression language is self-contained and pre-normalized. It does not make references into any constant pool, nor is there any "calculus" for proving that two distinct type expressions denote the same type.

The API for resolved type descriptors will be something like this:

interface ResolvedTypeDescriptor<T extends C, C> {
  Class<T> resolvedType();
  Class<C> carrierType();
  static <T> ResolvedTypeDescriptor<T,?> of(Class<T> clazz);

  // These defaults may be wired into the JVM bytecodes if desired.
  default boolean isInstance(C x) {
    if (this != of(carrierClass()))  throw subclassResponsibility();
    return carrierClass().isInstance(x);
  }
  default boolean isAssignableFrom(ResolvedTypeDescriptor<?,?> subDesc) {
    if (this != of(carrierClass()))  throw subclassResponsibility();
    return carrierClass().isAssignableFrom(subDesc.resolvedType());
  }
  default T cast(C x) {
    if (this != of(carrierClass()))  throw subclassResponsibility();
    return carrierClass().cast(x);
  }
  default T newArray(int length) {
    if (this != of(carrierClass()))  throw subclassResponsibility();
    return java.lang.reflect.Array.newInstance(carrierClass().getComponentType(), length);
  }
  default MethodHandle findVirtual(Lookup lookup, String name, MethodTypeDescriptor type) {
    if (this != of(carrierClass()))  throw subclassResponsibility();
    return lookup.findVirtual(carrierClass(), name, type.asMethodType());
  }
  private static RuntimeException subclassResponsibility() {
    throw new IllegalArgumentException();
  }
  /**
   * Initial entry point called from the VM when a type operator
   * expression must be resolved.
   */
  static <C> ResolvedTypeDescriptor<?,C> initialMetafactory(
    Lookup lookup, TypeDescriptorBootstrapCallInfo<C> bci
  ) throws BootstrapMethodError {
    String descriptor = bci.invocationName();
    Class<C> carrierType = bci.invocationType();
    Class<?> typeOpClass = bci.typeOperatorClass();
    String typeOpName = bci.typeOperatorName();
    List<Object> typeOpArgs = bci.asList();
    ...
  }
}

It is an open question whether any of the ResolvedTypeDescriptor API should be merged into the Class API. That decision could create a set of secondary "crasses" (runtime type quasi-classes) which do not directly represent a classfile, but instead represent a type somehow derived from or related to one or more classfiles. There is some precedent for this, since the existing Class instances for primitives and void, and for arrays, may be viewed as "crasses". In that case, the carrierClass API would probably be named getPrimaryClass, and would map a "crass" to its nearest proper supertype (or Object or an interface), and there would be a new query isTypeExpression.

Keeping the ResolvedTypeDescriptor API disjoint from the legacy Class API would be cleaner, but would also require us to duplicate or extend many APIs, such as Lookup, in which Class is a proxy for a JVM type descriptor. An interface TypeDescriptor (proposed by the Constable project) may give us a hook to generify those APIs, rather than brutally duplicating them, and without introducing "crasses".

Alternatives

This design can be viewed as a refinement of an earlier experimental mechanism called "class-dynamic", which decoded a sub-language from class name strings and spun classfiles on the fly in response to resolution requests. That mechanism funneled the type operator expression through the class name, which is similar to the above design, but makes no distinction between a regular class reference and a type operator expression.

The integration of type operators into the JVM seems to be cleaner if the distinction between regular named classes and type expressions is explicit from the beginning. In addition, we do not want to commit to spinning classfiles in response to type operators; some use cases of type operators intentionally alias regular classes, but with some extra "annotation" payload injected. This cannot be done in a framework which confuses class names with type expressions.

When we design template classes, we could attempt to add a purpose-built descriptor syntax designed expressly for templates. However, a design like the one in this JEP would be needed anyway.

We could try to live without reified generics altogether, in which case the existing type descriptors would be serviceable.

Testing

// What kinds of test development and execution will be required in order // to validate this enhancement, beyond the usual mandatory unit tests?

Risks and Assumptions

// Describe any risks or assumptions that must be considered along with // this proposal.

Dependencies

// Describe all dependencies that this JEP has on other JEPs, JBS issues, // components, products, or anything else.

Design FAQ

DRAFT DRAFT DRAFT The following section will be part of the comments, not the JEP proper.

You didn't use dot . for type operator syntax; why not? Because in some pathways, descriptors flow through class names, and slashes are converted to dots and vice versa. Any distinction between slash and dot would be lost at that point, without complicated context-sensitive rules for dot-preservation or dot-recovery.
That grammar is complicated: Everything seems optional. Why not get rid of some optionality? Briefly, each optionality is motivated as follows. The TypeCarrier could be removed in favor of making it always Object, but many use cases for type operators work within a static bound type, and it is wasteful not to allow that static bound to appear as a true verifier type. Given a TypeCarrier, it makes sense that the actual type operator should sometimes be derived directly from the carrier and other types be a separately specified parameter, hence the optionality of the TypeOpName. But if the TypeOpName is unrelated to the carrier type, the carrier is often Object, hence a special abbreviation for that common case that makes the TypeCarrier optional. So the carrier can be either identical with the type operator, or completely separate. The argument list is optional since some type operators inherently require arguments while some are "just the mode" (as with "not null"). The trailing semicolon ; for missing a TypeArg list is a judgment call; it could be denoted instead by [], but that seems egregiously noisy for a simple modifier like "not null", and requiring a non-empty TypeArg list in the brackets adds trivial complexity.
That grammar is complicated: Why are there different ways to denote a type operator? Dropping the TypeOpName allows the carrier and the type operator to come from the same class, as noted above, while allowing the type operator to be a fully resolved class name give obvious modularity benefits. In the latter case allowing an additional name to select a class member gives a way for one class to expose a library of type operators. The final case, of a simple identifier, allows either the carrier type to selected a class member (or "mode" argument such as "wildcarded"), or the system to globally define a handful of type operators outside of the package scoping system: ! (for "not null") and ? (for "maybe null") are two such likely global operators.
That grammar is complicated: You allow too many kinds of type operator arguments. Why not just have types as parameters? Type arguments are clearly all you need to upgrade today's generics in place, to reify their types inside of descriptors. But this is short-sighted, since C++ generics allow many other kinds of arguments. The grammar chosen above allows specification a reasonable array of non-type arguments corresponding to common use cases of template arguments in C++ and other languages. After types, strings are the obvious next candidate, and indeed strings can denote anything else we need, and are agreeably fundamental in the JVM. We threw in MethodType because that is a fundamental construct in the JVM, and shouldn't be passed through a stringy encoding channel. We threw in NumberArg because small integral numbers are fundamental in various use cases, such as definite arrays. All of the above correspond to natively encoded constant pool entries (except integers which are larger than a long).
You forget MethodHandle and Double arguments, aren't those fundamental also? Yes, they are, but they can be readily encoded to bootstrap methods using combinations of the other argument types, and designing a hardwired stringy encoding for them would be needlessly complex. For a method handle, just pass several arguments denoting its class, name, and type, with maybe a ref-kind also. For a floating point number, consider using a string containing its hex-float representation, to avoid problems with rounding and ambiguity.
Those identifier strings are useless without a way to quote the illegal characters; why not have strings with proper quoting? The limitations on TypeArg strings are the same as those on class names, and there are standard systems (such as the "Symbolic Freedom" encoding) for representing the handful of illegal characters using escape sequences. Bootstrap methods which need general strings should use such a scheme. This is much easier than somehow telling the JVM it must start allowing hitherto "dangerous characters" in small parts of the descriptor grammar.
Constrained primitive types, seriously? An earlier version of the grammar assumed that the only carrier type was Object, allowing the "head" of the type operator expression to be type operator name (such as a template class). This had two major downsides: First, it didn't capture the fact that a template might well be the supertype of all its instances; this is certainly true for containers like List<int>; throwing away that type bound means more checkcast bytecodes to restore it in method code, which seems a sorry waste. Second, the L descriptor letter might be augmented (at some point) by additional classy descriptors (such as the Q descriptor of the "minimal value type" prototype). Allowing carrier types to be any pre-existing verifier types seems prudent. Given that, the primitive types and arrays come in pretty much "for free", although it would be reasonable to disallow constrained primitives if that turns out to be hard to implement, and add them in later when primitives are unified more fully with other types.
Why doesn't the ArrayType production mention FieldType any more? The array type syntax is our sole legacy syntax that is similar to a type operator. From a prior component type it creates a complex new array object type. We don't want to pretend that there is a way to customize that array object type by adding arbitrary "tweaks" to its component type -- it is hard enough to manage constrained scalar types without cutting them into the "guts" of the JVM's built-in array object mechanism. We take the simpler choice of allowing array instances to be constrained without asking questions about what is inside them. When arrays are virtualized (made instances of interfaces) then we can fully nest constraints within array component types, but not until then.