JEP draft: enhanced checkcast for Valhalla type unification

Owner	John Rose
Type	Feature
Scope	JDK
Status	Draft
Created	2022/11/18 02:39
Updated	2025/05/28 05:54
Issue	8297236

Motivation

Invisible conversions everywhere dense in your classfile...

In the Java language, the Valhalla project unifies the type system so that all values (except possibly null) are assigned subclasses of Object, notably int and double and other legacy primitive types (also known as the non-reference types in Java before Valhalla). This unification is done by more closely identifying, as "the same thing", both the boxed and the native forms of any given primitive value. This allows the legacy primitives to work in generic classes and methods (with suitably permissive type variable bounds).

But the JVM knows different: The native form of a double occupies 64 bits and two stack slots, under the descriptor "D", while the boxed form stores the 64-bit value in one or more buffer objects in the heap, and its reference occupies just one stack slot, under the descriptor "Ljava/lang/Double;". There is no way, within the JVM, to retroactively unify those two representations of "the same" double value object.

Instead, the Java static compiler (e.g.,javac) needs to juggle at least two representations for double values, the two-slot native and the one-slot boxed reference. It must supply the correct format to each operation: If native arithmetic (like the dadd bytecode) it must be in the native "D" format, and if communicating with generic code or storing into an array of Object or Number, it must be in the boxed reference format.

The Java static compiler has historically done such juggling historically in order to implement language rules for implicit auto-boxing and auto-unboxing. To do this, it makes implicit calls (not seen in the source code) to well-defined API points such as Double::valueOf (for boxing) and Double::doubleValue (for unboxing).

Note also that the static compiler often implicitly changes the reference type of values around the boundaries of (erased) Java generic APIs, for example quietly casting the result of a call to List<String>::get from Object (the erased type variable bound, returned from the get method) to String (the type known at compile time). To do that task, it emits implicit uses of the checkcast bytecode.

In the case of a generic API point like List<double>::get, the Object value returned from the method must be implicitly retyped as doublein two operational steps: Cast to Double as a reference, and then unboxed to a native two-slot double with Double::doubleValue.

For Valhalla user-defined primitives, the representation for both boxed and unboxed values are the same, but there are still going to be implicit changes in their types, and those changes will (probably, in the unboxing direction only) be reflected by operational null-checks, using Objects::requireNonNull or the equivalent.

With Valhalla, we expect that the frequency (or density) of such conversions and checks may increase in some codes, as users enjoy freedom from worry about whether their values are boxed or not. But the JVM will have to worry all the more, especially for legacy primitives. Also, the static compiler will have to send the right guidance to it, in the form of implicit bytecode instructions to manage the implicit boxing and unboxing.

Valhalla does not plan to enhance the verifier type system beyond where it is today. In particular, we do not plan to propagate the results of null checks in the verifier type system. This means that if javac forgets to put in a call to Objects::requireNonNull, nulls will be checked later if at all, when a variable is reached that is positively null-rejecting. (The argument to Double::doubleValue is null-rejecting in that way, since a null receiver elicits a NullPointerException.)

In summary, this means that the following operations will effectively be used as virtual machine instructions for managing low-level type changes in code generated by javac:

Double::doubleValue -- for unboxing Double to double
Double::valueOf -- for boxing double to Double.
<Primtype>::<primtype>Value -- likewise for unboxing any legacy primitive type.
<Primtype>::valueOf -- likewise for boxing any legacy primitive type.
Objects::requireNonNull -- for unboxing any user-defined primitive type
(no code) -- for boxing any user-defined primitive type (verifier sees no type change)

In addition, calls to requireNonNull will, in many cases, need to be followed by a checkcast to reassure the verifier that what came out of the method is the same type (in fact, the same reference) as what went in. (This effect is not visible at the language level, since requireNonNull generically returns its input type. But the JVM requires a checkcast here.)

Also additionally, calls to doubleValue (or any <primtype>Value) will, in many cases, need to be preceded by a checkcast to the box type. (These cases are either explicit user casts from a supertype like Object, or implicit casts inserted around a generic API point.)

This prospect of a much greater volume of implicit conversion bytecodes, or pairs of such conversions, suggests that perhaps the translation strategy for Valhalla would benefit from new support in the JVM bytecode instruction set for expressing those conversions more simply.

(That is a big "perhaps"; it is expensive to add new bytecodes. This memo explores that expensive option. The fallback position, and plan of record at the moment, is to use as many library routine calls as it takes to get the job done, and call it a day.)

Description

Enhance the checkcast instruction in three directions:

polymorphically produce legacy primitives as well as references (cf. getstatic for a precedent)
polymorphically consume legacy primitives as well as references (cf. putstatic for a precedent)
optionally perform null check operations

All three enhancements are enabled by a condition which previously has been illegal. That condition holds when the checkcast instruction operand field, an index into the constant pool, refers to a CONSTANT_Utf8 item, rather than a CONSTANT_Class item, as is already legal.

The spelling of the CONSTANT_Utf8 item selects the function:

">D" pops an Object reference, casts to Double and then calls doubleValue
">I" pops an Object reference, casts to Integer and then calls intValue
"<D" pops a double (two slots) and calls Double::valueOf
"<I" pops an int and calls Integer::valueOf
(and so on for other ">x" and "<x", where x is in [BSIJZCFD])
"!" peeks at the value on the stack and throws NPE if it is null

Any other operand (any other spelling or other constant pool entry type) will fail verification of the checkcast instruction, and is thereby reserved for future use.

The descriptions above are carefully crafted to imply the following interactions with the verifier type system:

">x" requires a reference (Object) on the stack and leaves a primitive (x) on the stack
"<x" requires a primitive (x) on the stack and leaves its box type (not merely Object) on the stack
"!" requires a reference on the stack and leaves that reference alone, with the same verifier type

It is obvious that an efficient interpreter would probably choose to require these new UTF8-using forms of checkcast to an internal, otherwise unused bytecode, and use the operand field of that bytecode to efficiently select the required behavior corresponding to the UTF8 string.

It seems likely that the javac compiler should choose to emit these new instances of checkcast in some or all of the cases where it previously has emitted the method calls (whether implicit or explicit in the source code).

For presentations of this bytecode by other low-level tools, it is suggested that the name checkbox be used instead of checkcast, and the instruction be presented with its string operand unchanged. But the code point for checkcast (decimal 192) should be reused (overloaded) for this new purpose, rather than allocating a new codepoint.

Alternatives

The individual operations proposed here could also be expressed as static methods in a new helper class, such as java.lang.runtime.Checks . These could be defined in such a way as to minimize bytecode complexity. For example, void Checks::requireNonNull(Object) would do the job of Objects::requireNonNull, but would not return an inconvenient (T) value which needs a cleanup checkcast. This is inconvenient for programmers, but convenient for a bytecode generator, which is the target user of java.lang.runtime. Instead of a special bytecode, we would have special static methods in java.lang.runtime. Any specialized optimization or diagnostic behavior can be given to these runtime methods by making them VM intrinsics, in the interpreter and the JITs. Those VM components would have to be modified anyway (and with more difficulty) if we introduced new bytecode syntaxes. So it seems a net win to use intrinsic methods instead of new bytecode syntaxes.