JEP draft: enhanced checkcast for Valhalla type unification

OwnerJohn Rose
Created2022/11/18 02:39
Updated2022/11/18 21:18


Invisible conversions everywhere dense in your classfile...

In the Java language, the Valhalla project unifies the type system so that all values (except possibly null) are assigned subclasses of Object, notably int and double and other legacy primitive types (also known as the non-reference types in Java before Valhalla). This unification is done by more closely identifying, as "the same thing", both the boxed and the native forms of any given primitive value. This allows the legacy primitives to work in generic classes and methods (with suitably permissive type variable bounds).

But the JVM knows different: The native form of a double occupies 64 bits and two stack slots, under the descriptor "D", while the boxed form stores the 64-bit value in one or more buffer objects in the heap, and its reference occupies just one stack slot, under the descriptor "Ljava/lang/Double;". There is no way, within the JVM, to retroactively unify those two representations of "the same" double value object.

Instead, the Java static compiler (e.g.,javac) needs to juggle at least two representations for double values, the two-slot native and the one-slot boxed reference. It must supply the correct format to each operation: If native arithmetic (like the dadd bytecode) it must be in the native "D" format, and if communicating with generic code or storing into an array of Object or Number, it must be in the boxed reference format.

The Java static compiler has historically done such juggling historically in order to implement language rules for implicit auto-boxing and auto-unboxing. To do this, it makes implicit calls (not seen in the source code) to well-defined API points such as Double::valueOf (for boxing) and Double::doubleValue (for unboxing).

Note also that the static compiler often implicitly changes the reference type of values around the boundaries of (erased) Java generic APIs, for example quietly casting the result of a call to List<String>::get from Object (the erased type variable bound, returned from the get method) to String (the type known at compile time). To do that task, it emits implicit uses of the checkcast bytecode.

In the case of a generic API point like List<double>::get, the Object value returned from the method must be implicitly retyped as doublein two operational steps: Cast to Double as a reference, and then unboxed to a native two-slot double with Double::doubleValue.

For Valhalla user-defined primitives, the representation for both boxed and unboxed values are the same, but there are still going to be implicit changes in their types, and those changes will (probably, in the unboxing direction only) be reflected by operational null-checks, using Objects::requireNonNull or the equivalent.

With Valhalla, we expect that the frequency (or density) of such conversions and checks may increase in some codes, as users enjoy freedom from worry about whether their values are boxed or not. But the JVM will have to worry all the more, especially for legacy primitives. Also, the static compiler will have to send the right guidance to it, in the form of implicit bytecode instructions to manage the implicit boxing and unboxing.

Valhalla does not plan to enhance the verifier type system beyond where it is today. In particular, we do not plan to propagate the results of null checks in the verifier type system. This means that if javac forgets to put in a call to Objects::requireNonNull, nulls will be checked later if at all, when a variable is reached that is positively null-rejecting. (The argument to Double::doubleValue is null-rejecting in that way, since a null receiver elicits a NullPointerException.)

In summary, this means that the following operations will effectively be used as virtual machine instructions for managing low-level type changes in code generated by javac:

In addition, calls to requireNonNull will, in many cases, need to be followed by a checkcast to reassure the verifier that what came out of the method is the same type (in fact, the same reference) as what went in. (This effect is not visible at the language level, since requireNonNull generically returns its input type. But the JVM requires a checkcast here.)

Also additionally, calls to doubleValue (or any <primtype>Value) will, in many cases, need to be preceded by a checkcast to the box type. (These cases are either explicit user casts from a supertype like Object, or implicit casts inserted around a generic API point.)

This prospect of a much greater volume of implicit conversion bytecodes, or pairs of such conversions, suggests that perhaps the translation strategy for Valhalla would benefit from new support in the JVM bytecode instruction set for expressing those conversions more simply.

(That is a big "perhaps"; it is expensive to add new bytecodes. This memo explores that expensive option. The fallback position, and plan of record at the moment, is to use as many library routine calls as it takes to get the job done, and call it a day.)


Enhance the checkcast instruction in three directions:

All three enhancements are enabled by a condition which previously has been illegal. That condition holds when the checkcast instruction operand field, an index into the constant pool, refers to a CONSTANT_Utf8 item, rather than a CONSTANT_Class item, as is already legal.

The spelling of the CONSTANT_Utf8 item selects the function:

Any other operand (any other spelling or other constant pool entry type) will fail verification of the checkcast instruction, and is thereby reserved for future use.

The descriptions above are carefully crafted to imply the following interactions with the verifier type system:

It is obvious that an efficient interpreter would probably choose to require these new UTF8-using forms of checkcast to an internal, otherwise unused bytecode, and use the operand field of that bytecode to efficiently select the required behavior corresponding to the UTF8 string.

It seems likely that the javac compiler should choose to emit these new instances of checkcast in some or all of the cases where it previously has emitted the method calls (whether implicit or explicit in the source code).

For presentations of this bytecode by other low-level tools, it is suggested that the name checkbox be used instead of checkcast, and the instruction be presented with its string operand unchanged. But the code point for checkcast (decimal 192) should be reused (overloaded) for this new purpose, rather than allocating a new codepoint.