JEP draft: Better hash codes
Owner | John Rose |
Type | Feature |
Scope | JDK |
Status | Draft |
Component | hotspot / runtime |
Created | 2018/04/12 01:23 |
Updated | 2024/09/25 17:30 |
Issue | 8201462 |
Summary
Introduce the new method Object.longHashCode
to support the generation of
64 bit hash codes. Provide a set of JVM functions to optimize hash code
generation for a wide range of types, using modern CPU instructions.
Arrange appropriate forward and backward interoperation with the
existing Object.hashCode
API.
Goals
-
Generality: The new hash function, like the existing standard hash functions, will operate (at least potentially) on all field and array element types.
-
Size: The new hash function will supply a full 64 bits of mixed output, enabling it to scale to cloud-sized applications.
-
Mixing: As the hardware platform permits, the hash function will attempt to make the hash function mix thoroughly. In particular, every output bit should depend on every input, with approximately random statistics. That is, there should be no "dead bits" in the output, with respect to any part of the input. In addition, small numbers of input bits should not be able to "cancel" each others' effects on the output. That is, "funnels" between inputs should be avoided.
-
Speed: The hash function will be fast enough to serve as a replacement for most current applications of
Object.hashCode
, especially those with more than a few words of hashable input. Naturally this goal must be traded off carefully with the previous goal of good mixing. -
Saltable: The hash function will be instantiable into multiple versions, using an optional static "salt" parameter. This will enable VM instances and individual data structures to somewhat harden themselves against hash collision attacks.
-
Compatible: The existing specified behavior of
Object.hashCode
will not change. Use by legacy classes of the new algorithm will be on an opt-in basis. There will be upgrade paths which will not require replacing existing class or interface hierarchies.
Non-goals
-
Non-secure: The hash function will not claim to be cryptographic in strength, as that is currently impossible to implement at speeds consistent with ordinary uses of hash codes.
-
Non-reproducible: The output of the hash function will not be specified in such a way that independent conforming implementations of the Java libraries or VM will necessarily produce the same hashes for the same inputs. In this, the hash function will be free to produce arbitrary VM-assigned results, much like today's
System::identityHashCode
. (As a matter of good engineering, accidentally non-reproducible results are to be avoided.)
Success Metrics
-
Collection classes can be adapted (at user option) to use new hash codes, and show better performance when using them on appropriate workloads.
-
Hash code is usable as a substitability hash code for value types.
-
Some bulk algorithms run significantly faster.
Motivation
Existing standard implementations of the Object.hashCode
API have
well-known flaws, leading to excessive hash collisions and poor use of
CPU cycles and excessive memory footprint in hashed structures.
In many cases every bit of the input affects at most one bit of the
output (Integer.hashCode
), and/or can be trivially canceled by
another bit (Long.hashCode
). Even for variably-sized types, such as
strings and lists, this observation often applies to the last elements
of the type.
Even if the input bits are mixed a little, in most cases they are not
mixed thoroughly, because the usual mixing function is exclusive or
between parallel inputs, or the simple recurrence h=h*31+x
for
sequences of inputs. In these functions there are many funnels
between inputs and dead bits in the output.
The existing hash codes are insufficient for industrial strength hashing that must avoid of pathological collision behaviors, even in the presence of overloaded or subverted inputs.
Meanwhile, the JVM has a special role to play here. For any given
object layout (or value type layout), it knows how to load the hashed
input with as few memory operations as possible, and it knows whether
special-purpose instructions (such as AES steps) are available to
accelerate the hash algorithm while retaining good mixing properties.
The JVM can even sense when a vectorized hashing algorithm might be
appropriate. All of these considerations are inconsistent with an
externally-specified hash algorithm such as XOR or h*31+x
, and all
of them are important for obtaining the best possible trade-off
between hashing speed and quality. It follows that we don't need just
another library function for a few use cases, but rather a deeply
embedded facility that the JVM can optimize for (potentially) any Java
object, value, or array.
Description
The low-level API point System.primitiveHashCode
will be defined
alongside the API point System.identityHashCode
, as a way to obtain
a fast, high quality hash code for the publicly readable contents of
an object, value, or array. A variant entry point (see below) will
direct the JVM to privately readable contents on an opt-in basis.
The goals given above will be met by all instances of this algorithm. (Exception instances of salt selected non-randomly by an adversary will not serve as proof of failure.) The JVM is encouraged to use appropriate technology, possibly including built-in cryptographic instructions or specialized vector hardware, to implement these goals.
Any non-public fields, and any fields or elements of reference type in
the object value, or array will be ignored by primitiveHashCode
.
This may seem to be a deep usability problem, but in fact allows
primitiveHashCode
to be used as building block for a range of useful
hashcode behaviors, including those which treat references as object
identities or as pointers to more hashable content, or some mix of
both.
The API point System.primitiveHashCode(Object,Lookup)
will provide a
version of the hash algorithm which inspects all fields in the given
object, under the requirement that the Lookup
argument validates
access to these fields. This API point is intended for use as a
building block with classes whose internal structure has primitives.
Any fields of value type declared in the object or value class are
recursively scanned for non-reference inputs to the hash code
algorithm. Likewise, arrays of value types are recursively scanned.
Arrays of Object
or interface types will not be scanned, even if
some of the elements might refer to value types.
The class of the operand passed to System.primitiveHashCode
is
significant to the hash code. That is, two objects with equivalent
fields but different classes will produce independent hash codes. Two
arrays of the same length and numeric contents, but with different
types (short
vs. long
) will produce independent hash codes.
Likewise, the length of an array is significant to the hash code, as
well as all of its non-reference (primitive or value) elements.
As a special case, reference arrays could distribute the hash
algorithm to each of their reference elements, instead of ignoring
them. But this special case seems to work better as a special method;
see below. For consistency, passing a reference array to
System.primitiveHashCode
will simply hash the array's class and
length.
Since public fields are relatively rare in standard Java APIs, there
is also an API to adapt primitiveHashCode
to values or objects with
private fields. This API includes a permission check based on a
second argument, a Lookup
which must exactly match the class of the
operand. It is System.primitiveHashCode(Object,Lookup)
. It is not
caller sensitive.
Internally, the JVM will make two key choices to implement
System.primitiveHashCode
. First, it will statically sense any
configuration parameters that add "salt" to the hashing algorithm, for
this JVM instance. Second, for each operand, it will sense the class
and vector (if necessary for efficiency) to an appropriately
specialized implementation of the hash function. This second step
will often be omitted if JIT-time type information is available.
If the lookup parameter is present, a third step is required, which is comparing the lookup class against the object class, and ensuring that the lookup has the required "private" bit set which allows full access.
The API point MethodHandles.primitiveHashCode(String)
will provide
an instance of the long hash code algorithm, using the given string as
"salt". If the string is empty, the algorithm will behave the same as
that of System.primitiveHashCode
.
The combined API point that takes both the string and the Lookup
object may also be provided, as well as (perhaps) API points which
allow selection of a set of relevant fields and other inputs to the
hash function, such as hashes recursively derived from reference
fields.
There will be a way to specify a salt string at JVM start-up for use only the in that instance of the JVM. There will be way to ask the JVM to generate such a salt string randomly at start-up, and in this case the salt will be as unpredictable as possible.
For object and value classes, the set of non-reference fields which
are public, or the set of all fields, is often only a first
approximation to the set of inputs required for a correct hash
function. In those cases, user code (or a method handle generated by
a bootstrap method) can use primitiveHashCode
to extract all of the
primitive fields at once, and then mix them with hashes derived from
other sources, such as affiliated objects or arrays. This mixing
can be performed on an intermediate array of type long[]
.
Additional API points may be built on top as needed:
-
Object.longHashCode
is a new overridable API point inObject
, having the same guaranteed correspondence toObject.equals
asObject.hashCode
. For compatibility, its default must be built on a call toObject.hashCode
plus some extra mixing viaprimitiveHashCode
. Classes can easily upgrade over time. -
Collection.longHashCode
is a new overridable API point, with appropriate defaults based on the long hashes of the elements. Existing abstract superclass methods forhashCode
will be joined by methods forlongHashCode
. For lists, the definition is to calllongHashCode
on each element then call it once more on the resulting sequence of 64-bit hash codes, as if in a temporarylong[]
array. -
Objects.longHash(Object...)
will calllongHashCode
on each individual argument, and then accumulate all of the resulting 64-bit hash values into a single hash value, as if via a temporary array oflong
values of the same length as the original array, to which is applied a final call ofprimitiveHashCode
(ormixHashCodes
, see below). -
Arrays.longHashCode(T[])
will produce the same result asSystem.primitiveHashCode
if the component type is primitive or a value, else the the component type is a reference and it will produce the same result asObjects.longHash
. -
Arrays.deepLongHashCode(T[])
will produce the same result asSystem.primitiveHashCode
if the component type is primitive or value, else the the component type is a reference and it will produce the same result as callingdeepLongHashCode
on each element of the reference array, thus traversing a dynamic nest. It may produce stack overflow conditions for certain inputs. -
Collectors.longHash()
, when used as the terminal operation of a stream, will produce the same result asCollectors.toList()
followed bylongHashCode
(ormixHashCodes
, see below) on the resulting list. -
HashMap(int,float,boolean)
will allow individual hashmaps to make use oflongHashCode
instead ofhashCode
, if the new boolean parameter is true. (Alternatively, a separate classLongHashMap
could be built, but this seems unnecessary.) -
System.mixHashCodes(long...)
will allow users to mix hash codes derived from separate sources into a single hash code. It may be implemented as a redirect to callprimitiveHashCode
on along[]
array, or it may do something more efficient, under the assumption that the inputs are already well-conditioned. -
Additional overloadings of
mixHashCodes
on between one and fivelong
values will be equivalent to callingmixHashCodes
on an array of the operands, but can help avoid boxing overheads. As a special case,mixHashCodes
on a single long value is an acceptable way to condition the result of a legacyObject.hashCode
result to be better mixed, although it cannot magically remix the legacy result to perfection.
Publicly available block data hash algorithms, such as xxHash or entries in Google's CityHash are likely to be useful for building the above functionality.
An important advantage of defining hashes with JVM assistance is accelerating the acquisition of initial inputs to the hash, the primitive bits from the original object, value, or array. On modern hardware, this can often be performed most efficiently by loading an entire vector worth of data from the input instance. A vector masking operation (either with the load, or after it) can suppress irrelevant portions of the object layout, such object header, fragmentation overhead, nearby objects, private fields (if suppressed) and fields of reference type. If the object is dense enough with primitive fields, it is likely that a single masked vector load is competitive with a series of scalar loads, and the vector processing unit is often capable of performing significant hash steps, immediately after the load.
As a specific algorithm for hash steps, note that user-mode instructions for a single step of the 128-bit AES encode algorithm is available on many significant platforms supported today by Java, and are well integrated with vector units. (In some cases, the vector unit can run several steps in parallel.) A 128-bit AES step is likely to be about as cheap as many traditiional hashing steps, such as 64-bit multiply followed by bit swizzling. The AES step operates on 128 bits at a time, which gives it an edge on some platforms. Early experiments with SMHasher suggest that two AES steps (but not one) are usually enough to produce robust mixing of bits.
Thus, an AES-based algorithm something like this would seem to work well for the above APIs:
primitiveHashCode(x) {
cls = x.getClass()
mask1 = vectorMaskForInstance(cls)
bits = unsafe_vector_load(x, mask1)
bits = aes_step(bits, SALT1)
if (cls.isArray()) {
mask2 = vectorMaskForArrayBody(cls)
for (off in vectorOffsetsFor(cls, x.length)) {
bits2 = unsafe_vector_load(x + off, mask2)
bits = aes_step(bits, bits2)
bits2 = aes_step(bits2, SALT2)
bits = aes_step(bits, bits2)
}
}
bits = aes_step(bits, SALT3)
return bits
}
Indeed, the CityHash project appears to have experimented briefly with the approach of using AES as a mixing step.
The SALT
values above can be any pseudo-random 128-bit value. They
determine which instance of the hash algorithm is in use. They could
be derived from a random number generator, or from the hashing
(cryptographic or otherwise) of the salt string mentioned earlier.
For machines which can issue several AES steps at once, for larger arrays, it would be a routine adjustment to unroll the above algorithm and rearrange it to accumulate larger state vectors, in parallel. The choice of whether to do this depends sensitively on the the array length and the processor, which means it must be made on-line by the JVM and not by a prior decision, such as a mandate from a library API.
A JVM instance which runs on hardware that doesn't accelerate AES could fall back to a standard ALU-intensive hash algorithm, such as MurmurHash or xxHash. The freedom to back off like this is integral to getting the best-quality hash available on any given platform.
Collision resistance can be increased by adding more instructions, including XOR steps to carry forward duplicate values of previous steps (Davies-Mayer and similar constructions) and/or additional AES round instructions. This construction is not designed to be a cryptographically strong hash, and collisions may occur. The use of randomly selected salt values will make it more difficult, though not impossible, for adversaries to engineer collisions.
Java currently taps out at 64 bits on primitive size, but future value types will easily grow larger. At that point, the APIs sketched above could be expanded to provided "extra long" hashes of arbitrary size, if this seems desirable. Such jumbo hashes are relatively simple to derive from the above hashing algorithms, by using distinct salt values for each channel of the jumbo hash.
Alternatives
The special case for reference arrays could be removed, and instead
special API points reserved for those uses: Arrays.deepLongHashCode
The various API points suggested above could be deferred until proven useful.
We could continue to live with hash collisions from h*31+x
.
Library writers could continue to create their own hash functions, which will not be optimized well by the JVM and/or will have poorer hashing quality than what the JVM could do.
It is an unusual API design to mix array-element hashing with
instance-field hashing, all in one place. Perhaps the concerns should
be separated. Perhaps polymorphism should be expressed statically, as
method handle factories, rather than via dynamic type tests performed
by a single magic entry point. Note that the magic entry point has
precedents such as System.arraycopy
and Class.newInstance
. The
common thread handled by the proposed System.primitiveHashCode
is
that only the JVM can know the fastest and best thing to do for each
distinct object, value, or array layout (and for each selection of
public fields vs. all fields).