JEP draft: Better hash codes

Owner	John Rose
Type	Feature
Scope	JDK
Status	Draft
Component	hotspot / runtime
Created	2018/04/12 01:23
Updated	2024/09/25 17:30
Issue	8201462

Summary

Introduce the new method Object.longHashCode to support the generation of 64 bit hash codes. Provide a set of JVM functions to optimize hash code generation for a wide range of types, using modern CPU instructions. Arrange appropriate forward and backward interoperation with the existing Object.hashCode API.

Goals

Generality: The new hash function, like the existing standard hash functions, will operate (at least potentially) on all field and array element types.
Size: The new hash function will supply a full 64 bits of mixed output, enabling it to scale to cloud-sized applications.
Mixing: As the hardware platform permits, the hash function will attempt to make the hash function mix thoroughly. In particular, every output bit should depend on every input, with approximately random statistics. That is, there should be no "dead bits" in the output, with respect to any part of the input. In addition, small numbers of input bits should not be able to "cancel" each others' effects on the output. That is, "funnels" between inputs should be avoided.
Speed: The hash function will be fast enough to serve as a replacement for most current applications of Object.hashCode, especially those with more than a few words of hashable input. Naturally this goal must be traded off carefully with the previous goal of good mixing.
Saltable: The hash function will be instantiable into multiple versions, using an optional static "salt" parameter. This will enable VM instances and individual data structures to somewhat harden themselves against hash collision attacks.
Compatible: The existing specified behavior of Object.hashCode will not change. Use by legacy classes of the new algorithm will be on an opt-in basis. There will be upgrade paths which will not require replacing existing class or interface hierarchies.

Non-goals

Non-secure: The hash function will not claim to be cryptographic in strength, as that is currently impossible to implement at speeds consistent with ordinary uses of hash codes.
Non-reproducible: The output of the hash function will not be specified in such a way that independent conforming implementations of the Java libraries or VM will necessarily produce the same hashes for the same inputs. In this, the hash function will be free to produce arbitrary VM-assigned results, much like today's System::identityHashCode. (As a matter of good engineering, accidentally non-reproducible results are to be avoided.)

Success Metrics

Collection classes can be adapted (at user option) to use new hash codes, and show better performance when using them on appropriate workloads.
Hash code is usable as a substitability hash code for value types.
Some bulk algorithms run significantly faster.

Motivation

Existing standard implementations of the Object.hashCode API have well-known flaws, leading to excessive hash collisions and poor use of CPU cycles and excessive memory footprint in hashed structures.

In many cases every bit of the input affects at most one bit of the output (Integer.hashCode), and/or can be trivially canceled by another bit (Long.hashCode). Even for variably-sized types, such as strings and lists, this observation often applies to the last elements of the type.

Even if the input bits are mixed a little, in most cases they are not mixed thoroughly, because the usual mixing function is exclusive or between parallel inputs, or the simple recurrence h=h*31+x for sequences of inputs. In these functions there are many funnels between inputs and dead bits in the output.

The existing hash codes are insufficient for industrial strength hashing that must avoid of pathological collision behaviors, even in the presence of overloaded or subverted inputs.

Meanwhile, the JVM has a special role to play here. For any given object layout (or value type layout), it knows how to load the hashed input with as few memory operations as possible, and it knows whether special-purpose instructions (such as AES steps) are available to accelerate the hash algorithm while retaining good mixing properties. The JVM can even sense when a vectorized hashing algorithm might be appropriate. All of these considerations are inconsistent with an externally-specified hash algorithm such as XOR or h*31+x, and all of them are important for obtaining the best possible trade-off between hashing speed and quality. It follows that we don't need just another library function for a few use cases, but rather a deeply embedded facility that the JVM can optimize for (potentially) any Java object, value, or array.

Description

The low-level API point System.primitiveHashCode will be defined alongside the API point System.identityHashCode, as a way to obtain a fast, high quality hash code for the publicly readable contents of an object, value, or array. A variant entry point (see below) will direct the JVM to privately readable contents on an opt-in basis.

The goals given above will be met by all instances of this algorithm. (Exception instances of salt selected non-randomly by an adversary will not serve as proof of failure.) The JVM is encouraged to use appropriate technology, possibly including built-in cryptographic instructions or specialized vector hardware, to implement these goals.

Any non-public fields, and any fields or elements of reference type in the object value, or array will be ignored by primitiveHashCode. This may seem to be a deep usability problem, but in fact allows primitiveHashCode to be used as building block for a range of useful hashcode behaviors, including those which treat references as object identities or as pointers to more hashable content, or some mix of both.

The API point System.primitiveHashCode(Object,Lookup) will provide a version of the hash algorithm which inspects all fields in the given object, under the requirement that the Lookup argument validates access to these fields. This API point is intended for use as a building block with classes whose internal structure has primitives.

Any fields of value type declared in the object or value class are recursively scanned for non-reference inputs to the hash code algorithm. Likewise, arrays of value types are recursively scanned. Arrays of Object or interface types will not be scanned, even if some of the elements might refer to value types.

The class of the operand passed to System.primitiveHashCode is significant to the hash code. That is, two objects with equivalent fields but different classes will produce independent hash codes. Two arrays of the same length and numeric contents, but with different types (short vs. long) will produce independent hash codes. Likewise, the length of an array is significant to the hash code, as well as all of its non-reference (primitive or value) elements.

As a special case, reference arrays could distribute the hash algorithm to each of their reference elements, instead of ignoring them. But this special case seems to work better as a special method; see below. For consistency, passing a reference array to System.primitiveHashCode will simply hash the array's class and length.

Since public fields are relatively rare in standard Java APIs, there is also an API to adapt primitiveHashCode to values or objects with private fields. This API includes a permission check based on a second argument, a Lookup which must exactly match the class of the operand. It is System.primitiveHashCode(Object,Lookup). It is not caller sensitive.

Internally, the JVM will make two key choices to implement System.primitiveHashCode. First, it will statically sense any configuration parameters that add "salt" to the hashing algorithm, for this JVM instance. Second, for each operand, it will sense the class and vector (if necessary for efficiency) to an appropriately specialized implementation of the hash function. This second step will often be omitted if JIT-time type information is available.

If the lookup parameter is present, a third step is required, which is comparing the lookup class against the object class, and ensuring that the lookup has the required "private" bit set which allows full access.

The API point MethodHandles.primitiveHashCode(String) will provide an instance of the long hash code algorithm, using the given string as "salt". If the string is empty, the algorithm will behave the same as that of System.primitiveHashCode.

The combined API point that takes both the string and the Lookup object may also be provided, as well as (perhaps) API points which allow selection of a set of relevant fields and other inputs to the hash function, such as hashes recursively derived from reference fields.

There will be a way to specify a salt string at JVM start-up for use only the in that instance of the JVM. There will be way to ask the JVM to generate such a salt string randomly at start-up, and in this case the salt will be as unpredictable as possible.

For object and value classes, the set of non-reference fields which are public, or the set of all fields, is often only a first approximation to the set of inputs required for a correct hash function. In those cases, user code (or a method handle generated by a bootstrap method) can use primitiveHashCode to extract all of the primitive fields at once, and then mix them with hashes derived from other sources, such as affiliated objects or arrays. This mixing can be performed on an intermediate array of type long[].

Additional API points may be built on top as needed:

Object.longHashCode is a new overridable API point in Object, having the same guaranteed correspondence to Object.equals as Object.hashCode. For compatibility, its default must be built on a call to Object.hashCode plus some extra mixing via primitiveHashCode. Classes can easily upgrade over time.
Collection.longHashCode is a new overridable API point, with appropriate defaults based on the long hashes of the elements. Existing abstract superclass methods for hashCode will be joined by methods for longHashCode. For lists, the definition is to call longHashCode on each element then call it once more on the resulting sequence of 64-bit hash codes, as if in a temporary long[] array.
Objects.longHash(Object...) will call longHashCode on each individual argument, and then accumulate all of the resulting 64-bit hash values into a single hash value, as if via a temporary array of long values of the same length as the original array, to which is applied a final call of primitiveHashCode (or mixHashCodes, see below).
Arrays.longHashCode(T[]) will produce the same result as System.primitiveHashCode if the component type is primitive or a value, else the the component type is a reference and it will produce the same result as Objects.longHash.
Arrays.deepLongHashCode(T[]) will produce the same result as System.primitiveHashCode if the component type is primitive or value, else the the component type is a reference and it will produce the same result as calling deepLongHashCode on each element of the reference array, thus traversing a dynamic nest. It may produce stack overflow conditions for certain inputs.
Collectors.longHash(), when used as the terminal operation of a stream, will produce the same result as Collectors.toList() followed by longHashCode (or mixHashCodes, see below) on the resulting list.
HashMap(int,float,boolean) will allow individual hashmaps to make use of longHashCode instead of hashCode, if the new boolean parameter is true. (Alternatively, a separate class LongHashMap could be built, but this seems unnecessary.)
System.mixHashCodes(long...) will allow users to mix hash codes derived from separate sources into a single hash code. It may be implemented as a redirect to call primitiveHashCode on a long[] array, or it may do something more efficient, under the assumption that the inputs are already well-conditioned.
Additional overloadings of mixHashCodes on between one and five long values will be equivalent to calling mixHashCodes on an array of the operands, but can help avoid boxing overheads. As a special case, mixHashCodes on a single long value is an acceptable way to condition the result of a legacy Object.hashCode result to be better mixed, although it cannot magically remix the legacy result to perfection.

Publicly available block data hash algorithms, such as xxHash or entries in Google's CityHash are likely to be useful for building the above functionality.

An important advantage of defining hashes with JVM assistance is accelerating the acquisition of initial inputs to the hash, the primitive bits from the original object, value, or array. On modern hardware, this can often be performed most efficiently by loading an entire vector worth of data from the input instance. A vector masking operation (either with the load, or after it) can suppress irrelevant portions of the object layout, such object header, fragmentation overhead, nearby objects, private fields (if suppressed) and fields of reference type. If the object is dense enough with primitive fields, it is likely that a single masked vector load is competitive with a series of scalar loads, and the vector processing unit is often capable of performing significant hash steps, immediately after the load.

As a specific algorithm for hash steps, note that user-mode instructions for a single step of the 128-bit AES encode algorithm is available on many significant platforms supported today by Java, and are well integrated with vector units. (In some cases, the vector unit can run several steps in parallel.) A 128-bit AES step is likely to be about as cheap as many traditiional hashing steps, such as 64-bit multiply followed by bit swizzling. The AES step operates on 128 bits at a time, which gives it an edge on some platforms. Early experiments with SMHasher suggest that two AES steps (but not one) are usually enough to produce robust mixing of bits.

Thus, an AES-based algorithm something like this would seem to work well for the above APIs:

primitiveHashCode(x) {
  cls = x.getClass()
  mask1 = vectorMaskForInstance(cls)
  bits = unsafe_vector_load(x, mask1)
  bits = aes_step(bits, SALT1)
  if (cls.isArray()) {
    mask2 = vectorMaskForArrayBody(cls)
    for (off in vectorOffsetsFor(cls, x.length)) {
      bits2 = unsafe_vector_load(x + off, mask2)
      bits = aes_step(bits, bits2)
      bits2 = aes_step(bits2, SALT2)
      bits = aes_step(bits, bits2)
    }
  }
  bits = aes_step(bits, SALT3)
  return bits
}

Indeed, the CityHash project appears to have experimented briefly with the approach of using AES as a mixing step.

The SALT values above can be any pseudo-random 128-bit value. They determine which instance of the hash algorithm is in use. They could be derived from a random number generator, or from the hashing (cryptographic or otherwise) of the salt string mentioned earlier.

For machines which can issue several AES steps at once, for larger arrays, it would be a routine adjustment to unroll the above algorithm and rearrange it to accumulate larger state vectors, in parallel. The choice of whether to do this depends sensitively on the the array length and the processor, which means it must be made on-line by the JVM and not by a prior decision, such as a mandate from a library API.

A JVM instance which runs on hardware that doesn't accelerate AES could fall back to a standard ALU-intensive hash algorithm, such as MurmurHash or xxHash. The freedom to back off like this is integral to getting the best-quality hash available on any given platform.

Collision resistance can be increased by adding more instructions, including XOR steps to carry forward duplicate values of previous steps (Davies-Mayer and similar constructions) and/or additional AES round instructions. This construction is not designed to be a cryptographically strong hash, and collisions may occur. The use of randomly selected salt values will make it more difficult, though not impossible, for adversaries to engineer collisions.

Java currently taps out at 64 bits on primitive size, but future value types will easily grow larger. At that point, the APIs sketched above could be expanded to provided "extra long" hashes of arbitrary size, if this seems desirable. Such jumbo hashes are relatively simple to derive from the above hashing algorithms, by using distinct salt values for each channel of the jumbo hash.

Alternatives

The special case for reference arrays could be removed, and instead special API points reserved for those uses: Arrays.deepLongHashCode

The various API points suggested above could be deferred until proven useful.

We could continue to live with hash collisions from h*31+x.

Library writers could continue to create their own hash functions, which will not be optimized well by the JVM and/or will have poorer hashing quality than what the JVM could do.

It is an unusual API design to mix array-element hashing with instance-field hashing, all in one place. Perhaps the concerns should be separated. Perhaps polymorphism should be expressed statically, as method handle factories, rather than via dynamic type tests performed by a single magic entry point. Note that the magic entry point has precedents such as System.arraycopy and Class.newInstance. The common thread handled by the proposed System.primitiveHashCode is that only the JVM can know the fastest and best thing to do for each distinct object, value, or array layout (and for each selection of public fields vs. all fields).