JEP draft: Compact Object Headers (64 bit) (Experimental)

OwnerRoman Kennke
TypeFeature
ScopeImplementation
StatusDraft
Componenthotspot / runtime
Discussionhotspot dash dev
EffortL
DurationL
Created2022/10/07 19:27
Updated2023/03/31 19:39
Issue8294992

Summary

Reduce Java object header size from 96...128 bits to 64 bits on 64-bit systems, improving deployment density, allowing smaller Java heaps, and unlocking data locality improvements.

Goals

Introducing an intrusive experimental feature has a broad impact on real-world applications and experimental code might have inefficiencies, bugs, and unanticipated non-bug behaviours. To provide extra safety for the evolution of Java platform, the feature is gated by a runtime option -XX:(+/-)UseCompactObjectHeaders which is initially off by default, with the plan to enable it by default in future releases, and removing legacy code in more distant future.

Because of this, there are two sets of goals.

When the feature is enabled (explicitly, by user request), it:

When the feature is disabled (default behaviour), it:

Non-Goals

It is not a goal for this JEP to:

##Motivation

In the current implementation of HotSpot, Java objects need to have associated metadata. Object metadata is stored in the object header. Object header size is static and independent of object type or array shape and contents. In current 64-bit HotSpot, header sizes are between 96 bits (12 bytes) and 128 bytes (16 bytes). Object sizes in Java programs tend to be small. Experiments that have been conducted as part of the Lilliput project [Insert link to table] have shown that many workloads have average object sizes of 256...512 bits (32..64 bytes). This implies that >20% of live data might be taken by object headers alone.

This is why even a small improvement in object header size can have a large footprint impact. Cutting down the header of each object from 128 bits to 64 bits means >10% improvements for overall heap usage, with corresponding improvements in memory and GC pressure. Since object metadata is inlined into the objects, reducing the header size means objects become smaller too, which gives the benefits of improved data locality.

Early adopters of the Lilliput project who tried it with real-world applications confirm that live memory in the heap is typically reduced by 10%-20%.

Description

In current Hotspot, the object header is used to support the following features:

The current object header layout is conceptually split into a mark word and a class word.

The mark word comes first, has the size of a machine address, and contains:

Mark Word (normal):
   0  3    8                            39                       64
  [TT.AAAA.HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH.......................]
(tag)(GC age)    (Hash code)                 (unused)

Depending on runtime needs, the header can contain a lot more information through indirection to a separate data-structure:

Mark Word (overwritten):
   0 2                                                            64
  [TTppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp]
(tag)                       (native pointer)

When this happens, the tag (also known in Hotspot as lock-bits) describes the type of pointer we store in the header (lock record, ObjectMonitor, or something else), and the original mark word is preserved (displaced) if necessary in the relevant data-structure, to which this pointer refers. Accesses to the fields of the original header (like age bits and hash code) are done by dereferencing that pointer, eventually getting to the displaced header.

The class word comes after the mark word, and takes one of two shapes, depending on whether compressed class pointers are enabled:

Class Word (uncompressed):
 0                                                                64
 [cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc]
                          (class pointer)
                      
Class Word (compressed):
 0                                32
 [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
     (compressed class pointer)

The class word is never overwritten, which means type information is always available, and additional steps are not required to perform a type check, perform a call, etc. What is important, the parts of the runtime that need that data do not need to cooperate with locking, hashing, and GC subsystems that may change the mark word.

With compact object headers, the conceptual division into mark word and class word is removed, as the class pointer is subsumed into the “mark word” part of the header. The layout changes to:

Header (compact):
   0 2   6                         32                              64
  [TTAAAAHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
(tag)(GC age)    (Hash code)          (compressed class pointer)

Storing the tagged pointer into the header is now difficult, because we lose direct access to the compressed class pointer. The following sections in the JEP discuss the runtime changes required to support compact headers.

Due to the complexity of dealing with compact object headers, the feature is gated by an experimental -XX:(-|+)UseCompactObjectHeaders JVM option, which is off by default. Once the feature is deemed stable, we plan to turn it on by default, and remove legacy object header support in some future release.

Note mark and class words are machine-pointer sized. This means this JEP does not have to deal with header layout on 32-bit platforms, since the 32-bit VM headers are already 64-bit.

Interaction with Class Pointers Compression

The important prerequisite for compact headers to work is unconditionally enabled compressed class pointers, since we only have 32 bits to store the class data. Since JDK 8, compressed class pointers are on when compressed oops are enabled. Since JDK 15, compressed class pointers are not dependent on compressed oops, and enabled by default in most (all?) configurations. Therefore, with compact object headers, we are able to use compressed class pointers unconditionally. If uncompressed class pointers are required, then compact object headers cannot be used (see Risks).

Interaction with Stack Locking

Stack locking is used as the first light-weight step in the object locking implementation. It works for the case where the monitor is uncontended, no thread control methods, e.g. wait(), notify(), etc. are called, and no JNI locking is used. In this case, HotSpot simply displaces the original header to the stack of locking thread, and overwrites the header with a pointer to that stack location. This tells the object which thread locked it (since we know the thread stack address range), while retaining the original header.

The complication with compact object headers is that stack locking is inherently racy during unlocking, from the perspective of non-owner threads. If a thread A tries to access the compressed class pointer in the displaced header, and thread B stack-unlocks the object and removes the displaced header from the stack, it exposes thread A to a dangling stack reference, which now likely contains garbage. The same problem exists in current stack locking code when e.g. accessing the identity hash code stored in displaced header, and it is solved by first inflating the lock into full-blown monitor lock (see the section below). This approach would not work efficiently with compact object headers, because accesses to class pointers are orders of magnitude more frequent than accessing the identity hash code, and thus would inflate the majority of stack locks and yield unacceptable performance overheads.

To alleviate that problem, the prerequisite improvement for this JEP is an alternative light-weight locking protocol that stores locking data into a separate thread-local area rather than into the object header, which preserves the original header and the class information. See JDK-8291555: Implement alternative fast-locking scheme.

Interaction with Monitor Locking

Monitor locking is the final heavy-weight step in locking implementation. It is invoked when the monitor is contended or thread control methods are used, or any light-weight locking step fails. Monitor locking creates a separate ObjectMonitor structure, stores the original header there, and overwrites the header with the pointer to that ObjectMonitor.

Compared to stack locking, it is not as racy. Once a monitor has been created and installed (inflated), it stays there until deflation disposes it. Java threads coordinate with monitor deflation such that, when a Java thread loads a header that carries the monitor pointer, it can also safely access that monitor without risk of accessing a stale monitor. Therefore, once a lock has been ‘inflated’ to a full ObjectMonitor, it is safe to load the displaced header from ObjectMonitor and decode the class from there.

Interaction with GC Forwarding

Moving GCs perform the object relocations in two steps: moving the object and recording the mapping between old and new location (forwarding) for the object, and then using this mapping to update the references in the entire heap or particular generation.

Out of current Hotspot GCs, only ZGC uses a separate forwarding table to record forwardings. All other Hotspot GCs record forwarding information by overlaying the header of the old object with the new location of the object. There are two distinct scenarios that involve headers.

Copying phases: objects are copied to an empty space, and the forwarding pointer is stored in the header of the old copy. This way, the header is preserved in the new copy. Therefore, all data from the original header is available. GC code that requires the class pointer, for example, to determine the object size for heap iteration, can reach the new copy and the original header there. This is the simple case.

Sliding phases: objects are relocated (slid to lower addresses) within the same space. This is typically invoked in situations when heap memory is exhausted, and not enough space is left for which to copy objects. When that happens, a last-ditch effort is made to do a full collection using a sliding GC.

Sliding GC works in 4 phases:

  1. Mark: Determine the set of live objects.
  2. Compute Addresses: Walk over all live objects and compute their new locations (where they would be placed one after another). Record those final locations as forwardings in the object headers
  3. Update References: Walk over all live objects and update all object references to point to the new locations.
  4. Copy: Actually copy all live objects to their new locations.

Notice how on step (2) we lose the original headers. This is also a problem for the current implementation: if the header is “interesting”, having the identity hash code, or locking information, etc, then we need to preserve it. Current GCs do that by storing these headers in a side table, and restoring them after a GC. This works well, because there are usually only a few objects with “interesting” headers. With compact object headers, every object comes with an interesting header, because now that header contains crucial class information. Storing a large amount of preserved headers would eat a significant amount of native heap.

To overcome this problem, the prerequisite for this JEP proposes an alternative implementation that uses a compact native forwarding table to store forwardings, leaving the original headers intact. We know from ZGC experience, that it is possible to implement a forwarding table with reasonable footprint. See (insert the link here once ready) “JDK-XXXXXX: Forwarding tables for sliding GCs”.

Interaction with GC Walking

Garbage collectors frequently need to walk the “parsable heap” by iterating the objects using their sizes. The access to object size involves an access to the class pointer.

In case the class pointer is encoded in the header, there is an additional cost for decoding the class from the header. This cost is doing simple math after reading the header, and it is low compared to the cost of the actual memory accesses involved in such a GC walk. No additional implementation work is needed here, since GC access the classes via the common VM interfaces.

In case the header is displaced (object is locked, or already forwarded, etc), there is an additional cost of dereferencing via the pointer to reach the original header. This cost can be significant. However, we anticipate all cases are rare in practice: locked objects are rare, size-walking over already-forwarded objects is rare, etc. Further testing on real-world applications once we deliver this experimental feature will give us more data on these overheads.

Interaction with Identity Hash Code

Current implementation cannot and would not store the entirety of 32-bit int identity hash code, so it stores only 31 bits on 64-bit platforms and only 25 bits on 32-bit platforms. With the compact object header, the space constraints get even worse. We can only store 26 bits of the identity hash code. As with the current implementation, this does not impact correctness of the identity hash code, but may affect performance of large hash tables when using keys that rely on higher System.identityHashCode() entropy.

Alternatives

Inline types. We can continue with current object headers and rely on Valhalla to implement inline types, which obviate the need for object headers when inlined. This, however, does not help already existing applications without the migration to inline types. Therefore, we see compact object headers as complimentary feature to the improvements inline types bring to the table.

Continue to maintain 32-bit platforms. Since mark and class words are sized as machine pointers, the headers on 32-bit platforms are already 64 bit. However, the difficulty of maintaining the 32-bit ports, coupled with the industry move from 32-bit environments, make this alternative impractical in future.

Delay this JEP and implement 32-bit headers right away. With much more effort, we could implement 32 bit headers. This would likely involve changing object→monitor mapping to use a side table, implementing an on-demand side table for identity hash code, etc. That is the ultimate goal, but the initial explorations show it requires much more work. This JEP captures an important milestone that brings substantial improvements to the Java ecosystem, and it would be more benefit at lower risk to deliver this experimental milestone to end users while Project Lilliput continues with further improvements towards 32 bit headers.

Testing

Changing the header layout of Java objects touches many HotSpot JVM subsystems: the runtime, all garbage collectors, all just-in-time compilers, the interpreters, the serviceability agent, and architecture-specific code for all supported platforms. Such massive changes warrant massive testing.

Compact object headers are tested by:

All these tests are executed with feature turned on and off, with multiple combinations of GCs, JIT compilers, and on several hardware targets.

This JEP also delivers a set of new test cases which measure the size of a variety of objects, e.g. plain objects, primitive type arrays, reference arrays and their headers.

The ultimate test for performance and correctness would be real-world workloads once this experimental feature is delivered.

Risks and Assumptions

Future runtime features need object header bits. Other runtime features might need bits in the compact object header. The current proposal leaves no spare bits in the header, which would be a problem for those features. We mitigate this risk organizationally by discussing object header needs with major JDK projects. We mitigate this risk technically by assuming that identity hash code width and compressed class pointer width can be reduced even more in order to make several bits available should future runtime features need them.

Implementation bugs on feature path. The usual risk for an intrusive feature such as this are bugs in the implementation. While issues in the normal header layout might be visible immediately on most tests, the subtleties in new locking and GC forwarding protocols may expose rare bugs. We mitigate this risk with careful reviews by the component owners, and by running lots of tests with the feature enabled. This risk does not affect the default product, as long as the feature remains experimental and off by default.

Implementation bugs on legacy path. While the new code tries to avoid changing legacy code paths, some refactorings necessarily touch shared code. This exposes the risk of bugs even when the feature is disabled. In addition to careful reviews and testing, we mitigate this risk by coding defensively and trying to avoid modifying shared code paths, even if requires more work in feature code paths.

Performance issues on feature path. There are known risks that the more complicated interaction with compact object header may introduce performance issues when the feature is enabled. We mitigate this risk by running major benchmarks and understanding the feature impact on their performance. There are performance costs for accessing the class pointer indirectly, for using the alternative stack locking scheme, for employing the alternative GC sliding forwarding machinery, and for having less identity hash code bits. This risk does not affect the default product, as long as the feature remains experimental and off by default.

Performance issues on legacy path. There is a minor risk that refactoring in the legacy code paths affects performance in unexpected ways. We mitigate this risk by minimizing the changes in the legacy code paths and proving the performance on major workloads is not substantially affected.

Compressed class pointers. Compact object headers require compressed class pointers to be enabled. In older JDKs, disabling compressed oops (even implicitly, on large Java heaps) disabled compressed class pointer. After JDK 15, that is no longer the case for most configurations, see JDK-8241825 and related issues. The only exception is when JVMCI is enabled. JVMCI uses need to be fixed before the feature graduates out of the experimental status. We mitigate the current risk by disabling compact object headers if JVMCI is enabled.

Changing low-level interfaces. Some components that poke into object headers directly, notably Graal VM as the major user of JVMCI, would have to adopt the new header layout. We mitigate the current risk by identifying these components and disabling the feature when those components are in use. Before the feature graduates from the experimental phase, those components need to be fixed.

Project failure. While very unlikely, it may turn out the complexity the feature brings does not yield tangible real-world improvements, or the achievable improvements are not balancing out the additional complexity in dealing with compact object headers. We mitigate this minor risk by gating the feature code paths with the experimental runtime flag, thus keeping the open and clear path to remove the feature in future release, should a need arise.

Dependencies

The following issues must be resolved as the pre-requisites for this JEP: JDK-8291555: Implement alternative fast-locking scheme

JDK-XXXXXXX: Forwarding tables for sliding GCs

The following issues optionally improve the footprint after this JEP integrates: JDK-8139457: Array bases are aligned at HeapWord granularity