JEP 312: Thread-Local Handshakes
Owner | Robbin Ehn |
Type | Feature |
Scope | JDK |
Status | Closed / Delivered |
Release | 10 |
Component | hotspot / runtime |
Discussion | hotspot dash dev at openjdk dot java dot net |
Blocks | JEP 333: ZGC: A Scalable Low-Latency Garbage Collector (Experimental) |
Reviewed by | Mikael Vidstedt |
Endorsed by | Mikael Vidstedt |
Created | 2017/08/01 13:30 |
Updated | 2019/08/21 10:41 |
Issue | 8185640 |
Summary
Introduce a way to execute a callback on threads without performing a global VM safepoint. Make it both possible and cheap to stop individual threads and not just all threads or none.
Non-Goals
It may not be feasible to implement this efficiently on all supported architectures. It is initially not a goal to support all processor architectures and all versions of processor architectures.
Success metrics
-
The new mechanism does not incur performance overheads larger than 1% in standard benchmarks.
-
The new mechanism does not increase the time required to reach a traditional global safepoint.
Motivation
Being able to stop individual threads has a multitude of applications:
-
Improving biased lock revocation to only stop individual threads for revoking biases, rather than all of them.
-
Reducing the overall VM latency impact of different types of serviceability queries such as acquiring stack traces for all threads which on a VM with a large number of Java threads can be a slow operation.
-
Performing safer stack trace sampling by reducing reliance on signals.
-
Eliding some memory barriers using so called Asymmetric Dekker Synchronization techniques, by performing handshakes with Java threads. For example, the conditional card mark code inherently required by G1 and used by CMS, will not need memory barriers. As a result, the G1 post write barrier can be optimized, and branches that try to avoid the memory barrier can be removed.
All of these will help the VM achieve lower latency by reducing the number of global safepoints.
Description
A handshake operation is a callback that is executed for each JavaThread while that thread is in a safepoint safe state. The callback is executed either by the thread itself or by the VM thread while keeping the thread in a blocked state. The big difference between safepointing and handshaking is that the per thread operation will be performed on all threads as soon as possible and they will continue to execute as soon as it’s own operation is completed. If a JavaThread is known to be running, then a handshake can be performed with that single JavaThread as well.
In the initial implementation there will be a limitation of at most one handshake operation in flight at a given time. The operation can, however, involve any subset of all JavaThreads. The VM thread will coordinate the handshake operation through a VM operation which will in effect prevent global safepoints from occurring during the handshake operation.
The current safepointing scheme is modified to perform an indirection through a per-thread pointer which will allow a single thread's execution to be forced to trap on the guard page. Essentially, at all times there will be two polling pages: One which is always guarded, and one which is always unguarded. In order to force a thread to yield, the VM updates the per-thread pointer for the corresponding thread to point to the guarded page.
Thread-local handshakes will be implemented initially on x64 and SPARC. Other platforms will fall back to normal safepoints.
A new product option, -XX:ThreadLocalHandshakes
(default value true
), allows users to select normal safepoints on supported platforms.
Alternatives
Multiple alternatives were considered:
-
Emit conditional branches instead. This consumes branch predictor state and is not as tight as just a load. Experiments in this area have shown that the performance of conditional branches can be highly dependent on the specific microarchitecture of the target CPU. Another drawback with the conditional branches approach is that each conditional branch safepoint would need a corresponding stub to be output to take care of returning to the location of the poll.
-
There is an idea which implies sacrificing another register, and then performing a load of the address the register holds to the register itself, assuming the contents of the register is the address of its own thread-local field. One would start the thread-local handshake by changing the field to
NULL
. The next poll the register would be set toNULL
, and for the second poll, the load would trap. This requires sacrificing a register globally, the traps are more expensive, and on average it will take twice as many polls to reach the safepoint once the request is made for a thread to stop. The benefit is that it theoretically has a lower impact on application execution. -
Previously a prototype was constructed where the global polling page was left in as-is but only the actual target thread(s) were caught in the VM code. Threads which were not targets of the handshake would simply return from the signal handler and continue executing. A drawback with this approach is that if a target thread is slow to respond then this can cause a signal storm for other Java threads since the polling page cannot be disarmed until the target thread has responded.