JEP 270: Reserved Stack Areas for Critical Sections
Owner | Frederic Parain |
Type | Feature |
Scope | JDK |
Status | Closed / Delivered |
Release | 9 |
Component | hotspot / runtime |
Discussion | hotspot dash runtime dash dev at openjdk dot java dot net |
Effort | M |
Duration | M |
Reviewed by | Karen Kinnear, Mikael Vidstedt |
Created | 2014/06/16 16:35 |
Updated | 2023/10/30 09:29 |
Issue | 8046936 |
Summary
Reserve extra space on thread stacks for use by critical sections, so that they can complete even when stack overflows occur.
Goals
-
Provide a mechanism to mitigate the risk of deadlocks caused by the corruption of critical data such as
java.util.concurrent
locks (such asReentrantLock
) caused by aStackOverflowError
being thrown in a critical section. -
The solution must be mostly JVM-based in order not to require modifications to
java.util.concurrent
algorithms or published interfaces, or existing library and application code. -
The solution must not be limited to the
ReentrantLock
case, and should be applicable to any critical section in privileged code.
Non-Goals
-
The solution doesn't aim to provide robustness against stack overflows to non-privileged code.
-
The solution doesn't aim to avoid
StackOverflowError
s, but rather to mitigate the risk that a such an error is thrown inside a critical section and thereby corrupts some data structures. -
The proposed solution is a trade-off between solving some well-known corruption cases while preserving performance, with reasonable resource cost and relatively low complexity.
Motivation
StackOverflowError
is an asynchronous exception that can be thrown by
the Java Virtual Machine whenever the computation in a thread requires a
larger stack than is permitted (JVM spec §2.5.2 and
§2.5.6). The Java Language Specification permits a
StackOverflowError
to be thrown synchronously by method invocation (JLS
§11.1.3). The HotSpot VM uses this property to implement a
"stack-banging" mechanism on method entry.
The stack-banging mechanism is a clean way to report that a stack overflow has occurred while preserving the JVM's integrity, but it doesn't provide a safe way for the application to recover from this situation. A stack overflow could occur in the middle of a sequence of modifications which, if not complete, could leave a data structure in an inconsistent state.
For instance, when a StackOverflowError
is thrown in a critical section
of the java.util.concurrent.locks.ReentrantLock
class, the lock status
can be left in an inconsistent state, leading to potential deadlocks. The
ReentrantLock
class uses an instance of AbstractSynchronizerQueue
to
implement its critical section. The implementation of its lock()
method
is:
final void lock() {
if (compareAndSetState(0, 1))
setExclusiveOwnerThread(Thread.currentThread());
else
acquire(1);
}
The method tries to change the status word with an atomic operation. If
the modification is successful then the owner is set by invoking a setter
method, otherwise the slow path is invoked. The problem is that if a
StackOverflowError
is thrown after the status word has been changed and
before the owner has been effectively set then the lock becomes unusable:
Its status word indicates it is locked but no owner has been set, so no
thread can unlock it. Because stack-size checks are performed at
method-invocation time (in HotSpot, at least), a StackOverflowError
can
be thrown either when Thread.currentThread()
is invoked or when
setExclusiveOwnerThread()
is invoked. In either case it leads to a
corruption of the ReentrantLock
instance, and all threads trying to
acquire this lock will be blocked forever.
This particular problem caused some serious issues in JDK 7 because
parallel class loading was implemented using a ConcurrentHashMap
and,
at that time, the ConcurrentHashMap
code used ReentrantLock
instances. If a ReentrantLock
instance was corrupted because of a
StackOverflowError
then the class-loading mechanism itself could
deadlock. (This happened in stress tests
(JDK-7011862), but
could also happen in the field.)
The implementation of the ConcurrentHashMap
class was completely
changed in June 2013. The new implementation uses synchronized
statements rather than ReentrantLock
instances, so JDK 8 and later
releases are not subject to class-loading deadlock due to corrupted
ReentrantLock
s. However, any code using ReentrantLock
can still be
impacted and cause deadlock. Such issues have already been reported on
the concurrency-interest@cs.oswego.edu
mailing list.
The problem is not limited to the ReentrantLock
class.
Java applications or libraries often rely on the consistency of data structures to work properly. Any modification of those data structures is a critical section: Before the execution of the critical section the data structures are consistent, and after its execution the data structures are consistent too. During its execution, however, the data structure could go through transient inconsistent states.
If a critical section is made of a single Java method containing no other
method invocation, the current stack overflow mechanism works well:
Either the available stack is sufficient and the method executes without
trouble, or it is not sufficient and so a StackOverflowError
is thrown
before the first bytecode of the method is executed.
The problem occurs when a critical section is made of several methods,
for instance a method A which invokes a method B. The available stack can
be sufficient to let method A start its execution. Method A starts to
modify a data structure and then invokes method B, but the remaining
stack is not sufficient to execute B, causing a StackOverflowError
to
be thrown. Because method B and the remainder of method A have not been
executed, the consistency of the data structure might have been
compromised.
Description
The main idea of the proposed solution is to reserve some space on the
execution stack for critical sections, to allow them to complete their
execution where regular code would have been interrupted by a stack
overflow. The assumption is that critical sections are relatively small
and do not require enormous space on the execution stack to complete
successfully. The goal is not to rescue a faulty thread which hits its
stack limit, but rather to preserve shared data structures that could be
corrupted if the StackOverflowError
is thrown in a critical section.
The main mechanism will be implemented in the JVM. The only modification
required in the Java source code is the annotation that must be used to
identify the critical sections. This annotation, currently named
jdk.internal.vm.annotation.ReservedStackAccess
, is a runtime method annotation that can
be used by any class of privileged code (see paragraph below about the
accessibility of this annotation).
In order to prevent the corruption of shared data structures, the JVM
will try to delay the throwing of a StackOverflowError
until the thread
in question has exited all of its critical sections. Each Java thread has
a new zone defined in its execution stack, called the reserved zone. This
zone can be used only if the Java thread has a method annotated with
jdk.internal.vm.annotation.ReservedStackAccess
in its current call stack. When a stack
overflow condition is detected by the JVM, and the thread has an
annotated method in its call stack, the JVM grants temporary access to
the reserved zone until no more annotated methods are present in the call
stack. When access to the reserved zone is revoked, a delayed
StackOverflowError
is thrown. If the thread has no annotated method in
its call stack when the stack overflow condition is detected then the
StackOverflow
is thrown immediately (this is current JVM behavior).
Note that the reserved stack space is usable by annotated methods but also by methods invoked, directly or transitively, from them. The nesting of annotated methods is naturally supported, but there's a single shared reserved zone per thread; that is, the invocation of an annotated method does not add a new reserved zone. The sizing of the reserved zone must be done according to the worst case of all annotated critical sections.
By default, the jdk.internal.vm.annotation.ReservedStackAccess
annotation is applicable
only to privileged code (code loaded by the bootstrap or the extension
class loader). Both privileged code and non-privileged code can be
annotated with this annotation but by default the JVM will ignore it for
non-privileged code. The rationale behind this default policy is that the
reserved stack space for critical sections is a shared resource among all
critical sections. If any arbitrary code is able to use this space then
it is not a reserved space anymore, and this would defeat the whole
solution. A JVM flag is available, even in product builds, to relax this
policy and allow any code to be able to benefit from this feature.
Implementation
In the HotSpot VM, each Java thread has two zones defined at the end of its execution stack: the yellow zone and the red zone. Both memory areas are protected against all accesses.
If, during its execution, a thread tries to use the memory in the yellow
zone, a protection fault is triggered, the protection of the yellow zone
is temporarily removed, and a StackOverflowError
is created and
thrown. Before unwinding the thread execution stack to propagate the
StackOverflowError
, the protection of the yellow zone is restored.
If the thread tries to use the memory in its red zone, the JVM immediately branches to JVM error-reporting code, leading to the generation of an error report and a crash dump of the JVM process.
The new zone defined by the proposed solution is placed just before the
yellow zone. This reserved zone will behave like regular stack space if
the thread has a ReservedStackAccess
-annotated method in its call
stack, and like the yellow zone otherwise.
During the setup of the execution stack of a Java thread, the reserved
zone is protected the same way as the yellow and the red zones. If,
during its execution, the thread hits its reserved zone, a SIGSEGV
signal is generated and the signal handler applies the following
algorithm:
-
If the address of the fault is in the red zone, generate a JVM error report and a crash dump.
-
If the address of the fault is in the yellow zone, create and throw a
StackOverflowError
. -
If the address of the fault is in the reserved zone, perform a stack walk to check if there's a method annotated with
jdk.internal.vm.annotation.ReservedStackAccess
on the call stack. If not, create and throw aStackOverflowError
. If an annotated method is found, remove the protection of the critical zone and store in the C++Thread
object the stack pointer of the outermost activation (frame) related to an annotated method.
If the protection of the reserved zone has been removed to allow a
critical section to complete its execution, the protection must be
restored and the delayed StackOverflowError
thrown as soon as the
thread exits the critical section. The HotSpot interpreter has been
modified to check if the registered outermost annotated method is being
exited. The check is performed on every frame-activation removal by
comparing the value of the stack pointer being restored with the value
stored in the C++ Thread
object. If the restored stack pointer is above
the stored value (stacks grow downward), a call to the runtime is
performed to change the memory protection and reset the stack pointer
value in the Thread
object before jumping to the StackOverflowError
generation code. The two compilers have been modified to perform the same
check on method exit, but only for ReservedStackAccess
annotated
methods or methods with annotated methods in-lined in their compiled
code.
When an exception is thrown, the control flow doesn't go through the
regular method-exit code, so there's a possibility that the protection of
the reserved zone will not be restored correctly if the exception is
propagated above the annotated method. To prevent this situation, the
protection of the reserved zone is restored and the stack pointer value
stored in the C++ Thread
object is reset each time an exception starts
being propagated. In this scenario, the delayed StackOverflowError
is
not thrown. The rationale is that the thrown exception is more important
than the delayed StackOverflowError
because it indicates a cause and a
point where normal execution has been interrupted.
Throwing a StackOverflowError
is the Java way to notify the application
that a thread reached its stack limits. However, exceptions and errors
are sometime caught by Java code and the notification is lost or not
handled correctly, which can make the investigation of the issue really
hard. To ease troubleshooting of stack overflow errors in presence of a
reserved stack area, the JVM provides two other notifications when access
to the reserved stack area is granted: One is a warning printed by the
JVM (on the same stream as all other JVM messages), and the second is a
JFR event. Note that even if the delayed StackOverflowError
is not
thrown because another exception has been thrown in a critical section,
the JVM warning and the JFR event are generated and are available for
troubleshooting.
The reserved-stack feature is controlled by two JVM flags, one to configure the size of the reserved zone (all threads use the same size), and one to allow non-privileged code to use the feature. Setting the size of the reserved zone to zero disables the feature entirely. When disabled, interpreted code and compiled code do not perform the check on method exit.
Memory cost of this solution: For each thread the cost is the virtual memory of its reserved zone, as part of its stack space. The option to implement the reserved zone in a different memory area, as an alternate stack, has been considered. It would, however, significantly increase the complexity of any stack-walking code, so this option has been rejected.
Performance cost: measurements done with
JSR-166 tests
on ReentrantLock
s didn't show any significant impact on performance on
x86 platforms.
Performance
Here's how this solution could impact performance.
The most costly operation in this solution is the stack walking performed
when looking for an annotated method in the call stack. This operation is
performed only when the JVM has detected a potential stack
overflow. Without this fix, the JVM would throw a
StackOverflowError
. So even if the operation is relatively costly, it
is better than the current behavior since it will prevent data
corruptions. The most frequently-executed part of this solution is the
check performed when an annotated method exits, to check if the
protection of the reserved zone has to be re-enabled or not. The
performance-critical version of this check is in the compiler. The
current implementation adds the following code sequence to the compiled
code of an annotated method:
0x00007f98fcef5809: cmp rsp,QWORD PTR [r15+0x298]
0x00007f98fcef5810: jle 0x00007f98fcef583c
0x00007f98fcef5816: mov rdi,r15
0x00007f98fcef5819: test esp,0xf
0x00007f98fcef581f: je 0x00007f98fcef5837
0x00007f98fcef5825: sub rsp,0x8
0x00007f98fcef5829: call 0x00007f9910f62670 ; {runtime_call}
0x00007f98fcef582e: add rsp,0x8
0x00007f98fcef5832: jmp 0x00007f98fcef583c
0x00007f98fcef5837: call 0x00007f9910f62670 ; {runtime_call}
This code is for the x86_64 platform. In fast cases (no need to re-enable
protection of the reserved zone) it adds two instructions including a
small jump. The version for x86_32 is bigger because it doesn't have the
address of the Thread
object always available in a register. The
feature is also implemented for Solaris/SPARC.
Open issues
The default size of the reserved zone is still an open issue. This size
will depend on the longest critical zone in JDK code that uses the
ReservedStackAccess
annotation and will also depend on the platform
architecture. We could also consider different defaults depending upon
whether the JVM is running on a high-end server or in a
virtual-memory-constrained environment.
To mitigate the sizing issue a debug/troubleshooting feature has been
added. This feature is enabled by default on debug builds and available
as a diagnostic JVM option in product builds. When activated, it is run
when the JVM is about to throw a StackOverflowError
: It walks the call
stack and if one or more methods annotated with the
ReservedStackAccess
annotation are found, their names are printed with
a warning message on the JVM standard output. The name of the JVM flag
controlling this feature is PrintReservedStackAccessOnStackOverflow
.
The default size of the reserved area is one page (4K) and experiments
have shown that this is sufficient to cover the critical sections of
java.util.concurrent
locks that have been annotated so far.
The reserved stack area is not fully supported on Windows
platforms. During the development of the feature on Windows, a bug was
found in the way the stack's special zones are controlled
(JDK-8067946). This
bug prevents the JVM from granting access to the reserved stack area. As
a consequence, when a stack overflow condition is detected on Windows,
and an annotated method is on the call stack, the JVM warning is printed,
the JFR event is fired, and a StackOverflowError
is thrown
immediately. There's no change in the behavior of the JVM for the
application. However, the JVM warning and the JFR event can help
troubleshooting, indicating that a potentially-harmful situation
occurred.
Alternatives
Several alternative approaches have been considered and some of them have been implemented and tested. Here's a list of those approaches.
Language-based solutions:
-
try
/catch
/finally
constructs: They don't solve anything, since there's no guarantee that thefinally
clause will not trigger a stack overflow too. -
New constructs such as:
new CriticalSection( () -> { // do critical section code }).enter();
This construct might require significant work in
javac
and the JVM, and its usage is likely to have high impact on performance compared to the reserved stack area, even when not run in a stack-overflow condition.
Code-transformation solutions:
-
Avoid method calls (because stack overflow checks are performed at method invocation time) by forcing the JIT to inline all called methods: Inlining could require the loading and initialization of classes not used by the application, forcing inlining could conflict with compiler rules (code size, inlining depth), and inlining is not applicable to all code patterns (e.g., reflection).
-
Code refactoring to avoid method calls at source level: Refactoring would require the modification of already-complex code (
java.util.concurrent
), and this kind of refactoring would break encapsulation.
Stack-based solutions:
-
Extended stack banging: Bang the stack further before entering a critical section: This solution has a performance cost, even when not in a stack-overflow condition, and it is hard to maintain with nested critical sections.
-
Extensible stacks: Build stacks from several non-contiguous memory chunks, adding a new chunk when a stack overflow is detected: This solution adds significant complexity to the JVM to manage non-contiguous stacks (including all the logic currently based on pointer comparisons in stack management); it could also require us to copy/move some section of the stack, and it puts more pressure on the memory-allocation backend due to fragmentation issues.
Testing
This change comes with a reliable unit test able to reproduce the
java.util.concurrent.lock.ReentrantLock
corruption caused by a stack
overflow.
Dependencies
The reserved stack area relies on the "yellow pages" mechanism. This mechanism is currently partly broken on Windows JDK-8067946, so the reserved stack area is not fully supported in this platform.