JEP draft: Enable execution of Java methods on GPU
Type | Feature |
Scope | Implementation |
Status | Closed / Withdrawn |
Component | hotspot / runtime |
Discussion | sumatra dash dev at openjdk dot java dot net |
Effort | XL |
Duration | L |
Reviewed by | Mikael Vidstedt |
Created | 2014/06/17 15:05 |
Updated | 2020/09/01 18:36 |
Issue | 8047074 |
This is the JEP draft for OpenJDK project Sumatra: https://wiki.openjdk.java.net/display/Sumatra/Main
Summary
Enable Java applications to take advantage of GPUs, using JDK 8 Stream API parallel streams and lambdas as the programming model.
Goals
Enable seamless offload of Java 8 parallel stream APIs to GPGPU when possible.
By seamless we mean:
- No syntactic changes to Java 8 parallel stream API
- Autodetection of hardware and software stack
- Heuristic to decide when to offload to GPU gives perf gains
- Performance improvement for embarrassingly parallel workloads
- Code accuracy has the same (non-) guarantees you can get with multi core parallelism
- Code will always run with graceful fallback to normal CPU execution if offload fails
- Will not expose any additional security risks
- Offloaded code will maintain Java memory model correctness (find JSR)
- Where possible enable JVM languages to be offloaded
Non Goals
- Not intended to offload all code or all of Java 8 stream API to GPU
- No plan to support auto vectorization and auto parallelization offload to GPU
- No support for devices that do not support shared virtual memory
- Initially not exposing all GPU capabilities to Java language, for example group local memory
Metrics
An initial success metric would be to offload a parallel workload using Stream API and observe better performance in that part of the application.
Motivation
Many Java workloads are becoming larger and larger. GPUs offer computing power that are more efficient in both power and performance for some workloads, but earlier Java/GPU offload solutions such as Aparapi or JOCL are not integrated into the JDK and require their own programming model.
With Sumatra, we plan to offer seamless offload of some Stream API parallel lambda functions. The Stream API is designed to simplify parallel programming and Sumatra is a natural extension of the parallel capability already in the Stream API. Since Sumatra will be integrated into the JDK, it will simplify both development and deployment of offloadable applications compared to existing Java/GPU solutions.
Description
Our implementation uses Heterogeneous System Architecture supported in certain AMD APUs with a related software stack, and uses the Graal JVM that includes an HSAIL back end. The JDK is modified such that for certain Stream API operations, the application's lambda function is extracted from the stream and compiled into an HSA kernel. The stream data structures are examined to extract the lambda arguments, and passed to the HSA kernel.
Current GPUs have hundreds to thousands of stream cores. Ideally, for parallelizable workloads all the stream cores can operate on the input data at the same time. We use the Stream API parallel() method as the indicator that it is safe to offload the following part of the stream since the programmer explicitly wrote it. For example, we have implemented offloadable versions of parallel().forEach() and some parallel().reduce() operations in the Stream API.
Work sent to a GPU is generally in the form of an array. The length of the input array is sometimes called the "range" in GPU terms. The length of the range indicates how many "work items" are in the task. In the GPU programming model it is common for each stream core to use the work item id as an index into an array to get the data that stream core will process. In Sumatra, we find the source Java array in the stream and pass the array to the kernel and use the work item id to retrieve the array element for that stream core. Each stream core processes one array element which corresponds to one iteration variable execution of the lambda in the Stream API.
Note with HSA the GPU is operating on the main memory and has direct access to the Java heap, so there is no copying of data. Thus we can operate on Java objects and are not limited to basic type arrays.
Garbage collection cannot occur while a kernel is executing. Our prototype is executing the kernels from inside the JVM and is not using JNI, so no extra object pinning is required.
We support deoptimization of HSA kernels back to CPU execution, and handle safepoints by deoptimizing back to the CPU. In this way the CPU execution of the application is not blocked or delayed by execution of a kernel.
Here is a simple use of parallel stream API showing examples of what can be offloaded:
package simple;
import java.util.stream.IntStream;
public class Simple {
public static void main(String[] args) {
final int length = 8;
int[] ina = new int[length];
int[] inb = new int[length];
int[] out = new int[length];
// Initialize the input arrays - this is offloadable.
// Each iteration of this lambda is independent and
// always produces the same answer whether executed single-threaded,
// by CPU thread pool or GPU kernel.
IntStream.range(0, length).parallel().forEach(p -> {
ina[p] = 1;
inb[p] = 2;
});
// Sum each pair of elements into out[] - this is offloadable
// Meets the same criteria as the above example
IntStream.range(0, length).parallel().forEach(p -> {
out[p] = ina[p] + inb[p];
});
// Print results - this is not offloadable since it is calling
// native code etc. Also it is not really parallelizable even
// on the CPU since it is printing messages that might become garbled.
IntStream.range(0, length).forEach(p -> {
System.out.println(out[p] + ", " + ina[p] + ", " + inb[p]);
});
}
}
Alternatives
There are several open source packages available to offload some Java methods to GPUs with OpenCL or CUDA. They generally require their own programming model, their own jars in the classpath and native libraries.
- Aparapi
- RootBeer
- JCUDA/ JOCL
- SCALA CL
Testing
- Pass all JCK tests
- Develop new targeted tests for compilation failure and fallback to normal Java execution
- Develop new targeted tests for deoptimization, safepoints and allocation from kernels
Risks and Assumptions
- Other offload solutions besides HSA require copying data over a bus to the offload device. Thus the offload benefit/penalty will be completely different from an HSA based solution.
- The floating point standard used on GPUs is different from that used in Java.
Dependencies
- Our version depends on the HSA runtime. Other offload platforms will have their own software layer.
- For HSA, there will be modifications to the linux kernel which should be generally available in future distros
Impact
- JVM modifications similar to what we have implemented in the Graal JVM
- Possibly JDK modifications to direct the workload to the GPU, unless this can be done completely in the JVM
- Requires a new compiler/backend to produce the GPU kernels from the lambda method similar to what we have implemented in the Graal JVM