JEP draft: Enable execution of Java methods on GPU

Type	Feature
Scope	Implementation
Status	Closed / Withdrawn
Component	hotspot / runtime
Discussion	sumatra dash dev at openjdk dot java dot net
Effort	XL
Duration	L
Reviewed by	Mikael Vidstedt
Created	2014/06/17 15:05
Updated	2020/09/01 18:36
Issue	8047074

This is the JEP draft for OpenJDK project Sumatra: https://wiki.openjdk.java.net/display/Sumatra/Main

Summary

Enable Java applications to take advantage of GPUs, using JDK 8 Stream API parallel streams and lambdas as the programming model.

Goals

Enable seamless offload of Java 8 parallel stream APIs to GPGPU when possible.

By seamless we mean:

No syntactic changes to Java 8 parallel stream API
Autodetection of hardware and software stack
Heuristic to decide when to offload to GPU gives perf gains
Performance improvement for embarrassingly parallel workloads
Code accuracy has the same (non-) guarantees you can get with multi core parallelism
Code will always run with graceful fallback to normal CPU execution if offload fails
Will not expose any additional security risks
Offloaded code will maintain Java memory model correctness (find JSR)
Where possible enable JVM languages to be offloaded

Non Goals

Not intended to offload all code or all of Java 8 stream API to GPU
No plan to support auto vectorization and auto parallelization offload to GPU
No support for devices that do not support shared virtual memory
Initially not exposing all GPU capabilities to Java language, for example group local memory

Metrics

An initial success metric would be to offload a parallel workload using Stream API and observe better performance in that part of the application.

Motivation

Many Java workloads are becoming larger and larger. GPUs offer computing power that are more efficient in both power and performance for some workloads, but earlier Java/GPU offload solutions such as Aparapi or JOCL are not integrated into the JDK and require their own programming model.

With Sumatra, we plan to offer seamless offload of some Stream API parallel lambda functions. The Stream API is designed to simplify parallel programming and Sumatra is a natural extension of the parallel capability already in the Stream API. Since Sumatra will be integrated into the JDK, it will simplify both development and deployment of offloadable applications compared to existing Java/GPU solutions.

Description

Our implementation uses Heterogeneous System Architecture supported in certain AMD APUs with a related software stack, and uses the Graal JVM that includes an HSAIL back end. The JDK is modified such that for certain Stream API operations, the application's lambda function is extracted from the stream and compiled into an HSA kernel. The stream data structures are examined to extract the lambda arguments, and passed to the HSA kernel.

Current GPUs have hundreds to thousands of stream cores. Ideally, for parallelizable workloads all the stream cores can operate on the input data at the same time. We use the Stream API parallel() method as the indicator that it is safe to offload the following part of the stream since the programmer explicitly wrote it. For example, we have implemented offloadable versions of parallel().forEach() and some parallel().reduce() operations in the Stream API.

Work sent to a GPU is generally in the form of an array. The length of the input array is sometimes called the "range" in GPU terms. The length of the range indicates how many "work items" are in the task. In the GPU programming model it is common for each stream core to use the work item id as an index into an array to get the data that stream core will process. In Sumatra, we find the source Java array in the stream and pass the array to the kernel and use the work item id to retrieve the array element for that stream core. Each stream core processes one array element which corresponds to one iteration variable execution of the lambda in the Stream API.

Note with HSA the GPU is operating on the main memory and has direct access to the Java heap, so there is no copying of data. Thus we can operate on Java objects and are not limited to basic type arrays.

Garbage collection cannot occur while a kernel is executing. Our prototype is executing the kernels from inside the JVM and is not using JNI, so no extra object pinning is required.

We support deoptimization of HSA kernels back to CPU execution, and handle safepoints by deoptimizing back to the CPU. In this way the CPU execution of the application is not blocked or delayed by execution of a kernel.

Here is a simple use of parallel stream API showing examples of what can be offloaded:

package simple;

import java.util.stream.IntStream;

public class Simple {

public static void main(String[] args) {
    final int length = 8;
    int[] ina = new int[length];
    int[] inb = new int[length];
    int[] out = new int[length];

    // Initialize the input arrays - this is offloadable.
    // Each iteration of this lambda is independent and
    // always produces the same answer whether executed single-threaded, 
    // by CPU thread pool or GPU kernel.
    IntStream.range(0, length).parallel().forEach(p -> {
        ina[p] = 1;
        inb[p] = 2;
    });

    // Sum each pair of elements into out[] - this is offloadable
    // Meets the same criteria as the above example
    IntStream.range(0, length).parallel().forEach(p -> {
        out[p] = ina[p] + inb[p];
    });

    // Print results - this is not offloadable since it is calling
    // native code etc. Also it is not really parallelizable even
    // on the CPU since it is printing messages that might become garbled.
    IntStream.range(0, length).forEach(p -> {
        System.out.println(out[p] + ", " + ina[p] + ", " + inb[p]);
    });
}

}

Alternatives

There are several open source packages available to offload some Java methods to GPUs with OpenCL or CUDA. They generally require their own programming model, their own jars in the classpath and native libraries.

Aparapi
RootBeer
JCUDA/ JOCL
SCALA CL

Testing

Pass all JCK tests
Develop new targeted tests for compilation failure and fallback to normal Java execution
Develop new targeted tests for deoptimization, safepoints and allocation from kernels

Risks and Assumptions

Other offload solutions besides HSA require copying data over a bus to the offload device. Thus the offload benefit/penalty will be completely different from an HSA based solution.
The floating point standard used on GPUs is different from that used in Java.

Dependencies

Our version depends on the HSA runtime. Other offload platforms will have their own software layer.
For HSA, there will be modifications to the linux kernel which should be generally available in future distros

Impact

JVM modifications similar to what we have implemented in the Graal JVM
Possibly JDK modifications to direct the workload to the GPU, unless this can be done completely in the JVM
Requires a new compiler/backend to produce the GPU kernels from the lambda method similar to what we have implemented in the Graal JVM