JEP 107: Bulk Data Operations for Collections

OwnerMike Duigou
TypeFeature
ScopeSE
StatusClosed / Delivered
Release8
Componentcore-libs
JSR335
Discussionlambda dash dev at openjdk dot java dot net
EffortL
DurationXL
DependsJEP 109: Enhance Core Libraries with Lambda
JEP 126: Lambda Expressions & Virtual Extension Methods
Endorsed byBrian Goetz
Created2011/09/23 20:00
Updated2024/04/22 16:06
Issue8046097

Summary

Add functionality to the Java Collections Framework for bulk operations upon data. This is commonly referenced as "filter/map/reduce for Java." The bulk data operations include both serial (on the calling thread) and parallel (using many threads) versions of the operations. Operations upon data are generally expressed as lambda functions.

Goals

Provide new features for bulk data processing utilizing lambda functions including parallel operations

Non-Goals

Convert existing usages to parallel operation.

Motivation

FlumeJava as used by Google internally and PLinq offered by Microsoft are the most directly similar offerings. Linq and Plinq in particular are seen as extremely valuable by .NET developers and a subject of much envy by Java developers.

The primary benefits are for developers currently building single threaded business process applications. To be able to take advantage of concurrency with minimal changes to their application is expected to be a huge benefit.

Description

The serial implementation provides a bridge from existing collections bulk-data operations to parallel operation that does not change the threading model of the application.

The parallel implementation is the central element of this feature. The parallel operation provides the opportunity to accelerate operations upon large amounts of data by dividing the task between multiple threads (processors). The parallel implementation builds upon the java.util.concurrency Fork/Join implementation introduced in Java 7.

For both the serial and parallel implementations an "eager" mode and a "lazy" mode are possible. In eager mode operations upon data are performed directly upon the data at the time the operation function is invoked. In lazy mode the operations upon the data are deferred until the final result is requested. Lazy mode operation allows the implementation more optimization opportunities based upon reorganization of the data and operations to be performed.

Testing

Benchmarking and performance regression testing is going to be critical to delivering a quality final product.

This work will require significant hardware resources to fully test. ie. dedicated 8+ core systems for all of the primary supported platforms.

Dependences

Impact