JEP draft: Predictable regex performance

Authormartin
OwnerMartin Buchholz
TypeFeature
ScopeSE
StatusDraft
Componentcore-libs / java.util.regex
EffortL
Created2021/01/30 23:20
Updated2022/09/01 10:18
Issue8260688

Problem

There is an increasing focus on the performance of regular expression matching, and especially the predictability.

The current (jdk16) implementation is a NFA-based backtracking engine. It mitigates but does not eliminate O(2^N) performance. StackOverflowError is a risk.

Goals

Make it possible to build a ReDoS-safe "regexp search engine" that can safely accept user-provided regexes with O(N) runtime performance, O(1) stack space usage, and as close as possible to O(M) space usage for the compiled regex. It's OK for some regex features like backrefs to be dropped. Users can ask for safety or power; no one knows how to provide both. Of course, we maintain compatibility - the legacy API must remain unsafe.

Possible Approaches

The re2 library strongly advocates building a regex engine using DFA, and this guarantees O(N) runtime matching performance. But a DFA engine cannot handle some popular modern regex engine features, notably back-references. The re2 library simply chooses not to support such features, and that is a reasonable restriction when regexes are user-supplied. One plausible evolution of the API is to add a flag to Pattern.compile(String, int) that would specifically request O(N) performance (or fail). Another plausible evolution of the API is to allow alternate implementations of Pattern, and ensure that most Java SE APIs that take a String regex also take a Pattern as input.

The re2 library has been ported from C++ to other languages:

Some library implementations are a thin wrapper around native re2:

Wrappers around native re2 work better in languages other than java. There is more cultural acceptance of such wrappers, and the refcounting GC common in such languages is better at promptly releasing resources. No one wants to close() their Patterns when they are done with them, to release the associated native memory!

GNU grep, which has invested heavily in performance, implements both DFA and NFA engines, falling back to NFA if the regex cannot be compiled to DFA.

Openjdk addressed the problem of exponential O(2^N) performance in JDK-6328855 by memoizing previous failed match attempts, and this mitigates most exponential performance problems caused by sloppy regexes, but:

Perl addressed the problem by adding Backtracking Control Verbs which allow a highly skilled regex programmer to produce an efficient program, but at the cost of lock-in to the backtracking execution model and not solving the general problem.

Modern Java includes invokedynamic and the ability to build highly performant languages on the JVM. Those techniques can be applied to the regular expression language as well.

An optimized DFA or NFA engine may benefit from the value types in the valhalla project.

Even though the very first implementation of regular expressions by Ken Thompson used efficient DFA, most current implementations use NFA backtracking. It may be best to re-engineer all such implementations to return to DFA (where possible), but it's an enormous engineering effort.

There are resources other than runtime cpu to be considered:

The very ambitious sregex project (also based on re2) claims

No pathological regexes exist for this regex engine because it does not use a backtracking algorithm at all.

but it appears to be unfinished and inactive.

Building a great engine is hard!