JEP 111: Additional Unicode Constructs for Regular Expressions

Owner	Xueming Shen
Type	Feature
Scope	SE
Status	Candidate
Component	core-libs
Discussion	core dash libs dash dev at openjdk dot java dot net
Effort	S
Duration	S
Endorsed by	Brian Goetz
Created	2011/07/26 20:00
Updated	2016/01/18 04:55
Issue	8046101

Summary

Adopt further regular-expression constructs from from Unicode TR#18.

Motivation

The primary motivation is to enhance/enrich the Unicode support level to allow developers to write sophisticated Unicode-enabled regular expressions on the Java platform. This is important to keep the Java Platform competitive with other languages that already offer more complete support for Unicode regular expressions.

Description

Java Regular Expressions are derived from Perl Regular Expression and are supposed to provide Java developers most of the Perl style regression expression features. Perl Regular Expressions have evolved rapidly in the past couple years to follow Unicode Standard TR#18 Unicode Regular Expressions. Java Regular Expressions have claimed to be in conformance with Level 1 of the same Unicode Standard TR#18 Unicode Regular Expressions, plus RL2.1 Canonical Equivalents, which is the "lowest" level of conformance. Given that the Unicode Standard has been widely accepted as the de facto standard for development platforms and Java uses Unicode as its internal encoding scheme, it appears that higher-level Unicode support is desirable for developers working on Unicode-aware applications. The following new constructs and features are proposed to provide better Unicode support in Java Regular Expressions:

\N \{...\} -- Unicode Name Properties
\X -- Extended Grapheme Clusters
Fix the broken Canonical Equivalent support
\R -- Unicode line-break sequence, as suggested at TR#18 Line Boundaries
\g \{...\} -- Perl style construct for named capturing group and capturing group
More complete Unicode properties, as in \p \{IsXXXX\}
\h \H \v \V -- Horizontal/vertical whitespace

Testing

All the features (new regex constructs) listed here will be covered by the new unit tests and run by the existing test framework.