JEP 111: Additional Unicode Constructs for Regular Expressions
Owner | Xueming Shen |
Type | Feature |
Scope | SE |
Status | Candidate |
Component | core-libs |
Discussion | core dash libs dash dev at openjdk dot java dot net |
Effort | S |
Duration | S |
Endorsed by | Brian Goetz |
Created | 2011/07/26 20:00 |
Updated | 2016/01/18 04:55 |
Issue | 8046101 |
Summary
Adopt further regular-expression constructs from from Unicode TR#18.
Motivation
The primary motivation is to enhance/enrich the Unicode support level to allow developers to write sophisticated Unicode-enabled regular expressions on the Java platform. This is important to keep the Java Platform competitive with other languages that already offer more complete support for Unicode regular expressions.
Description
Java Regular Expressions are derived from Perl Regular Expression and are supposed to provide Java developers most of the Perl style regression expression features. Perl Regular Expressions have evolved rapidly in the past couple years to follow Unicode Standard TR#18 Unicode Regular Expressions. Java Regular Expressions have claimed to be in conformance with Level 1 of the same Unicode Standard TR#18 Unicode Regular Expressions, plus RL2.1 Canonical Equivalents, which is the "lowest" level of conformance. Given that the Unicode Standard has been widely accepted as the de facto standard for development platforms and Java uses Unicode as its internal encoding scheme, it appears that higher-level Unicode support is desirable for developers working on Unicode-aware applications. The following new constructs and features are proposed to provide better Unicode support in Java Regular Expressions:
- \N \{...\} -- Unicode Name Properties
- \X -- Extended Grapheme Clusters
- Fix the broken Canonical Equivalent support
- \R -- Unicode line-break sequence, as suggested at TR#18 Line Boundaries
- \g \{...\} -- Perl style construct for named capturing group and capturing group
- More complete Unicode properties, as in \p \{IsXXXX\}
- \h \H \v \V -- Horizontal/vertical whitespace
Testing
All the features (new regex constructs) listed here will be covered by the new unit tests and run by the existing test framework.