JEP 252: Use CLDR Locale Data by Default

AuthorNaoto Sato & Alex Buckley
OwnerNaoto Sato
TypeFeature
ScopeJDK
StatusClosed / Delivered
Release9
Componentcore-libs / java.util:i18n
Discussioni18n dash dev at openjdk dot java dot net
EffortM
DurationM
Reviewed byAlan Bateman, Brian Goetz, Mark Reinhold
Endorsed byBrian Goetz
Created2014/05/20 17:14
Updated2024/06/25 15:14
Issue8043554

Summary

Use the locale data in the Common Locale Data Repository (CLDR) to format dates, times, currencies, languages, countries, and time zones in the standard Java APIs. CLDR, which is maintained by the Unicode Consortium, provides locale data of higher quality than the legacy data in JDK 8. Locale-sensitive applications may be affected by the switch to CLDR locale data, and, in the future, by revisions of the CLDR locale data.

History

The original text of this JEP, written in 2014 for JDK 9, did not provide adequate guidance to developers. We rewrote it in 2024 to add such guidance and to include information about JDK 21 and JDK 23.

Goals

Non-Goals

Motivation

The Java Platform offers APIs that help to localize Java programs, i.e., adapt them to different languages and countries. The APIs, principally in the java.text and java.util packages, are locale-sensitive: They depend upon a Locale that tailors an operation to a specific language, country, calendar system, and other cultural norms. Each locale is associated with locale data that describes how dates, times, currencies, languages, countries, and time zones are presented. In the example below, words such as "Thursday" and "March", as well as the pattern "EEEE, MMMM, d, y", come from the locale data for Locale.US, while "木曜日" and the pattern "y年M月d日EEEE" come from the locale data for Locale.JAPAN:

jshell> Date today = new Date();
today ==> Thu Mar 14 09:49:43 PDT 2024

jshell> import java.text.*;

jshell> DateFormat.getDateInstance(DateFormat.FULL, Locale.US).format(today);
$2 ==> "Thursday, March 14, 2024"

jshell> DateFormat.getDateInstance(DateFormat.FULL, Locale.JAPAN).format(today);
$3 ==> "2024年3月14日木曜日"

JDK 8 contains locale data for about 160 locales, originally created in the 1990s by Sun Microsystems and its industry partners. While cutting edge for its time, this locale data has various problems:

The Unicode Consortium created the Common Locale Data Repository (CLDR) in 2003 to address quality and extensibility issues with locale data. CLDR contains locale data for over 500 locales. It is released every six months, to stay in sync with regional and cultural developments, and changes to its content are managed through a formal public process. Locale data is described with a domain-specific markup language, LDML, which ensures that CLDR is well structured and extensible. As a result, CLDR has been adopted by all major operating systems.

JDK 8 was the first Java release to contain CLDR locale data as well as the legacy locale data from the 1990s, though it used the legacy data by default. Given the high quality and widespread adoption of CLDR, the entire Java ecosystem would benefit if JDK 9 switched to using CLDR locale data by default. It is neither realistic nor advantageous for the JDK to keep using its own legacy locale data when CLDR exists as a superior alternative. JDK 9 and later releases will continue to contain the legacy locale data in order to ease migration for localized applications.

Description

In JDK 8 and later, there are two built-in providers of locale data: JRE, which provides the legacy locale data from the 1990s, and CLDR, which provides the CLDR locale data from the Unicode Consortium.

JDK 8, by default, selects only the JRE provider at run time, so locale-sensitive Java APIs use only legacy locale data.

JDK 9, by default, will give priority to the CLDR provider at run time, so locale-sensitive Java APIs will use CLDR locale data in preference to legacy locale data.

The use of CLDR locale data is an implementation characteristic of JDK 9; it is not mandated by the Java Platform Specification. Other implementations of the Platform need not use CLDR locale data by default, and they need not even provide it as an option. This approach is in line with how the Java Platform works in other areas of internationalization, such as the handling of time zones (see below).

Regardless of provider, the locale data for the US country locale, the ENGLISH language locale, and the technical root locale is contained in the java.base module; all other locale data is contained in the jdk.localedata module. Developers who use the jlink tool to build custom run-time images can save space by selecting which locales to include in a run-time image.

Where locale data is used

Applications represent dates, times, currencies, languages, countries, and time zones with objects of the following classes:

Locale-sensitive APIs convert these objects to and from strings, so that a date, time, currency, language, country, or time zone can be denoted in plain text. The APIs use locale data in both directions: to convert an object to a string (formatting), and to convert a string to an object (parsing). The default behavior of these APIs will change after the switch to CLDR locale data.

The Calendar, Currency, and TimeZone classes in the java.util package are inherently locale-sensitive because they are instantiated with reference to a specific locale. They provide formatting and parsing methods which use the locale data for that specific locale. In contrast, java.util.Date and the six classes in the java.time package are not locale-sensitive because they are not instantiated with reference to a specific locale. Companion classes provide their locale-sensitive API, e.g., the java.text.DateFormat class is responsible for formatting and parsing Date objects. Some general-purpose I/O classes also provide locale-sensitive APIs for formatting. Here are the companion and I/O classes that provide locale-sensitive APIs:

Some APIs that are critical to localization are not locale-sensitive and thus are unaffected by the switch to CLDR locale data:

How applications are affected by CLDR locale data

Applications that expect locale-sensitive APIs to use legacy locale data will see different results when formatting, and possibly exceptions when parsing, when the APIs use CLDR locale data in JDK 9.

It is impractical to list all the differences between the legacy and CLDR locale data, but here are seven notable differences that will be visible to applications (no significance is implied by the order of this list):

Here are examples of these differences:

System.out.println(DateFormat.getDateInstance(DateFormat.MEDIUM, Locale.UK)
                             .format(new Date()));
// JDK 8:  15-Mar-2024
// JDK 9:  15 Mar 2024

System.out.println(DateFormat.getDateTimeInstance(DateFormat.SHORT,
                                                  DateFormat.SHORT,
                                                  Locale.ENGLISH)
                             .format(new Date()));
// JDK 8:  3/19/24 2:35 PM
// JDK 9:  3/19/24, 2:35 PM

System.out.println(DateFormat.getTimeInstance(DateFormat.FULL, Locale.ENGLISH)
                             .format(new Date()));
// JDK 8:  2:27:03 PM PDT
// JDK 9:  2:27:03 PM Pacific Daylight Time

System.out.println(NumberFormat.getInstance(Locale.ENGLISH).format(Double.NaN));
// JDK 8:  �
// JDK 9:  NaN

System.out.println(new SimpleDateFormat("dd MMM", Locale.GERMANY)
                       .format(new GregorianCalendar(2024, Calendar.MARCH, 19)
                       .getTime()));
// JDK 8:  19 Mär
// JDK 9:  19 März

System.out.println(NumberFormat.getCurrencyInstance(Locale.ITALY).format(100));
// JDK 8:  € 100,00
// JDK 9:  100,00 €

System.out.println(new Locale("lt").getDisplayName(Locale.FRENCH));
// JDK 8:  lithuanien
// JDK 9:  lituanien

Prior to deploying on JDK 9 or later where CLDR locale data is used by default, we strongly encourage you to check for compatibility issues by running your applications on JDK 8 with the CLDR provider selected. Do this by starting the Java 8 runtime with

$ java -Djava.locale.providers=CLDR,JRE ...

so that CLDR locale data has priority over legacy locale data.

If your code uses locale-sensitive APIs, we strongly encourage you to revise it, as necessary, to align with CLDR locale data as soon as possible. Code that interacts with locale-sensitive APIs must work properly when dates, times, currencies, languages, countries, and time zones are formatted and parsed using CLDR locale data.

The impact on code can depend on whether the string representations of dates, times, etc., are exchanged with or stored in systems outside the application. For example, suppose an application has a Date object that it needs to persist, so it formats the Date for the UK locale and stores the resulting string in a database. If the application, later in the same session, retrieves the string from the database and parses it as a Date in the UK locale, there will be no impact from the switch to CLDR locale data. The application will get the same Date that it started with, since both formatting and parsing are performed on the same JDK, with the same locale data.

However, suppose the application ran on JDK 8 when it stored the string in the database, but runs on JDK 17 when it retrieves the string. The Date object was formatted as a string using legacy locale data, but the string will be parsed as a Date using CLDR locale data. The code will trigger a java.text.ParseException because, e.g., the hyphenated string "15-Mar-2024" does not match the dd MMM yyyy pattern used for UK dates in CLDR. As a result of the exception, the application could fail or behave in unexpected ways.

Beyond the code of the application itself, code used for testing the application may be impacted by the switch to CLDR locale data. Unit tests frequently include hard-coded date/time strings that the application is expected to parse in a locale-sensitive way. If the tests were written with JDK 8 and the application is migrated to JDK 9 or later then the tests could fail.

Continuing to use legacy locale data

If it is impractical to revise code to format and parse strings using CLDR locale data, there are three measures that you can take to continue formatting and parsing strings using legacy locale data:

  1. Force locale-sensitive APIs to use legacy locale data at startup. Do this by starting the Java runtime with

    $ java -Djava.locale.providers=JRE,CLDR ...

    The system property value COMPAT can be used as a synonym for JRE, e.g., -Djava.locale.providers=COMPAT,CLDR ...

    Forcing the use of legacy locale data must be treated as a temporary measure. In a release after JDK 9, only CLDR locale data will be available.

  2. Modify your code to always format and parse strings with the same patterns as those in legacy locale data.

    For example, suppose your code uses the locale-sensitive SimpleDateFormat API to format Date objects. On JDK 8, the code might have obtained a SimpleDateFormat as follows:

    SimpleDateFormat fmt
        = (SimpleDateFormat)DateFormat.getDateInstance(DateFormat.MEDIUM, Locale.UK);
    // prints "19-Mar-2024" on JDK 8 but "19 Mar 2024" on JDK 9
    System.out.println(fmt.format(new Date()));

    You could modify the code to create a SimpleDateFormat directly, passing the desired pattern (date components separated by hyphens) to the constructor of SimpleDateFormat:

    SimpleDateFormat fmt = new SimpleDateFormat("dd-MMM-yyyy", Locale.UK);
    // prints "19-Mar-2024", even on JDK 9
    System.out.println(fmt.format(new Date()));

    This solution can work well for small applications, or for large applications that store formats in singleton variables whose use is rigorously enforced across the codebase.

  3. Create a custom locale data provider and include it in the application. This provider can override the CLDR provider so that locale-sensitive APIs, when formatting and parsing strings, give priority to the patterns defined by the custom provider.

    For example, here is a custom locale data provider that can be used on JDK 9 to reinstate the hyphen-separated pattern for UK dates from JDK 8:

    package com.example.localization;
    import java.text.*;
    import java.text.spi.*;
    import java.util.*;
    
    public class HyphenatedUKDates extends DateFormatProvider {
    
         @Override
         public Locale[] getAvailableLocales() {
             return new Locale[]{Locale.UK};
         }
    
         @Override
         public DateFormat getDateInstance(int style, Locale locale) {
             assert locale.equals(Locale.UK);
             switch (style) {
                 case DateFormat.FULL:
                     return new SimpleDateFormat("EEEE, d MMMM yyyy");
                 case DateFormat.LONG:
                     return new SimpleDateFormat("dd MMMM yyyy");
                 case DateFormat.MEDIUM:
                     return new SimpleDateFormat("dd-MMM-yyyy");
                 case DateFormat.SHORT:
                     return new SimpleDateFormat("dd/MM/yy");
                 default:
                     throw new IllegalArgumentException("style not supported");
             }
         }
    
         @Override
         public DateFormat getDateTimeInstance(int dateStyle, int timeStyle,
                                               Locale locale)
         {
             ...
         }
    
         @Override
         public DateFormat getTimeInstance(int style, Locale locale) {
             ...
         }
    
    }

Future plans for legacy locale data

In a release after JDK 9, we will stop shipping legacy locale data entirely. We will gradually degrade support for legacy locale data:

Risks and Assumptions