[PATCH v4 4/4] Add generic C.UTF-8 locale (Bug 17318)

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Carlos O'Donell <carlos@redhat.com>
To: libc-alpha@sourceware.org, fweimer@redhat.com
Subject: [PATCH v4 4/4] Add generic C.UTF-8 locale (Bug 17318)
Date: Wed, 28 Apr 2021 09:00:33 -0400	[thread overview]
Message-ID: <20210428130033.3196848-5-carlos@redhat.com> (raw)
In-Reply-To: <20210428130033.3196848-1-carlos@redhat.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 9198 bytes --]

We add a new C.UTF-8 locale.  This locale is not builtin to glibc, but
is provided as a distinct locale.  The locale provides full support
for UTF-8 and this includes full code point sorting via collation
(excludes surrogates).  Unfortuantely given the present implementation
in glibc this results in 28MiB of LC_COLLATE data for all possible
Unicode code points.  Future improvements may reduce this size. Such
improvements likely require a shortcut for the collation data that
relies on C.UTF-8 single-byte sorting being equivalent to strcmp.

The new locale is NOT added to SUPPORTED.  Minimal test data for
specific code points (minus those not supported by collate-test) is
provided in C.UTF-8.in, and this verifies code point sorting is
working reasonably across the range.

The next step is to reduce LC_COLLATE to a manageable size before we
enable the locale in SUPPORTED. Fully testing C.UTF-8 collation can
add ~5-7 minutes to the locale testing (collate-test, and xfrm-test
twice) so we don't enable full testing of all code points until we can
parallelize the sort-test test. Testing sort-test with C.UTF-8 minimal
test data passes cleanly.

Tested on x86_64 or i686 without regression.
---
 localedata/C.UTF-8.in | 156 +++++++++++++++++++++++++++++++++++
 localedata/Makefile   |   2 +
 localedata/locales/C  | 188 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 346 insertions(+)
 create mode 100644 localedata/C.UTF-8.in
 create mode 100644 localedata/locales/C

diff --git a/localedata/C.UTF-8.in b/localedata/C.UTF-8.in
new file mode 100644
index 0000000000..b8764a4e04
--- /dev/null
+++ b/localedata/C.UTF-8.in
@@ -0,0 +1,156 @@
+\x01 ; <U1>
+\x02 ; <U2>
+\x03 ; <U3>
+\x04 ; <U4>
+\x05 ; <U5>
+\x06 ; <U6>
+\a ; <U7>
+\b ; <U8>
+\x0e ; <UE>
+\x0f ; <UF>
+\x10 ; <U10>
+\x11 ; <U11>
+\x12 ; <U12>
+\x13 ; <U13>
+\x14 ; <U14>
+\x15 ; <U15>
+\x16 ; <U16>
+\x17 ; <U17>
+\x18 ; <U18>
+\x19 ; <U19>
+\x1a ; <U1A>
+^[ ; <U1B>
+\x1c ; <U1C>
+\x1d ; <U1D>
+\x1e ; <U1E>
+\x1f ; <U1F>
+! ; <U21>
+" ; <U22>
+# ; <U23>
+$ ; <U24>
+% ; <U25>
+& ; <U26>
+' ; <U27>
+) ; <U29>
+* ; <U2A>
++ ; <U2B>
+, ; <U2C>
+- ; <U2D>
+. ; <U2E>
+/ ; <U2F>
+0 ; <U30>
+1 ; <U31>
+2 ; <U32>
+3 ; <U33>
+4 ; <U34>
+5 ; <U35>
+6 ; <U36>
+7 ; <U37>
+8 ; <U38>
+9 ; <U39>
+< ; <U3C>
+= ; <U3D>
+> ; <U3E>
+? ; <U3F>
+@ ; <U40>
+A ; <U41>
+B ; <U42>
+C ; <U43>
+D ; <U44>
+E ; <U45>
+F ; <U46>
+G ; <U47>
+H ; <U48>
+I ; <U49>
+J ; <U4A>
+K ; <U4B>
+L ; <U4C>
+M ; <U4D>
+N ; <U4E>
+O ; <U4F>
+P ; <U50>
+Q ; <U51>
+R ; <U52>
+S ; <U53>
+T ; <U54>
+U ; <U55>
+V ; <U56>
+W ; <U57>
+X ; <U58>
+Y ; <U59>
+Z ; <U5A>
+[ ; <U5B>
+\ ; <U5C>
+] ; <U5D>
+^ ; <U5E>
+_ ; <U5F>
+` ; <U60>
+a ; <U61>
+b ; <U62>
+c ; <U63>
+d ; <U64>
+e ; <U65>
+f ; <U66>
+g ; <U67>
+h ; <U68>
+i ; <U69>
+j ; <U6A>
+k ; <U6B>
+l ; <U6C>
+m ; <U6D>
+n ; <U6E>
+o ; <U6F>
+p ; <U70>
+q ; <U71>
+r ; <U72>
+s ; <U73>
+t ; <U74>
+u ; <U75>
+v ; <U76>
+w ; <U77>
+x ; <U78>
+y ; <U79>
+z ; <U7A>
+{ ; <U7B>
+| ; <U7C>
+} ; <U7D>
+~ ; <U7E>
+\x7f ; <U7F>
+Â€ ; <U80>
+Ã¿ ; <UFF>
+Ä€ ; <U100>
+à¿¿ ; <UFFF>
+á€€ ; <U1000>
+ï¿¿ ; <UFFFF>
+ð€€ ; <U10000>
+ðŸ¿¿ ; <U1FFFF>
+ð €€ ; <U20000>
+ð¯¿¿ ; <U2FFFF>
+ð°€€ ; <U30000>
+ð¿¿¾ ; <U3FFFE>
+ñ€€€ ; <U40000>
+ñ¿¿ ; <U4FFFF>
+ñ€€ ; <U50000>
+ñŸ¿¿ ; <U5FFFF>
+ñ €€ ; <U60000>
+ñ¯¿¿ ; <U6FFFF>
+ñ°€€ ; <U70000>
+ñ¿¿¿ ; <U7FFFF>
+ò€€€ ; <U80000>
+ò¿¿ ; <U8FFFF>
+ò€€ ; <U90000>
+òŸ¿¿ ; <U9FFFF>
+ò €€ ; <UA0000>
+ò¯¿¿ ; <UAFFFF>
+ò°€€ ; <UB0000>
+ò¿¿¿ ; <UBFFFF>
+ó€€ ; <UC0001>
+ó¿Œ ; <UCFFCC>
+ó€Ž ; <UD000E>
+óŸ¿¿ ; <UDFFFF>
+ó € ; <UE0001>
+ó¯¿¿ ; <UEFFFF>
+ó°€ ; <UF0001>
+ó¿¿¿ ; <UFFFFF>
+ô€€ ; <U100001>
+ô¿¿ ; <U10FFFF>
diff --git a/localedata/Makefile b/localedata/Makefile
index 14e04cd3c5..38017f2c4c 100644
--- a/localedata/Makefile
+++ b/localedata/Makefile
@@ -47,6 +47,7 @@ test-input := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
@@ -206,6 +207,7 @@ LOCALES := \
 	bg_BG.UTF-8 \
 	br_FR.UTF-8 \
 	bs_BA.UTF-8 \
+	C.UTF-8 \
 	ckb_IQ.UTF-8 \
 	cmn_TW.UTF-8 \
 	crh_UA.UTF-8 \
diff --git a/localedata/locales/C b/localedata/locales/C
new file mode 100644
index 0000000000..67e5bd913b
--- /dev/null
+++ b/localedata/locales/C
@@ -0,0 +1,188 @@
+escape_char /
+comment_char %
+% Locale for C locale in UTF-8
+
+LC_IDENTIFICATION
+title      "C locale"
+source     ""
+address    ""
+contact    ""
+email      "bug-glibc-locales@gnu.org"
+tel        ""
+fax        ""
+language   ""
+territory  ""
+revision   "2.0"
+date       "2020-06-28"
+category  "i18n:2012";LC_IDENTIFICATION
+category  "i18n:2012";LC_CTYPE
+category  "i18n:2012";LC_COLLATE
+category  "i18n:2012";LC_TIME
+category  "i18n:2012";LC_NUMERIC
+category  "i18n:2012";LC_MONETARY
+category  "i18n:2012";LC_MESSAGES
+category  "i18n:2012";LC_PAPER
+category  "i18n:2012";LC_NAME
+category  "i18n:2012";LC_ADDRESS
+category  "i18n:2012";LC_TELEPHONE
+category  "i18n:2012";LC_MEASUREMENT
+END LC_IDENTIFICATION
+
+LC_CTYPE
+
+% Include only the i18n character type classes without any of the
+% transliteration that i18n uses by default.  The C locale has no
+% transliteration and passes all characters through unchanged.
+copy "i18n_ctype"
+
+END LC_CTYPE
+
+% One rule, sort forward, for all Unicode scalar values to give
+% code point order sorting for Unicode (excludes surrogates
+% which are not in the UTF-8 character map).
+LC_COLLATE
+order_start forward
+<U00000000>
+..
+<U0000D7FF>
+% Exclude surrogates <UD800> to <UDFFF> from collation.
+<U0000E000>
+..
+<U0010FFFF>
+UNDEFINED
+order_end
+END LC_COLLATE
+
+LC_MONETARY
+
+% This is the 14652 i18n fdcc-set definition for the LC_MONETARY
+% category (except for the int_curr_symbol and currency_symbol, they are
+% empty in the 14652 i18n fdcc-set definition and also empty in
+% glibc/locale/C-monetary.c.).
+int_curr_symbol     ""
+currency_symbol     ""
+mon_decimal_point   "."
+mon_thousands_sep   ""
+mon_grouping        -1
+positive_sign       ""
+negative_sign       "-"
+int_frac_digits     -1
+frac_digits         -1
+p_cs_precedes       -1
+int_p_sep_by_space  -1
+p_sep_by_space      -1
+n_cs_precedes       -1
+int_n_sep_by_space  -1
+n_sep_by_space      -1
+p_sign_posn         -1
+n_sign_posn         -1
+%
+END LC_MONETARY
+
+LC_NUMERIC
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+decimal_point   "."
+thousands_sep   ""
+grouping        -1
+END LC_NUMERIC
+
+LC_TIME
+% This is the POSIX Locale definition for the LC_TIME category with the
+% exception that time is per ISO 8601 and 24-hour.
+%
+% Abbreviated weekday names (%a)
+abday       "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat"
+
+% Full weekday names (%A)
+day         "Sunday";"Monday";"Tuesday";"Wednesday";"Thursday";/
+            "Friday";"Saturday"
+
+% Abbreviated month names (%b)
+abmon       "Jan";"Feb";"Mar";"Apr";"May";"Jun";"Jul";"Aug";"Sep";/
+            "Oct";"Nov";"Dec"
+
+% Full month names (%B)
+mon         "January";"February";"March";"April";"May";"June";"July";/
+            "August";"September";"October";"November";"December"
+
+% Week description, consists of three fields:
+% 1. Number of days in a week.
+% 2. Gregorian date that is a first weekday (19971130 for Sunday, 19971201 for Monday).
+% 3. The weekday number to be contained in the first week of the year.
+%
+% ISO 8601 conforming applications should use the values 7, 19971201 (a
+% Monday), and 4 (Thursday), respectively.
+week    7;19971201;4
+first_weekday	1
+first_workday	1
+
+% Appropriate date and time representation (%c)
+d_t_fmt "%a %b %e %H:%M:%S %Y"
+
+% Appropriate date representation (%x)
+d_fmt   "%m/%d/%y"
+
+% Appropriate time representation (%X)
+t_fmt   "%H:%M:%S"
+
+% Appropriate AM/PM time representation (%r)
+t_fmt_ampm "%I:%M:%S %p"
+
+% Equivalent of AM/PM (%p)
+am_pm	"AM";"PM"
+
+% Appropriate date representation (date(1))   "%a %b %e %H:%M:%S %Z %Y"
+date_fmt	"%a %b %e %H:%M:%S %Z %Y"
+END LC_TIME
+
+LC_MESSAGES
+% This is the POSIX Locale definition for
+% the LC_NUMERIC category.
+%
+yesexpr "^[yY]"
+noexpr  "^[nN]"
+yesstr  "Yes"
+nostr   "No"
+END LC_MESSAGES
+
+LC_PAPER
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_PAPER category.
+% (A4 paper, this is also used in the built in C/POSIX
+% locale in glibc/locale/C-paper.c)
+height   297
+width    210
+END LC_PAPER
+
+LC_NAME
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_NAME category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-name.c)
+name_fmt    "%p%t%g%t%m%t%f"
+END LC_NAME
+
+LC_ADDRESS
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_ADDRESS category.
+% (also used in the built in C/POSIX locale in glibc/locale/C-address.c)
+postal_fmt    "%a%N%f%N%d%N%b%N%s %h %e %r%N%C-%z %T%N%c%N"
+END LC_ADDRESS
+
+LC_TELEPHONE
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_TELEPHONE category.
+% "+%c %a %l"
+tel_int_fmt    "+%c %a %l"
+% (also used in the built in C/POSIX locale in glibc/locale/C-telephone.c)
+END LC_TELEPHONE
+
+LC_MEASUREMENT
+% This is the ISO/IEC 14652 "i18n" definition for
+% the LC_MEASUREMENT category.
+% (same as in the built in C/POSIX locale in glibc/locale/C-measurement.c)
+%metric
+measurement    1
+END LC_MEASUREMENT
+
-- 
2.26.3

next prev parent reply	other threads:[~2021-04-28 13:01 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-28 13:00 [PATCH v4 0/4] Add new " Carlos O'Donell
2021-04-28 13:00 ` [PATCH v4 1/4] Add support for processing wide ellipsis ranges in UTF-8 Carlos O'Donell
2021-04-29 14:11   ` Florian Weimer
2021-04-28 13:00 ` [PATCH v4 2/4] Update UTF-8 charmap processing Carlos O'Donell
2021-04-29 14:07   ` Florian Weimer
2021-04-29 21:02     ` Carlos O'Donell
2021-04-30  4:18       ` Florian Weimer
2021-05-02 19:18         ` Carlos O'Donell
2021-04-28 13:00 ` [PATCH v4 3/4] Regenerate localedata files Carlos O'Donell
2021-04-29 21:03   ` Carlos O'Donell
2021-04-28 13:00 ` Carlos O'Donell [this message]
2021-04-29 14:13   ` [PATCH v4 4/4] Add generic C.UTF-8 locale (Bug 17318) Florian Weimer
2021-04-29 20:05     ` Carlos O'Donell
2021-04-30 17:59       ` Carlos O'Donell
2021-04-30 18:20         ` Florian Weimer
2021-05-02 19:18           ` Carlos O'Donell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210428130033.3196848-5-carlos@redhat.com \
    --to=carlos@redhat.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).