[PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
@ 2018-07-19 19:43 Carlos O'Donell
  2018-07-19 20:39 ` Florian Weimer
                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-19 19:43 UTC (permalink / raw)
  To: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 3871 bytes --]

In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
the collation data to harmonize with the new version of ISO 14651
which is derived from Unicode 9.0.0.  This collation update brought
with it some changes to locales which were not desirable by some
users, in particular it altered the meaning of the
locale-dependent-range regular expression, namely [a-z] and [A-Z], and
for en_US it caused uppercase letters to be matched by [a-z] for the
first time.  The matching of uppercase letters by [a-z] is something
which is already known to users of other locales which have this
property, but this change could cause significant problems to en_US
and other similar locales that had never had this change before.
Whether this behaviour is desirable or not is contentious and GNU Awk
has this to say on the topic:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
While the POSIX standard also has this further to say: "RE Bracket
Expression":
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
"The current standard leaves unspecified the behavior of a range
expression outside the POSIX locale. ... As noted above, efforts were
made to resolve the differences, but no solution has been found that
would be specific enough to allow for portable software while not
invalidating existing implementations."
In glibc we implement the requirement of ISO POSIX-2:1993 and use
collation element order (CEO) to construct the range expression, the
API internally is __collseq_table_lookup().  The fact that we use CEO
and also have 4-level weights on each collation rule means that we can
in practice reorder the collation rules in iso14651_t1_common (the new
data) to provide consistent range expression resolution *and* the
weights should maintain the expected total order.  Therefore this
patch does three things:

* Reorder the collation rules for the LATIN script in
  iso14651_t1_common to deinterlace uppercase and lowercase letters in
  the collation element orders.

* Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
  strcoll* and strxfrm* and ensures the ISO 14651 collation remains.

* Add back tests to tst-fnmatch.input and tst-regexloc.c which
  exercise that [a-z] does not match A or Z.

The reordering of the ISO 14651 data is done in an entirely mechanical
fashion using the following program attached to the bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28

It is up for discussion if the iso14651_t1_common data should be
refined further to have 3 very tight collation element ranges that
include only a-z, A-Z, and 0-9, which would implement the solution
sought after in:
https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12

No regressions on x86_64.
Verified that removal of the iso14651_t1_common change causes tst-fnmatch
to regress with:
422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
...
425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
---
 ChangeLog                             |   11 +
 localedata/Makefile                   |    1 +
 localedata/en_US.UTF-8.in             | 2159 +++++++++++++++++++++++++++++++++
 localedata/locales/iso14651_t1_common | 1928 ++++++++++++++---------------
 posix/tst-fnmatch.input               |  125 +-
 posix/tst-regexloc.c                  |    8 +-
 6 files changed, 3224 insertions(+), 1008 deletions(-)
 create mode 100644 localedata/en_US.UTF-8.in

I'm suggesting this change immediately for 2.28 to avoid further
problems with users expectations and sorting with [a-z] and [A-Z] until
a clearer consensus can be reached for a final solution.

File attached as .tar.gz to get past spam detectors. There is a lot
of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN
set that can be sorted with the existing test case infrastructure).

-- 
Cheers,
Carlos.

[-- Attachment #2: swbz23393.tar.gz --]
[-- Type: application/gzip, Size: 57788 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell
@ 2018-07-19 20:39 ` Florian Weimer
  2018-07-20 18:49   ` Carlos O'Donell
  2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
  2018-07-26  1:33 ` Jonathan Nieder
  2 siblings, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-19 20:39 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>    exercise that [a-z] does not match A or Z.

[a-z] still matches Ã±, ðš—, but not ðš£, which I doubt is useful.  It's an 
improvement, and it may be good enough for glibc 2.28, but I would 
rather see us implement the rational ranges interpretation.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-19 20:39 ` Florian Weimer
@ 2018-07-20 18:49   ` Carlos O'Donell
  2018-07-20 19:02     ` Rich Felker
  2018-07-20 19:19     ` Florian Weimer
  0 siblings, 2 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-20 18:49 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 2202 bytes --]

On 07/19/2018 04:39 PM, Florian Weimer wrote:
> On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which 
>> exercise that [a-z] does not match A or Z.
> 
> [a-z] still matches Ã±, ðš—, but not ðš£, which I doubt is useful.

Sorry, I don't follow, it absolutely matches ASCII z.

We deinterlace the collation element ordering (not sequence) to get
the right range expression resolution.

See the added fnmatch tests:

+en_US.UTF-8     "a"                    "[a-z]"                0
+en_US.UTF-8     "z"                    "[a-z]"                0
+en_US.UTF-8     "A"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "Z"                    "[a-z]"                NOMATCH
+en_US.UTF-8     "a"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "z"                    "[A-Z]"                NOMATCH
+en_US.UTF-8     "A"                    "[A-Z]"                0
+en_US.UTF-8     "Z"                    "[A-Z]"                0
+en_US.UTF-8     "0"                    "[0-9]"                0
+en_US.UTF-8     "9"                    "[0-9]"                0

[a-z] matches a-z (including z), *and* all the lowercase inbetween,
and so behaves like :lower: effectively.

[A-Z] matches A-Z (including Z), *and* all the uppercase inbetwee,
and so behaves like :upper: effectively.

I left in all the matches for the accented characters because it was
the most conservative thing to do for now.

I could be persuaded otherwise I think, just reading the old history
and seeing the new reports seems to indicate we should back down to
behaving like C/POSIX in these cases.

> It's an improvement, and it may be good enough for glibc 2.28, but I would
> rather see us implement the rational ranges interpretation.

That requires all ranges behave rationally?

We could fix a-z, A-Z, and 0-9 easily.

Patch attached.

It has no effect on collation sequence, but it will break scripts
that expect the new-style behaviour, and we knew that, but it
certainly aligns us with the pre-POSIX requirement and the rest of
the GNU tools implementing rational ranges, which is a much better
reason.

-- 
Cheers,
Carlos.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: rational-ranges.diff --]
[-- Type: text/x-patch; name="rational-ranges.diff", Size: 44348 bytes --]

diff --git a/localedata/locales/iso14651_t1_common b/localedata/locales/iso14651_t1_common
index 227400cc4e..7248074a8b 100644
--- a/localedata/locales/iso14651_t1_common
+++ b/localedata/locales/iso14651_t1_common
@@ -63177,7 +63177,19 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U20BC> <S20BC>;<BASE>;<MIN>;<U20BC> % MANAT SIGN
 <U20BD> <S20BD>;<BASE>;<MIN>;<U20BD> % RUBLE SIGN
 <U20BE> <S20BE>;<BASE>;<MIN>;<U20BE> % LARI SIGN
+% Implement rational range for [0-9] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
 <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
+<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
+<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
+<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
+<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR
+<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE
+<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX
+<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN
+<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT
+<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE
 <U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
 <U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
 <U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
@@ -63250,7 +63262,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U2080> <S0030>;<BASE>;<MNS>;<U2080> % SUBSCRIPT ZERO
 <U2189> "<S0030><S0033>";"<BASE><BASE>";"<FRACTION><FRACTION>";<U2189> % VULGAR FRACTION ZERO THIRDS
 <U3358> "<S0030><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3358> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO
-<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
 <U0661> <S0031>;<BASE>;<MIN>;<U0661> % ARABIC-INDIC DIGIT ONE
 <U06F1> <S0031>;<BASE>;<MIN>;<U06F1> % EXTENDED ARABIC-INDIC DIGIT ONE
 <U07C1> <S0031>;<BASE>;<MIN>;<U07C1> % NKO DIGIT ONE
@@ -63440,7 +63451,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E0> "<S0031><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE
 <U32C0> "<S0031><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY
 <U3359> "<S0031><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3359> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ONE
-<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
 <U0662> <S0032>;<BASE>;<MIN>;<U0662> % ARABIC-INDIC DIGIT TWO
 <U06F2> <S0032>;<BASE>;<MIN>;<U06F2> % EXTENDED ARABIC-INDIC DIGIT TWO
 <U07C2> <S0032>;<BASE>;<MIN>;<U07C2> % NKO DIGIT TWO
@@ -63583,7 +63593,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E1> "<S0032><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO
 <U32C1> "<S0032><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR FEBRUARY
 <U335A> "<S0032><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335A> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWO
-<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
 <U0663> <S0033>;<BASE>;<MIN>;<U0663> % ARABIC-INDIC DIGIT THREE
 <U06F3> <S0033>;<BASE>;<MIN>;<U06F3> % EXTENDED ARABIC-INDIC DIGIT THREE
 <U07C3> <S0033>;<BASE>;<MIN>;<U07C3> % NKO DIGIT THREE
@@ -63709,7 +63718,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E2> "<S0033><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THREE
 <U32C2> "<S0033><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MARCH
 <U335B> "<S0033><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335B> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR THREE
-<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR
 <U0664> <S0034>;<BASE>;<MIN>;<U0664> % ARABIC-INDIC DIGIT FOUR
 <U06F4> <S0034>;<BASE>;<MIN>;<U06F4> % EXTENDED ARABIC-INDIC DIGIT FOUR
 <U07C4> <S0034>;<BASE>;<MIN>;<U07C4> % NKO DIGIT FOUR
@@ -63829,7 +63837,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E3> "<S0034><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FOUR
 <U32C3> "<S0034><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR APRIL
 <U335C> "<S0034><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335C> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FOUR
-<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE
 <U0665> <S0035>;<BASE>;<MIN>;<U0665> % ARABIC-INDIC DIGIT FIVE
 <U06F5> <S0035>;<BASE>;<MIN>;<U06F5> % EXTENDED ARABIC-INDIC DIGIT FIVE
 <U07C5> <S0035>;<BASE>;<MIN>;<U07C5> % NKO DIGIT FIVE
@@ -63941,7 +63948,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E4> "<S0035><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FIVE
 <U32C4> "<S0035><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MAY
 <U335D> "<S0035><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335D> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FIVE
-<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX
 <U0666> <S0036>;<BASE>;<MIN>;<U0666> % ARABIC-INDIC DIGIT SIX
 <U06F6> <S0036>;<BASE>;<MIN>;<U06F6> % EXTENDED ARABIC-INDIC DIGIT SIX
 <U07C6> <S0036>;<BASE>;<MIN>;<U07C6> % NKO DIGIT SIX
@@ -64036,7 +64042,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E5> "<S0036><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SIX
 <U32C5> "<S0036><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JUNE
 <U335E> "<S0036><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335E> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SIX
-<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN
 <U0667> <S0037>;<BASE>;<MIN>;<U0667> % ARABIC-INDIC DIGIT SEVEN
 <U06F7> <S0037>;<BASE>;<MIN>;<U06F7> % EXTENDED ARABIC-INDIC DIGIT SEVEN
 <U07C7> <S0037>;<BASE>;<MIN>;<U07C7> % NKO DIGIT SEVEN
@@ -64132,7 +64137,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E6> "<S0037><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SEVEN
 <U32C6> "<S0037><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JULY
 <U335F> "<S0037><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335F> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SEVEN
-<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT
 <U0668> <S0038>;<BASE>;<MIN>;<U0668> % ARABIC-INDIC DIGIT EIGHT
 <U06F8> <S0038>;<BASE>;<MIN>;<U06F8> % EXTENDED ARABIC-INDIC DIGIT EIGHT
 <U07C8> <S0038>;<BASE>;<MIN>;<U07C8> % NKO DIGIT EIGHT
@@ -64226,7 +64230,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E7> "<S0038><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY EIGHT
 <U32C7> "<S0038><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR AUGUST
 <U3360> "<S0038><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3360> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR EIGHT
-<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE
 <U0669> <S0039>;<BASE>;<MIN>;<U0669> % ARABIC-INDIC DIGIT NINE
 <U06F9> <S0039>;<BASE>;<MIN>;<U06F9> % EXTENDED ARABIC-INDIC DIGIT NINE
 <U07C9> <S0039>;<BASE>;<MIN>;<U07C9> % NKO DIGIT NINE
@@ -64326,7 +64329,35 @@ order_start <LATIN>;forward;backward;forward;forward,position
 else
 order_start <LATIN>;forward;forward;forward;forward,position
 endif
+% Implement rational range for [a-z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
 <U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A
+<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
+<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C
+<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D
+<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E
+<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F
+<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
+<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H
+<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I
+<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J
+<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K
+<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L
+<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M
+<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N
+<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O
+<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P
+<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q
+<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R
+<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S
+<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T
+<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U
+<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V
+<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W
+<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X
+<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y
+<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z
 <UFF41> <S0061>;<BASE>;<WIDE>;<UFF41> % FULLWIDTH LATIN SMALL LETTER A
 <U0363> <S0061>;<BASE>;<COMPAT>;<U0363> % COMBINING LATIN SMALL LETTER A
 <U249C> <S0061>;<BASE>;<COMPAT>;<U249C> % PARENTHESIZED LATIN SMALL LETTER A
@@ -64418,7 +64449,6 @@ endif
 <U0252> <S0252>;<BASE>;<MIN>;<U0252> % LATIN SMALL LETTER TURNED ALPHA
 <U1D9B> <S0252>;<BASE>;<MNN>;<U1D9B> % MODIFIER LETTER SMALL TURNED ALPHA
 <UAB64> <SAB64>;<BASE>;<MIN>;<UAB64> % LATIN SMALL LETTER INVERTED ALPHA
-<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
 <UFF42> <S0062>;<BASE>;<WIDE>;<UFF42> % FULLWIDTH LATIN SMALL LETTER B
 <U1DE8> <S0062>;<BASE>;<COMPAT>;<U1DE8> % COMBINING LATIN SMALL LETTER B
 <U249D> <S0062>;<BASE>;<COMPAT>;<U249D> % PARENTHESIZED LATIN SMALL LETTER B
@@ -64454,7 +64484,6 @@ endif
 <U0183> <S0183>;<BASE>;<MIN>;<U0183> % LATIN SMALL LETTER B WITH TOPBAR
 <UA7B5> <SA7B5>;<BASE>;<MIN>;<UA7B5> % LATIN SMALL LETTER BETA
 <U1DE9> <SA7B5>;<BASE>;<COMPAT>;<U1DE9> % COMBINING LATIN SMALL LETTER BETA
-<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C
 <UFF43> <S0063>;<BASE>;<WIDE>;<UFF43> % FULLWIDTH LATIN SMALL LETTER C
 <U0368> <S0063>;<BASE>;<COMPAT>;<U0368> % COMBINING LATIN SMALL LETTER C
 <U217D> <S0063>;<BASE>;<COMPAT>;<U217D> % SMALL ROMAN NUMERAL ONE HUNDRED
@@ -64504,7 +64533,6 @@ endif
 <U1D9D> <S0255>;<BASE>;<MNN>;<U1D9D> % MODIFIER LETTER SMALL C WITH CURL
 <U2184> <S2184>;<BASE>;<MIN>;<U2184> % LATIN SMALL LETTER REVERSED C
 <UA73F> <SA73F>;<BASE>;<MIN>;<UA73F> % LATIN SMALL LETTER REVERSED C WITH DOT
-<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D
 <UFF44> <S0064>;<BASE>;<WIDE>;<UFF44> % FULLWIDTH LATIN SMALL LETTER D
 <U0369> <S0064>;<BASE>;<COMPAT>;<U0369> % COMBINING LATIN SMALL LETTER D
 <U217E> <S0064>;<BASE>;<COMPAT>;<U217E> % SMALL ROMAN NUMERAL FIVE HUNDRED
@@ -64563,7 +64591,6 @@ endif
 <U0221> <S0221>;<BASE>;<MIN>;<U0221> % LATIN SMALL LETTER D WITH CURL
 <UA771> <SA771>;<BASE>;<MIN>;<UA771> % LATIN SMALL LETTER DUM
 <U1E9F> <S1E9F>;<BASE>;<MIN>;<U1E9F> % LATIN SMALL LETTER DELTA
-<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E
 <UFF45> <S0065>;<BASE>;<WIDE>;<UFF45> % FULLWIDTH LATIN SMALL LETTER E
 <U0364> <S0065>;<BASE>;<COMPAT>;<U0364> % COMBINING LATIN SMALL LETTER E
 <U24A0> <S0065>;<BASE>;<COMPAT>;<U24A0> % PARENTHESIZED LATIN SMALL LETTER E
@@ -64641,7 +64668,6 @@ endif
 <U025E> <S025E>;<BASE>;<MIN>;<U025E> % LATIN SMALL LETTER CLOSED REVERSED OPEN E
 <U029A> <S029A>;<BASE>;<MIN>;<U029A> % LATIN SMALL LETTER CLOSED OPEN E
 <U0264> <S0264>;<BASE>;<MIN>;<U0264> % LATIN SMALL LETTER RAMS HORN
-<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F
 <UFF46> <S0066>;<BASE>;<WIDE>;<UFF46> % FULLWIDTH LATIN SMALL LETTER F
 <U1DEB> <S0066>;<BASE>;<COMPAT>;<U1DEB> % COMBINING LATIN SMALL LETTER F
 <U24A1> <S0066>;<BASE>;<COMPAT>;<U24A1> % PARENTHESIZED LATIN SMALL LETTER F
@@ -64680,7 +64706,6 @@ endif
 <U0192> <S0192>;<BASE>;<MIN>;<U0192> % LATIN SMALL LETTER F WITH HOOK
 <U214E> <S214E>;<BASE>;<MIN>;<U214E> % TURNED SMALL F
 <UA7FB> <SA7FB>;<BASE>;<MIN>;<UA7FB> % LATIN EPIGRAPHIC LETTER REVERSED F
-<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
 <UFF47> <S0067>;<BASE>;<WIDE>;<UFF47> % FULLWIDTH LATIN SMALL LETTER G
 <U1DDA> <S0067>;<BASE>;<COMPAT>;<U1DDA> % COMBINING LATIN SMALL LETTER G
 <U24A2> <S0067>;<BASE>;<COMPAT>;<U24A2> % PARENTHESIZED LATIN SMALL LETTER G
@@ -64727,7 +64752,6 @@ endif
 <U0263> <S0263>;<BASE>;<MIN>;<U0263> % LATIN SMALL LETTER GAMMA
 <U02E0> <S0263>;<BASE>;<MNN>;<U02E0> % MODIFIER LETTER SMALL GAMMA
 <U01A3> <S01A3>;<BASE>;<MIN>;<U01A3> % LATIN SMALL LETTER OI
-<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H
 <UFF48> <S0068>;<BASE>;<WIDE>;<UFF48> % FULLWIDTH LATIN SMALL LETTER H
 <U036A> <S0068>;<BASE>;<COMPAT>;<U036A> % COMBINING LATIN SMALL LETTER H
 <U24A3> <S0068>;<BASE>;<COMPAT>;<U24A3> % PARENTHESIZED LATIN SMALL LETTER H
@@ -64780,7 +64804,6 @@ endif
 <U0267> <S0267>;<BASE>;<MIN>;<U0267> % LATIN SMALL LETTER HENG WITH HOOK
 <U02BB> <S02BB>;<BASE>;<MIN>;<U02BB> % MODIFIER LETTER TURNED COMMA
 <U02BD> <S02BD>;<BASE>;<MIN>;<U02BD> % MODIFIER LETTER REVERSED COMMA
-<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I
 <UFF49> <S0069>;<BASE>;<WIDE>;<UFF49> % FULLWIDTH LATIN SMALL LETTER I
 <U0365> <S0069>;<BASE>;<COMPAT>;<U0365> % COMBINING LATIN SMALL LETTER I
 <U2170> <S0069>;<BASE>;<COMPAT>;<U2170> % SMALL ROMAN NUMERAL ONE
@@ -64844,7 +64867,6 @@ endif
 <U0269> <S0269>;<BASE>;<MIN>;<U0269> % LATIN SMALL LETTER IOTA
 <U1DA5> <S0269>;<BASE>;<MNN>;<U1DA5> % MODIFIER LETTER SMALL IOTA
 <U1D7C> <S1D7C>;<BASE>;<MIN>;<U1D7C> % LATIN SMALL LETTER IOTA WITH STROKE
-<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J
 <UFF4A> <S006A>;<BASE>;<WIDE>;<UFF4A> % FULLWIDTH LATIN SMALL LETTER J
 <U24A5> <S006A>;<BASE>;<COMPAT>;<U24A5> % PARENTHESIZED LATIN SMALL LETTER J
 <U2149> <S006A>;<BASE>;<FONT>;<U2149> % DOUBLE-STRUCK ITALIC SMALL J
@@ -64876,7 +64898,6 @@ endif
 <U025F> <S025F>;<BASE>;<MIN>;<U025F> % LATIN SMALL LETTER DOTLESS J WITH STROKE
 <U1DA1> <S025F>;<BASE>;<MNN>;<U1DA1> % MODIFIER LETTER SMALL DOTLESS J WITH STROKE
 <U0284> <S0284>;<BASE>;<MIN>;<U0284> % LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK
-<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K
 <UFF4B> <S006B>;<BASE>;<WIDE>;<UFF4B> % FULLWIDTH LATIN SMALL LETTER K
 <U1DDC> <S006B>;<BASE>;<COMPAT>;<U1DDC> % COMBINING LATIN SMALL LETTER K
 <U24A6> <S006B>;<BASE>;<COMPAT>;<U24A6> % PARENTHESIZED LATIN SMALL LETTER K
@@ -64926,7 +64947,6 @@ endif
 <UA743> <SA743>;<BASE>;<MIN>;<UA743> % LATIN SMALL LETTER K WITH DIAGONAL STROKE
 <UA745> <SA745>;<BASE>;<MIN>;<UA745> % LATIN SMALL LETTER K WITH STROKE AND DIAGONAL STROKE
 <U029E> <S029E>;<BASE>;<MIN>;<U029E> % LATIN SMALL LETTER TURNED K
-<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L
 <UFF4C> <S006C>;<BASE>;<WIDE>;<UFF4C> % FULLWIDTH LATIN SMALL LETTER L
 <U1DDD> <S006C>;<BASE>;<COMPAT>;<U1DDD> % COMBINING LATIN SMALL LETTER L
 <U217C> <S006C>;<BASE>;<COMPAT>;<U217C> % SMALL ROMAN NUMERAL FIFTY
@@ -64996,7 +65016,6 @@ endif
 <UA781> <SA781>;<BASE>;<MIN>;<UA781> % LATIN SMALL LETTER TURNED L
 <U019B> <S019B>;<BASE>;<MIN>;<U019B> % LATIN SMALL LETTER LAMBDA WITH STROKE
 <U028E> <S028E>;<BASE>;<MIN>;<U028E> % LATIN SMALL LETTER TURNED Y
-<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M
 <UFF4D> <S006D>;<BASE>;<WIDE>;<UFF4D> % FULLWIDTH LATIN SMALL LETTER M
 <U036B> <S006D>;<BASE>;<COMPAT>;<U036B> % COMBINING LATIN SMALL LETTER M
 <U217F> <S006D>;<BASE>;<COMPAT>;<U217F> % SMALL ROMAN NUMERAL ONE THOUSAND
@@ -65055,7 +65074,6 @@ endif
 <UA7FD> <SA7FD>;<BASE>;<MIN>;<UA7FD> % LATIN EPIGRAPHIC LETTER INVERTED M
 <UA7FF> <SA7FF>;<BASE>;<MIN>;<UA7FF> % LATIN EPIGRAPHIC LETTER ARCHAIC M
 <UA773> <SA773>;<BASE>;<MIN>;<UA773> % LATIN SMALL LETTER MUM
-<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N
 <UFF4E> <S006E>;<BASE>;<WIDE>;<UFF4E> % FULLWIDTH LATIN SMALL LETTER N
 <U1DE0> <S006E>;<BASE>;<COMPAT>;<U1DE0> % COMBINING LATIN SMALL LETTER N
 <U24A9> <S006E>;<BASE>;<COMPAT>;<U24A9> % PARENTHESIZED LATIN SMALL LETTER N
@@ -65114,7 +65132,6 @@ endif
 <U014B> <S014B>;<BASE>;<MIN>;<U014B> % LATIN SMALL LETTER ENG
 <U1D51> <S014B>;<BASE>;<MNN>;<U1D51> % MODIFIER LETTER SMALL ENG
 <UAB3C> <SAB3C>;<BASE>;<MIN>;<UAB3C> % LATIN SMALL LETTER ENG WITH CROSSED-TAIL
-<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O
 <UFF4F> <S006F>;<BASE>;<WIDE>;<UFF4F> % FULLWIDTH LATIN SMALL LETTER O
 <U0366> <S006F>;<BASE>;<COMPAT>;<U0366> % COMBINING LATIN SMALL LETTER O
 <U24AA> <S006F>;<BASE>;<COMPAT>;<U24AA> % PARENTHESIZED LATIN SMALL LETTER O
@@ -65213,7 +65230,6 @@ endif
 <U0223> <S0223>;<BASE>;<MIN>;<U0223> % LATIN SMALL LETTER OU
 <U1D3D> <S0223>;<BASE>;<MISCCAP>;<U1D3D> % MODIFIER LETTER CAPITAL OU
 <U1D15> <S1D15>;<BASE>;<MIN>;<U1D15> % LATIN LETTER SMALL CAPITAL OU
-<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P
 <UFF50> <S0070>;<BASE>;<WIDE>;<UFF50> % FULLWIDTH LATIN SMALL LETTER P
 <U1DEE> <S0070>;<BASE>;<COMPAT>;<U1DEE> % COMBINING LATIN SMALL LETTER P
 <U24AB> <S0070>;<BASE>;<COMPAT>;<U24AB> % PARENTHESIZED LATIN SMALL LETTER P
@@ -65262,7 +65278,6 @@ endif
 <U0278> <S0278>;<BASE>;<MIN>;<U0278> % LATIN SMALL LETTER PHI
 <U1DB2> <S0278>;<BASE>;<MNN>;<U1DB2> % MODIFIER LETTER SMALL PHI
 <U2C77> <S2C77>;<BASE>;<MIN>;<U2C77> % LATIN SMALL LETTER TAILLESS PHI
-<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q
 <UFF51> <S0071>;<BASE>;<WIDE>;<UFF51> % FULLWIDTH LATIN SMALL LETTER Q
 <U24AC> <S0071>;<BASE>;<COMPAT>;<U24AC> % PARENTHESIZED LATIN SMALL LETTER Q
 <U0001D42A> <S0071>;<BASE>;<FONT>;<U0001D42A> % MATHEMATICAL BOLD SMALL Q
@@ -65285,7 +65300,6 @@ endif
 <U02A0> <S02A0>;<BASE>;<MIN>;<U02A0> % LATIN SMALL LETTER Q WITH HOOK
 <U024B> <S024B>;<BASE>;<MIN>;<U024B> % LATIN SMALL LETTER Q WITH HOOK TAIL
 <U0138> <S0138>;<BASE>;<MIN>;<U0138> % LATIN SMALL LETTER KRA
-<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R
 <UFF52> <S0072>;<BASE>;<WIDE>;<UFF52> % FULLWIDTH LATIN SMALL LETTER R
 <U036C> <S0072>;<BASE>;<COMPAT>;<U036C> % COMBINING LATIN SMALL LETTER R
 <U1DCA> <S0072>;<BASE>;<COMPAT>;<U1DCA> % COMBINING LATIN SMALL LETTER R BELOW
@@ -65354,7 +65368,6 @@ endif
 <UA775> <SA775>;<BASE>;<MIN>;<UA775> % LATIN SMALL LETTER RUM
 <UA776> <SA776>;<BASE>;<MIN>;<UA776> % LATIN LETTER SMALL CAPITAL RUM
 <UA75D> <SA75D>;<BASE>;<MIN>;<UA75D> % LATIN SMALL LETTER RUM ROTUNDA
-<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S
 <UFF53> <S0073>;<BASE>;<WIDE>;<UFF53> % FULLWIDTH LATIN SMALL LETTER S
 <U1DE4> <S0073>;<BASE>;<COMPAT>;<U1DE4> % COMBINING LATIN SMALL LETTER S
 <U24AE> <S0073>;<BASE>;<COMPAT>;<U24AE> % PARENTHESIZED LATIN SMALL LETTER S
@@ -65417,7 +65430,6 @@ endif
 <U0285> <S0285>;<BASE>;<MIN>;<U0285> % LATIN SMALL LETTER SQUAT REVERSED ESH
 <U1D98> <S1D98>;<BASE>;<MIN>;<U1D98> % LATIN SMALL LETTER ESH WITH RETROFLEX HOOK
 <U0286> <S0286>;<BASE>;<MIN>;<U0286> % LATIN SMALL LETTER ESH WITH CURL
-<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T
 <UFF54> <S0074>;<BASE>;<WIDE>;<UFF54> % FULLWIDTH LATIN SMALL LETTER T
 <U036D> <S0074>;<BASE>;<COMPAT>;<U036D> % COMBINING LATIN SMALL LETTER T
 <U24AF> <S0074>;<BASE>;<COMPAT>;<U24AF> % PARENTHESIZED LATIN SMALL LETTER T
@@ -65467,7 +65479,6 @@ endif
 <U0236> <S0236>;<BASE>;<MIN>;<U0236> % LATIN SMALL LETTER T WITH CURL
 <UA777> <SA777>;<BASE>;<MIN>;<UA777> % LATIN SMALL LETTER TUM
 <U0287> <S0287>;<BASE>;<MIN>;<U0287> % LATIN SMALL LETTER TURNED T
-<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U
 <UFF55> <S0075>;<BASE>;<WIDE>;<UFF55> % FULLWIDTH LATIN SMALL LETTER U
 <U0367> <S0075>;<BASE>;<COMPAT>;<U0367> % COMBINING LATIN SMALL LETTER U
 <U24B0> <S0075>;<BASE>;<COMPAT>;<U24B0> % PARENTHESIZED LATIN SMALL LETTER U
@@ -65552,7 +65563,6 @@ endif
 <U028A> <S028A>;<BASE>;<MIN>;<U028A> % LATIN SMALL LETTER UPSILON
 <U1DB7> <S028A>;<BASE>;<MNN>;<U1DB7> % MODIFIER LETTER SMALL UPSILON
 <U1D7F> <S1D7F>;<BASE>;<MIN>;<U1D7F> % LATIN SMALL LETTER UPSILON WITH STROKE
-<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V
 <UFF56> <S0076>;<BASE>;<WIDE>;<UFF56> % FULLWIDTH LATIN SMALL LETTER V
 <U036E> <S0076>;<BASE>;<COMPAT>;<U036E> % COMBINING LATIN SMALL LETTER V
 <U2174> <S0076>;<BASE>;<COMPAT>;<U2174> % SMALL ROMAN NUMERAL FIVE
@@ -65593,7 +65603,6 @@ endif
 <U1EFD> <S1EFD>;<BASE>;<MIN>;<U1EFD> % LATIN SMALL LETTER MIDDLE-WELSH V
 <U028C> <S028C>;<BASE>;<MIN>;<U028C> % LATIN SMALL LETTER TURNED V
 <U1DBA> <S028C>;<BASE>;<MNN>;<U1DBA> % MODIFIER LETTER SMALL TURNED V
-<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W
 <UFF57> <S0077>;<BASE>;<WIDE>;<UFF57> % FULLWIDTH LATIN SMALL LETTER W
 <U1DF1> <S0077>;<BASE>;<COMPAT>;<U1DF1> % COMBINING LATIN SMALL LETTER W
 <U24B2> <S0077>;<BASE>;<COMPAT>;<U24B2> % PARENTHESIZED LATIN SMALL LETTER W
@@ -65627,7 +65636,6 @@ endif
 <U1D21> <S1D21>;<BASE>;<MIN>;<U1D21> % LATIN LETTER SMALL CAPITAL W
 <U2C73> <S2C73>;<BASE>;<MIN>;<U2C73> % LATIN SMALL LETTER W WITH HOOK
 <U028D> <S028D>;<BASE>;<MIN>;<U028D> % LATIN SMALL LETTER TURNED W
-<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X
 <UFF58> <S0078>;<BASE>;<WIDE>;<UFF58> % FULLWIDTH LATIN SMALL LETTER X
 <U036F> <S0078>;<BASE>;<COMPAT>;<U036F> % COMBINING LATIN SMALL LETTER X
 <U2179> <S0078>;<BASE>;<COMPAT>;<U2179> % SMALL ROMAN NUMERAL TEN
@@ -65660,7 +65668,6 @@ endif
 <UAB53> <SAB53>;<BASE>;<MIN>;<UAB53> % LATIN SMALL LETTER CHI
 <UAB54> <SAB54>;<BASE>;<MIN>;<UAB54> % LATIN SMALL LETTER CHI WITH LOW RIGHT RING
 <UAB55> <SAB55>;<BASE>;<MIN>;<UAB55> % LATIN SMALL LETTER CHI WITH LOW LEFT SERIF
-<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y
 <UFF59> <S0079>;<BASE>;<WIDE>;<UFF59> % FULLWIDTH LATIN SMALL LETTER Y
 <U24B4> <S0079>;<BASE>;<COMPAT>;<U24B4> % PARENTHESIZED LATIN SMALL LETTER Y
 <U0001D432> <S0079>;<BASE>;<FONT>;<U0001D432> % MATHEMATICAL BOLD SMALL Y
@@ -65694,7 +65701,6 @@ endif
 <U1EFF> <S1EFF>;<BASE>;<MIN>;<U1EFF> % LATIN SMALL LETTER Y WITH LOOP
 <UAB5A> <SAB5A>;<BASE>;<MIN>;<UAB5A> % LATIN SMALL LETTER Y WITH SHORT RIGHT LEG
 <U021D> <S021D>;<BASE>;<MIN>;<U021D> % LATIN SMALL LETTER YOGH
-<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z
 <UFF5A> <S007A>;<BASE>;<WIDE>;<UFF5A> % FULLWIDTH LATIN SMALL LETTER Z
 <U1DE6> <S007A>;<BASE>;<COMPAT>;<U1DE6> % COMBINING LATIN SMALL LETTER Z
 <U24B5> <S007A>;<BASE>;<COMPAT>;<U24B5> % PARENTHESIZED LATIN SMALL LETTER Z
@@ -65796,7 +65802,35 @@ endif
 <U0001D736> <S03B1>;<BASE>;<FONT>;<U0001D736> % MATHEMATICAL BOLD ITALIC SMALL ALPHA
 <U0001D770> <S03B1>;<BASE>;<FONT>;<U0001D770> % MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA
 <U0001D7AA> <S03B1>;<BASE>;<FONT>;<U0001D7AA> % MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA
+% Implement rational range for [A-Z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
 <U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
+<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
+<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
+<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
+<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
+<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
+<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
+<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
+<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I
+<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
+<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
+<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
+<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
+<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
+<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
+<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
+<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
+<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
+<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
+<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
+<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
+<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
+<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
+<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
+<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
+<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
 <UFF21> <S0061>;<BASE>;<WIDECAP>;<UFF21> % FULLWIDTH LATIN CAPITAL LETTER A
 <U0001F110> <S0061>;<BASE>;<COMPATCAP>;<U0001F110> % PARENTHESIZED LATIN CAPITAL LETTER A
 <U0001D400> <S0061>;<BASE>;<FONTCAP>;<U0001D400> % MATHEMATICAL BOLD CAPITAL A
@@ -65860,7 +65894,6 @@ endif
 <U2C6F> <S0250>;<BASE>;<CAP>;<U2C6F> % LATIN CAPITAL LETTER TURNED A
 <U2C6D> <S0251>;<BASE>;<CAP>;<U2C6D> % LATIN CAPITAL LETTER ALPHA
 <U2C70> <S0252>;<BASE>;<CAP>;<U2C70> % LATIN CAPITAL LETTER TURNED ALPHA
-<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
 <UFF22> <S0062>;<BASE>;<WIDECAP>;<UFF22> % FULLWIDTH LATIN CAPITAL LETTER B
 <U0001F111> <S0062>;<BASE>;<COMPATCAP>;<U0001F111> % PARENTHESIZED LATIN CAPITAL LETTER B
 <U212C> <S0062>;<BASE>;<FONTCAP>;<U212C> % SCRIPT CAPITAL B
@@ -65888,7 +65921,6 @@ endif
 <U0181> <S0253>;<BASE>;<CAP>;<U0181> % LATIN CAPITAL LETTER B WITH HOOK
 <U0182> <S0183>;<BASE>;<CAP>;<U0182> % LATIN CAPITAL LETTER B WITH TOPBAR
 <UA7B4> <SA7B5>;<BASE>;<CAP>;<UA7B4> % LATIN CAPITAL LETTER BETA
-<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
 <UFF23> <S0063>;<BASE>;<WIDECAP>;<UFF23> % FULLWIDTH LATIN CAPITAL LETTER C
 <U216D> <S0063>;<BASE>;<COMPATCAP>;<U216D> % ROMAN NUMERAL ONE HUNDRED
 <U0001F112> <S0063>;<BASE>;<COMPATCAP>;<U0001F112> % PARENTHESIZED LATIN CAPITAL LETTER C
@@ -65921,7 +65953,6 @@ endif
 <U0187> <S0188>;<BASE>;<CAP>;<U0187> % LATIN CAPITAL LETTER C WITH HOOK
 <U2183> <S2184>;<BASE>;<CAP>;<U2183> % ROMAN NUMERAL REVERSED ONE HUNDRED
 <UA73E> <SA73F>;<BASE>;<CAP>;<UA73E> % LATIN CAPITAL LETTER REVERSED C WITH DOT
-<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
 <UFF24> <S0064>;<BASE>;<WIDECAP>;<UFF24> % FULLWIDTH LATIN CAPITAL LETTER D
 <U216E> <S0064>;<BASE>;<COMPATCAP>;<U216E> % ROMAN NUMERAL FIVE HUNDRED
 <U0001F113> <S0064>;<BASE>;<COMPATCAP>;<U0001F113> % PARENTHESIZED LATIN CAPITAL LETTER D
@@ -65959,7 +65990,6 @@ endif
 <U0189> <S0256>;<BASE>;<CAP>;<U0189> % LATIN CAPITAL LETTER AFRICAN D
 <U018A> <S0257>;<BASE>;<CAP>;<U018A> % LATIN CAPITAL LETTER D WITH HOOK
 <U018B> <S018C>;<BASE>;<CAP>;<U018B> % LATIN CAPITAL LETTER D WITH TOPBAR
-<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
 <UFF25> <S0065>;<BASE>;<WIDECAP>;<UFF25> % FULLWIDTH LATIN CAPITAL LETTER E
 <U0001F114> <S0065>;<BASE>;<COMPATCAP>;<U0001F114> % PARENTHESIZED LATIN CAPITAL LETTER E
 <U2130> <S0065>;<BASE>;<FONTCAP>;<U2130> % SCRIPT CAPITAL E
@@ -66010,7 +66040,6 @@ endif
 <U0190> <S025B>;<BASE>;<CAP>;<U0190> % LATIN CAPITAL LETTER OPEN E
 <U2107> <S025B>;<BASE>;<COMPATCAP>;<U2107> % EULER CONSTANT
 <UA7AB> <S025C>;<BASE>;<CAP>;<UA7AB> % LATIN CAPITAL LETTER REVERSED OPEN E
-<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
 <UFF26> <S0066>;<BASE>;<WIDECAP>;<UFF26> % FULLWIDTH LATIN CAPITAL LETTER F
 <U0001F115> <S0066>;<BASE>;<COMPATCAP>;<U0001F115> % PARENTHESIZED LATIN CAPITAL LETTER F
 <U2131> <S0066>;<BASE>;<FONTCAP>;<U2131> % SCRIPT CAPITAL F
@@ -66035,7 +66064,6 @@ endif
 <UA798> <SA799>;<BASE>;<CAP>;<UA798> % LATIN CAPITAL LETTER F WITH STROKE
 <U0191> <S0192>;<BASE>;<CAP>;<U0191> % LATIN CAPITAL LETTER F WITH HOOK
 <U2132> <S214E>;<BASE>;<CAP>;<U2132> % TURNED CAPITAL F
-<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
 <UFF27> <S0067>;<BASE>;<WIDECAP>;<UFF27> % FULLWIDTH LATIN CAPITAL LETTER G
 <U0001F116> <S0067>;<BASE>;<COMPATCAP>;<U0001F116> % PARENTHESIZED LATIN CAPITAL LETTER G
 <U0001D406> <S0067>;<BASE>;<FONTCAP>;<U0001D406> % MATHEMATICAL BOLD CAPITAL G
@@ -66071,7 +66099,6 @@ endif
 <UA77E> <SA77F>;<BASE>;<CAP>;<UA77E> % LATIN CAPITAL LETTER TURNED INSULAR G
 <U0194> <S0263>;<BASE>;<CAP>;<U0194> % LATIN CAPITAL LETTER GAMMA
 <U01A2> <S01A3>;<BASE>;<CAP>;<U01A2> % LATIN CAPITAL LETTER OI
-<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
 <UFF28> <S0068>;<BASE>;<WIDECAP>;<UFF28> % FULLWIDTH LATIN CAPITAL LETTER H
 <U0001F117> <S0068>;<BASE>;<COMPATCAP>;<U0001F117> % PARENTHESIZED LATIN CAPITAL LETTER H
 <U210B> <S0068>;<BASE>;<FONTCAP>;<U210B> % SCRIPT CAPITAL H
@@ -66104,7 +66131,6 @@ endif
 <U2C67> <S2C68>;<BASE>;<CAP>;<U2C67> % LATIN CAPITAL LETTER H WITH DESCENDER
 <U2C75> <S2C76>;<BASE>;<CAP>;<U2C75> % LATIN CAPITAL LETTER HALF H
 <UA726> <SA727>;<BASE>;<CAP>;<UA726> % LATIN CAPITAL LETTER HENG
-<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I
 <UFF29> <S0069>;<BASE>;<WIDECAP>;<UFF29> % FULLWIDTH LATIN CAPITAL LETTER I
 <U2160> <S0069>;<BASE>;<COMPATCAP>;<U2160> % ROMAN NUMERAL ONE
 <U0001F118> <S0069>;<BASE>;<COMPATCAP>;<U0001F118> % PARENTHESIZED LATIN CAPITAL LETTER I
@@ -66149,7 +66175,6 @@ endif
 <UA7AE> <S026A>;<BASE>;<CAP>;<UA7AE> % LATIN CAPITAL LETTER SMALL CAPITAL I
 <U0197> <S0268>;<BASE>;<CAP>;<U0197> % LATIN CAPITAL LETTER I WITH STROKE
 <U0196> <S0269>;<BASE>;<CAP>;<U0196> % LATIN CAPITAL LETTER IOTA
-<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
 <UFF2A> <S006A>;<BASE>;<WIDECAP>;<UFF2A> % FULLWIDTH LATIN CAPITAL LETTER J
 <U0001F119> <S006A>;<BASE>;<COMPATCAP>;<U0001F119> % PARENTHESIZED LATIN CAPITAL LETTER J
 <U0001D409> <S006A>;<BASE>;<FONTCAP>;<U0001D409> % MATHEMATICAL BOLD CAPITAL J
@@ -66172,7 +66197,6 @@ endif
 <U0134> <S006A>;"<BASE><CIRCF>";"<CAP><MIN>";<U0134> % LATIN CAPITAL LETTER J WITH CIRCUMFLEX
 <U0248> <S0249>;<BASE>;<CAP>;<U0248> % LATIN CAPITAL LETTER J WITH STROKE
 <UA7B2> <S029D>;<BASE>;<CAP>;<UA7B2> % LATIN CAPITAL LETTER J WITH CROSSED-TAIL
-<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
 <U212A> <S006B>;<BASE>;<CAP>;<U212A> % KELVIN SIGN
 <UFF2B> <S006B>;<BASE>;<WIDECAP>;<UFF2B> % FULLWIDTH LATIN CAPITAL LETTER K
 <U0001F11A> <S006B>;<BASE>;<COMPATCAP>;<U0001F11A> % PARENTHESIZED LATIN CAPITAL LETTER K
@@ -66206,7 +66230,6 @@ endif
 <UA742> <SA743>;<BASE>;<CAP>;<UA742> % LATIN CAPITAL LETTER K WITH DIAGONAL STROKE
 <UA744> <SA745>;<BASE>;<CAP>;<UA744> % LATIN CAPITAL LETTER K WITH STROKE AND DIAGONAL STROKE
 <UA7B0> <S029E>;<BASE>;<CAP>;<UA7B0> % LATIN CAPITAL LETTER TURNED K
-<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
 <UFF2C> <S006C>;<BASE>;<WIDECAP>;<UFF2C> % FULLWIDTH LATIN CAPITAL LETTER L
 <U216C> <S006C>;<BASE>;<COMPATCAP>;<U216C> % ROMAN NUMERAL FIFTY
 <U0001F11B> <S006C>;<BASE>;<COMPATCAP>;<U0001F11B> % PARENTHESIZED LATIN CAPITAL LETTER L
@@ -66249,7 +66272,6 @@ endif
 <U2C62> <S026B>;<BASE>;<CAP>;<U2C62> % LATIN CAPITAL LETTER L WITH MIDDLE TILDE
 <UA7AD> <S026C>;<BASE>;<CAP>;<UA7AD> % LATIN CAPITAL LETTER L WITH BELT
 <UA780> <SA781>;<BASE>;<CAP>;<UA780> % LATIN CAPITAL LETTER TURNED L
-<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
 <UFF2D> <S006D>;<BASE>;<WIDECAP>;<UFF2D> % FULLWIDTH LATIN CAPITAL LETTER M
 <U216F> <S006D>;<BASE>;<COMPATCAP>;<U216F> % ROMAN NUMERAL ONE THOUSAND
 <U0001F11C> <S006D>;<BASE>;<COMPATCAP>;<U0001F11C> % PARENTHESIZED LATIN CAPITAL LETTER M
@@ -66275,7 +66297,6 @@ endif
 <U1E42> <S006D>;"<BASE><POINS>";"<CAP><MIN>";<U1E42> % LATIN CAPITAL LETTER M WITH DOT BELOW
 <U1DDF> <S1D0D>;<BASE>;<COMPAT>;<U1DDF> % COMBINING LATIN LETTER SMALL CAPITAL M
 <U2C6E> <S0271>;<BASE>;<CAP>;<U2C6E> % LATIN CAPITAL LETTER M WITH HOOK
-<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
 <UFF2E> <S006E>;<BASE>;<WIDECAP>;<UFF2E> % FULLWIDTH LATIN CAPITAL LETTER N
 <U0001F11D> <S006E>;<BASE>;<COMPATCAP>;<U0001F11D> % PARENTHESIZED LATIN CAPITAL LETTER N
 <U2115> <S006E>;<BASE>;<FONTCAP>;<U2115> % DOUBLE-STRUCK CAPITAL N
@@ -66312,7 +66333,6 @@ endif
 <U0220> <S019E>;<BASE>;<CAP>;<U0220> % LATIN CAPITAL LETTER N WITH LONG RIGHT LEG
 <UA790> <SA791>;<BASE>;<CAP>;<UA790> % LATIN CAPITAL LETTER N WITH DESCENDER
 <U014A> <S014B>;<BASE>;<CAP>;<U014A> % LATIN CAPITAL LETTER ENG
-<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
 <UFF2F> <S006F>;<BASE>;<WIDECAP>;<UFF2F> % FULLWIDTH LATIN CAPITAL LETTER O
 <U0001F11E> <S006F>;<BASE>;<COMPATCAP>;<U0001F11E> % PARENTHESIZED LATIN CAPITAL LETTER O
 <U0001D40E> <S006F>;<BASE>;<FONTCAP>;<U0001D40E> % MATHEMATICAL BOLD CAPITAL O
@@ -66377,7 +66397,6 @@ endif
 <UA74A> <SA74B>;<BASE>;<CAP>;<UA74A> % LATIN CAPITAL LETTER O WITH LONG STROKE OVERLAY
 <UA7B6> <SA7B7>;<BASE>;<CAP>;<UA7B6> % LATIN CAPITAL LETTER OMEGA
 <U0222> <S0223>;<BASE>;<CAP>;<U0222> % LATIN CAPITAL LETTER OU
-<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
 <UFF30> <S0070>;<BASE>;<WIDECAP>;<UFF30> % FULLWIDTH LATIN CAPITAL LETTER P
 <U0001F11F> <S0070>;<BASE>;<COMPATCAP>;<U0001F11F> % PARENTHESIZED LATIN CAPITAL LETTER P
 <U2119> <S0070>;<BASE>;<FONTCAP>;<U2119> % DOUBLE-STRUCK CAPITAL P
@@ -66405,7 +66424,6 @@ endif
 <U01A4> <S01A5>;<BASE>;<CAP>;<U01A4> % LATIN CAPITAL LETTER P WITH HOOK
 <UA752> <SA753>;<BASE>;<CAP>;<UA752> % LATIN CAPITAL LETTER P WITH FLOURISH
 <UA754> <SA755>;<BASE>;<CAP>;<UA754> % LATIN CAPITAL LETTER P WITH SQUIRREL TAIL
-<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
 <UFF31> <S0071>;<BASE>;<WIDECAP>;<UFF31> % FULLWIDTH LATIN CAPITAL LETTER Q
 <U0001F120> <S0071>;<BASE>;<COMPATCAP>;<U0001F120> % PARENTHESIZED LATIN CAPITAL LETTER Q
 <U211A> <S0071>;<BASE>;<FONTCAP>;<U211A> % DOUBLE-STRUCK CAPITAL Q
@@ -66428,7 +66446,6 @@ endif
 <UA756> <SA757>;<BASE>;<CAP>;<UA756> % LATIN CAPITAL LETTER Q WITH STROKE THROUGH DESCENDER
 <UA758> <SA759>;<BASE>;<CAP>;<UA758> % LATIN CAPITAL LETTER Q WITH DIAGONAL STROKE
 <U024A> <S024B>;<BASE>;<CAP>;<U024A> % LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
-<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
 <UFF32> <S0072>;<BASE>;<WIDECAP>;<UFF32> % FULLWIDTH LATIN CAPITAL LETTER R
 <U0001F121> <S0072>;<BASE>;<COMPATCAP>;<U0001F121> % PARENTHESIZED LATIN CAPITAL LETTER R
 <U211B> <S0072>;<BASE>;<FONTCAP>;<U211B> % SCRIPT CAPITAL R
@@ -66466,7 +66483,6 @@ endif
 <U024C> <S024D>;<BASE>;<CAP>;<U024C> % LATIN CAPITAL LETTER R WITH STROKE
 <U2C64> <S027D>;<BASE>;<CAP>;<U2C64> % LATIN CAPITAL LETTER R WITH TAIL
 <UA75C> <SA75D>;<BASE>;<CAP>;<UA75C> % LATIN CAPITAL LETTER RUM ROTUNDA
-<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
 <UFF33> <S0073>;<BASE>;<WIDECAP>;<UFF33> % FULLWIDTH LATIN CAPITAL LETTER S
 <U0001F122> <S0073>;<BASE>;<COMPATCAP>;<U0001F122> % PARENTHESIZED LATIN CAPITAL LETTER S
 <U0001F12A> <S0073>;<BASE>;<COMPATCAP>;<U0001F12A> % TORTOISE SHELL BRACKETED LATIN CAPITAL LETTER S
@@ -66502,7 +66518,6 @@ endif
 <U1E9E> "<S0073><S0073>";"<BASE><VRNT1><BASE>";"<COMPATCAP><COMPAT><COMPATCAP>";<U1E9E> % LATIN CAPITAL LETTER SHARP S
 <U2C7E> <S023F>;<BASE>;<CAP>;<U2C7E> % LATIN CAPITAL LETTER S WITH SWASH TAIL
 <U01A9> <S0283>;<BASE>;<CAP>;<U01A9> % LATIN CAPITAL LETTER ESH
-<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
 <UFF34> <S0074>;<BASE>;<WIDECAP>;<UFF34> % FULLWIDTH LATIN CAPITAL LETTER T
 <U0001F123> <S0074>;<BASE>;<COMPATCAP>;<U0001F123> % PARENTHESIZED LATIN CAPITAL LETTER T
 <U0001D413> <S0074>;<BASE>;<FONTCAP>;<U0001D413> % MATHEMATICAL BOLD CAPITAL T
@@ -66536,7 +66551,6 @@ endif
 <U01AC> <S01AD>;<BASE>;<CAP>;<U01AC> % LATIN CAPITAL LETTER T WITH HOOK
 <U01AE> <S0288>;<BASE>;<CAP>;<U01AE> % LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
 <UA7B1> <S0287>;<BASE>;<CAP>;<UA7B1> % LATIN CAPITAL LETTER TURNED T
-<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
 <UFF35> <S0075>;<BASE>;<WIDECAP>;<UFF35> % FULLWIDTH LATIN CAPITAL LETTER U
 <U0001F124> <S0075>;<BASE>;<COMPATCAP>;<U0001F124> % PARENTHESIZED LATIN CAPITAL LETTER U
 <U0001D414> <S0075>;<BASE>;<FONTCAP>;<U0001D414> % MATHEMATICAL BOLD CAPITAL U
@@ -66591,7 +66605,6 @@ endif
 <UA78D> <S0265>;<BASE>;<CAP>;<UA78D> % LATIN CAPITAL LETTER TURNED H
 <U019C> <S026F>;<BASE>;<CAP>;<U019C> % LATIN CAPITAL LETTER TURNED M
 <U01B1> <S028A>;<BASE>;<CAP>;<U01B1> % LATIN CAPITAL LETTER UPSILON
-<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
 <UFF36> <S0076>;<BASE>;<WIDECAP>;<UFF36> % FULLWIDTH LATIN CAPITAL LETTER V
 <U2164> <S0076>;<BASE>;<COMPATCAP>;<U2164> % ROMAN NUMERAL FIVE
 <U0001F125> <S0076>;<BASE>;<COMPATCAP>;<U0001F125> % PARENTHESIZED LATIN CAPITAL LETTER V
@@ -66622,7 +66635,6 @@ endif
 <U01B2> <S028B>;<BASE>;<CAP>;<U01B2> % LATIN CAPITAL LETTER V WITH HOOK
 <U1EFC> <S1EFD>;<BASE>;<CAP>;<U1EFC> % LATIN CAPITAL LETTER MIDDLE-WELSH V
 <U0245> <S028C>;<BASE>;<CAP>;<U0245> % LATIN CAPITAL LETTER TURNED V
-<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
 <UFF37> <S0077>;<BASE>;<WIDECAP>;<UFF37> % FULLWIDTH LATIN CAPITAL LETTER W
 <U0001F126> <S0077>;<BASE>;<COMPATCAP>;<U0001F126> % PARENTHESIZED LATIN CAPITAL LETTER W
 <U0001D416> <S0077>;<BASE>;<FONTCAP>;<U0001D416> % MATHEMATICAL BOLD CAPITAL W
@@ -66649,7 +66661,6 @@ endif
 <U1E86> <S0077>;"<BASE><POINT>";"<CAP><MIN>";<U1E86> % LATIN CAPITAL LETTER W WITH DOT ABOVE
 <U1E88> <S0077>;"<BASE><POINS>";"<CAP><MIN>";<U1E88> % LATIN CAPITAL LETTER W WITH DOT BELOW
 <U2C72> <S2C73>;<BASE>;<CAP>;<U2C72> % LATIN CAPITAL LETTER W WITH HOOK
-<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
 <UFF38> <S0078>;<BASE>;<WIDECAP>;<UFF38> % FULLWIDTH LATIN CAPITAL LETTER X
 <U2169> <S0078>;<BASE>;<COMPATCAP>;<U2169> % ROMAN NUMERAL TEN
 <U0001F127> <S0078>;<BASE>;<COMPATCAP>;<U0001F127> % PARENTHESIZED LATIN CAPITAL LETTER X
@@ -66675,7 +66686,6 @@ endif
 <U216A> "<S0078><S0069>";"<BASE><BASE>";"<COMPATCAP><COMPATCAP>";<U216A> % ROMAN NUMERAL ELEVEN
 <U216B> "<S0078><S0069><S0069>";"<BASE><BASE><BASE>";"<COMPATCAP><COMPATCAP><COMPATCAP>";<U216B> % ROMAN NUMERAL TWELVE
 <UA7B3> <SAB53>;<BASE>;<CAP>;<UA7B3> % LATIN CAPITAL LETTER CHI
-<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
 <UFF39> <S0079>;<BASE>;<WIDECAP>;<UFF39> % FULLWIDTH LATIN CAPITAL LETTER Y
 <U0001F128> <S0079>;<BASE>;<COMPATCAP>;<U0001F128> % PARENTHESIZED LATIN CAPITAL LETTER Y
 <U0001D418> <S0079>;<BASE>;<FONTCAP>;<U0001D418> % MATHEMATICAL BOLD CAPITAL Y
@@ -66708,7 +66718,6 @@ endif
 <U01B3> <S01B4>;<BASE>;<CAP>;<U01B3> % LATIN CAPITAL LETTER Y WITH HOOK
 <U1EFE> <S1EFF>;<BASE>;<CAP>;<U1EFE> % LATIN CAPITAL LETTER Y WITH LOOP
 <U021C> <S021D>;<BASE>;<CAP>;<U021C> % LATIN CAPITAL LETTER YOGH
-<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
 <UFF3A> <S007A>;<BASE>;<WIDECAP>;<UFF3A> % FULLWIDTH LATIN CAPITAL LETTER Z
 <U0001F129> <S007A>;<BASE>;<COMPATCAP>;<U0001F129> % PARENTHESIZED LATIN CAPITAL LETTER Z
 <U2124> <S007A>;<BASE>;<FONTCAP>;<U2124> % DOUBLE-STRUCK CAPITAL Z
diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
index dc2ca8d01a..0b3c78fd1c 100644
--- a/posix/tst-fnmatch.input
+++ b/posix/tst-fnmatch.input
@@ -67,9 +67,11 @@
 # https://sourceware.org/bugzilla/show_bug.cgi?id=23393
 # https://sourceware.org/bugzilla/show_bug.cgi?id=23420
 #
-# No consensus exists on how best to handle the changes so the
-# iso14651_t1_common collation element order (CEO) has been changed to
-# deinterlace the a-z and A-Z regions.
+# The solution was to implement rational ranges by moving the collation
+# element order to fix this for [a-z], [A-Z], and [0-9]. Likewise the
+# upper and lower case letters are deinterlaced to allow for accented
+# ranges that don't include uppercase e.g. [a-ñ] should not include
+# any uppercase letters but may include a-z and more.
 #
 # With the deinterlacing commit ac3a3b4b0d561d776b60317d6a926050c8541655
 # could be reverted to re-test the correct non-interleaved expectations.
@@ -77,9 +79,7 @@
 # Please note that despite the region being deinterlaced, the ordering
 # of collation remains the same.  In glibc we implement CEO and because of
 # that we can reorder the elements to reorder ranges without impacting
-# collation which depends on weights.  The collation element ordering
-# could have been changed to include just a-z, A-Z, and 0-9 in three
-# distinct blocks, but this needs more discussion by the community.
+# collation which depends on weights.
 
 # B.6 004(C)
 C		 "!#%+,-./01234567889"	"!#%+,-./01234567889"  0
@@ -477,9 +477,9 @@ C		"-"			"[Z-\\]]"	       NOMATCH
 # handling of ranges and the recognition of character (vs bytes).
 de_DE.ISO-8859-1 "a"			"[a-z]"		       0
 de_DE.ISO-8859-1 "z"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ä"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ö"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ü"			"[a-z]"		       0
+de_DE.ISO-8859-1 "ä"			"[a-z]"		       NOMATCH
+de_DE.ISO-8859-1 "ö"			"[a-z]"		       NOMATCH
+de_DE.ISO-8859-1 "ü"			"[a-z]"		       NOMATCH
 de_DE.ISO-8859-1 "A"			"[a-z]"		       NOMATCH
 de_DE.ISO-8859-1 "Z"			"[a-z]"		       NOMATCH
 de_DE.ISO-8859-1 "Ä"			"[a-z]"		       NOMATCH
@@ -492,9 +492,9 @@ de_DE.ISO-8859-1 "
 de_DE.ISO-8859-1 "ü"			"[A-Z]"		       NOMATCH
 de_DE.ISO-8859-1 "A"			"[A-Z]"		       0
 de_DE.ISO-8859-1 "Z"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ö"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ü"			"[A-Z]"		       0
+de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       NOMATCH
+de_DE.ISO-8859-1 "Ö"			"[A-Z]"		       NOMATCH
+de_DE.ISO-8859-1 "Ü"			"[A-Z]"		       NOMATCH
 de_DE.ISO-8859-1 "a"			"[[:lower:]]"	       0
 de_DE.ISO-8859-1 "z"			"[[:lower:]]"	       0
 de_DE.ISO-8859-1 "ä"			"[[:lower:]]"	       0
@@ -568,20 +568,34 @@ de_DE.ISO-8859-1 "ba"			"[[.a.]]a"	       NOMATCH
 
 # And with a multibyte character set.
 en_US.UTF-8	 "a"			"[a-z]"		       0
+# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [a-z].
+en_US.UTF-8	 "Ã±"			"[a-z]"		       NOMATCH
 en_US.UTF-8	 "z"			"[a-z]"		       0
 en_US.UTF-8	 "A"			"[a-z]"		       NOMATCH
+# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [a-z].
+en_US.UTF-8	 "Ã‘"			"[a-z]"		       NOMATCH
 en_US.UTF-8	 "Z"			"[a-z]"		       NOMATCH
 en_US.UTF-8	 "a"			"[A-Z]"		       NOMATCH
+# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [A-Z].
+en_US.UTF-8	 "Ã±"			"[A-Z]"		       NOMATCH
 en_US.UTF-8	 "z"			"[A-Z]"		       NOMATCH
 en_US.UTF-8	 "A"			"[A-Z]"		       0
+# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [A-Z].
+en_US.UTF-8	 "Ã‘"			"[A-Z]"		       NOMATCH
 en_US.UTF-8	 "Z"			"[A-Z]"		       0
 en_US.UTF-8	 "0"			"[0-9]"		       0
+# Test that <UFF10> FULLWIDTH DIGIT ZERO is not in [0-9].
+en_US.UTF-8	 "ï¼"			"[0-9]"		       NOMATCH
+# Test that <U00BD> VULGAR FRACTION ONE HALF is not in [0-9].
+en_US.UTF-8	 "Â½"			"[0-9]"		       NOMATCH
 en_US.UTF-8	 "9"			"[0-9]"		       0
+# Test that <UFF19> FULLWIDTH DIGIT NINE is not in [0-9].
+en_US.UTF-8	 "ï¼™"			"[0-9]"		       NOMATCH
 de_DE.UTF-8	 "a"			"[a-z]"		       0
 de_DE.UTF-8	 "z"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¤"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¶"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¼"			"[a-z]"		       0
+de_DE.UTF-8	 "Ã¤"			"[a-z]"		       NOMATCH
+de_DE.UTF-8	 "Ã¶"			"[a-z]"		       NOMATCH
+de_DE.UTF-8	 "Ã¼"			"[a-z]"		       NOMATCH
 de_DE.UTF-8	 "A"			"[a-z]"		       NOMATCH
 de_DE.UTF-8	 "Z"			"[a-z]"		       NOMATCH
 de_DE.UTF-8	 "Ã„"			"[a-z]"		       NOMATCH
@@ -594,9 +608,9 @@ de_DE.UTF-8	 "Ã¶"			"[A-Z]"		       NOMATCH
 de_DE.UTF-8	 "Ã¼"			"[A-Z]"		       NOMATCH
 de_DE.UTF-8	 "A"			"[A-Z]"		       0
 de_DE.UTF-8	 "Z"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã„"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã–"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ãœ"			"[A-Z]"		       0
+de_DE.UTF-8	 "Ã„"		"[A-Z]"		       NOMATCH
+de_DE.UTF-8	 "Ã–"		"[A-Z]"		       NOMATCH
+de_DE.UTF-8	 "Ãœ"		"[A-Z]"		       NOMATCH
 de_DE.UTF-8	 "a"			"[[:lower:]]"	       0
 de_DE.UTF-8	 "z"			"[[:lower:]]"	       0
 de_DE.UTF-8	 "Ã¤"			"[[:lower:]]"	       0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-20 18:49   ` Carlos O'Donell
@ 2018-07-20 19:02     ` Rich Felker
  2018-07-20 19:19     ` Florian Weimer
  1 sibling, 0 replies; 42+ messages in thread
From: Rich Felker @ 2018-07-20 19:02 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Florian Weimer, GNU C Library, Mike Fabian, Zorro Lang, Joseph S. Myers

On Fri, Jul 20, 2018 at 02:49:07PM -0400, Carlos O'Donell wrote:
> On 07/19/2018 04:39 PM, Florian Weimer wrote:
> > On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
> >> * Add back tests to tst-fnmatch.input and tst-regexloc.c which 
> >> exercise that [a-z] does not match A or Z.
> > 
> > [a-z] still matches Ã±, ðš—, but not ðš£, which I doubt is useful.
> 
> Sorry, I don't follow, it absolutely matches ASCII z.

That's not an ASCII z. It's some plane-1 mathematical z. :-)

Rich

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-20 18:49   ` Carlos O'Donell
  2018-07-20 19:02     ` Rich Felker
@ 2018-07-20 19:19     ` Florian Weimer
  2018-07-20 21:56       ` Carlos O'Donell
  1 sibling, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-20 19:19 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
> On 07/19/2018 04:39 PM, Florian Weimer wrote:
>> On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
>>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>>> exercise that [a-z] does not match A or Z.
>>
>> [a-z] still matches Ã±, ðš—, but not ðš£, which I doubt is useful.
> 
> Sorry, I don't follow, it absolutely matches ASCII z.

The z I wrote above is one of the non-BMP math characters.

> We deinterlace the collation element ordering (not sequence) to get
> the right range expression resolution.
> 
> See the added fnmatch tests:
> 
> +en_US.UTF-8     "a"                    "[a-z]"                0
> +en_US.UTF-8     "z"                    "[a-z]"                0
> +en_US.UTF-8     "A"                    "[a-z]"                NOMATCH
> +en_US.UTF-8     "Z"                    "[a-z]"                NOMATCH
> +en_US.UTF-8     "a"                    "[A-Z]"                NOMATCH
> +en_US.UTF-8     "z"                    "[A-Z]"                NOMATCH
> +en_US.UTF-8     "A"                    "[A-Z]"                0
> +en_US.UTF-8     "Z"                    "[A-Z]"                0
> +en_US.UTF-8     "0"                    "[0-9]"                0
> +en_US.UTF-8     "9"                    "[0-9]"                0
> 
> [a-z] matches a-z (including z), *and* all the lowercase inbetween,
> and so behaves like :lower: effectively.

There are characters equivalent to ASCII z (like the z above), but which 
sort after z, so they are not matched.  This is one reason why I think 
this is a bad idea: it looks like [:lower:], but it's not.  Same for 
[0-9], I assume.

>> It's an improvement, and it may be good enough for glibc 2.28, but I would
>> rather see us implement the rational ranges interpretation.
> 
> That requires all ranges behave rationally?
> 
> We could fix a-z, A-Z, and 0-9 easily.
> 
> Patch attached.

(NB: Patch is relative to the previous patch.)

My enumeration tester likes it much more. 8-)

   actual:   "abcdefghijklmnopqrstuvwxyz"
   actual:   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
   actual:   "0123456789"

That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. 
However, I still get this:

tst-regex-classes.script:85:0: result character set difference in locale 
tr_TR.ISO-8859-9
enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
   expected: "abcdefghijklmnopqrstuvwxyz"
   actual:   "abcdefghjklmnopqrstuvwxyz"
tst-regex-classes.script:86:0: result character set difference in locale 
tr_TR.ISO-8859-9
enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
   expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
   actual:   "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures

Can you fix this with data-only changes, too?

posix/bug-regex17 regresses as well in the test for bug 9697, but I can 
incorporate that into my enumeration tester.  I don't think the bug is 
actually regressing, it's just that the test objective is not expressed 
properly in it.

posix/tst-rxspencer fails as well, presumably due to this:

UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end

I think this happens because the test blindly replaces ASCII characters 
with non-ASCII characters, which causes issues if they are not ordered 
as expected.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-20 19:19     ` Florian Weimer
@ 2018-07-20 21:56       ` Carlos O'Donell
  2018-07-23 15:11         ` Florian Weimer
  0 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-20 21:56 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 5429 bytes --]

On 07/20/2018 03:19 PM, Florian Weimer wrote:
> On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
>> On 07/19/2018 04:39 PM, Florian Weimer wrote:
>>> On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
>>>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>>>> exercise that [a-z] does not match A or Z.
>>>
>>> [a-z] still matches Ã±, ðš—, but not ðš£, which I doubt is useful.
>>
>> Sorry, I don't follow, it absolutely matches ASCII z.
> 
> The z I wrote above is one of the non-BMP math characters.

Thanks :-}

It was a conservative solution.

>> We deinterlace the collation element ordering (not sequence) to get
>> the right range expression resolution.
>>
>> See the added fnmatch tests:
>>
>> +en_US.UTF-8Â Â Â Â  "a"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "A"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "Z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "a"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "A"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "Z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "0"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[0-9]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "9"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[0-9]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>>
>> [a-z] matches a-z (including z), *and* all the lowercase inbetween,
>> and so behaves like :lower: effectively.
> 
> There are characters equivalent to ASCII z (like the z above), but
> which sort after z, so they are not matched.  This is one reason why
> I think this is a bad idea: it looks like [:lower:], but it's not.
> Same for [0-9], I assume.

Again, conservatively, this is how it worked before, and now works again
the same, but retains the improvement of ISO 14651 data being added.
 
>>> It's an improvement, and it may be good enough for glibc 2.28, but I would
>>> rather see us implement the rational ranges interpretation.
>>
>> That requires all ranges behave rationally?
>>
>> We could fix a-z, A-Z, and 0-9 easily.
>>
>> Patch attached.
> 
> (NB: Patch is relative to the previous patch.)
> 
> My enumeration tester likes it much more. 8-)

It was designed exactly for your enumerator ;-)

> Â  actual:Â Â  "abcdefghijklmnopqrstuvwxyz"
> Â  actual:Â Â  "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
> Â  actual:Â Â  "0123456789"
> 
> That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. However, I still get this:
> 
> tst-regex-classes.script:85:0: result character set difference in locale tr_TR.ISO-8859-9
> enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
> ^
> Â  expected: "abcdefghijklmnopqrstuvwxyz"
> Â  actual:Â Â  "abcdefghjklmnopqrstuvwxyz"
>
> tst-regex-classes.script:86:0: result character set difference in locale tr_TR.ISO-8859-9
> enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
> ^
> Â  expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
> Â  actual:Â Â  "ABCDEFGHJKLMNOPQRSTUVWXYZ"
> error: 2 test failures
> 
> Can you fix this with data-only changes, too?

Yes, I need to duplicate the rational range for A-Z in tr_TR and
remove 'i' since it's just fine the way it is, the existing

New patch attached with additional tests in tst-fnmatch.input to
test tr_TR.UTF-8, and ISO-8859-9.

Noticed equivalence class issues and filed a bug and added an XFAIL-ish
test case in test-fnmatch.input:
https://sourceware.org/bugzilla/show_bug.cgi?id=23437

> posix/bug-regex17 regresses as well in the test for bug 9697, but I
> can incorporate that into my enumeration tester.  I don't think the
> bug is actually regressing, it's just that the test objective is not
> expressed properly in it.

Fixed.

> 
> posix/tst-rxspencer fails as well, presumably due to this:
> 
> UTF-8 aA FAIL regcomp failed: Invalid range end
> UTF-8 aAcC FAIL regcomp failed: Invalid range end
> 
> I think this happens because the test blindly replaces ASCII
> characters with non-ASCII characters, which causes issues if they are
> not ordered as expected.

Fixed.

v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.

Tell me how the new version does.

-- 
Cheers,
Carlos.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: rational-ranges-v2.diff --]
[-- Type: text/x-patch; name="rational-ranges-v2.diff", Size: 49826 bytes --]

diff --git a/localedata/locales/iso14651_t1_common b/localedata/locales/iso14651_t1_common
index 227400cc4e..7248074a8b 100644
--- a/localedata/locales/iso14651_t1_common
+++ b/localedata/locales/iso14651_t1_common
@@ -63177,7 +63177,19 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U20BC> <S20BC>;<BASE>;<MIN>;<U20BC> % MANAT SIGN
 <U20BD> <S20BD>;<BASE>;<MIN>;<U20BD> % RUBLE SIGN
 <U20BE> <S20BE>;<BASE>;<MIN>;<U20BE> % LARI SIGN
+% Implement rational range for [0-9] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
 <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO
+<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
+<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
+<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
+<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR
+<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE
+<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX
+<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN
+<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT
+<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE
 <U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO
 <U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO
 <U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO
@@ -63250,7 +63262,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U2080> <S0030>;<BASE>;<MNS>;<U2080> % SUBSCRIPT ZERO
 <U2189> "<S0030><S0033>";"<BASE><BASE>";"<FRACTION><FRACTION>";<U2189> % VULGAR FRACTION ZERO THIRDS
 <U3358> "<S0030><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3358> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO
-<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE
 <U0661> <S0031>;<BASE>;<MIN>;<U0661> % ARABIC-INDIC DIGIT ONE
 <U06F1> <S0031>;<BASE>;<MIN>;<U06F1> % EXTENDED ARABIC-INDIC DIGIT ONE
 <U07C1> <S0031>;<BASE>;<MIN>;<U07C1> % NKO DIGIT ONE
@@ -63440,7 +63451,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E0> "<S0031><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE
 <U32C0> "<S0031><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY
 <U3359> "<S0031><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3359> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ONE
-<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO
 <U0662> <S0032>;<BASE>;<MIN>;<U0662> % ARABIC-INDIC DIGIT TWO
 <U06F2> <S0032>;<BASE>;<MIN>;<U06F2> % EXTENDED ARABIC-INDIC DIGIT TWO
 <U07C2> <S0032>;<BASE>;<MIN>;<U07C2> % NKO DIGIT TWO
@@ -63583,7 +63593,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E1> "<S0032><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO
 <U32C1> "<S0032><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR FEBRUARY
 <U335A> "<S0032><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335A> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWO
-<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE
 <U0663> <S0033>;<BASE>;<MIN>;<U0663> % ARABIC-INDIC DIGIT THREE
 <U06F3> <S0033>;<BASE>;<MIN>;<U06F3> % EXTENDED ARABIC-INDIC DIGIT THREE
 <U07C3> <S0033>;<BASE>;<MIN>;<U07C3> % NKO DIGIT THREE
@@ -63709,7 +63718,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E2> "<S0033><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THREE
 <U32C2> "<S0033><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MARCH
 <U335B> "<S0033><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335B> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR THREE
-<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR
 <U0664> <S0034>;<BASE>;<MIN>;<U0664> % ARABIC-INDIC DIGIT FOUR
 <U06F4> <S0034>;<BASE>;<MIN>;<U06F4> % EXTENDED ARABIC-INDIC DIGIT FOUR
 <U07C4> <S0034>;<BASE>;<MIN>;<U07C4> % NKO DIGIT FOUR
@@ -63829,7 +63837,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E3> "<S0034><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FOUR
 <U32C3> "<S0034><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR APRIL
 <U335C> "<S0034><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335C> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FOUR
-<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE
 <U0665> <S0035>;<BASE>;<MIN>;<U0665> % ARABIC-INDIC DIGIT FIVE
 <U06F5> <S0035>;<BASE>;<MIN>;<U06F5> % EXTENDED ARABIC-INDIC DIGIT FIVE
 <U07C5> <S0035>;<BASE>;<MIN>;<U07C5> % NKO DIGIT FIVE
@@ -63941,7 +63948,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E4> "<S0035><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FIVE
 <U32C4> "<S0035><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MAY
 <U335D> "<S0035><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335D> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FIVE
-<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX
 <U0666> <S0036>;<BASE>;<MIN>;<U0666> % ARABIC-INDIC DIGIT SIX
 <U06F6> <S0036>;<BASE>;<MIN>;<U06F6> % EXTENDED ARABIC-INDIC DIGIT SIX
 <U07C6> <S0036>;<BASE>;<MIN>;<U07C6> % NKO DIGIT SIX
@@ -64036,7 +64042,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E5> "<S0036><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SIX
 <U32C5> "<S0036><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JUNE
 <U335E> "<S0036><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335E> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SIX
-<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN
 <U0667> <S0037>;<BASE>;<MIN>;<U0667> % ARABIC-INDIC DIGIT SEVEN
 <U06F7> <S0037>;<BASE>;<MIN>;<U06F7> % EXTENDED ARABIC-INDIC DIGIT SEVEN
 <U07C7> <S0037>;<BASE>;<MIN>;<U07C7> % NKO DIGIT SEVEN
@@ -64132,7 +64137,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E6> "<S0037><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SEVEN
 <U32C6> "<S0037><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JULY
 <U335F> "<S0037><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335F> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SEVEN
-<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT
 <U0668> <S0038>;<BASE>;<MIN>;<U0668> % ARABIC-INDIC DIGIT EIGHT
 <U06F8> <S0038>;<BASE>;<MIN>;<U06F8> % EXTENDED ARABIC-INDIC DIGIT EIGHT
 <U07C8> <S0038>;<BASE>;<MIN>;<U07C8> % NKO DIGIT EIGHT
@@ -64226,7 +64230,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position
 <U33E7> "<S0038><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY EIGHT
 <U32C7> "<S0038><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR AUGUST
 <U3360> "<S0038><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3360> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR EIGHT
-<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE
 <U0669> <S0039>;<BASE>;<MIN>;<U0669> % ARABIC-INDIC DIGIT NINE
 <U06F9> <S0039>;<BASE>;<MIN>;<U06F9> % EXTENDED ARABIC-INDIC DIGIT NINE
 <U07C9> <S0039>;<BASE>;<MIN>;<U07C9> % NKO DIGIT NINE
@@ -64326,7 +64329,35 @@ order_start <LATIN>;forward;backward;forward;forward,position
 else
 order_start <LATIN>;forward;forward;forward;forward,position
 endif
+% Implement rational range for [a-z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
 <U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A
+<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
+<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C
+<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D
+<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E
+<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F
+<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
+<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H
+<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I
+<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J
+<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K
+<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L
+<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M
+<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N
+<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O
+<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P
+<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q
+<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R
+<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S
+<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T
+<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U
+<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V
+<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W
+<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X
+<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y
+<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z
 <UFF41> <S0061>;<BASE>;<WIDE>;<UFF41> % FULLWIDTH LATIN SMALL LETTER A
 <U0363> <S0061>;<BASE>;<COMPAT>;<U0363> % COMBINING LATIN SMALL LETTER A
 <U249C> <S0061>;<BASE>;<COMPAT>;<U249C> % PARENTHESIZED LATIN SMALL LETTER A
@@ -64418,7 +64449,6 @@ endif
 <U0252> <S0252>;<BASE>;<MIN>;<U0252> % LATIN SMALL LETTER TURNED ALPHA
 <U1D9B> <S0252>;<BASE>;<MNN>;<U1D9B> % MODIFIER LETTER SMALL TURNED ALPHA
 <UAB64> <SAB64>;<BASE>;<MIN>;<UAB64> % LATIN SMALL LETTER INVERTED ALPHA
-<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B
 <UFF42> <S0062>;<BASE>;<WIDE>;<UFF42> % FULLWIDTH LATIN SMALL LETTER B
 <U1DE8> <S0062>;<BASE>;<COMPAT>;<U1DE8> % COMBINING LATIN SMALL LETTER B
 <U249D> <S0062>;<BASE>;<COMPAT>;<U249D> % PARENTHESIZED LATIN SMALL LETTER B
@@ -64454,7 +64484,6 @@ endif
 <U0183> <S0183>;<BASE>;<MIN>;<U0183> % LATIN SMALL LETTER B WITH TOPBAR
 <UA7B5> <SA7B5>;<BASE>;<MIN>;<UA7B5> % LATIN SMALL LETTER BETA
 <U1DE9> <SA7B5>;<BASE>;<COMPAT>;<U1DE9> % COMBINING LATIN SMALL LETTER BETA
-<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C
 <UFF43> <S0063>;<BASE>;<WIDE>;<UFF43> % FULLWIDTH LATIN SMALL LETTER C
 <U0368> <S0063>;<BASE>;<COMPAT>;<U0368> % COMBINING LATIN SMALL LETTER C
 <U217D> <S0063>;<BASE>;<COMPAT>;<U217D> % SMALL ROMAN NUMERAL ONE HUNDRED
@@ -64504,7 +64533,6 @@ endif
 <U1D9D> <S0255>;<BASE>;<MNN>;<U1D9D> % MODIFIER LETTER SMALL C WITH CURL
 <U2184> <S2184>;<BASE>;<MIN>;<U2184> % LATIN SMALL LETTER REVERSED C
 <UA73F> <SA73F>;<BASE>;<MIN>;<UA73F> % LATIN SMALL LETTER REVERSED C WITH DOT
-<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D
 <UFF44> <S0064>;<BASE>;<WIDE>;<UFF44> % FULLWIDTH LATIN SMALL LETTER D
 <U0369> <S0064>;<BASE>;<COMPAT>;<U0369> % COMBINING LATIN SMALL LETTER D
 <U217E> <S0064>;<BASE>;<COMPAT>;<U217E> % SMALL ROMAN NUMERAL FIVE HUNDRED
@@ -64563,7 +64591,6 @@ endif
 <U0221> <S0221>;<BASE>;<MIN>;<U0221> % LATIN SMALL LETTER D WITH CURL
 <UA771> <SA771>;<BASE>;<MIN>;<UA771> % LATIN SMALL LETTER DUM
 <U1E9F> <S1E9F>;<BASE>;<MIN>;<U1E9F> % LATIN SMALL LETTER DELTA
-<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E
 <UFF45> <S0065>;<BASE>;<WIDE>;<UFF45> % FULLWIDTH LATIN SMALL LETTER E
 <U0364> <S0065>;<BASE>;<COMPAT>;<U0364> % COMBINING LATIN SMALL LETTER E
 <U24A0> <S0065>;<BASE>;<COMPAT>;<U24A0> % PARENTHESIZED LATIN SMALL LETTER E
@@ -64641,7 +64668,6 @@ endif
 <U025E> <S025E>;<BASE>;<MIN>;<U025E> % LATIN SMALL LETTER CLOSED REVERSED OPEN E
 <U029A> <S029A>;<BASE>;<MIN>;<U029A> % LATIN SMALL LETTER CLOSED OPEN E
 <U0264> <S0264>;<BASE>;<MIN>;<U0264> % LATIN SMALL LETTER RAMS HORN
-<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F
 <UFF46> <S0066>;<BASE>;<WIDE>;<UFF46> % FULLWIDTH LATIN SMALL LETTER F
 <U1DEB> <S0066>;<BASE>;<COMPAT>;<U1DEB> % COMBINING LATIN SMALL LETTER F
 <U24A1> <S0066>;<BASE>;<COMPAT>;<U24A1> % PARENTHESIZED LATIN SMALL LETTER F
@@ -64680,7 +64706,6 @@ endif
 <U0192> <S0192>;<BASE>;<MIN>;<U0192> % LATIN SMALL LETTER F WITH HOOK
 <U214E> <S214E>;<BASE>;<MIN>;<U214E> % TURNED SMALL F
 <UA7FB> <SA7FB>;<BASE>;<MIN>;<UA7FB> % LATIN EPIGRAPHIC LETTER REVERSED F
-<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G
 <UFF47> <S0067>;<BASE>;<WIDE>;<UFF47> % FULLWIDTH LATIN SMALL LETTER G
 <U1DDA> <S0067>;<BASE>;<COMPAT>;<U1DDA> % COMBINING LATIN SMALL LETTER G
 <U24A2> <S0067>;<BASE>;<COMPAT>;<U24A2> % PARENTHESIZED LATIN SMALL LETTER G
@@ -64727,7 +64752,6 @@ endif
 <U0263> <S0263>;<BASE>;<MIN>;<U0263> % LATIN SMALL LETTER GAMMA
 <U02E0> <S0263>;<BASE>;<MNN>;<U02E0> % MODIFIER LETTER SMALL GAMMA
 <U01A3> <S01A3>;<BASE>;<MIN>;<U01A3> % LATIN SMALL LETTER OI
-<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H
 <UFF48> <S0068>;<BASE>;<WIDE>;<UFF48> % FULLWIDTH LATIN SMALL LETTER H
 <U036A> <S0068>;<BASE>;<COMPAT>;<U036A> % COMBINING LATIN SMALL LETTER H
 <U24A3> <S0068>;<BASE>;<COMPAT>;<U24A3> % PARENTHESIZED LATIN SMALL LETTER H
@@ -64780,7 +64804,6 @@ endif
 <U0267> <S0267>;<BASE>;<MIN>;<U0267> % LATIN SMALL LETTER HENG WITH HOOK
 <U02BB> <S02BB>;<BASE>;<MIN>;<U02BB> % MODIFIER LETTER TURNED COMMA
 <U02BD> <S02BD>;<BASE>;<MIN>;<U02BD> % MODIFIER LETTER REVERSED COMMA
-<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I
 <UFF49> <S0069>;<BASE>;<WIDE>;<UFF49> % FULLWIDTH LATIN SMALL LETTER I
 <U0365> <S0069>;<BASE>;<COMPAT>;<U0365> % COMBINING LATIN SMALL LETTER I
 <U2170> <S0069>;<BASE>;<COMPAT>;<U2170> % SMALL ROMAN NUMERAL ONE
@@ -64844,7 +64867,6 @@ endif
 <U0269> <S0269>;<BASE>;<MIN>;<U0269> % LATIN SMALL LETTER IOTA
 <U1DA5> <S0269>;<BASE>;<MNN>;<U1DA5> % MODIFIER LETTER SMALL IOTA
 <U1D7C> <S1D7C>;<BASE>;<MIN>;<U1D7C> % LATIN SMALL LETTER IOTA WITH STROKE
-<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J
 <UFF4A> <S006A>;<BASE>;<WIDE>;<UFF4A> % FULLWIDTH LATIN SMALL LETTER J
 <U24A5> <S006A>;<BASE>;<COMPAT>;<U24A5> % PARENTHESIZED LATIN SMALL LETTER J
 <U2149> <S006A>;<BASE>;<FONT>;<U2149> % DOUBLE-STRUCK ITALIC SMALL J
@@ -64876,7 +64898,6 @@ endif
 <U025F> <S025F>;<BASE>;<MIN>;<U025F> % LATIN SMALL LETTER DOTLESS J WITH STROKE
 <U1DA1> <S025F>;<BASE>;<MNN>;<U1DA1> % MODIFIER LETTER SMALL DOTLESS J WITH STROKE
 <U0284> <S0284>;<BASE>;<MIN>;<U0284> % LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK
-<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K
 <UFF4B> <S006B>;<BASE>;<WIDE>;<UFF4B> % FULLWIDTH LATIN SMALL LETTER K
 <U1DDC> <S006B>;<BASE>;<COMPAT>;<U1DDC> % COMBINING LATIN SMALL LETTER K
 <U24A6> <S006B>;<BASE>;<COMPAT>;<U24A6> % PARENTHESIZED LATIN SMALL LETTER K
@@ -64926,7 +64947,6 @@ endif
 <UA743> <SA743>;<BASE>;<MIN>;<UA743> % LATIN SMALL LETTER K WITH DIAGONAL STROKE
 <UA745> <SA745>;<BASE>;<MIN>;<UA745> % LATIN SMALL LETTER K WITH STROKE AND DIAGONAL STROKE
 <U029E> <S029E>;<BASE>;<MIN>;<U029E> % LATIN SMALL LETTER TURNED K
-<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L
 <UFF4C> <S006C>;<BASE>;<WIDE>;<UFF4C> % FULLWIDTH LATIN SMALL LETTER L
 <U1DDD> <S006C>;<BASE>;<COMPAT>;<U1DDD> % COMBINING LATIN SMALL LETTER L
 <U217C> <S006C>;<BASE>;<COMPAT>;<U217C> % SMALL ROMAN NUMERAL FIFTY
@@ -64996,7 +65016,6 @@ endif
 <UA781> <SA781>;<BASE>;<MIN>;<UA781> % LATIN SMALL LETTER TURNED L
 <U019B> <S019B>;<BASE>;<MIN>;<U019B> % LATIN SMALL LETTER LAMBDA WITH STROKE
 <U028E> <S028E>;<BASE>;<MIN>;<U028E> % LATIN SMALL LETTER TURNED Y
-<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M
 <UFF4D> <S006D>;<BASE>;<WIDE>;<UFF4D> % FULLWIDTH LATIN SMALL LETTER M
 <U036B> <S006D>;<BASE>;<COMPAT>;<U036B> % COMBINING LATIN SMALL LETTER M
 <U217F> <S006D>;<BASE>;<COMPAT>;<U217F> % SMALL ROMAN NUMERAL ONE THOUSAND
@@ -65055,7 +65074,6 @@ endif
 <UA7FD> <SA7FD>;<BASE>;<MIN>;<UA7FD> % LATIN EPIGRAPHIC LETTER INVERTED M
 <UA7FF> <SA7FF>;<BASE>;<MIN>;<UA7FF> % LATIN EPIGRAPHIC LETTER ARCHAIC M
 <UA773> <SA773>;<BASE>;<MIN>;<UA773> % LATIN SMALL LETTER MUM
-<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N
 <UFF4E> <S006E>;<BASE>;<WIDE>;<UFF4E> % FULLWIDTH LATIN SMALL LETTER N
 <U1DE0> <S006E>;<BASE>;<COMPAT>;<U1DE0> % COMBINING LATIN SMALL LETTER N
 <U24A9> <S006E>;<BASE>;<COMPAT>;<U24A9> % PARENTHESIZED LATIN SMALL LETTER N
@@ -65114,7 +65132,6 @@ endif
 <U014B> <S014B>;<BASE>;<MIN>;<U014B> % LATIN SMALL LETTER ENG
 <U1D51> <S014B>;<BASE>;<MNN>;<U1D51> % MODIFIER LETTER SMALL ENG
 <UAB3C> <SAB3C>;<BASE>;<MIN>;<UAB3C> % LATIN SMALL LETTER ENG WITH CROSSED-TAIL
-<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O
 <UFF4F> <S006F>;<BASE>;<WIDE>;<UFF4F> % FULLWIDTH LATIN SMALL LETTER O
 <U0366> <S006F>;<BASE>;<COMPAT>;<U0366> % COMBINING LATIN SMALL LETTER O
 <U24AA> <S006F>;<BASE>;<COMPAT>;<U24AA> % PARENTHESIZED LATIN SMALL LETTER O
@@ -65213,7 +65230,6 @@ endif
 <U0223> <S0223>;<BASE>;<MIN>;<U0223> % LATIN SMALL LETTER OU
 <U1D3D> <S0223>;<BASE>;<MISCCAP>;<U1D3D> % MODIFIER LETTER CAPITAL OU
 <U1D15> <S1D15>;<BASE>;<MIN>;<U1D15> % LATIN LETTER SMALL CAPITAL OU
-<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P
 <UFF50> <S0070>;<BASE>;<WIDE>;<UFF50> % FULLWIDTH LATIN SMALL LETTER P
 <U1DEE> <S0070>;<BASE>;<COMPAT>;<U1DEE> % COMBINING LATIN SMALL LETTER P
 <U24AB> <S0070>;<BASE>;<COMPAT>;<U24AB> % PARENTHESIZED LATIN SMALL LETTER P
@@ -65262,7 +65278,6 @@ endif
 <U0278> <S0278>;<BASE>;<MIN>;<U0278> % LATIN SMALL LETTER PHI
 <U1DB2> <S0278>;<BASE>;<MNN>;<U1DB2> % MODIFIER LETTER SMALL PHI
 <U2C77> <S2C77>;<BASE>;<MIN>;<U2C77> % LATIN SMALL LETTER TAILLESS PHI
-<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q
 <UFF51> <S0071>;<BASE>;<WIDE>;<UFF51> % FULLWIDTH LATIN SMALL LETTER Q
 <U24AC> <S0071>;<BASE>;<COMPAT>;<U24AC> % PARENTHESIZED LATIN SMALL LETTER Q
 <U0001D42A> <S0071>;<BASE>;<FONT>;<U0001D42A> % MATHEMATICAL BOLD SMALL Q
@@ -65285,7 +65300,6 @@ endif
 <U02A0> <S02A0>;<BASE>;<MIN>;<U02A0> % LATIN SMALL LETTER Q WITH HOOK
 <U024B> <S024B>;<BASE>;<MIN>;<U024B> % LATIN SMALL LETTER Q WITH HOOK TAIL
 <U0138> <S0138>;<BASE>;<MIN>;<U0138> % LATIN SMALL LETTER KRA
-<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R
 <UFF52> <S0072>;<BASE>;<WIDE>;<UFF52> % FULLWIDTH LATIN SMALL LETTER R
 <U036C> <S0072>;<BASE>;<COMPAT>;<U036C> % COMBINING LATIN SMALL LETTER R
 <U1DCA> <S0072>;<BASE>;<COMPAT>;<U1DCA> % COMBINING LATIN SMALL LETTER R BELOW
@@ -65354,7 +65368,6 @@ endif
 <UA775> <SA775>;<BASE>;<MIN>;<UA775> % LATIN SMALL LETTER RUM
 <UA776> <SA776>;<BASE>;<MIN>;<UA776> % LATIN LETTER SMALL CAPITAL RUM
 <UA75D> <SA75D>;<BASE>;<MIN>;<UA75D> % LATIN SMALL LETTER RUM ROTUNDA
-<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S
 <UFF53> <S0073>;<BASE>;<WIDE>;<UFF53> % FULLWIDTH LATIN SMALL LETTER S
 <U1DE4> <S0073>;<BASE>;<COMPAT>;<U1DE4> % COMBINING LATIN SMALL LETTER S
 <U24AE> <S0073>;<BASE>;<COMPAT>;<U24AE> % PARENTHESIZED LATIN SMALL LETTER S
@@ -65417,7 +65430,6 @@ endif
 <U0285> <S0285>;<BASE>;<MIN>;<U0285> % LATIN SMALL LETTER SQUAT REVERSED ESH
 <U1D98> <S1D98>;<BASE>;<MIN>;<U1D98> % LATIN SMALL LETTER ESH WITH RETROFLEX HOOK
 <U0286> <S0286>;<BASE>;<MIN>;<U0286> % LATIN SMALL LETTER ESH WITH CURL
-<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T
 <UFF54> <S0074>;<BASE>;<WIDE>;<UFF54> % FULLWIDTH LATIN SMALL LETTER T
 <U036D> <S0074>;<BASE>;<COMPAT>;<U036D> % COMBINING LATIN SMALL LETTER T
 <U24AF> <S0074>;<BASE>;<COMPAT>;<U24AF> % PARENTHESIZED LATIN SMALL LETTER T
@@ -65467,7 +65479,6 @@ endif
 <U0236> <S0236>;<BASE>;<MIN>;<U0236> % LATIN SMALL LETTER T WITH CURL
 <UA777> <SA777>;<BASE>;<MIN>;<UA777> % LATIN SMALL LETTER TUM
 <U0287> <S0287>;<BASE>;<MIN>;<U0287> % LATIN SMALL LETTER TURNED T
-<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U
 <UFF55> <S0075>;<BASE>;<WIDE>;<UFF55> % FULLWIDTH LATIN SMALL LETTER U
 <U0367> <S0075>;<BASE>;<COMPAT>;<U0367> % COMBINING LATIN SMALL LETTER U
 <U24B0> <S0075>;<BASE>;<COMPAT>;<U24B0> % PARENTHESIZED LATIN SMALL LETTER U
@@ -65552,7 +65563,6 @@ endif
 <U028A> <S028A>;<BASE>;<MIN>;<U028A> % LATIN SMALL LETTER UPSILON
 <U1DB7> <S028A>;<BASE>;<MNN>;<U1DB7> % MODIFIER LETTER SMALL UPSILON
 <U1D7F> <S1D7F>;<BASE>;<MIN>;<U1D7F> % LATIN SMALL LETTER UPSILON WITH STROKE
-<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V
 <UFF56> <S0076>;<BASE>;<WIDE>;<UFF56> % FULLWIDTH LATIN SMALL LETTER V
 <U036E> <S0076>;<BASE>;<COMPAT>;<U036E> % COMBINING LATIN SMALL LETTER V
 <U2174> <S0076>;<BASE>;<COMPAT>;<U2174> % SMALL ROMAN NUMERAL FIVE
@@ -65593,7 +65603,6 @@ endif
 <U1EFD> <S1EFD>;<BASE>;<MIN>;<U1EFD> % LATIN SMALL LETTER MIDDLE-WELSH V
 <U028C> <S028C>;<BASE>;<MIN>;<U028C> % LATIN SMALL LETTER TURNED V
 <U1DBA> <S028C>;<BASE>;<MNN>;<U1DBA> % MODIFIER LETTER SMALL TURNED V
-<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W
 <UFF57> <S0077>;<BASE>;<WIDE>;<UFF57> % FULLWIDTH LATIN SMALL LETTER W
 <U1DF1> <S0077>;<BASE>;<COMPAT>;<U1DF1> % COMBINING LATIN SMALL LETTER W
 <U24B2> <S0077>;<BASE>;<COMPAT>;<U24B2> % PARENTHESIZED LATIN SMALL LETTER W
@@ -65627,7 +65636,6 @@ endif
 <U1D21> <S1D21>;<BASE>;<MIN>;<U1D21> % LATIN LETTER SMALL CAPITAL W
 <U2C73> <S2C73>;<BASE>;<MIN>;<U2C73> % LATIN SMALL LETTER W WITH HOOK
 <U028D> <S028D>;<BASE>;<MIN>;<U028D> % LATIN SMALL LETTER TURNED W
-<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X
 <UFF58> <S0078>;<BASE>;<WIDE>;<UFF58> % FULLWIDTH LATIN SMALL LETTER X
 <U036F> <S0078>;<BASE>;<COMPAT>;<U036F> % COMBINING LATIN SMALL LETTER X
 <U2179> <S0078>;<BASE>;<COMPAT>;<U2179> % SMALL ROMAN NUMERAL TEN
@@ -65660,7 +65668,6 @@ endif
 <UAB53> <SAB53>;<BASE>;<MIN>;<UAB53> % LATIN SMALL LETTER CHI
 <UAB54> <SAB54>;<BASE>;<MIN>;<UAB54> % LATIN SMALL LETTER CHI WITH LOW RIGHT RING
 <UAB55> <SAB55>;<BASE>;<MIN>;<UAB55> % LATIN SMALL LETTER CHI WITH LOW LEFT SERIF
-<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y
 <UFF59> <S0079>;<BASE>;<WIDE>;<UFF59> % FULLWIDTH LATIN SMALL LETTER Y
 <U24B4> <S0079>;<BASE>;<COMPAT>;<U24B4> % PARENTHESIZED LATIN SMALL LETTER Y
 <U0001D432> <S0079>;<BASE>;<FONT>;<U0001D432> % MATHEMATICAL BOLD SMALL Y
@@ -65694,7 +65701,6 @@ endif
 <U1EFF> <S1EFF>;<BASE>;<MIN>;<U1EFF> % LATIN SMALL LETTER Y WITH LOOP
 <UAB5A> <SAB5A>;<BASE>;<MIN>;<UAB5A> % LATIN SMALL LETTER Y WITH SHORT RIGHT LEG
 <U021D> <S021D>;<BASE>;<MIN>;<U021D> % LATIN SMALL LETTER YOGH
-<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z
 <UFF5A> <S007A>;<BASE>;<WIDE>;<UFF5A> % FULLWIDTH LATIN SMALL LETTER Z
 <U1DE6> <S007A>;<BASE>;<COMPAT>;<U1DE6> % COMBINING LATIN SMALL LETTER Z
 <U24B5> <S007A>;<BASE>;<COMPAT>;<U24B5> % PARENTHESIZED LATIN SMALL LETTER Z
@@ -65796,7 +65802,35 @@ endif
 <U0001D736> <S03B1>;<BASE>;<FONT>;<U0001D736> % MATHEMATICAL BOLD ITALIC SMALL ALPHA
 <U0001D770> <S03B1>;<BASE>;<FONT>;<U0001D770> % MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA
 <U0001D7AA> <S03B1>;<BASE>;<FONT>;<U0001D7AA> % MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA
+% Implement rational range for [A-Z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
 <U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
+<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
+<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
+<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
+<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
+<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
+<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
+<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
+<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I
+<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
+<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
+<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
+<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
+<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
+<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
+<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
+<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
+<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
+<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
+<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
+<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
+<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
+<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
+<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
+<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
+<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
 <UFF21> <S0061>;<BASE>;<WIDECAP>;<UFF21> % FULLWIDTH LATIN CAPITAL LETTER A
 <U0001F110> <S0061>;<BASE>;<COMPATCAP>;<U0001F110> % PARENTHESIZED LATIN CAPITAL LETTER A
 <U0001D400> <S0061>;<BASE>;<FONTCAP>;<U0001D400> % MATHEMATICAL BOLD CAPITAL A
@@ -65860,7 +65894,6 @@ endif
 <U2C6F> <S0250>;<BASE>;<CAP>;<U2C6F> % LATIN CAPITAL LETTER TURNED A
 <U2C6D> <S0251>;<BASE>;<CAP>;<U2C6D> % LATIN CAPITAL LETTER ALPHA
 <U2C70> <S0252>;<BASE>;<CAP>;<U2C70> % LATIN CAPITAL LETTER TURNED ALPHA
-<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
 <UFF22> <S0062>;<BASE>;<WIDECAP>;<UFF22> % FULLWIDTH LATIN CAPITAL LETTER B
 <U0001F111> <S0062>;<BASE>;<COMPATCAP>;<U0001F111> % PARENTHESIZED LATIN CAPITAL LETTER B
 <U212C> <S0062>;<BASE>;<FONTCAP>;<U212C> % SCRIPT CAPITAL B
@@ -65888,7 +65921,6 @@ endif
 <U0181> <S0253>;<BASE>;<CAP>;<U0181> % LATIN CAPITAL LETTER B WITH HOOK
 <U0182> <S0183>;<BASE>;<CAP>;<U0182> % LATIN CAPITAL LETTER B WITH TOPBAR
 <UA7B4> <SA7B5>;<BASE>;<CAP>;<UA7B4> % LATIN CAPITAL LETTER BETA
-<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
 <UFF23> <S0063>;<BASE>;<WIDECAP>;<UFF23> % FULLWIDTH LATIN CAPITAL LETTER C
 <U216D> <S0063>;<BASE>;<COMPATCAP>;<U216D> % ROMAN NUMERAL ONE HUNDRED
 <U0001F112> <S0063>;<BASE>;<COMPATCAP>;<U0001F112> % PARENTHESIZED LATIN CAPITAL LETTER C
@@ -65921,7 +65953,6 @@ endif
 <U0187> <S0188>;<BASE>;<CAP>;<U0187> % LATIN CAPITAL LETTER C WITH HOOK
 <U2183> <S2184>;<BASE>;<CAP>;<U2183> % ROMAN NUMERAL REVERSED ONE HUNDRED
 <UA73E> <SA73F>;<BASE>;<CAP>;<UA73E> % LATIN CAPITAL LETTER REVERSED C WITH DOT
-<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
 <UFF24> <S0064>;<BASE>;<WIDECAP>;<UFF24> % FULLWIDTH LATIN CAPITAL LETTER D
 <U216E> <S0064>;<BASE>;<COMPATCAP>;<U216E> % ROMAN NUMERAL FIVE HUNDRED
 <U0001F113> <S0064>;<BASE>;<COMPATCAP>;<U0001F113> % PARENTHESIZED LATIN CAPITAL LETTER D
@@ -65959,7 +65990,6 @@ endif
 <U0189> <S0256>;<BASE>;<CAP>;<U0189> % LATIN CAPITAL LETTER AFRICAN D
 <U018A> <S0257>;<BASE>;<CAP>;<U018A> % LATIN CAPITAL LETTER D WITH HOOK
 <U018B> <S018C>;<BASE>;<CAP>;<U018B> % LATIN CAPITAL LETTER D WITH TOPBAR
-<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
 <UFF25> <S0065>;<BASE>;<WIDECAP>;<UFF25> % FULLWIDTH LATIN CAPITAL LETTER E
 <U0001F114> <S0065>;<BASE>;<COMPATCAP>;<U0001F114> % PARENTHESIZED LATIN CAPITAL LETTER E
 <U2130> <S0065>;<BASE>;<FONTCAP>;<U2130> % SCRIPT CAPITAL E
@@ -66010,7 +66040,6 @@ endif
 <U0190> <S025B>;<BASE>;<CAP>;<U0190> % LATIN CAPITAL LETTER OPEN E
 <U2107> <S025B>;<BASE>;<COMPATCAP>;<U2107> % EULER CONSTANT
 <UA7AB> <S025C>;<BASE>;<CAP>;<UA7AB> % LATIN CAPITAL LETTER REVERSED OPEN E
-<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
 <UFF26> <S0066>;<BASE>;<WIDECAP>;<UFF26> % FULLWIDTH LATIN CAPITAL LETTER F
 <U0001F115> <S0066>;<BASE>;<COMPATCAP>;<U0001F115> % PARENTHESIZED LATIN CAPITAL LETTER F
 <U2131> <S0066>;<BASE>;<FONTCAP>;<U2131> % SCRIPT CAPITAL F
@@ -66035,7 +66064,6 @@ endif
 <UA798> <SA799>;<BASE>;<CAP>;<UA798> % LATIN CAPITAL LETTER F WITH STROKE
 <U0191> <S0192>;<BASE>;<CAP>;<U0191> % LATIN CAPITAL LETTER F WITH HOOK
 <U2132> <S214E>;<BASE>;<CAP>;<U2132> % TURNED CAPITAL F
-<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
 <UFF27> <S0067>;<BASE>;<WIDECAP>;<UFF27> % FULLWIDTH LATIN CAPITAL LETTER G
 <U0001F116> <S0067>;<BASE>;<COMPATCAP>;<U0001F116> % PARENTHESIZED LATIN CAPITAL LETTER G
 <U0001D406> <S0067>;<BASE>;<FONTCAP>;<U0001D406> % MATHEMATICAL BOLD CAPITAL G
@@ -66071,7 +66099,6 @@ endif
 <UA77E> <SA77F>;<BASE>;<CAP>;<UA77E> % LATIN CAPITAL LETTER TURNED INSULAR G
 <U0194> <S0263>;<BASE>;<CAP>;<U0194> % LATIN CAPITAL LETTER GAMMA
 <U01A2> <S01A3>;<BASE>;<CAP>;<U01A2> % LATIN CAPITAL LETTER OI
-<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
 <UFF28> <S0068>;<BASE>;<WIDECAP>;<UFF28> % FULLWIDTH LATIN CAPITAL LETTER H
 <U0001F117> <S0068>;<BASE>;<COMPATCAP>;<U0001F117> % PARENTHESIZED LATIN CAPITAL LETTER H
 <U210B> <S0068>;<BASE>;<FONTCAP>;<U210B> % SCRIPT CAPITAL H
@@ -66104,7 +66131,6 @@ endif
 <U2C67> <S2C68>;<BASE>;<CAP>;<U2C67> % LATIN CAPITAL LETTER H WITH DESCENDER
 <U2C75> <S2C76>;<BASE>;<CAP>;<U2C75> % LATIN CAPITAL LETTER HALF H
 <UA726> <SA727>;<BASE>;<CAP>;<UA726> % LATIN CAPITAL LETTER HENG
-<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I
 <UFF29> <S0069>;<BASE>;<WIDECAP>;<UFF29> % FULLWIDTH LATIN CAPITAL LETTER I
 <U2160> <S0069>;<BASE>;<COMPATCAP>;<U2160> % ROMAN NUMERAL ONE
 <U0001F118> <S0069>;<BASE>;<COMPATCAP>;<U0001F118> % PARENTHESIZED LATIN CAPITAL LETTER I
@@ -66149,7 +66175,6 @@ endif
 <UA7AE> <S026A>;<BASE>;<CAP>;<UA7AE> % LATIN CAPITAL LETTER SMALL CAPITAL I
 <U0197> <S0268>;<BASE>;<CAP>;<U0197> % LATIN CAPITAL LETTER I WITH STROKE
 <U0196> <S0269>;<BASE>;<CAP>;<U0196> % LATIN CAPITAL LETTER IOTA
-<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
 <UFF2A> <S006A>;<BASE>;<WIDECAP>;<UFF2A> % FULLWIDTH LATIN CAPITAL LETTER J
 <U0001F119> <S006A>;<BASE>;<COMPATCAP>;<U0001F119> % PARENTHESIZED LATIN CAPITAL LETTER J
 <U0001D409> <S006A>;<BASE>;<FONTCAP>;<U0001D409> % MATHEMATICAL BOLD CAPITAL J
@@ -66172,7 +66197,6 @@ endif
 <U0134> <S006A>;"<BASE><CIRCF>";"<CAP><MIN>";<U0134> % LATIN CAPITAL LETTER J WITH CIRCUMFLEX
 <U0248> <S0249>;<BASE>;<CAP>;<U0248> % LATIN CAPITAL LETTER J WITH STROKE
 <UA7B2> <S029D>;<BASE>;<CAP>;<UA7B2> % LATIN CAPITAL LETTER J WITH CROSSED-TAIL
-<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
 <U212A> <S006B>;<BASE>;<CAP>;<U212A> % KELVIN SIGN
 <UFF2B> <S006B>;<BASE>;<WIDECAP>;<UFF2B> % FULLWIDTH LATIN CAPITAL LETTER K
 <U0001F11A> <S006B>;<BASE>;<COMPATCAP>;<U0001F11A> % PARENTHESIZED LATIN CAPITAL LETTER K
@@ -66206,7 +66230,6 @@ endif
 <UA742> <SA743>;<BASE>;<CAP>;<UA742> % LATIN CAPITAL LETTER K WITH DIAGONAL STROKE
 <UA744> <SA745>;<BASE>;<CAP>;<UA744> % LATIN CAPITAL LETTER K WITH STROKE AND DIAGONAL STROKE
 <UA7B0> <S029E>;<BASE>;<CAP>;<UA7B0> % LATIN CAPITAL LETTER TURNED K
-<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
 <UFF2C> <S006C>;<BASE>;<WIDECAP>;<UFF2C> % FULLWIDTH LATIN CAPITAL LETTER L
 <U216C> <S006C>;<BASE>;<COMPATCAP>;<U216C> % ROMAN NUMERAL FIFTY
 <U0001F11B> <S006C>;<BASE>;<COMPATCAP>;<U0001F11B> % PARENTHESIZED LATIN CAPITAL LETTER L
@@ -66249,7 +66272,6 @@ endif
 <U2C62> <S026B>;<BASE>;<CAP>;<U2C62> % LATIN CAPITAL LETTER L WITH MIDDLE TILDE
 <UA7AD> <S026C>;<BASE>;<CAP>;<UA7AD> % LATIN CAPITAL LETTER L WITH BELT
 <UA780> <SA781>;<BASE>;<CAP>;<UA780> % LATIN CAPITAL LETTER TURNED L
-<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
 <UFF2D> <S006D>;<BASE>;<WIDECAP>;<UFF2D> % FULLWIDTH LATIN CAPITAL LETTER M
 <U216F> <S006D>;<BASE>;<COMPATCAP>;<U216F> % ROMAN NUMERAL ONE THOUSAND
 <U0001F11C> <S006D>;<BASE>;<COMPATCAP>;<U0001F11C> % PARENTHESIZED LATIN CAPITAL LETTER M
@@ -66275,7 +66297,6 @@ endif
 <U1E42> <S006D>;"<BASE><POINS>";"<CAP><MIN>";<U1E42> % LATIN CAPITAL LETTER M WITH DOT BELOW
 <U1DDF> <S1D0D>;<BASE>;<COMPAT>;<U1DDF> % COMBINING LATIN LETTER SMALL CAPITAL M
 <U2C6E> <S0271>;<BASE>;<CAP>;<U2C6E> % LATIN CAPITAL LETTER M WITH HOOK
-<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
 <UFF2E> <S006E>;<BASE>;<WIDECAP>;<UFF2E> % FULLWIDTH LATIN CAPITAL LETTER N
 <U0001F11D> <S006E>;<BASE>;<COMPATCAP>;<U0001F11D> % PARENTHESIZED LATIN CAPITAL LETTER N
 <U2115> <S006E>;<BASE>;<FONTCAP>;<U2115> % DOUBLE-STRUCK CAPITAL N
@@ -66312,7 +66333,6 @@ endif
 <U0220> <S019E>;<BASE>;<CAP>;<U0220> % LATIN CAPITAL LETTER N WITH LONG RIGHT LEG
 <UA790> <SA791>;<BASE>;<CAP>;<UA790> % LATIN CAPITAL LETTER N WITH DESCENDER
 <U014A> <S014B>;<BASE>;<CAP>;<U014A> % LATIN CAPITAL LETTER ENG
-<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
 <UFF2F> <S006F>;<BASE>;<WIDECAP>;<UFF2F> % FULLWIDTH LATIN CAPITAL LETTER O
 <U0001F11E> <S006F>;<BASE>;<COMPATCAP>;<U0001F11E> % PARENTHESIZED LATIN CAPITAL LETTER O
 <U0001D40E> <S006F>;<BASE>;<FONTCAP>;<U0001D40E> % MATHEMATICAL BOLD CAPITAL O
@@ -66377,7 +66397,6 @@ endif
 <UA74A> <SA74B>;<BASE>;<CAP>;<UA74A> % LATIN CAPITAL LETTER O WITH LONG STROKE OVERLAY
 <UA7B6> <SA7B7>;<BASE>;<CAP>;<UA7B6> % LATIN CAPITAL LETTER OMEGA
 <U0222> <S0223>;<BASE>;<CAP>;<U0222> % LATIN CAPITAL LETTER OU
-<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
 <UFF30> <S0070>;<BASE>;<WIDECAP>;<UFF30> % FULLWIDTH LATIN CAPITAL LETTER P
 <U0001F11F> <S0070>;<BASE>;<COMPATCAP>;<U0001F11F> % PARENTHESIZED LATIN CAPITAL LETTER P
 <U2119> <S0070>;<BASE>;<FONTCAP>;<U2119> % DOUBLE-STRUCK CAPITAL P
@@ -66405,7 +66424,6 @@ endif
 <U01A4> <S01A5>;<BASE>;<CAP>;<U01A4> % LATIN CAPITAL LETTER P WITH HOOK
 <UA752> <SA753>;<BASE>;<CAP>;<UA752> % LATIN CAPITAL LETTER P WITH FLOURISH
 <UA754> <SA755>;<BASE>;<CAP>;<UA754> % LATIN CAPITAL LETTER P WITH SQUIRREL TAIL
-<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
 <UFF31> <S0071>;<BASE>;<WIDECAP>;<UFF31> % FULLWIDTH LATIN CAPITAL LETTER Q
 <U0001F120> <S0071>;<BASE>;<COMPATCAP>;<U0001F120> % PARENTHESIZED LATIN CAPITAL LETTER Q
 <U211A> <S0071>;<BASE>;<FONTCAP>;<U211A> % DOUBLE-STRUCK CAPITAL Q
@@ -66428,7 +66446,6 @@ endif
 <UA756> <SA757>;<BASE>;<CAP>;<UA756> % LATIN CAPITAL LETTER Q WITH STROKE THROUGH DESCENDER
 <UA758> <SA759>;<BASE>;<CAP>;<UA758> % LATIN CAPITAL LETTER Q WITH DIAGONAL STROKE
 <U024A> <S024B>;<BASE>;<CAP>;<U024A> % LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL
-<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
 <UFF32> <S0072>;<BASE>;<WIDECAP>;<UFF32> % FULLWIDTH LATIN CAPITAL LETTER R
 <U0001F121> <S0072>;<BASE>;<COMPATCAP>;<U0001F121> % PARENTHESIZED LATIN CAPITAL LETTER R
 <U211B> <S0072>;<BASE>;<FONTCAP>;<U211B> % SCRIPT CAPITAL R
@@ -66466,7 +66483,6 @@ endif
 <U024C> <S024D>;<BASE>;<CAP>;<U024C> % LATIN CAPITAL LETTER R WITH STROKE
 <U2C64> <S027D>;<BASE>;<CAP>;<U2C64> % LATIN CAPITAL LETTER R WITH TAIL
 <UA75C> <SA75D>;<BASE>;<CAP>;<UA75C> % LATIN CAPITAL LETTER RUM ROTUNDA
-<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
 <UFF33> <S0073>;<BASE>;<WIDECAP>;<UFF33> % FULLWIDTH LATIN CAPITAL LETTER S
 <U0001F122> <S0073>;<BASE>;<COMPATCAP>;<U0001F122> % PARENTHESIZED LATIN CAPITAL LETTER S
 <U0001F12A> <S0073>;<BASE>;<COMPATCAP>;<U0001F12A> % TORTOISE SHELL BRACKETED LATIN CAPITAL LETTER S
@@ -66502,7 +66518,6 @@ endif
 <U1E9E> "<S0073><S0073>";"<BASE><VRNT1><BASE>";"<COMPATCAP><COMPAT><COMPATCAP>";<U1E9E> % LATIN CAPITAL LETTER SHARP S
 <U2C7E> <S023F>;<BASE>;<CAP>;<U2C7E> % LATIN CAPITAL LETTER S WITH SWASH TAIL
 <U01A9> <S0283>;<BASE>;<CAP>;<U01A9> % LATIN CAPITAL LETTER ESH
-<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
 <UFF34> <S0074>;<BASE>;<WIDECAP>;<UFF34> % FULLWIDTH LATIN CAPITAL LETTER T
 <U0001F123> <S0074>;<BASE>;<COMPATCAP>;<U0001F123> % PARENTHESIZED LATIN CAPITAL LETTER T
 <U0001D413> <S0074>;<BASE>;<FONTCAP>;<U0001D413> % MATHEMATICAL BOLD CAPITAL T
@@ -66536,7 +66551,6 @@ endif
 <U01AC> <S01AD>;<BASE>;<CAP>;<U01AC> % LATIN CAPITAL LETTER T WITH HOOK
 <U01AE> <S0288>;<BASE>;<CAP>;<U01AE> % LATIN CAPITAL LETTER T WITH RETROFLEX HOOK
 <UA7B1> <S0287>;<BASE>;<CAP>;<UA7B1> % LATIN CAPITAL LETTER TURNED T
-<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
 <UFF35> <S0075>;<BASE>;<WIDECAP>;<UFF35> % FULLWIDTH LATIN CAPITAL LETTER U
 <U0001F124> <S0075>;<BASE>;<COMPATCAP>;<U0001F124> % PARENTHESIZED LATIN CAPITAL LETTER U
 <U0001D414> <S0075>;<BASE>;<FONTCAP>;<U0001D414> % MATHEMATICAL BOLD CAPITAL U
@@ -66591,7 +66605,6 @@ endif
 <UA78D> <S0265>;<BASE>;<CAP>;<UA78D> % LATIN CAPITAL LETTER TURNED H
 <U019C> <S026F>;<BASE>;<CAP>;<U019C> % LATIN CAPITAL LETTER TURNED M
 <U01B1> <S028A>;<BASE>;<CAP>;<U01B1> % LATIN CAPITAL LETTER UPSILON
-<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
 <UFF36> <S0076>;<BASE>;<WIDECAP>;<UFF36> % FULLWIDTH LATIN CAPITAL LETTER V
 <U2164> <S0076>;<BASE>;<COMPATCAP>;<U2164> % ROMAN NUMERAL FIVE
 <U0001F125> <S0076>;<BASE>;<COMPATCAP>;<U0001F125> % PARENTHESIZED LATIN CAPITAL LETTER V
@@ -66622,7 +66635,6 @@ endif
 <U01B2> <S028B>;<BASE>;<CAP>;<U01B2> % LATIN CAPITAL LETTER V WITH HOOK
 <U1EFC> <S1EFD>;<BASE>;<CAP>;<U1EFC> % LATIN CAPITAL LETTER MIDDLE-WELSH V
 <U0245> <S028C>;<BASE>;<CAP>;<U0245> % LATIN CAPITAL LETTER TURNED V
-<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
 <UFF37> <S0077>;<BASE>;<WIDECAP>;<UFF37> % FULLWIDTH LATIN CAPITAL LETTER W
 <U0001F126> <S0077>;<BASE>;<COMPATCAP>;<U0001F126> % PARENTHESIZED LATIN CAPITAL LETTER W
 <U0001D416> <S0077>;<BASE>;<FONTCAP>;<U0001D416> % MATHEMATICAL BOLD CAPITAL W
@@ -66649,7 +66661,6 @@ endif
 <U1E86> <S0077>;"<BASE><POINT>";"<CAP><MIN>";<U1E86> % LATIN CAPITAL LETTER W WITH DOT ABOVE
 <U1E88> <S0077>;"<BASE><POINS>";"<CAP><MIN>";<U1E88> % LATIN CAPITAL LETTER W WITH DOT BELOW
 <U2C72> <S2C73>;<BASE>;<CAP>;<U2C72> % LATIN CAPITAL LETTER W WITH HOOK
-<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
 <UFF38> <S0078>;<BASE>;<WIDECAP>;<UFF38> % FULLWIDTH LATIN CAPITAL LETTER X
 <U2169> <S0078>;<BASE>;<COMPATCAP>;<U2169> % ROMAN NUMERAL TEN
 <U0001F127> <S0078>;<BASE>;<COMPATCAP>;<U0001F127> % PARENTHESIZED LATIN CAPITAL LETTER X
@@ -66675,7 +66686,6 @@ endif
 <U216A> "<S0078><S0069>";"<BASE><BASE>";"<COMPATCAP><COMPATCAP>";<U216A> % ROMAN NUMERAL ELEVEN
 <U216B> "<S0078><S0069><S0069>";"<BASE><BASE><BASE>";"<COMPATCAP><COMPATCAP><COMPATCAP>";<U216B> % ROMAN NUMERAL TWELVE
 <UA7B3> <SAB53>;<BASE>;<CAP>;<UA7B3> % LATIN CAPITAL LETTER CHI
-<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
 <UFF39> <S0079>;<BASE>;<WIDECAP>;<UFF39> % FULLWIDTH LATIN CAPITAL LETTER Y
 <U0001F128> <S0079>;<BASE>;<COMPATCAP>;<U0001F128> % PARENTHESIZED LATIN CAPITAL LETTER Y
 <U0001D418> <S0079>;<BASE>;<FONTCAP>;<U0001D418> % MATHEMATICAL BOLD CAPITAL Y
@@ -66708,7 +66718,6 @@ endif
 <U01B3> <S01B4>;<BASE>;<CAP>;<U01B3> % LATIN CAPITAL LETTER Y WITH HOOK
 <U1EFE> <S1EFF>;<BASE>;<CAP>;<U1EFE> % LATIN CAPITAL LETTER Y WITH LOOP
 <U021C> <S021D>;<BASE>;<CAP>;<U021C> % LATIN CAPITAL LETTER YOGH
-<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
 <UFF3A> <S007A>;<BASE>;<WIDECAP>;<UFF3A> % FULLWIDTH LATIN CAPITAL LETTER Z
 <U0001F129> <S007A>;<BASE>;<COMPATCAP>;<U0001F129> % PARENTHESIZED LATIN CAPITAL LETTER Z
 <U2124> <S007A>;<BASE>;<FONTCAP>;<U2124> % DOUBLE-STRUCK CAPITAL Z
diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR
index f7c13ddf4b..7d5c9d878e 100644
--- a/localedata/locales/tr_TR
+++ b/localedata/locales/tr_TR
@@ -81,6 +81,8 @@ copy "iso14651_t1"
 %
 % The following rules implement the same order for glibc.
 
+% All of these collating symbols are used as primary weights
+% and cause equivalnce class problems, see Bug 23437.
 collating-symbol <c-cedilla>
 collating-symbol <g-breve>
 collating-symbol <i-dotless>
@@ -111,8 +113,40 @@ reorder-after <AFTER-U>
 <U011F> <g-breve>;<BASE>;<MIN>;IGNORE % ÄŸ
 <U011E> <g-breve>;<BASE>;<CAP>;IGNORE % Äž
 <U0131> <i-dotless>;<BASE>;<MIN>;IGNORE % Ä±
+
+% tr_TR must copy the rational range definition here for CEO:
+% Implement rational range for [A-Z] in regular expressions.
+% We order the collation element order to support rational ranges.
+% Collation is unaffected because the 4-level weights remain the same.
+<U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A
+<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B
+<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C
+<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D
+<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E
+<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F
+<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G
+<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H
+% Turkish sorting of I, but within rational range.
+% FIXME: 'I' is no longer in the equivalence class of i's.
 <U0049> <i-dotless>;<BASE>;<CAP>;IGNORE % I
-<U0069> <S0069>;<BASE>;<MIN>;IGNORE % i
+<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J
+<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K
+<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L
+<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M
+<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N
+<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O
+<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P
+<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q
+<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R
+<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S
+<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T
+<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U
+<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V
+<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W
+<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X
+<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y
+<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z
+
 <U0130> <S0069>;<BASE>;<CAP>;IGNORE % Ä°
 <U00F6> <o-diaresis>;<BASE>;<MIN>;IGNORE % Ã¶
 <U00D6> <o-diaresis>;<BASE>;<CAP>;IGNORE % Ã–
diff --git a/posix/bug-regex17.c b/posix/bug-regex17.c
index 893b9654b8..341fe4d827 100644
--- a/posix/bug-regex17.c
+++ b/posix/bug-regex17.c
@@ -46,14 +46,25 @@ struct
     { { 2, 10 }, { -1, -1 } } },
 
   /* Tests for bug 9697:
+     Look for a multibyte sequence in a range. We pick the range based
+     on collation element order, since a-z is no longer valid since it's
+     a rational range.
+
+     We use U+FF53 FULLWIDTH LATIN SMALL LETTER S as the start of the
+     range, and U+33DC SQUARE SV as the end of the range.  These were
+     chosen by looking at collation element ordering and picking a range
+     in which the matching character was listed.
+
+     U+02E2	\xcb\xa2	MODIFIER LETTER SMALL S
      U+00DF	\xc3\x9f	LATIN SMALL LETTER SHARP S
      U+02DA	\xcb\x9a	RING ABOVE
-     U+02E2	\xcb\xa2	MODIFIER LETTER SMALL S  */
-  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+     
+     The U+02DA RING ABOVE is chosen because it's not in [ï½“-ãœ].  */
+  { "[ï½“-ãœ]|[^ï½“-ãœ]", "\xcb\xa2", REG_EXTENDED, 2,
     { { 0, 2 }, { -1, -1 } } },
-  { "[a-z]", "\xc3\x9f", REG_EXTENDED, 2,
+  { "[ï½“-ãœ]", "\xc3\x9f", REG_EXTENDED, 2,
     { { 0, 2 }, { -1, -1 } } },
-  { "[^a-z]", "\xcb\x9a", REG_EXTENDED, 2,
+  { "[^ï½“-ãœ]", "\xcb\x9a", REG_EXTENDED, 2,
     { { 0, 2 }, { -1, -1 } } },
 };
 
diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input
index dc2ca8d01a..2131d1e437 100644
--- a/posix/tst-fnmatch.input
+++ b/posix/tst-fnmatch.input
@@ -67,9 +67,11 @@
 # https://sourceware.org/bugzilla/show_bug.cgi?id=23393
 # https://sourceware.org/bugzilla/show_bug.cgi?id=23420
 #
-# No consensus exists on how best to handle the changes so the
-# iso14651_t1_common collation element order (CEO) has been changed to
-# deinterlace the a-z and A-Z regions.
+# The solution was to implement rational ranges by moving the collation
+# element order to fix this for [a-z], [A-Z], and [0-9]. Likewise the
+# upper and lower case letters are deinterlaced to allow for accented
+# ranges that don't include uppercase e.g. [a-ñ] should not include
+# any uppercase letters but may include a-z and more.
 #
 # With the deinterlacing commit ac3a3b4b0d561d776b60317d6a926050c8541655
 # could be reverted to re-test the correct non-interleaved expectations.
@@ -77,9 +79,7 @@
 # Please note that despite the region being deinterlaced, the ordering
 # of collation remains the same.  In glibc we implement CEO and because of
 # that we can reorder the elements to reorder ranges without impacting
-# collation which depends on weights.  The collation element ordering
-# could have been changed to include just a-z, A-Z, and 0-9 in three
-# distinct blocks, but this needs more discussion by the community.
+# collation which depends on weights.
 
 # B.6 004(C)
 C		 "!#%+,-./01234567889"	"!#%+,-./01234567889"  0
@@ -477,9 +477,9 @@ C		"-"			"[Z-\\]]"	       NOMATCH
 # handling of ranges and the recognition of character (vs bytes).
 de_DE.ISO-8859-1 "a"			"[a-z]"		       0
 de_DE.ISO-8859-1 "z"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ä"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ö"			"[a-z]"		       0
-de_DE.ISO-8859-1 "ü"			"[a-z]"		       0
+de_DE.ISO-8859-1 "ä"			"[a-z]"		       NOMATCH
+de_DE.ISO-8859-1 "ö"			"[a-z]"		       NOMATCH
+de_DE.ISO-8859-1 "ü"			"[a-z]"		       NOMATCH
 de_DE.ISO-8859-1 "A"			"[a-z]"		       NOMATCH
 de_DE.ISO-8859-1 "Z"			"[a-z]"		       NOMATCH
 de_DE.ISO-8859-1 "Ä"			"[a-z]"		       NOMATCH
@@ -492,9 +492,9 @@ de_DE.ISO-8859-1 "
 de_DE.ISO-8859-1 "ü"			"[A-Z]"		       NOMATCH
 de_DE.ISO-8859-1 "A"			"[A-Z]"		       0
 de_DE.ISO-8859-1 "Z"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ö"			"[A-Z]"		       0
-de_DE.ISO-8859-1 "Ü"			"[A-Z]"		       0
+de_DE.ISO-8859-1 "Ä"			"[A-Z]"		       NOMATCH
+de_DE.ISO-8859-1 "Ö"			"[A-Z]"		       NOMATCH
+de_DE.ISO-8859-1 "Ü"			"[A-Z]"		       NOMATCH
 de_DE.ISO-8859-1 "a"			"[[:lower:]]"	       0
 de_DE.ISO-8859-1 "z"			"[[:lower:]]"	       0
 de_DE.ISO-8859-1 "ä"			"[[:lower:]]"	       0
@@ -566,22 +566,46 @@ de_DE.ISO-8859-1 "aa"			"[[.a.]]a"	       0
 de_DE.ISO-8859-1 "ba"			"[[.a.]]a"	       NOMATCH
 
 
-# And with a multibyte character set.
+# And with a multibyte character set:
+# Ensure that Turkish reordering rules don't move 'i' out of a-z set,
+# or 'I' out of A-Z set.
+tr_TR.UTF-8	 "i"			"[a-z]"		       0
+tr_TR.UTF-8	 "Ä±"			"[a-z]"		       NOMATCH
+tr_TR.UTF-8	 "I"			"[A-Z]"		       0
+tr_TR.UTF-8	 "Ä°"			"[A-Z]"		       NOMATCH
+tr_TR.ISO-8859-9 "i"			"[a-z]"		       0
+tr_TR.ISO-8859-9 "I"			"[A-Z]"		       0
+# See bug 23437 for I not being in [=i=].
+tr_TR.UTF-8	 "I"			"[=i=]"		       NOMATCH
 en_US.UTF-8	 "a"			"[a-z]"		       0
+# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [a-z].
+en_US.UTF-8	 "Ã±"			"[a-z]"		       NOMATCH
 en_US.UTF-8	 "z"			"[a-z]"		       0
 en_US.UTF-8	 "A"			"[a-z]"		       NOMATCH
+# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [a-z].
+en_US.UTF-8	 "Ã‘"			"[a-z]"		       NOMATCH
 en_US.UTF-8	 "Z"			"[a-z]"		       NOMATCH
 en_US.UTF-8	 "a"			"[A-Z]"		       NOMATCH
+# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [A-Z].
+en_US.UTF-8	 "Ã±"			"[A-Z]"		       NOMATCH
 en_US.UTF-8	 "z"			"[A-Z]"		       NOMATCH
 en_US.UTF-8	 "A"			"[A-Z]"		       0
+# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [A-Z].
+en_US.UTF-8	 "Ã‘"			"[A-Z]"		       NOMATCH
 en_US.UTF-8	 "Z"			"[A-Z]"		       0
 en_US.UTF-8	 "0"			"[0-9]"		       0
+# Test that <UFF10> FULLWIDTH DIGIT ZERO is not in [0-9].
+en_US.UTF-8	 "ï¼"			"[0-9]"		       NOMATCH
+# Test that <U00BD> VULGAR FRACTION ONE HALF is not in [0-9].
+en_US.UTF-8	 "Â½"			"[0-9]"		       NOMATCH
 en_US.UTF-8	 "9"			"[0-9]"		       0
+# Test that <UFF19> FULLWIDTH DIGIT NINE is not in [0-9].
+en_US.UTF-8	 "ï¼™"			"[0-9]"		       NOMATCH
 de_DE.UTF-8	 "a"			"[a-z]"		       0
 de_DE.UTF-8	 "z"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¤"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¶"			"[a-z]"		       0
-de_DE.UTF-8	 "Ã¼"			"[a-z]"		       0
+de_DE.UTF-8	 "Ã¤"			"[a-z]"		       NOMATCH
+de_DE.UTF-8	 "Ã¶"			"[a-z]"		       NOMATCH
+de_DE.UTF-8	 "Ã¼"			"[a-z]"		       NOMATCH
 de_DE.UTF-8	 "A"			"[a-z]"		       NOMATCH
 de_DE.UTF-8	 "Z"			"[a-z]"		       NOMATCH
 de_DE.UTF-8	 "Ã„"			"[a-z]"		       NOMATCH
@@ -594,9 +618,9 @@ de_DE.UTF-8	 "Ã¶"			"[A-Z]"		       NOMATCH
 de_DE.UTF-8	 "Ã¼"			"[A-Z]"		       NOMATCH
 de_DE.UTF-8	 "A"			"[A-Z]"		       0
 de_DE.UTF-8	 "Z"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã„"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ã–"			"[A-Z]"		       0
-de_DE.UTF-8	 "Ãœ"			"[A-Z]"		       0
+de_DE.UTF-8	 "Ã„"		"[A-Z]"		       NOMATCH
+de_DE.UTF-8	 "Ã–"		"[A-Z]"		       NOMATCH
+de_DE.UTF-8	 "Ãœ"		"[A-Z]"		       NOMATCH
 de_DE.UTF-8	 "a"			"[[:lower:]]"	       0
 de_DE.UTF-8	 "z"			"[[:lower:]]"	       0
 de_DE.UTF-8	 "Ã¤"			"[[:lower:]]"	       0
diff --git a/posix/tst-rxspencer.c b/posix/tst-rxspencer.c
index 9d597ef3e9..a3d836679a 100644
--- a/posix/tst-rxspencer.c
+++ b/posix/tst-rxspencer.c
@@ -155,7 +155,12 @@ mb_frob_pattern (const char *str, const char *letters)
 	*dst++ = *src;
 	continue;
       }
-    else if (!in_class && strchr (letters, *src))
+    /* We do a replacement, but not for the start of ranges, because
+       mb_replace will create invalid rational ranges.  For example
+       [Ã¡-z] is an invalid range because Ã¡ comes after z, but [a-Ã¡]
+       is a valid range.  So we avoid replacing the start of ranges
+       to avoid this problem.  */
+    else if (!in_class && src[1] != '-' && strchr (letters, *src))
       dst = mb_replace (dst, *src);
     else
       {

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-20 21:56       ` Carlos O'Donell
@ 2018-07-23 15:11         ` Florian Weimer
  2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
  2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
  0 siblings, 2 replies; 42+ messages in thread
From: Florian Weimer @ 2018-07-23 15:11 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
> v2
> - Fixed tr_TR by duplicating A-Z rational range.
> - Fixed tst-rxspender.
> - Fixed bug-regex17.
> 
> Tell me how the new version does.

My tester likes it.  tr_TR.ISO-8859-9 is now fixed.  I added fnmatch 
support, too, and initial results look good as well.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Rational Ranges - Rafal and Mike's opinion? (Bug 23393).
  2018-07-23 15:11         ` Florian Weimer
@ 2018-07-23 18:09           ` Carlos O'Donell
  2018-07-24 20:45             ` Rafal Luzynski
  2018-07-25 15:44             ` Mike FABIAN
  2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
  1 sibling, 2 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-23 18:09 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers, Rafal Luzynski

On 07/23/2018 11:10 AM, Florian Weimer wrote:
> On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
>> v2
>> - Fixed tr_TR by duplicating A-Z rational range.
>> - Fixed tst-rxspender.
>> - Fixed bug-regex17.
>>
>> Tell me how the new version does.
> 
> My tester likes it.  tr_TR.ISO-8859-9 is now fixed.  I added fnmatch
> support, too, and initial results look good as well.

OK, so we have the capability to deploy rational ranges.

Florian,

Should we do so in 2.28? Avoiding all possible problems in the future
and making the ranges portable, rational, and safe from a security
perspective?

Rafal,

As localedata maintainer what is your opinion of changing the meaning
of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
which mean exactly the latin character sequences you would expect
e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
[A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?

Mike,

Same question to you.

For historical context in gawk:
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html

For context from POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
(see the section on "RE Bracket Expressions").

Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges
rational for all locales, and would no longer include mixed case, or accents.

I'd like to year affirmatives from the localedata maintainers on this issue.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393).
  2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
@ 2018-07-24 20:45             ` Rafal Luzynski
  2018-07-24 20:53               ` Carlos O'Donell
  2018-07-24 20:59               ` Carlos O'Donell
  2018-07-25 15:44             ` Mike FABIAN
  1 sibling, 2 replies; 42+ messages in thread
From: Rafal Luzynski @ 2018-07-24 20:45 UTC (permalink / raw)
  To: GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers,
	Carlos O'Donell

23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote:
> [...]
> Rafal,
>
> As localedata maintainer what is your opinion of changing the meaning
> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
> which mean exactly the latin character sequences you would expect
> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?

Having discussed this off-list my answer is: I'm in favor of implementing
rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
code-point ranges.  But I understand that this is possible only in 2.29.
Therefore for 2.28 I support this data-based solution.

Regards,

Rafal

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393).
  2018-07-24 20:45             ` Rafal Luzynski
@ 2018-07-24 20:53               ` Carlos O'Donell
  2018-07-24 20:59               ` Carlos O'Donell
  1 sibling, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-24 20:53 UTC (permalink / raw)
  To: Rafal Luzynski, GNU C Library, Mike Fabian, Florian Weimer,
	Joseph S. Myers

On 07/24/2018 04:45 PM, Rafal Luzynski wrote:
> 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote:
>> [...]
>> Rafal,
>>
>> As localedata maintainer what is your opinion of changing the meaning
>> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
>> which mean exactly the latin character sequences you would expect
>> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
>> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
> 
> Having discussed this off-list my answer is: I'm in favor of implementing
> rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
> code-point ranges.  But I understand that this is possible only in 2.29.
> Therefore for 2.28 I support this data-based solution.

From the perspective of the user of the library and the locales the
rational ranges we implement will look as-if they were code point ranges
for the ranges in question e.g. a-z, A-Z, 0-9 and their subranges.

For 2.28 we will implement rational ranges for [a-z], [A-Z], and [0-9],
and all of their subsets via a data-only solution. Just wanted to make
it clear that all subsets will be treated as rational ranges.

It is only for other subsets like [!-~] (ASCII range) where we will not
have a rational range until we switch to making ranges operate on code
points. That will be a 2.29 optimization.

OK, I will prepare a patch to fix this.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393).
  2018-07-24 20:45             ` Rafal Luzynski
  2018-07-24 20:53               ` Carlos O'Donell
@ 2018-07-24 20:59               ` Carlos O'Donell
  1 sibling, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-24 20:59 UTC (permalink / raw)
  To: Rafal Luzynski, GNU C Library, Mike Fabian, Florian Weimer,
	Joseph S. Myers

On 07/24/2018 04:45 PM, Rafal Luzynski wrote:
> 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote:
>> [...]
>> Rafal,
>>
>> As localedata maintainer what is your opinion of changing the meaning
>> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
>> which mean exactly the latin character sequences you would expect
>> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
>> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
> 
> Having discussed this off-list my answer is: I'm in favor of implementing
> rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as
> code-point ranges.  But I understand that this is possible only in 2.29.
> Therefore for 2.28 I support this data-based solution.

I'll put together a final patch ASAP that provides:

* Deinterlace upper/lower
* Group a-z, A-Z, 0-9,
* NEWS entry for rational ranges.

Note: manual/stdio.texi also makes the mistake of saying [a-z] is lowercase
      characters, so this will fix the manual bug with no change :-)

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393).
  2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
  2018-07-24 20:45             ` Rafal Luzynski
@ 2018-07-25 15:44             ` Mike FABIAN
  1 sibling, 0 replies; 42+ messages in thread
From: Mike FABIAN @ 2018-07-25 15:44 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Florian Weimer, GNU C Library, Rich Felker, Zorro Lang,
	Joseph S. Myers, Rafal Luzynski

Carlos O'Donell <carlos@redhat.com> ã•ã‚“ã¯ã‹ãã¾ã—ãŸ:

> On 07/23/2018 11:10 AM, Florian Weimer wrote:
>> On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
>>> v2
>>> - Fixed tr_TR by duplicating A-Z rational range.
>>> - Fixed tst-rxspender.
>>> - Fixed bug-regex17.
>>>
>>> Tell me how the new version does.
>> 
>> My tester likes it.  tr_TR.ISO-8859-9 is now fixed.  I added fnmatch
>> support, too, and initial results look good as well.
>
> OK, so we have the capability to deploy rational ranges.
>
> Florian,
>
> Should we do so in 2.28? Avoiding all possible problems in the future
> and making the ranges portable, rational, and safe from a security
> perspective?
>
> Rafal,
>
> As localedata maintainer what is your opinion of changing the meaning
> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales
> which mean exactly the latin character sequences you would expect
> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z],
> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}?
>
> Mike,
>
> Same question to you.

I agree that rational ranges are much more useful.

I cannot imagine any use case for [a-z] matching aAbB...z and not Z.

One never knows what [a-z] would match if it uses the locale sort order,
it is just too confusing.

In the long run, I think implementing ranges by code points would be
the best solution and make updates of the iso14651_t1_common file easier
because we need to make less changes to the upstream version of that
file then.

But for 2.28 this cannot be done. Therefore, I think the solution
by Carlos is very good.

> For historical context in gawk:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
>
> For context from POSIX:
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
> (see the section on "RE Bracket Expressions").
>
> Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges
> rational for all locales, and would no longer include mixed case, or accents.
>
> I'd like to year affirmatives from the localedata maintainers on this issue.
>
> Cheers,
> Carlos.

-- 
Mike FABIAN <mfabian@redhat.com>
ç¡çœ ä¸è¶³ã¯ã„ã„ä»•äº‹ã®æ•µã ã€‚

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-23 15:11         ` Florian Weimer
  2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
@ 2018-07-25 15:54           ` Carlos O'Donell
  2018-07-25 20:19             ` Florian Weimer
  1 sibling, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-25 15:54 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 1921 bytes --]

On 07/23/2018 11:10 AM, Florian Weimer wrote:
> On 07/20/2018 11:56 PM, Carlos O'Donell wrote:
>> v2
>> - Fixed tr_TR by duplicating A-Z rational range.
>> - Fixed tst-rxspender.
>> - Fixed bug-regex17.
>>
>> Tell me how the new version does.
> 
> My tester likes it.  tr_TR.ISO-8859-9 is now fixed.  I added fnmatch
> support, too, and initial results look good as well.

OK, here is v3.

~~~ NEWS ~~
* The GNU C Library now uses rational ranges for regular expression
  matching of ranges that are within a-z, A-Z, and 0-9 for all
  locales.  This means that the range [a-c] will no longer match
  accented letter a's and will only match exactly a, b, and c. Likewise
  [0-9] will only include the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and
  no other characters.  Rational ranges have been implemented by
  several other GNU projects to provide straight forward rules for
  regular expression ranges and to make them portable across locales.
  The current rational ranges are implemented using collation element
  ordering, which may yield unexpected results if the range includes
  accented characters e.g. [a-Ã±], since such a range will include a-z
  since Ã± comes after the rational range in collation element order.
  In the future the library may implement full rational ranges covering
  all characters by using Unicode code point ordering which will make
  the sequences faster to match and more portable.
~~~

We have approval from Mike and Rafal, the two localedata subsystem
maintainers.

This solution matches what you and Rich Felker both thinks is the
correct solution.

So for 2.28 we would use rational ranges for a-z, A-Z, and 0-9, until
we can implement code point ranges.

v3
- Merged lowercase/uppercase deinterlacing.
- Added NEWS entry.

Please run this through your checker, and ACK this for 2.28 and I'll
commit.

Attaching it as swbz23393v3.tar.gz to avoid spam rejection.

Cheers,
Carlos.

[-- Attachment #2: swbz23393v3.tar.gz --]
[-- Type: application/gzip, Size: 50219 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
@ 2018-07-25 20:19             ` Florian Weimer
  2018-07-25 20:25               ` Carlos O'Donell
  0 siblings, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-25 20:19 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.

Quick comment.  The middle line here adds trailing whitespace:

-  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
+
+     The U+02DA RING ABOVE is chosen because it's not in [ï½“-ãœ].  */

Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 20:19             ` Florian Weimer
@ 2018-07-25 20:25               ` Carlos O'Donell
  2018-07-25 20:31                 ` Florian Weimer
  2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
  0 siblings, 2 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-25 20:25 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 04:18 PM, Florian Weimer wrote:
> On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
> 
> Quick comment.Â  The middle line here adds trailing whitespace:
> 
> -Â  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
> +
> +Â Â Â Â  The U+02DA RING ABOVE is chosen because it's not in [ï½“-ãœ].Â  */

Thanks. I'll fix this with v4.

I had to fix the following locales:

	modified:   localedata/locales/ar_SA
	modified:   localedata/locales/km_KH
	modified:   localedata/locales/lo_LA
	modified:   localedata/locales/or_IN
	modified:   localedata/locales/sl_SI
	modified:   localedata/locales/th_TH

They all re-arranged ASCII character collation element ordering like tr_TR,
and so they needed manual fixing.

Could you please add these locales to your tester?

c.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 20:25               ` Carlos O'Donell
@ 2018-07-25 20:31                 ` Florian Weimer
  2018-07-25 20:57                   ` [PATCHv4] " Carlos O'Donell
  2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
  1 sibling, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-25 20:31 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 10:25 PM, Carlos O'Donell wrote:
> On 07/25/2018 04:18 PM, Florian Weimer wrote:
>> On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
>>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
>>
>> Quick comment.Â  The middle line here adds trailing whitespace:
>>
>> -Â  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
>> +
>> +Â Â Â Â  The U+02DA RING ABOVE is chosen because it's not in [ï½“-ãœ].Â  */
> 
> Thanks. I'll fix this with v4.

I have verified that localedata/locales/iso14651_t1_common is just a 
reordering (except for the new comments).

localedata/locales/tr_TR is more complicated, but looks like an 
order-only change for me too.

> I had to fix the following locales:
> 
> 	modified:   localedata/locales/ar_SA
> 	modified:   localedata/locales/km_KH
> 	modified:   localedata/locales/lo_LA
> 	modified:   localedata/locales/or_IN
> 	modified:   localedata/locales/sl_SI
> 	modified:   localedata/locales/th_TH

Do you have the actual locale names handy?  localedata/SUPPORTED 
contains charsets, but I'm not sure if the translation to locale names 
is completely regular.

> They all re-arranged ASCII character collation element ordering like tr_TR,
> and so they needed manual fixing.
> 
> Could you please add these locales to your tester?

I will try.  I already have an xtests part, and these probably need to 
go there as well.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCHv4] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 20:31                 ` Florian Weimer
@ 2018-07-25 20:57                   ` Carlos O'Donell
  2018-07-26  2:34                     ` [PATCHv4a] " Carlos O'Donell
  0 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-25 20:57 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 2016 bytes --]

On 07/25/2018 04:31 PM, Florian Weimer wrote:
> On 07/25/2018 10:25 PM, Carlos O'Donell wrote:
>> On 07/25/2018 04:18 PM, Florian Weimer wrote:
>>> On 07/25/2018 05:54 PM, Carlos O'Donell wrote:
>>>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection.
>>>
>>> Quick comment.Â  The middle line here adds trailing whitespace:
>>>
>>> -Â  { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2,
>>> +
>>> +Â Â Â Â  The U+02DA RING ABOVE is chosen because it's not in [ï½“-ãœ].Â  */
>>
>> Thanks. I'll fix this with v4.
> 
> I have verified that localedata/locales/iso14651_t1_common is just a reordering (except for the new comments).
> 
> localedata/locales/tr_TR is more complicated, but looks like an order-only change for me too.
> 
>> I had to fix the following locales:
>>
>> Â Â Â Â modified:Â Â  localedata/locales/ar_SA
>> Â Â Â Â modified:Â Â  localedata/locales/km_KH
>> Â Â Â Â modified:Â Â  localedata/locales/lo_LA
>> Â Â Â Â modified:Â Â  localedata/locales/or_IN
>> Â Â Â Â modified:Â Â  localedata/locales/sl_SI
>> Â Â Â Â modified:Â Â  localedata/locales/th_TH
> 
> Do you have the actual locale names handy?Â  localedata/SUPPORTED contains charsets, but I'm not sure if the translation to locale names is completely regular.

It is completely regular. In that ar_SA => ar_SA.UTF-8. And so forth.

>> They all re-arranged ASCII character collation element ordering like tr_TR,
>> and so they needed manual fixing.
>>
>> Could you please add these locales to your tester?
> 
> I will try.Â  I already have an xtests part, and these probably need to go there as well.

v4
- Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
- Added range checking for a-z, A-Z for all supported UTF-8 locales.

All of my testers are clean.

So the question is now:

Do we commit to rational ranges for a-z, A-Z, 0-9 ... for 2.28.

or

Do we just do the deinterlacing of iso14651_t1_common to fix en_US.UTF-8?

Cheers,
Carlos.

[-- Attachment #2: swbz23393v4.tar.gz --]
[-- Type: application/gzip, Size: 67108 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 20:25               ` Carlos O'Donell
  2018-07-25 20:31                 ` Florian Weimer
@ 2018-07-25 21:06                 ` Rafal Luzynski
  2018-07-25 21:12                   ` Carlos O'Donell
  1 sibling, 1 reply; 42+ messages in thread
From: Rafal Luzynski @ 2018-07-25 21:06 UTC (permalink / raw)
  To: GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers,
	Carlos O'Donell

25.07.2018 22:25 Carlos O'Donell <carlos@redhat.com> wrote:
> [...]
> I had to fix the following locales:
>
> modified: localedata/locales/ar_SA
> modified: localedata/locales/km_KH
> modified: localedata/locales/lo_LA
> modified: localedata/locales/or_IN
> modified: localedata/locales/sl_SI
> modified: localedata/locales/th_TH
>
> They all re-arranged ASCII character collation element ordering like tr_TR,
> and so they needed manual fixing.

Please check bg_BG.  It also has a large reorder: puts all Cyrillic characters
before Latin.  (However, this may not be relevant at all.)

Regards,

Rafal

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
@ 2018-07-25 21:12                   ` Carlos O'Donell
  0 siblings, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-25 21:12 UTC (permalink / raw)
  To: Rafal Luzynski, GNU C Library, Mike Fabian, Florian Weimer,
	Joseph S. Myers

On 07/25/2018 05:06 PM, Rafal Luzynski wrote:
> 25.07.2018 22:25 Carlos O'Donell <carlos@redhat.com> wrote:
>> [...]
>> I had to fix the following locales:
>>
>> modified: localedata/locales/ar_SA
>> modified: localedata/locales/km_KH
>> modified: localedata/locales/lo_LA
>> modified: localedata/locales/or_IN
>> modified: localedata/locales/sl_SI
>> modified: localedata/locales/th_TH
>>
>> They all re-arranged ASCII character collation element ordering like tr_TR,
>> and so they needed manual fixing.
> 
> Please check bg_BG.  It also has a large reorder: puts all Cyrillic characters
> before Latin.  (However, this may not be relevant at all.)

Right, that won't affect the rational range for ASCII.

The new tst-fnmatch.input has this:

 886 bg_BG.UTF-8     "a"     "[a-z]"         0
 887 bg_BG.UTF-8     "z"     "[a-z]"         0
 888 bg_BG.UTF-8     "A"     "[a-z]"         NOMATCH
 889 bg_BG.UTF-8     "Z"     "[a-z]"         NOMATCH
 890 bg_BG.UTF-8     "A"     "[A-Z]"         0
 891 bg_BG.UTF-8     "Z"     "[A-Z]"         0
 892 bg_BG.UTF-8     "a"     "[A-Z]"         NOMATCH
 893 bg_BG.UTF-8     "z"     "[A-Z]"         NOMATCH

Which tests the range extremes, and it passes.

It doesn't reorder any actual LATIN characters and so it's safe.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell
  2018-07-19 20:39 ` Florian Weimer
@ 2018-07-25 21:35 ` Carlos O'Donell
  2018-07-25 22:50   ` Florian Weimer
  2018-07-26  1:33 ` Jonathan Nieder
  2 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-25 21:35 UTC (permalink / raw)
  To: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/19/2018 03:43 PM, Carlos O'Donell wrote:
> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
> the collation data to harmonize with the new version of ISO 14651
> which is derived from Unicode 9.0.0.  This collation update brought
> with it some changes to locales which were not desirable by some
> users, in particular it altered the meaning of the
> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
> for en_US it caused uppercase letters to be matched by [a-z] for the
> first time.  The matching of uppercase letters by [a-z] is something
> which is already known to users of other locales which have this
> property, but this change could cause significant problems to en_US
> and other similar locales that had never had this change before.
> Whether this behaviour is desirable or not is contentious and GNU Awk
> has this to say on the topic:
> https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
> While the POSIX standard also has this further to say: "RE Bracket
> Expression":
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html
> "The current standard leaves unspecified the behavior of a range
> expression outside the POSIX locale. ... As noted above, efforts were
> made to resolve the differences, but no solution has been found that
> would be specific enough to allow for portable software while not
> invalidating existing implementations."
> In glibc we implement the requirement of ISO POSIX-2:1993 and use
> collation element order (CEO) to construct the range expression, the
> API internally is __collseq_table_lookup().  The fact that we use CEO
> and also have 4-level weights on each collation rule means that we can
> in practice reorder the collation rules in iso14651_t1_common (the new
> data) to provide consistent range expression resolution *and* the
> weights should maintain the expected total order.  Therefore this
> patch does three things:
> 
> * Reorder the collation rules for the LATIN script in
>   iso14651_t1_common to deinterlace uppercase and lowercase letters in
>   the collation element orders.
> 
> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
> 
> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>   exercise that [a-z] does not match A or Z.
> 
> The reordering of the ISO 14651 data is done in an entirely mechanical
> fashion using the following program attached to the bug:
> https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28
> 
> It is up for discussion if the iso14651_t1_common data should be
> refined further to have 3 very tight collation element ranges that
> include only a-z, A-Z, and 0-9, which would implement the solution
> sought after in:
> https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12
> 
> No regressions on x86_64.
> Verified that removal of the iso14651_t1_common change causes tst-fnmatch
> to regress with:
> 422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
> ...
> 425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) ***
> ---
>  ChangeLog                             |   11 +
>  localedata/Makefile                   |    1 +
>  localedata/en_US.UTF-8.in             | 2159 +++++++++++++++++++++++++++++++++
>  localedata/locales/iso14651_t1_common | 1928 ++++++++++++++---------------
>  posix/tst-fnmatch.input               |  125 +-
>  posix/tst-regexloc.c                  |    8 +-
>  6 files changed, 3224 insertions(+), 1008 deletions(-)
>  create mode 100644 localedata/en_US.UTF-8.in
> 
> I'm suggesting this change immediately for 2.28 to avoid further
> problems with users expectations and sorting with [a-z] and [A-Z] until
> a clearer consensus can be reached for a final solution.
> 
> File attached as .tar.gz to get past spam detectors. There is a lot
> of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN
> set that can be sorted with the existing test case infrastructure).
> 

I have committed only the most conservative fix for this issue, which is
to deinterlace the lower and upper case ranges.

I think we are too late to commit rational ranges, and we can do that in
2.29 when it opens. Right now I want to remove the blocker that is causing
regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z].

We have consensus that this is the right direction to take a solution,
and if anyone objects, please speak up before I cut the branch on August 1st
(if we can still achieve that and get good machine coverage).

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
@ 2018-07-25 22:50   ` Florian Weimer
  2018-07-26  1:20     ` Carlos O'Donell
  0 siblings, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-25 22:50 UTC (permalink / raw)
  To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 11:35 PM, Carlos O'Donell wrote:
> I have committed only the most conservative fix for this issue, which is
> to deinterlace the lower and upper case ranges.
> 
> I think we are too late to commit rational ranges, and we can do that in
> 2.29 when it opens. Right now I want to remove the blocker that is causing
> regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z].

How is this the most conservative fix, relative to glibc 2.27 upstream?

[a-z] still matches lots of non-ASCII characters, which it did not before.

When I meant that we left regression-fixing territory, I was talking 
about the locales which had iso14651_t1_common customizations.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-25 22:50   ` Florian Weimer
@ 2018-07-26  1:20     ` Carlos O'Donell
  2018-07-26  8:09       ` Andreas Schwab
  0 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-26  1:20 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 06:50 PM, Florian Weimer wrote:
> On 07/25/2018 11:35 PM, Carlos O'Donell wrote:
>> I have committed only the most conservative fix for this issue,
>> which is to deinterlace the lower and upper case ranges.
>> 
>> I think we are too late to commit rational ranges, and we can do
>> that in 2.29 when it opens. Right now I want to remove the blocker
>> that is causing regressions for en_US.UTF-8 scripts that use [a-z],
>> and [A-Z].
> 
> How is this the most conservative fix, relative to glibc 2.27
> upstream?

We have two solutions to fix the regression:

* Revert the entire ISO 14651 udpate.
  - This is 13 commits for just the update.
  - Several more commits for Rafal and Mike's work on locales on top of that.

* Fix the key issue of a-z interleaving with A-Z.

My opinion is that is most conservative to fix the interleaving.

In 2.27 we accepted 297 characters between A-Z.

In 2.28 we accept 2280 characters between A-Z as part of the ISO 14651 update.

> [a-z] still matches lots of non-ASCII characters, which it did not
> before.

This is not true, we were already matching 297 characters between A-Z
in 2.27. It has always been the case that we accepted non-ASCII characters
in the range. With the ISO 14651 update the *key* issue was that lowercase
and uppercase were now mixed in collation element ordering, resulting in
surprising matches and failures like the reported xfs test failure where
[a-z] matched "Makefile" and broke their test infrastructure.

> When I meant that we left regression-fixing territory, I was talking
> about the locales which had iso14651_t1_common customizations.

OK, so to be clear you think we *should* go forward with rational ranges?

I don't think it's too late, we could commit it tomorrow, it should not
impact machine testing in way.

My v4 fixes all of the locales that either have customizations on
iso14651_t1_common or have their own custom locales. No more locales
remain to be fixed, I tested all of them with tst-fnmatch.input additions
to catch the ones that needed fixing.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell
  2018-07-19 20:39 ` Florian Weimer
  2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
@ 2018-07-26  1:33 ` Jonathan Nieder
  2018-07-26  1:49   ` Carlos O'Donell
  2 siblings, 1 reply; 42+ messages in thread
From: Jonathan Nieder @ 2018-07-26  1:33 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

Hi,

Carlos O'Donell wrote:

> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
> the collation data to harmonize with the new version of ISO 14651
> which is derived from Unicode 9.0.0.  This collation update brought
> with it some changes to locales which were not desirable by some
> users, in particular it altered the meaning of the
> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
> for en_US it caused uppercase letters to be matched by [a-z] for the
> first time.

The Debian system where it is most convenient for me to test has
Debian's libc6 package, version 2.24-12.  [a-z] matches uppercase
letters.  I've always considered that undesirable but I'm confused
about the described regression.  Did one of Debian's patches to
localedata cause it to pick up the regression early (by which I mean,
more than 5 years ago)?

> In glibc we implement the requirement of ISO POSIX-2:1993 and use
> collation element order (CEO) to construct the range expression, the
> API internally is __collseq_table_lookup().  The fact that we use CEO
> and also have 4-level weights on each collation rule means that we can
> in practice reorder the collation rules in iso14651_t1_common (the new
> data) to provide consistent range expression resolution *and* the
> weights should maintain the expected total order.
[...]
> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.

Cool!  Checking my understanding: does this mean that if I have files

	lll
	MMM
	nnn

that with this patch,

	echo [a-z]*

would no longer match MMM, and

	ls | sort

would continue to sort in the order lll < MMM < nnn?

I wish we had done it 10 years ago. ;-)  Thanks for getting it done.

Jonathan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  1:33 ` Jonathan Nieder
@ 2018-07-26  1:49   ` Carlos O'Donell
  2018-07-26  2:16     ` Jonathan Nieder
  0 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-26  1:49 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 09:33 PM, Jonathan Nieder wrote:
> Hi,
> 
> Carlos O'Donell wrote:
> 
>> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of
>> the collation data to harmonize with the new version of ISO 14651
>> which is derived from Unicode 9.0.0.  This collation update brought
>> with it some changes to locales which were not desirable by some
>> users, in particular it altered the meaning of the
>> locale-dependent-range regular expression, namely [a-z] and [A-Z], and
>> for en_US it caused uppercase letters to be matched by [a-z] for the
>> first time.
> 
> The Debian system where it is most convenient for me to test has
> Debian's libc6 package, version 2.24-12.  [a-z] matches uppercase
> letters.  I've always considered that undesirable but I'm confused
> about the described regression.  Did one of Debian's patches to
> localedata cause it to pick up the regression early (by which I mean,
> more than 5 years ago)?

It depends entirely on the locale you use. Some locales already have
[a-z] matching uppercase and have had it for years. The problem is that
this is new for en_US.UTF-8.

Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
done something different with iso14651_t1_common to change this, or added
something else. I did a quick look at the debian patches for 2.24-12 and
didn't see anything that would change this materially for en_US.

>> In glibc we implement the requirement of ISO POSIX-2:1993 and use
>> collation element order (CEO) to construct the range expression, the
>> API internally is __collseq_table_lookup().  The fact that we use CEO
>> and also have 4-level weights on each collation rule means that we can
>> in practice reorder the collation rules in iso14651_t1_common (the new
>> data) to provide consistent range expression resolution *and* the
>> weights should maintain the expected total order.
> [...]
>> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises
>>   strcoll* and strxfrm* and ensures the ISO 14651 collation remains.
> 
> Cool!  Checking my understanding: does this mean that if I have files
> 
> 	lll
> 	MMM
> 	nnn
> 
> that with this patch,
> 
> 	echo [a-z]*
> 
> would no longer match MMM, and

Correct.

> 
> 	ls | sort
> 
> would continue to sort in the order lll < MMM < nnn?

Yes.

> 
> I wish we had done it 10 years ago. ;-)  Thanks for getting it done.

The rational ranges follow code point order.

The sorting follows collation sequence.

I think this was never an issue because most locales following ISO 14651
were using an old data set which never exhibited this issue. However, thanks
to Mike Fabian's hard work (and no good deed goes unpunished) we have updated
collation all the way to Unicode 9.0.0-era and so encountered this problem.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  1:49   ` Carlos O'Donell
@ 2018-07-26  2:16     ` Jonathan Nieder
  2018-07-26  3:48       ` Carlos O'Donell
  2018-07-26  7:42       ` Florian Weimer
  0 siblings, 2 replies; 42+ messages in thread
From: Jonathan Nieder @ 2018-07-26  2:16 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

Carlos O'Donell wrote:
> On 07/25/2018 09:33 PM, Jonathan Nieder wrote:

>> The Debian system where it is most convenient for me to test has
>> Debian's libc6 package, version 2.24-12.  [a-z] matches uppercase
>> letters.  I've always considered that undesirable but I'm confused
>> about the described regression.  Did one of Debian's patches to
>> localedata cause it to pick up the regression early (by which I mean,
>> more than 5 years ago)?
>
> It depends entirely on the locale you use. Some locales already have
> [a-z] matching uppercase and have had it for years. The problem is that
> this is new for en_US.UTF-8.
>
> Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
> done something different with iso14651_t1_common to change this, or added
> something else. I did a quick look at the debian patches for 2.24-12 and
> didn't see anything that would change this materially for en_US.

I tried with the following locales:

 en_US:		matches (bad)
 en_US.UTF-8:	matches (bad)
 C:		does not match (good)
 C.UTF-8:	does not match (good)
 fr_CH:		matches (bad)
 fr_CH.UTF-8:	matches (bad)

Looking over
https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata
and https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale,
I don't see any obvious culprits.  Anyway, please just take this as more
feedback in favor of your approach.

See the user reports merged with https://bugs.debian.org/301717.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4a] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-25 20:57                   ` [PATCHv4] " Carlos O'Donell
@ 2018-07-26  2:34                     ` Carlos O'Donell
  2018-07-26 14:51                       ` Florian Weimer
  0 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-26  2:34 UTC (permalink / raw)
  To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 432 bytes --]

On 07/25/2018 04:57 PM, Carlos O'Donell wrote:
> v4
> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
> - Added range checking for a-z, A-Z for all supported UTF-8 locales.
> 
> All of my testers are clean.

Attaching v4 on top of the current master.

This fixes all the locales.

All locales, even with tailoring have rational range support now.

If this passes your tests tomorrow I'm OK to put this into 2.28.

Cheers,
Carlos.

[-- Attachment #2: swbz23393v4a.tar.gz --]
[-- Type: application/gzip, Size: 29142 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  2:16     ` Jonathan Nieder
@ 2018-07-26  3:48       ` Carlos O'Donell
  2018-07-26  7:42       ` Florian Weimer
  1 sibling, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-26  3:48 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/25/2018 10:16 PM, Jonathan Nieder wrote:
> Carlos O'Donell wrote:
>> On 07/25/2018 09:33 PM, Jonathan Nieder wrote:
> 
>>> The Debian system where it is most convenient for me to test has
>>> Debian's libc6 package, version 2.24-12.  [a-z] matches uppercase
>>> letters.  I've always considered that undesirable but I'm confused
>>> about the described regression.  Did one of Debian's patches to
>>> localedata cause it to pick up the regression early (by which I mean,
>>> more than 5 years ago)?
>>
>> It depends entirely on the locale you use. Some locales already have
>> [a-z] matching uppercase and have had it for years. The problem is that
>> this is new for en_US.UTF-8.
>>
>> Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have
>> done something different with iso14651_t1_common to change this, or added
>> something else. I did a quick look at the debian patches for 2.24-12 and
>> didn't see anything that would change this materially for en_US.
> 
> I tried with the following locales:
> 
>  en_US:		matches (bad)
>  en_US.UTF-8:	matches (bad)
>  C:		does not match (good)
>  C.UTF-8:	does not match (good)
>  fr_CH:		matches (bad)
>  fr_CH.UTF-8:	matches (bad)
> 
> Looking over
> https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata
> and https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale,
> I don't see any obvious culprits.  Anyway, please just take this as more
> feedback in favor of your approach.
> 
> See the user reports merged with https://bugs.debian.org/301717.

This is your shell doing the expanding, and worse doing it
differently from glibc.

My bash shell also handles [a-z] expansion differently given 
the locale data. It appears to be using collation sequence
i.e. the order in which the elements sort in. 

Using grep doesn't result in these matches.

The fix is this: `shopt -s globasciiranges`, and we should
make it the default from now on. The option turns on rational
ranges for bash. Florian found this out when digging into
the issue.

We have a lot of cleanup to do to get rational ranges on
at each step of expansion.

Cheers,
Carlos.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  2:16     ` Jonathan Nieder
  2018-07-26  3:48       ` Carlos O'Donell
@ 2018-07-26  7:42       ` Florian Weimer
  2018-07-26  8:18         ` Andreas Schwab
  1 sibling, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-26  7:42 UTC (permalink / raw)
  To: Jonathan Nieder, Carlos O'Donell
  Cc: GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers

On 07/26/2018 04:16 AM, Jonathan Nieder wrote:
> Looking over
> https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata
> andhttps://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale,
> I don't see any obvious culprits.  Anyway, please just take this as more
> feedback in favor of your approach.
> 
> See the user reports merged with https://bugs.debian.org/301717.

The bash implementation of glob always uses strcoll/wcscoll ordering 
when globasciirange is not active.  It does not use collation element 
ordering, so rearranging collation data does not affect it.  This means 
that the changes discussed here will not affect bash (well, the glob 
part at least).

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  1:20     ` Carlos O'Donell
@ 2018-07-26  8:09       ` Andreas Schwab
  2018-07-26  9:16         ` Florian Weimer
  0 siblings, 1 reply; 42+ messages in thread
From: Andreas Schwab @ 2018-07-26  8:09 UTC (permalink / raw)
  To: Carlos O'Donell
  Cc: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On Jul 25 2018, Carlos O'Donell <carlos@redhat.com> wrote:

> surprising matches and failures like the reported xfs test failure where
> [a-z] matched "Makefile"

??? [a-z] has always done that.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  7:42       ` Florian Weimer
@ 2018-07-26  8:18         ` Andreas Schwab
  2018-07-26  9:15           ` Florian Weimer
  2018-07-26 13:25           ` Carlos O'Donell
  0 siblings, 2 replies; 42+ messages in thread
From: Andreas Schwab @ 2018-07-26  8:18 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Jonathan Nieder, Carlos O'Donell, GNU C Library, Rich Felker,
	Mike Fabian, Zorro Lang, Joseph S. Myers

On Jul 26 2018, Florian Weimer <fweimer@redhat.com> wrote:

> The bash implementation of glob always uses strcoll/wcscoll ordering when
> globasciirange is not active.  It does not use collation element ordering,
> so rearranging collation data does not affect it.

Why does strcoll not agree with the collation sequence?

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  8:18         ` Andreas Schwab
@ 2018-07-26  9:15           ` Florian Weimer
  2018-07-26 13:25           ` Carlos O'Donell
  1 sibling, 0 replies; 42+ messages in thread
From: Florian Weimer @ 2018-07-26  9:15 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Jonathan Nieder, Carlos O'Donell, GNU C Library, Rich Felker,
	Mike Fabian, Zorro Lang, Joseph S. Myers

On 07/26/2018 10:18 AM, Andreas Schwab wrote:
> On Jul 26 2018, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> The bash implementation of glob always uses strcoll/wcscoll ordering when
>> globasciirange is not active.  It does not use collation element ordering,
>> so rearranging collation data does not affect it.
> 
> Why does strcoll not agree with the collation sequence?

The collation element ordering is encoded in the _NL_COLLATE_COLLSEQMB 
and _NL_COLLATE_COLLSEQWC tables, and not the weights used by strcoll.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  8:09       ` Andreas Schwab
@ 2018-07-26  9:16         ` Florian Weimer
  0 siblings, 0 replies; 42+ messages in thread
From: Florian Weimer @ 2018-07-26  9:16 UTC (permalink / raw)
  To: Andreas Schwab, Carlos O'Donell
  Cc: GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers

On 07/26/2018 10:08 AM, Andreas Schwab wrote:
> On Jul 25 2018, Carlos O'Donell <carlos@redhat.com> wrote:
> 
>> surprising matches and failures like the reported xfs test failure where
>> [a-z] matched "Makefile"
> 
> ??? [a-z] has always done that.

It's about the glob/fnmatch pattern â€œ[a-z]*â€.

Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
  2018-07-26  8:18         ` Andreas Schwab
  2018-07-26  9:15           ` Florian Weimer
@ 2018-07-26 13:25           ` Carlos O'Donell
  1 sibling, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-26 13:25 UTC (permalink / raw)
  To: Andreas Schwab, Florian Weimer
  Cc: Jonathan Nieder, GNU C Library, Rich Felker, Mike Fabian,
	Zorro Lang, Joseph S. Myers

On 07/26/2018 04:18 AM, Andreas Schwab wrote:
> On Jul 26 2018, Florian Weimer <fweimer@redhat.com> wrote:
> 
>> The bash implementation of glob always uses strcoll/wcscoll ordering when
>> globasciirange is not active.  It does not use collation element ordering,
>> so rearranging collation data does not affect it.
> 
> Why does strcoll not agree with the collation sequence?

There are two terms that mean very different things.

The strcoll output and collation sequence are the same.

The collation sequence is not the same as the collation element ordering
(the order of the rules in the source file).

POSIX mandated the use of collation element ordering (not sequence) for
regular expression ranges, and then decided this was a bad idea and instead
made it unspecified.

In glibc we continue to implement and support collation element ordering,
not collation sequence, for posix regular expression ranges.

Even collation sequence is a bad idea because [a-z] does not include all the
z's that are sorted after z, and you need special collation element markers
like AFTER-Z to find all the z's. Instead we should use rational ranges
and make everything based on code points to make it portable across all
locales.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4a] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-26  2:34                     ` [PATCHv4a] " Carlos O'Donell
@ 2018-07-26 14:51                       ` Florian Weimer
  2018-07-26 14:59                         ` Carlos O'Donell
  2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
  0 siblings, 2 replies; 42+ messages in thread
From: Florian Weimer @ 2018-07-26 14:51 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 3967 bytes --]

On 07/26/2018 04:34 AM, Carlos O'Donell wrote:
> On 07/25/2018 04:57 PM, Carlos O'Donell wrote:
>> v4
>> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
>> - Added range checking for a-z, A-Z for all supported UTF-8 locales.
>>
>> All of my testers are clean.
> 
> Attaching v4 on top of the current master.
> 
> This fixes all the locales.

I wrote another enumeration tester, this time covering all locales.  It 
found these issues:

az_AZ: U+000069 fails to match /[a-z]/
az_AZ: U+000049 fails to match /[A-Z]/
az_AZ.utf8: U+000069 fails to match /[a-z]/
az_AZ.utf8: U+000049 fails to match /[A-Z]/
crh_UA: U+000069 fails to match /[a-z]/
crh_UA: U+000049 fails to match /[A-Z]/
crh_UA.utf8: U+000069 fails to match /[a-z]/
crh_UA.utf8: U+000049 fails to match /[A-Z]/
ku_TR: U+000069 fails to match /[a-z]/
ku_TR: U+000049 fails to match /[A-Z]/
ku_TR.iso88599: U+000069 fails to match /[a-z]/
ku_TR.iso88599: U+000049 fails to match /[A-Z]/
ku_TR.utf8: U+000069 fails to match /[a-z]/
ku_TR.utf8: U+000049 fails to match /[A-Z]/
lv_LV: U+000079 fails to match /[a-z]/
lv_LV: U+000059 fails to match /[A-Z]/
lv_LV.iso885913: U+000079 fails to match /[a-z]/
lv_LV.iso885913: U+000059 fails to match /[A-Z]/
lv_LV.utf8: U+000079 fails to match /[a-z]/
lv_LV.utf8: U+000059 fails to match /[A-Z]/
shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
slovene: U+00006A fails to match /[a-z]/
slovene: U+00006B fails to match /[a-z]/
slovene: U+00006C fails to match /[a-z]/
slovene: U+00006D fails to match /[a-z]/
slovene: U+00006E fails to match /[a-z]/
slovene: U+00006F fails to match /[a-z]/
slovenian: U+00006A fails to match /[a-z]/
slovenian: U+00006B fails to match /[a-z]/
slovenian: U+00006C fails to match /[a-z]/
slovenian: U+00006D fails to match /[a-z]/
slovenian: U+00006E fails to match /[a-z]/
slovenian: U+00006F fails to match /[a-z]/
sl_SI: U+00006A fails to match /[a-z]/
sl_SI: U+00006B fails to match /[a-z]/
sl_SI: U+00006C fails to match /[a-z]/
sl_SI: U+00006D fails to match /[a-z]/
sl_SI: U+00006E fails to match /[a-z]/
sl_SI: U+00006F fails to match /[a-z]/
sl_SI.iso88592: U+00006A fails to match /[a-z]/
sl_SI.iso88592: U+00006B fails to match /[a-z]/
sl_SI.iso88592: U+00006C fails to match /[a-z]/
sl_SI.iso88592: U+00006D fails to match /[a-z]/
sl_SI.iso88592: U+00006E fails to match /[a-z]/
sl_SI.iso88592: U+00006F fails to match /[a-z]/
sl_SI.utf8: U+00006A fails to match /[a-z]/
sl_SI.utf8: U+00006B fails to match /[a-z]/
sl_SI.utf8: U+00006C fails to match /[a-z]/
sl_SI.utf8: U+00006D fails to match /[a-z]/
sl_SI.utf8: U+00006E fails to match /[a-z]/
sl_SI.utf8: U+00006F fails to match /[a-z]/
sv_FI: U+000077 fails to match /[a-z]/
sv_FI: U+000057 fails to match /[A-Z]/
sv_FI@euro: U+000077 fails to match /[a-z]/
sv_FI@euro: U+000057 fails to match /[A-Z]/
sv_FI.iso88591: U+000077 fails to match /[a-z]/
sv_FI.iso88591: U+000057 fails to match /[A-Z]/
sv_FI.iso885915@euro: U+000077 fails to match /[a-z]/
sv_FI.iso885915@euro: U+000057 fails to match /[A-Z]/
sv_FI.utf8: U+000077 fails to match /[a-z]/
sv_FI.utf8: U+000057 fails to match /[A-Z]/
sv_SE: U+000077 fails to match /[a-z]/
sv_SE: U+000057 fails to match /[A-Z]/
sv_SE.iso88591: U+000077 fails to match /[a-z]/
sv_SE.iso88591: U+000057 fails to match /[A-Z]/
sv_SE.utf8: U+000077 fails to match /[a-z]/
sv_SE.utf8: U+000057 fails to match /[A-Z]/
swedish: U+000077 fails to match /[a-z]/
swedish: U+000057 fails to match /[A-Z]/
tt_RU: U+000069 fails to match /[a-z]/
tt_RU: U+000049 fails to match /[A-Z]/
tt_RU@iqtelif: U+000069 fails to match /[a-z]/
tt_RU@iqtelif: U+000049 fails to match /[A-Z]/
tt_RU.utf8: U+000069 fails to match /[a-z]/
tt_RU.utf8: U+000049 fails to match /[A-Z]/
tt_RU.utf8@iqtelif: U+000069 fails to match /[a-z]/
tt_RU.utf8@iqtelif: U+000049 fails to match /[A-Z]/

Thanks,
Florian

[-- Attachment #2: rational-ranges-1.cc --]
[-- Type: text/x-c++src, Size: 2944 bytes --]

#include <err.h>
#include <errno.h>
#include <limits.h>
#include <locale.h>
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

#include <algorithm>
#include <string>
#include <vector>

static std::vector<std::string>
get_locales()
{
  FILE *fp = popen("locale -a", "r");
  if (fp == NULL)
    err(1, "running locale -a");

  std::vector<std::string> result;
  while (!feof(fp))
    {
      char *elem{};
      int ret = fscanf(fp, "%ms", &elem);
      if (ret == 1)
        {
          if (elem == nullptr)
            errx(1, "invalid fscanf result");
          result.emplace_back(elem);
          free(elem);
        }
      else if (ferror(fp))
        err(1, "fscanf failed");
    }

  int ret = pclose(fp);
  if (ret != 0)
    err(1, "locale -a failed with status %d", ret);

  std::sort(result.begin(), result.end());
  return result;
}

static void
test_regexp_range(const char *locale, const char *pattern,
                  std::pair<wchar_t, wchar_t> range)
{

  regex_t reg;
  {
    int ret = regcomp(&reg, pattern, REG_EXTENDED | REG_NOSUB);
    if (ret != 0)
      errx(1, "Cannot compile regular expression /%s/: %d", pattern, ret);
  }

  const wchar_t maximum_character = 0x10FFFF;
  const unsigned maximum_length = 5; /* With NUL.  */
  for (wchar_t ch = 1; ch <= maximum_character; ++ch)
    {
      char uch[MB_LEN_MAX];
      mbstate_t ps{};
      {
        size_t ret = wcrtomb(uch, ch, &ps);
        if (ret == static_cast<size_t>(-1))
          {
            if (errno == EILSEQ)
              continue;
            err(1, "wcrtomb(0x%x)", static_cast<unsigned>(ch));
          }
        else if (ret == 0)
          continue;             // Some anomaly.
        if (ret >= maximum_length)
          errx(1, "multi-byte length %zu at 0x%x exceeds %u",
               ret, ch, maximum_length);
        uch[ret]  = '\0';
      }
      int ret = regexec(&reg, uch, 0, NULL, 0);
      if (ret != 0 && ret != REG_NOMATCH)
        errx(1, "regexec of /%s/ failed with code %d", pattern, ret);
      bool regex_matches = ret == 0;
      bool range_matches = range.first <= ch && ch <= range.second;
      if (regex_matches != range_matches) {
        if (regex_matches)
          printf("%s: U+%06X matches /%s/ unexpectedly\n",
                 locale, static_cast<unsigned>(ch), pattern);
        else
          printf("%s: U+%06X fails to match /%s/\n",
                 locale, static_cast<unsigned>(ch), pattern);
      }
    }

  regfree(&reg);
}

int
main()
{
  std::vector<std::string> locales{get_locales()};
  for (const auto &locale : locales) {
    if (setlocale(LC_ALL, locale.c_str()) == NULL)
      err(1, "Cannot set locale to %s", locale.c_str());
    test_regexp_range(locale.c_str(), "[0-9]", std::make_pair(L'0', L'9'));
    test_regexp_range(locale.c_str(), "[a-z]", std::make_pair(L'a', L'z'));
    test_regexp_range(locale.c_str(), "[A-Z]", std::make_pair(L'A', L'Z'));
  }
}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCHv4a] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-26 14:51                       ` Florian Weimer
@ 2018-07-26 14:59                         ` Carlos O'Donell
  2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
  1 sibling, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-26 14:59 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha

On 07/26/2018 10:50 AM, Florian Weimer wrote:
> On 07/26/2018 04:34 AM, Carlos O'Donell wrote:
>> On 07/25/2018 04:57 PM, Carlos O'Donell wrote:
>>> v4
>>> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH.
>>> - Added range checking for a-z, A-Z for all supported UTF-8 locales.
>>>
>>> All of my testers are clean.
>>
>> Attaching v4 on top of the current master.
>>
>> This fixes all the locales.
> 
> I wrote another enumeration tester, this time covering all locales.Â  It found these issues:
> 
> az_AZ: U+000069 fails to match /[a-z]/
> az_AZ: U+000049 fails to match /[A-Z]/
> az_AZ.utf8: U+000069 fails to match /[a-z]/
> az_AZ.utf8: U+000049 fails to match /[A-Z]/

See it.

> crh_UA: U+000069 fails to match /[a-z]/
> crh_UA: U+000049 fails to match /[A-Z]/
> crh_UA.utf8: U+000069 fails to match /[a-z]/
> crh_UA.utf8: U+000049 fails to match /[A-Z]/

See it.

> ku_TR: U+000069 fails to match /[a-z]/
> ku_TR: U+000049 fails to match /[A-Z]/
> ku_TR.iso88599: U+000069 fails to match /[a-z]/
> ku_TR.iso88599: U+000049 fails to match /[A-Z]/
> ku_TR.utf8: U+000069 fails to match /[a-z]/
> ku_TR.utf8: U+000049 fails to match /[A-Z]/

See it.

> lv_LV: U+000079 fails to match /[a-z]/
> lv_LV: U+000059 fails to match /[A-Z]/
> lv_LV.iso885913: U+000079 fails to match /[a-z]/
> lv_LV.iso885913: U+000059 fails to match /[A-Z]/
> lv_LV.utf8: U+000079 fails to match /[a-z]/
> lv_LV.utf8: U+000059 fails to match /[A-Z]/

See it.

> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly

Good catch. These were the ones I was hoping your finder would catch.

> slovene: U+00006A fails to match /[a-z]/
> slovene: U+00006B fails to match /[a-z]/
> slovene: U+00006C fails to match /[a-z]/
> slovene: U+00006D fails to match /[a-z]/
> slovene: U+00006E fails to match /[a-z]/
> slovene: U+00006F fails to match /[a-z]/

This is an alias for sl_SI.ISO-8859-2 and we see it below.

> slovenian: U+00006A fails to match /[a-z]/
> slovenian: U+00006B fails to match /[a-z]/
> slovenian: U+00006C fails to match /[a-z]/
> slovenian: U+00006D fails to match /[a-z]/
> slovenian: U+00006E fails to match /[a-z]/
> slovenian: U+00006F fails to match /[a-z]/

This is an alias for sl_SI.ISO-8859-2 and we see it below.

> sl_SI: U+00006A fails to match /[a-z]/
> sl_SI: U+00006B fails to match /[a-z]/
> sl_SI: U+00006C fails to match /[a-z]/
> sl_SI: U+00006D fails to match /[a-z]/
> sl_SI: U+00006E fails to match /[a-z]/
> sl_SI: U+00006F fails to match /[a-z]/

See it.

> sl_SI.iso88592: U+00006A fails to match /[a-z]/
> sl_SI.iso88592: U+00006B fails to match /[a-z]/
> sl_SI.iso88592: U+00006C fails to match /[a-z]/
> sl_SI.iso88592: U+00006D fails to match /[a-z]/
> sl_SI.iso88592: U+00006E fails to match /[a-z]/
> sl_SI.iso88592: U+00006F fails to match /[a-z]/

See it (aliased above twice).

> sl_SI.utf8: U+00006A fails to match /[a-z]/
> sl_SI.utf8: U+00006B fails to match /[a-z]/
> sl_SI.utf8: U+00006C fails to match /[a-z]/
> sl_SI.utf8: U+00006D fails to match /[a-z]/
> sl_SI.utf8: U+00006E fails to match /[a-z]/
> sl_SI.utf8: U+00006F fails to match /[a-z]/

See it.

> sv_FI: U+000077 fails to match /[a-z]/
> sv_FI: U+000057 fails to match /[A-Z]/

See it.

> sv_FI@euro: U+000077 fails to match /[a-z]/
> sv_FI@euro: U+000057 fails to match /[A-Z]/

Same as sv_FI.

> sv_FI.iso88591: U+000077 fails to match /[a-z]/
> sv_FI.iso88591: U+000057 fails to match /[A-Z]/

Likewise.

> sv_FI.iso885915@euro: U+000077 fails to match /[a-z]/
> sv_FI.iso885915@euro: U+000057 fails to match /[A-Z]/

Likewise.

> sv_FI.utf8: U+000077 fails to match /[a-z]/
> sv_FI.utf8: U+000057 fails to match /[A-Z]/

Likewise.

> sv_SE: U+000077 fails to match /[a-z]/
> sv_SE: U+000057 fails to match /[A-Z]/

See it.

> sv_SE.iso88591: U+000077 fails to match /[a-z]/
> sv_SE.iso88591: U+000057 fails to match /[A-Z]/

Same as above.

> sv_SE.utf8: U+000077 fails to match /[a-z]/
> sv_SE.utf8: U+000057 fails to match /[A-Z]/

Likewise.

> swedish: U+000077 fails to match /[a-z]/
> swedish: U+000057 fails to match /[A-Z]/

Alias for sv_SE.

> tt_RU: U+000069 fails to match /[a-z]/
> tt_RU: U+000049 fails to match /[A-Z]/

See it.

> tt_RU@iqtelif: U+000069 fails to match /[a-z]/
> tt_RU@iqtelif: U+000049 fails to match /[A-Z]/

See it.

> tt_RU.utf8: U+000069 fails to match /[a-z]/
> tt_RU.utf8: U+000049 fails to match /[A-Z]/

See it.

> tt_RU.utf8@iqtelif: U+000069 fails to match /[a-z]/
> tt_RU.utf8@iqtelif: U+000049 fails to match /[A-Z]/

See it.

Thanks you!

I increased tst-fnmatch.input coverage and I get this:

Line #3699: Test #3548 (az_AZ.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #3751: Test #3600 (az_AZ.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #6819: Test #6668 (crh_UA.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #6871: Test #6720 (crh_UA.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #18675: Test #18524 (ku_TR.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #18727: Test #18576 (ku_TR.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #19835: Test #19684 (lv_LV.UTF-8): fnmatch ("[a-z]", "y", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #19887: Test #19736 (lv_LV.UTF-8): fnmatch ("[A-Z]", "Y", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26684: Test #26533 (sl_SI.UTF-8): fnmatch ("[a-z]", "j", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26685: Test #26534 (sl_SI.UTF-8): fnmatch ("[a-z]", "k", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26686: Test #26535 (sl_SI.UTF-8): fnmatch ("[a-z]", "l", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26687: Test #26536 (sl_SI.UTF-8): fnmatch ("[a-z]", "m", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26688: Test #26537 (sl_SI.UTF-8): fnmatch ("[a-z]", "n", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #26689: Test #26538 (sl_SI.UTF-8): fnmatch ("[a-z]", "o", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28049: Test #27898 (sv_FI.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28101: Test #27950 (sv_FI.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28153: Test #28002 (sv_SE.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #28205: Test #28054 (sv_SE.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30427: Test #30276 (tt_RU.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30479: Test #30328 (tt_RU.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30531: Test #30380 (tt_RU.UTF-8@iqtelif): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) ***
Line #30583: Test #30432 (tt_RU.UTF-8@iqtelif): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) ***

Which matches all the locales you saw failures in except for shs_CA, which is a real bug.

I'll fix these up quickly.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-26 14:51                       ` Florian Weimer
  2018-07-26 14:59                         ` Carlos O'Donell
@ 2018-07-28  1:12                         ` Carlos O'Donell
  2018-07-30 17:40                           ` Florian Weimer
  1 sibling, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-28  1:12 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 900 bytes --]

On 07/26/2018 10:50 AM, Florian Weimer wrote:
> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
This is a WIP, because the number of tests now is too big
to simply add them to tst-fnmatch.input, and so I'm writing
a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
expecting all of the locales to be built for testing, and
then running through all the rational ranges to test
inclusion of the required datums.

How slow is your tester? Should I do what you do to test
for the inclusion of characters that shouldn't be in the
range? Or will that take too long?

v5
- Add ~30k+ tests to tst-fnmatch.input.
- Fix broken locales: 
- Fix shs_CA to not reorder-after for no reason.

Could you run this through the tester please?

Cheers,
Carlos.

[-- Attachment #2: swbz23393v5.tar.gz --]
[-- Type: application/gzip, Size: 126966 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
@ 2018-07-30 17:40                           ` Florian Weimer
  2018-07-30 17:45                             ` Carlos O'Donell
  2018-07-31  2:18                             ` Carlos O'Donell
  0 siblings, 2 replies; 42+ messages in thread
From: Florian Weimer @ 2018-07-30 17:40 UTC (permalink / raw)
  To: Carlos O'Donell, libc-alpha

On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly

> This is a WIP, because the number of tests now is too big
> to simply add them to tst-fnmatch.input, and so I'm writing
> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
> expecting all of the locales to be built for testing, and
> then running through all the rational ranges to test
> inclusion of the required datums.

Let me repeat my suggestion that we should initially fix the locales 
with the common collation order, where glibc 2.28 regresses.

> How slow is your tester? Should I do what you do to test
> for the inclusion of characters that shouldn't be in the
> range? Or will that take too long?
> 
> v5
> - Add ~30k+ tests to tst-fnmatch.input.
> - Fix broken locales:
> - Fix shs_CA to not reorder-after for no reason.
> 
> Could you run this through the tester please?

It fails installation for me:

$ make localedata/install-locales DESTDIR=/tmp/locales
sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined 
at locales/sl_SI:998
locales/sl_SI:1231: [error] symbol `S0062' not defined
locales/sl_SI:1231: [error] symbol `BASE' not defined
/bin/sh: line 17:  4148 Segmentation fault      (core dumped) I18NPATH=. 
GCONV_PATH=/home/fweimer/src/gnu/glibc/build/iconvdata LC_ALL=C 
/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2 
--library-path 
/home/fweimer/src/gnu/glibc/build:/home/fweimer/src/gnu/glibc/build/math:/home/fweimer/src/gnu/glibc/build/elf:/home/fweimer/src/gnu/glibc/build/dlfcn:/home/fweimer/src/gnu/glibc/build/nss:/home/fweimer/src/gnu/glibc/build/nis:/home/fweimer/src/gnu/glibc/build/rt:/home/fweimer/src/gnu/glibc/build/resolv:/home/fweimer/src/gnu/glibc/build/mathvec:/home/fweimer/src/gnu/glibc/build/support:/home/fweimer/src/gnu/glibc/build/crypt:/home/fweimer/src/gnu/glibc/build/nptl 
/home/fweimer/src/gnu/glibc/build/locale/localedef $flags 
--alias-file=../intl/locale.alias -i locales/$input -f charmaps/$charset 
--prefix=/tmp/locales $locale

GDB says this:

Core was generated by 
`/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2 
--library-path /home'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000419234 in output_weight (pool=pool@entry=0x7ffdf1550ce0, 
collate=collate@entry=0x7fd5a8a03240,
     elem=elem@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912
1912              len += utf8_encode (&buf[len],
(gdb) bt
#0  0x0000000000419234 in output_weight (pool=pool@entry=0x7ffdf1550ce0, 
collate=collate@entry=0x7fd5a8a03240,
     elem=elem@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912
#1  0x000000000041dc4a in collate_output () at programs/ld-collate.c:2180
#2  0x000000000042709f in write_all_categories 
(definitions=0x7ffdf15513c0, charmap=charmap@entry=0x7fd5a71786a0,
     locname=0x7ffdf1552e33 "sl_SI.UTF-8", 
output_path=output_path@entry=0x7fd5a7178310 
"/tmp/locales/usr/lib64/locale/sl_SI.utf8/")
     at programs/locfile.c:337
#3  0x0000000000402f69 in main (argc=<optimized out>, 
argv=0x7ffdf1551630) at programs/localedef.c:300
(gdb) l
1907          int i;
1908
1909          for (i = 0; i < elem->weights[cnt].cnt; ++i)
1910            /* Encode the weight value.  We do nothing for IGNORE 
entries.  */
1911            if (elem->weights[cnt].w[i] != NULL)
1912              len += utf8_encode (&buf[len],
1913 
elem->weights[cnt].w[i]->mborder[cnt]);
1914
1915          /* And add the buffer content.  */
1916          obstack_1grow (pool, len);
(gdb) print elem->weights[cnt].w[i]->mborder[cnt]
Cannot access memory at address 0x0
(gdb) print elem->weights[cnt].w[i]->mborder
$3 = (int *) 0x0
(gdb)

Any idea what is going on?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-30 17:40                           ` Florian Weimer
@ 2018-07-30 17:45                             ` Carlos O'Donell
  2018-07-30 17:54                               ` Florian Weimer
  2018-07-31  2:18                             ` Carlos O'Donell
  1 sibling, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-30 17:45 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha

On 07/30/2018 01:39 PM, Florian Weimer wrote:
> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
> 
>> This is a WIP, because the number of tests now is too big
>> to simply add them to tst-fnmatch.input, and so I'm writing
>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>> expecting all of the locales to be built for testing, and
>> then running through all the rational ranges to test
>> inclusion of the required datums.
> 
> Let me repeat my suggestion that we should initially fix the locales
> with the common collation order, where glibc 2.28 regresses.

I do not think it is appropriate to release rational range support on
only a subset of the SUPPORTED set of locales. Either we support it on
all SUPPORTED locales or we work until we are ready.

At present glibc 2.28 does not regress because of commit 
7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
uppercase.

In glibc 2.28 we simply have ~2500 characters in the range of a-z,
and in 2.27 we had ~250, it's still a large set of non-ASCII characters
accepted by the range, all because we caught up to Unicode 9.0.0 with
the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
with the next release, and probably always lagging a bit).

I don't see an urgent need to get rational range support into 2.28.
I was happy to get it in earlier, but now with deeper testing showing
that not all locales are working correctly, I'm not happy to see this
go out the door. I think it will be ready very shortly, and we can check
it in immediately into 2.29, and then continue our work on code point
ranges as the next step, which will require even more testing, and
internal API cleanup.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-30 17:45                             ` Carlos O'Donell
@ 2018-07-30 17:54                               ` Florian Weimer
  2018-07-30 18:26                                 ` Carlos O'Donell
  0 siblings, 1 reply; 42+ messages in thread
From: Florian Weimer @ 2018-07-30 17:54 UTC (permalink / raw)
  To: Carlos O'Donell, libc-alpha

On 07/30/2018 07:45 PM, Carlos O'Donell wrote:
> On 07/30/2018 01:39 PM, Florian Weimer wrote:
>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>>
>>> This is a WIP, because the number of tests now is too big
>>> to simply add them to tst-fnmatch.input, and so I'm writing
>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>>> expecting all of the locales to be built for testing, and
>>> then running through all the rational ranges to test
>>> inclusion of the required datums.
>>
>> Let me repeat my suggestion that we should initially fix the locales
>> with the common collation order, where glibc 2.28 regresses.
> 
> I do not think it is appropriate to release rational range support on
> only a subset of the SUPPORTED set of locales. Either we support it on
> all SUPPORTED locales or we work until we are ready.
> 
> At present glibc 2.28 does not regress because of commit
> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
> uppercase.
> 
> In glibc 2.28 we simply have ~2500 characters in the range of a-z,
> and in 2.27 we had ~250, it's still a large set of non-ASCII characters
> accepted by the range, all because we caught up to Unicode 9.0.0 with
> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
> with the next release, and probably always lagging a bit).

Ahh.  So it's more complex and a regression longer in the making.

> I don't see an urgent need to get rational range support into 2.28.
> I was happy to get it in earlier, but now with deeper testing showing
> that not all locales are working correctly, I'm not happy to see this
> go out the door. I think it will be ready very shortly, and we can check
> it in immediately into 2.29, and then continue our work on code point
> ranges as the next step, which will require even more testing, and
> internal API cleanup.

Sounds reasonable.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-30 17:54                               ` Florian Weimer
@ 2018-07-30 18:26                                 ` Carlos O'Donell
  2018-07-30 18:34                                   ` Florian Weimer
  0 siblings, 1 reply; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-30 18:26 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha

On 07/30/2018 01:54 PM, Florian Weimer wrote:
> On 07/30/2018 07:45 PM, Carlos O'Donell wrote:
>> On 07/30/2018 01:39 PM, Florian Weimer wrote:
>>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote:
>>>> On 07/26/2018 10:50 AM, Florian Weimer wrote:
>>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly
>>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly
>>>
>>>> This is a WIP, because the number of tests now is too big
>>>> to simply add them to tst-fnmatch.input, and so I'm writing
>>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED,
>>>> expecting all of the locales to be built for testing, and
>>>> then running through all the rational ranges to test
>>>> inclusion of the required datums.
>>>
>>> Let me repeat my suggestion that we should initially fix the locales
>>> with the common collation order, where glibc 2.28 regresses.
>>
>> I do not think it is appropriate to release rational range support on
>> only a subset of the SUPPORTED set of locales. Either we support it on
>> all SUPPORTED locales or we work until we are ready.
>>
>> At present glibc 2.28 does not regress because of commit
>> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and
>> uppercase.
>>
>> In glibc 2.28 we simply have ~2500 characters in the range of a-z,
>> and in 2.27 we had ~250, it's still a large set of non-ASCII characters
>> accepted by the range, all because we caught up to Unicode 9.0.0 with
>> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0
>> with the next release, and probably always lagging a bit).
> 
> Ahh.Â  So it's more complex and a regression longer in the making.

I'm worried I don't quite follow your statement of "longer in the making,"
but let me summarize what I think you wrote, and tell me if I have
it right.

The regression, from the perspective of en_US, is that [a-z] in master
accepts uppercase ASCII characters, and this breaks user expectations.

This is the only regression I'm considering serious enough to block the
release for and we've fixed it for now.

The regression which you say is "longer in the making" is that at some
point in the past the collation data for en_US contained only ASCII
ranges for a-z, A-Z, and 0-9. Then at some point in the past the ranges,
particularly those from a-z, and A-Z began accepting non-ASCII characters.

Thus the regression, from your perspective, happened far in the past.

As far as I can tell the regression has existed since the first import
for en_US which copied LC_COLLATE from en_DK (showing en_DK):
~~~
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  968) <a> <A>;<NONE>;<SMALL>;IGNORE
...
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'>        <Z>;<ACUTE>;<CAPITAL>;IGNORE
~~~
Is this what you mean by "longer in the making?"

I expect that en_US at some point along the way is switched to use the
iso14651_t1 data, and so gains non-interleaved a-z/A-Z CEO, but it's hard
to tell exactly if CEO was fully functional, if fnmatch worked as expected,
etc.

Either way this is all a poorly understood and structured solution at this
point, and I hope that in 1 or 2 releases we go from "unusable interface" to
"rational ranges (data)" to "full rational ranges (code point ranges)" and
end up with a sensible portable solution.

>> I don't see an urgent need to get rational range support into 2.28.
>> I was happy to get it in earlier, but now with deeper testing showing
>> that not all locales are working correctly, I'm not happy to see this
>> go out the door. I think it will be ready very shortly, and we can check
>> it in immediately into 2.29, and then continue our work on code point
>> ranges as the next step, which will require even more testing, and
>> internal API cleanup.
> 
> Sounds reasonable.

That sounds great. I will continue to update this patch set and get some
independent checking from your scripts, and my own testing. I also need
to add collation tests for all the locales I touch to ensure that the
reordering is just that, and that it doesn't materially change the collation
sequence (if it does it's a bug). This all adds more coverage to the
SUPPORTED set of languages which is a positive thing.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-30 18:26                                 ` Carlos O'Donell
@ 2018-07-30 18:34                                   ` Florian Weimer
  0 siblings, 0 replies; 42+ messages in thread
From: Florian Weimer @ 2018-07-30 18:34 UTC (permalink / raw)
  To: Carlos O'Donell, libc-alpha

On 07/30/2018 08:25 PM, Carlos O'Donell wrote:
> As far as I can tell the regression has existed since the first import
> for en_US which copied LC_COLLATE from en_DK (showing en_DK):
> ~~~
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  967) <A> <A>;<NONE>;<CAPITAL>;IGNORE
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000  968) <a> <A>;<NONE>;<SMALL>;IGNORE
> ...
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE
> f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'>        <Z>;<ACUTE>;<CAPITAL>;IGNORE
> ~~~
> Is this what you mean by "longer in the making?"

Yes, that's what I meant.  I didn't check whether it went back to 2.17, 
2.12, or even earlier.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393).
  2018-07-30 17:40                           ` Florian Weimer
  2018-07-30 17:45                             ` Carlos O'Donell
@ 2018-07-31  2:18                             ` Carlos O'Donell
  1 sibling, 0 replies; 42+ messages in thread
From: Carlos O'Donell @ 2018-07-31  2:18 UTC (permalink / raw)
  To: Florian Weimer, libc-alpha

On 07/30/2018 01:39 PM, Florian Weimer wrote:
> It fails installation for me:

I'm so sorry to waste your time like this.

I apparently failed to test sl_SI.

> $ make localedata/install-locales DESTDIR=/tmp/locales
> sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined at locales/sl_SI:998
> locales/sl_SI:1231: [error] symbol `S0062' not defined
> locales/sl_SI:1231: [error] symbol `BASE' not defined

... this is a cascading set of errors.

> (gdb) print elem->weights[cnt].w[i]->mborder[cnt]
> Cannot access memory at address 0x0
> (gdb) print elem->weights[cnt].w[i]->mborder
> $3 = (int *) 0x0
> (gdb)
> 
> Any idea what is going on?

The parser should have stopped at the first error IMO, going any further
just results in problems. It's very hard to rollback the state of the parser
and data structures if there is an error in the source files. It should just
have stopped at the duplicate U0061 definition.

I'm testing a v6 with the sl_SI fixes, and a new test case.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2018-07-31  2:18 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell
2018-07-19 20:39 ` Florian Weimer
2018-07-20 18:49   ` Carlos O'Donell
2018-07-20 19:02     ` Rich Felker
2018-07-20 19:19     ` Florian Weimer
2018-07-20 21:56       ` Carlos O'Donell
2018-07-23 15:11         ` Florian Weimer
2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
2018-07-24 20:45             ` Rafal Luzynski
2018-07-24 20:53               ` Carlos O'Donell
2018-07-24 20:59               ` Carlos O'Donell
2018-07-25 15:44             ` Mike FABIAN
2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
2018-07-25 20:19             ` Florian Weimer
2018-07-25 20:25               ` Carlos O'Donell
2018-07-25 20:31                 ` Florian Weimer
2018-07-25 20:57                   ` [PATCHv4] " Carlos O'Donell
2018-07-26  2:34                     ` [PATCHv4a] " Carlos O'Donell
2018-07-26 14:51                       ` Florian Weimer
2018-07-26 14:59                         ` Carlos O'Donell
2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
2018-07-30 17:40                           ` Florian Weimer
2018-07-30 17:45                             ` Carlos O'Donell
2018-07-30 17:54                               ` Florian Weimer
2018-07-30 18:26                                 ` Carlos O'Donell
2018-07-30 18:34                                   ` Florian Weimer
2018-07-31  2:18                             ` Carlos O'Donell
2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
2018-07-25 21:12                   ` Carlos O'Donell
2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
2018-07-25 22:50   ` Florian Weimer
2018-07-26  1:20     ` Carlos O'Donell
2018-07-26  8:09       ` Andreas Schwab
2018-07-26  9:16         ` Florian Weimer
2018-07-26  1:33 ` Jonathan Nieder
2018-07-26  1:49   ` Carlos O'Donell
2018-07-26  2:16     ` Jonathan Nieder
2018-07-26  3:48       ` Carlos O'Donell
2018-07-26  7:42       ` Florian Weimer
2018-07-26  8:18         ` Andreas Schwab
2018-07-26  9:15           ` Florian Weimer
2018-07-26 13:25           ` Carlos O'Donell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).