* [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). @ 2018-07-19 19:43 Carlos O'Donell 2018-07-19 20:39 ` Florian Weimer ` (2 more replies) 0 siblings, 3 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-19 19:43 UTC (permalink / raw) To: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers [-- Attachment #1: Type: text/plain, Size: 3871 bytes --] In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of the collation data to harmonize with the new version of ISO 14651 which is derived from Unicode 9.0.0. This collation update brought with it some changes to locales which were not desirable by some users, in particular it altered the meaning of the locale-dependent-range regular expression, namely [a-z] and [A-Z], and for en_US it caused uppercase letters to be matched by [a-z] for the first time. The matching of uppercase letters by [a-z] is something which is already known to users of other locales which have this property, but this change could cause significant problems to en_US and other similar locales that had never had this change before. Whether this behaviour is desirable or not is contentious and GNU Awk has this to say on the topic: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html While the POSIX standard also has this further to say: "RE Bracket Expression": http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html "The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. ... As noted above, efforts were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software while not invalidating existing implementations." In glibc we implement the requirement of ISO POSIX-2:1993 and use collation element order (CEO) to construct the range expression, the API internally is __collseq_table_lookup(). The fact that we use CEO and also have 4-level weights on each collation rule means that we can in practice reorder the collation rules in iso14651_t1_common (the new data) to provide consistent range expression resolution *and* the weights should maintain the expected total order. Therefore this patch does three things: * Reorder the collation rules for the LATIN script in iso14651_t1_common to deinterlace uppercase and lowercase letters in the collation element orders. * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises strcoll* and strxfrm* and ensures the ISO 14651 collation remains. * Add back tests to tst-fnmatch.input and tst-regexloc.c which exercise that [a-z] does not match A or Z. The reordering of the ISO 14651 data is done in an entirely mechanical fashion using the following program attached to the bug: https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28 It is up for discussion if the iso14651_t1_common data should be refined further to have 3 very tight collation element ranges that include only a-z, A-Z, and 0-9, which would implement the solution sought after in: https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12 No regressions on x86_64. Verified that removal of the iso14651_t1_common change causes tst-fnmatch to regress with: 422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) *** ... 425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) *** --- ChangeLog | 11 + localedata/Makefile | 1 + localedata/en_US.UTF-8.in | 2159 +++++++++++++++++++++++++++++++++ localedata/locales/iso14651_t1_common | 1928 ++++++++++++++--------------- posix/tst-fnmatch.input | 125 +- posix/tst-regexloc.c | 8 +- 6 files changed, 3224 insertions(+), 1008 deletions(-) create mode 100644 localedata/en_US.UTF-8.in I'm suggesting this change immediately for 2.28 to avoid further problems with users expectations and sorting with [a-z] and [A-Z] until a clearer consensus can be reached for a final solution. File attached as .tar.gz to get past spam detectors. There is a lot of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN set that can be sorted with the existing test case infrastructure). -- Cheers, Carlos. [-- Attachment #2: swbz23393.tar.gz --] [-- Type: application/gzip, Size: 57788 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell @ 2018-07-19 20:39 ` Florian Weimer 2018-07-20 18:49 ` Carlos O'Donell 2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell 2018-07-26 1:33 ` Jonathan Nieder 2 siblings, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-19 20:39 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/19/2018 09:43 PM, Carlos O'Donell wrote: > * Add back tests to tst-fnmatch.input and tst-regexloc.c which > exercise that [a-z] does not match A or Z. [a-z] still matches ñ, ð, but not ð£, which I doubt is useful. It's an improvement, and it may be good enough for glibc 2.28, but I would rather see us implement the rational ranges interpretation. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-19 20:39 ` Florian Weimer @ 2018-07-20 18:49 ` Carlos O'Donell 2018-07-20 19:02 ` Rich Felker 2018-07-20 19:19 ` Florian Weimer 0 siblings, 2 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-20 18:49 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers [-- Attachment #1: Type: text/plain, Size: 2202 bytes --] On 07/19/2018 04:39 PM, Florian Weimer wrote: > On 07/19/2018 09:43 PM, Carlos O'Donell wrote: >> * Add back tests to tst-fnmatch.input and tst-regexloc.c which >> exercise that [a-z] does not match A or Z. > > [a-z] still matches ñ, ð, but not ð£, which I doubt is useful. Sorry, I don't follow, it absolutely matches ASCII z. We deinterlace the collation element ordering (not sequence) to get the right range expression resolution. See the added fnmatch tests: +en_US.UTF-8 "a" "[a-z]" 0 +en_US.UTF-8 "z" "[a-z]" 0 +en_US.UTF-8 "A" "[a-z]" NOMATCH +en_US.UTF-8 "Z" "[a-z]" NOMATCH +en_US.UTF-8 "a" "[A-Z]" NOMATCH +en_US.UTF-8 "z" "[A-Z]" NOMATCH +en_US.UTF-8 "A" "[A-Z]" 0 +en_US.UTF-8 "Z" "[A-Z]" 0 +en_US.UTF-8 "0" "[0-9]" 0 +en_US.UTF-8 "9" "[0-9]" 0 [a-z] matches a-z (including z), *and* all the lowercase inbetween, and so behaves like :lower: effectively. [A-Z] matches A-Z (including Z), *and* all the uppercase inbetwee, and so behaves like :upper: effectively. I left in all the matches for the accented characters because it was the most conservative thing to do for now. I could be persuaded otherwise I think, just reading the old history and seeing the new reports seems to indicate we should back down to behaving like C/POSIX in these cases. > It's an improvement, and it may be good enough for glibc 2.28, but I would > rather see us implement the rational ranges interpretation. That requires all ranges behave rationally? We could fix a-z, A-Z, and 0-9 easily. Patch attached. It has no effect on collation sequence, but it will break scripts that expect the new-style behaviour, and we knew that, but it certainly aligns us with the pre-POSIX requirement and the rest of the GNU tools implementing rational ranges, which is a much better reason. -- Cheers, Carlos. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: rational-ranges.diff --] [-- Type: text/x-patch; name="rational-ranges.diff", Size: 44348 bytes --] diff --git a/localedata/locales/iso14651_t1_common b/localedata/locales/iso14651_t1_common index 227400cc4e..7248074a8b 100644 --- a/localedata/locales/iso14651_t1_common +++ b/localedata/locales/iso14651_t1_common @@ -63177,7 +63177,19 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U20BC> <S20BC>;<BASE>;<MIN>;<U20BC> % MANAT SIGN <U20BD> <S20BD>;<BASE>;<MIN>;<U20BD> % RUBLE SIGN <U20BE> <S20BE>;<BASE>;<MIN>;<U20BE> % LARI SIGN +% Implement rational range for [0-9] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO +<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE +<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO +<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE +<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR +<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE +<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX +<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN +<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT +<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE <U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO <U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO <U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO @@ -63250,7 +63262,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U2080> <S0030>;<BASE>;<MNS>;<U2080> % SUBSCRIPT ZERO <U2189> "<S0030><S0033>";"<BASE><BASE>";"<FRACTION><FRACTION>";<U2189> % VULGAR FRACTION ZERO THIRDS <U3358> "<S0030><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3358> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO -<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE <U0661> <S0031>;<BASE>;<MIN>;<U0661> % ARABIC-INDIC DIGIT ONE <U06F1> <S0031>;<BASE>;<MIN>;<U06F1> % EXTENDED ARABIC-INDIC DIGIT ONE <U07C1> <S0031>;<BASE>;<MIN>;<U07C1> % NKO DIGIT ONE @@ -63440,7 +63451,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E0> "<S0031><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE <U32C0> "<S0031><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY <U3359> "<S0031><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3359> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ONE -<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO <U0662> <S0032>;<BASE>;<MIN>;<U0662> % ARABIC-INDIC DIGIT TWO <U06F2> <S0032>;<BASE>;<MIN>;<U06F2> % EXTENDED ARABIC-INDIC DIGIT TWO <U07C2> <S0032>;<BASE>;<MIN>;<U07C2> % NKO DIGIT TWO @@ -63583,7 +63593,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E1> "<S0032><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO <U32C1> "<S0032><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR FEBRUARY <U335A> "<S0032><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335A> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWO -<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE <U0663> <S0033>;<BASE>;<MIN>;<U0663> % ARABIC-INDIC DIGIT THREE <U06F3> <S0033>;<BASE>;<MIN>;<U06F3> % EXTENDED ARABIC-INDIC DIGIT THREE <U07C3> <S0033>;<BASE>;<MIN>;<U07C3> % NKO DIGIT THREE @@ -63709,7 +63718,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E2> "<S0033><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THREE <U32C2> "<S0033><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MARCH <U335B> "<S0033><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335B> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR THREE -<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR <U0664> <S0034>;<BASE>;<MIN>;<U0664> % ARABIC-INDIC DIGIT FOUR <U06F4> <S0034>;<BASE>;<MIN>;<U06F4> % EXTENDED ARABIC-INDIC DIGIT FOUR <U07C4> <S0034>;<BASE>;<MIN>;<U07C4> % NKO DIGIT FOUR @@ -63829,7 +63837,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E3> "<S0034><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FOUR <U32C3> "<S0034><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR APRIL <U335C> "<S0034><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335C> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FOUR -<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE <U0665> <S0035>;<BASE>;<MIN>;<U0665> % ARABIC-INDIC DIGIT FIVE <U06F5> <S0035>;<BASE>;<MIN>;<U06F5> % EXTENDED ARABIC-INDIC DIGIT FIVE <U07C5> <S0035>;<BASE>;<MIN>;<U07C5> % NKO DIGIT FIVE @@ -63941,7 +63948,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E4> "<S0035><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FIVE <U32C4> "<S0035><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MAY <U335D> "<S0035><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335D> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FIVE -<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX <U0666> <S0036>;<BASE>;<MIN>;<U0666> % ARABIC-INDIC DIGIT SIX <U06F6> <S0036>;<BASE>;<MIN>;<U06F6> % EXTENDED ARABIC-INDIC DIGIT SIX <U07C6> <S0036>;<BASE>;<MIN>;<U07C6> % NKO DIGIT SIX @@ -64036,7 +64042,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E5> "<S0036><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SIX <U32C5> "<S0036><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JUNE <U335E> "<S0036><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335E> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SIX -<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN <U0667> <S0037>;<BASE>;<MIN>;<U0667> % ARABIC-INDIC DIGIT SEVEN <U06F7> <S0037>;<BASE>;<MIN>;<U06F7> % EXTENDED ARABIC-INDIC DIGIT SEVEN <U07C7> <S0037>;<BASE>;<MIN>;<U07C7> % NKO DIGIT SEVEN @@ -64132,7 +64137,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E6> "<S0037><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SEVEN <U32C6> "<S0037><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JULY <U335F> "<S0037><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335F> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SEVEN -<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT <U0668> <S0038>;<BASE>;<MIN>;<U0668> % ARABIC-INDIC DIGIT EIGHT <U06F8> <S0038>;<BASE>;<MIN>;<U06F8> % EXTENDED ARABIC-INDIC DIGIT EIGHT <U07C8> <S0038>;<BASE>;<MIN>;<U07C8> % NKO DIGIT EIGHT @@ -64226,7 +64230,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E7> "<S0038><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY EIGHT <U32C7> "<S0038><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR AUGUST <U3360> "<S0038><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3360> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR EIGHT -<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE <U0669> <S0039>;<BASE>;<MIN>;<U0669> % ARABIC-INDIC DIGIT NINE <U06F9> <S0039>;<BASE>;<MIN>;<U06F9> % EXTENDED ARABIC-INDIC DIGIT NINE <U07C9> <S0039>;<BASE>;<MIN>;<U07C9> % NKO DIGIT NINE @@ -64326,7 +64329,35 @@ order_start <LATIN>;forward;backward;forward;forward,position else order_start <LATIN>;forward;forward;forward;forward,position endif +% Implement rational range for [a-z] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. <U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A +<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B +<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C +<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D +<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E +<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F +<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G +<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H +<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I +<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J +<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K +<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L +<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M +<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N +<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O +<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P +<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q +<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R +<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S +<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T +<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U +<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V +<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W +<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X +<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y +<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z <UFF41> <S0061>;<BASE>;<WIDE>;<UFF41> % FULLWIDTH LATIN SMALL LETTER A <U0363> <S0061>;<BASE>;<COMPAT>;<U0363> % COMBINING LATIN SMALL LETTER A <U249C> <S0061>;<BASE>;<COMPAT>;<U249C> % PARENTHESIZED LATIN SMALL LETTER A @@ -64418,7 +64449,6 @@ endif <U0252> <S0252>;<BASE>;<MIN>;<U0252> % LATIN SMALL LETTER TURNED ALPHA <U1D9B> <S0252>;<BASE>;<MNN>;<U1D9B> % MODIFIER LETTER SMALL TURNED ALPHA <UAB64> <SAB64>;<BASE>;<MIN>;<UAB64> % LATIN SMALL LETTER INVERTED ALPHA -<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B <UFF42> <S0062>;<BASE>;<WIDE>;<UFF42> % FULLWIDTH LATIN SMALL LETTER B <U1DE8> <S0062>;<BASE>;<COMPAT>;<U1DE8> % COMBINING LATIN SMALL LETTER B <U249D> <S0062>;<BASE>;<COMPAT>;<U249D> % PARENTHESIZED LATIN SMALL LETTER B @@ -64454,7 +64484,6 @@ endif <U0183> <S0183>;<BASE>;<MIN>;<U0183> % LATIN SMALL LETTER B WITH TOPBAR <UA7B5> <SA7B5>;<BASE>;<MIN>;<UA7B5> % LATIN SMALL LETTER BETA <U1DE9> <SA7B5>;<BASE>;<COMPAT>;<U1DE9> % COMBINING LATIN SMALL LETTER BETA -<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C <UFF43> <S0063>;<BASE>;<WIDE>;<UFF43> % FULLWIDTH LATIN SMALL LETTER C <U0368> <S0063>;<BASE>;<COMPAT>;<U0368> % COMBINING LATIN SMALL LETTER C <U217D> <S0063>;<BASE>;<COMPAT>;<U217D> % SMALL ROMAN NUMERAL ONE HUNDRED @@ -64504,7 +64533,6 @@ endif <U1D9D> <S0255>;<BASE>;<MNN>;<U1D9D> % MODIFIER LETTER SMALL C WITH CURL <U2184> <S2184>;<BASE>;<MIN>;<U2184> % LATIN SMALL LETTER REVERSED C <UA73F> <SA73F>;<BASE>;<MIN>;<UA73F> % LATIN SMALL LETTER REVERSED C WITH DOT -<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D <UFF44> <S0064>;<BASE>;<WIDE>;<UFF44> % FULLWIDTH LATIN SMALL LETTER D <U0369> <S0064>;<BASE>;<COMPAT>;<U0369> % COMBINING LATIN SMALL LETTER D <U217E> <S0064>;<BASE>;<COMPAT>;<U217E> % SMALL ROMAN NUMERAL FIVE HUNDRED @@ -64563,7 +64591,6 @@ endif <U0221> <S0221>;<BASE>;<MIN>;<U0221> % LATIN SMALL LETTER D WITH CURL <UA771> <SA771>;<BASE>;<MIN>;<UA771> % LATIN SMALL LETTER DUM <U1E9F> <S1E9F>;<BASE>;<MIN>;<U1E9F> % LATIN SMALL LETTER DELTA -<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E <UFF45> <S0065>;<BASE>;<WIDE>;<UFF45> % FULLWIDTH LATIN SMALL LETTER E <U0364> <S0065>;<BASE>;<COMPAT>;<U0364> % COMBINING LATIN SMALL LETTER E <U24A0> <S0065>;<BASE>;<COMPAT>;<U24A0> % PARENTHESIZED LATIN SMALL LETTER E @@ -64641,7 +64668,6 @@ endif <U025E> <S025E>;<BASE>;<MIN>;<U025E> % LATIN SMALL LETTER CLOSED REVERSED OPEN E <U029A> <S029A>;<BASE>;<MIN>;<U029A> % LATIN SMALL LETTER CLOSED OPEN E <U0264> <S0264>;<BASE>;<MIN>;<U0264> % LATIN SMALL LETTER RAMS HORN -<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F <UFF46> <S0066>;<BASE>;<WIDE>;<UFF46> % FULLWIDTH LATIN SMALL LETTER F <U1DEB> <S0066>;<BASE>;<COMPAT>;<U1DEB> % COMBINING LATIN SMALL LETTER F <U24A1> <S0066>;<BASE>;<COMPAT>;<U24A1> % PARENTHESIZED LATIN SMALL LETTER F @@ -64680,7 +64706,6 @@ endif <U0192> <S0192>;<BASE>;<MIN>;<U0192> % LATIN SMALL LETTER F WITH HOOK <U214E> <S214E>;<BASE>;<MIN>;<U214E> % TURNED SMALL F <UA7FB> <SA7FB>;<BASE>;<MIN>;<UA7FB> % LATIN EPIGRAPHIC LETTER REVERSED F -<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G <UFF47> <S0067>;<BASE>;<WIDE>;<UFF47> % FULLWIDTH LATIN SMALL LETTER G <U1DDA> <S0067>;<BASE>;<COMPAT>;<U1DDA> % COMBINING LATIN SMALL LETTER G <U24A2> <S0067>;<BASE>;<COMPAT>;<U24A2> % PARENTHESIZED LATIN SMALL LETTER G @@ -64727,7 +64752,6 @@ endif <U0263> <S0263>;<BASE>;<MIN>;<U0263> % LATIN SMALL LETTER GAMMA <U02E0> <S0263>;<BASE>;<MNN>;<U02E0> % MODIFIER LETTER SMALL GAMMA <U01A3> <S01A3>;<BASE>;<MIN>;<U01A3> % LATIN SMALL LETTER OI -<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H <UFF48> <S0068>;<BASE>;<WIDE>;<UFF48> % FULLWIDTH LATIN SMALL LETTER H <U036A> <S0068>;<BASE>;<COMPAT>;<U036A> % COMBINING LATIN SMALL LETTER H <U24A3> <S0068>;<BASE>;<COMPAT>;<U24A3> % PARENTHESIZED LATIN SMALL LETTER H @@ -64780,7 +64804,6 @@ endif <U0267> <S0267>;<BASE>;<MIN>;<U0267> % LATIN SMALL LETTER HENG WITH HOOK <U02BB> <S02BB>;<BASE>;<MIN>;<U02BB> % MODIFIER LETTER TURNED COMMA <U02BD> <S02BD>;<BASE>;<MIN>;<U02BD> % MODIFIER LETTER REVERSED COMMA -<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I <UFF49> <S0069>;<BASE>;<WIDE>;<UFF49> % FULLWIDTH LATIN SMALL LETTER I <U0365> <S0069>;<BASE>;<COMPAT>;<U0365> % COMBINING LATIN SMALL LETTER I <U2170> <S0069>;<BASE>;<COMPAT>;<U2170> % SMALL ROMAN NUMERAL ONE @@ -64844,7 +64867,6 @@ endif <U0269> <S0269>;<BASE>;<MIN>;<U0269> % LATIN SMALL LETTER IOTA <U1DA5> <S0269>;<BASE>;<MNN>;<U1DA5> % MODIFIER LETTER SMALL IOTA <U1D7C> <S1D7C>;<BASE>;<MIN>;<U1D7C> % LATIN SMALL LETTER IOTA WITH STROKE -<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J <UFF4A> <S006A>;<BASE>;<WIDE>;<UFF4A> % FULLWIDTH LATIN SMALL LETTER J <U24A5> <S006A>;<BASE>;<COMPAT>;<U24A5> % PARENTHESIZED LATIN SMALL LETTER J <U2149> <S006A>;<BASE>;<FONT>;<U2149> % DOUBLE-STRUCK ITALIC SMALL J @@ -64876,7 +64898,6 @@ endif <U025F> <S025F>;<BASE>;<MIN>;<U025F> % LATIN SMALL LETTER DOTLESS J WITH STROKE <U1DA1> <S025F>;<BASE>;<MNN>;<U1DA1> % MODIFIER LETTER SMALL DOTLESS J WITH STROKE <U0284> <S0284>;<BASE>;<MIN>;<U0284> % LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK -<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K <UFF4B> <S006B>;<BASE>;<WIDE>;<UFF4B> % FULLWIDTH LATIN SMALL LETTER K <U1DDC> <S006B>;<BASE>;<COMPAT>;<U1DDC> % COMBINING LATIN SMALL LETTER K <U24A6> <S006B>;<BASE>;<COMPAT>;<U24A6> % PARENTHESIZED LATIN SMALL LETTER K @@ -64926,7 +64947,6 @@ endif <UA743> <SA743>;<BASE>;<MIN>;<UA743> % LATIN SMALL LETTER K WITH DIAGONAL STROKE <UA745> <SA745>;<BASE>;<MIN>;<UA745> % LATIN SMALL LETTER K WITH STROKE AND DIAGONAL STROKE <U029E> <S029E>;<BASE>;<MIN>;<U029E> % LATIN SMALL LETTER TURNED K -<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L <UFF4C> <S006C>;<BASE>;<WIDE>;<UFF4C> % FULLWIDTH LATIN SMALL LETTER L <U1DDD> <S006C>;<BASE>;<COMPAT>;<U1DDD> % COMBINING LATIN SMALL LETTER L <U217C> <S006C>;<BASE>;<COMPAT>;<U217C> % SMALL ROMAN NUMERAL FIFTY @@ -64996,7 +65016,6 @@ endif <UA781> <SA781>;<BASE>;<MIN>;<UA781> % LATIN SMALL LETTER TURNED L <U019B> <S019B>;<BASE>;<MIN>;<U019B> % LATIN SMALL LETTER LAMBDA WITH STROKE <U028E> <S028E>;<BASE>;<MIN>;<U028E> % LATIN SMALL LETTER TURNED Y -<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M <UFF4D> <S006D>;<BASE>;<WIDE>;<UFF4D> % FULLWIDTH LATIN SMALL LETTER M <U036B> <S006D>;<BASE>;<COMPAT>;<U036B> % COMBINING LATIN SMALL LETTER M <U217F> <S006D>;<BASE>;<COMPAT>;<U217F> % SMALL ROMAN NUMERAL ONE THOUSAND @@ -65055,7 +65074,6 @@ endif <UA7FD> <SA7FD>;<BASE>;<MIN>;<UA7FD> % LATIN EPIGRAPHIC LETTER INVERTED M <UA7FF> <SA7FF>;<BASE>;<MIN>;<UA7FF> % LATIN EPIGRAPHIC LETTER ARCHAIC M <UA773> <SA773>;<BASE>;<MIN>;<UA773> % LATIN SMALL LETTER MUM -<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N <UFF4E> <S006E>;<BASE>;<WIDE>;<UFF4E> % FULLWIDTH LATIN SMALL LETTER N <U1DE0> <S006E>;<BASE>;<COMPAT>;<U1DE0> % COMBINING LATIN SMALL LETTER N <U24A9> <S006E>;<BASE>;<COMPAT>;<U24A9> % PARENTHESIZED LATIN SMALL LETTER N @@ -65114,7 +65132,6 @@ endif <U014B> <S014B>;<BASE>;<MIN>;<U014B> % LATIN SMALL LETTER ENG <U1D51> <S014B>;<BASE>;<MNN>;<U1D51> % MODIFIER LETTER SMALL ENG <UAB3C> <SAB3C>;<BASE>;<MIN>;<UAB3C> % LATIN SMALL LETTER ENG WITH CROSSED-TAIL -<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O <UFF4F> <S006F>;<BASE>;<WIDE>;<UFF4F> % FULLWIDTH LATIN SMALL LETTER O <U0366> <S006F>;<BASE>;<COMPAT>;<U0366> % COMBINING LATIN SMALL LETTER O <U24AA> <S006F>;<BASE>;<COMPAT>;<U24AA> % PARENTHESIZED LATIN SMALL LETTER O @@ -65213,7 +65230,6 @@ endif <U0223> <S0223>;<BASE>;<MIN>;<U0223> % LATIN SMALL LETTER OU <U1D3D> <S0223>;<BASE>;<MISCCAP>;<U1D3D> % MODIFIER LETTER CAPITAL OU <U1D15> <S1D15>;<BASE>;<MIN>;<U1D15> % LATIN LETTER SMALL CAPITAL OU -<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P <UFF50> <S0070>;<BASE>;<WIDE>;<UFF50> % FULLWIDTH LATIN SMALL LETTER P <U1DEE> <S0070>;<BASE>;<COMPAT>;<U1DEE> % COMBINING LATIN SMALL LETTER P <U24AB> <S0070>;<BASE>;<COMPAT>;<U24AB> % PARENTHESIZED LATIN SMALL LETTER P @@ -65262,7 +65278,6 @@ endif <U0278> <S0278>;<BASE>;<MIN>;<U0278> % LATIN SMALL LETTER PHI <U1DB2> <S0278>;<BASE>;<MNN>;<U1DB2> % MODIFIER LETTER SMALL PHI <U2C77> <S2C77>;<BASE>;<MIN>;<U2C77> % LATIN SMALL LETTER TAILLESS PHI -<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q <UFF51> <S0071>;<BASE>;<WIDE>;<UFF51> % FULLWIDTH LATIN SMALL LETTER Q <U24AC> <S0071>;<BASE>;<COMPAT>;<U24AC> % PARENTHESIZED LATIN SMALL LETTER Q <U0001D42A> <S0071>;<BASE>;<FONT>;<U0001D42A> % MATHEMATICAL BOLD SMALL Q @@ -65285,7 +65300,6 @@ endif <U02A0> <S02A0>;<BASE>;<MIN>;<U02A0> % LATIN SMALL LETTER Q WITH HOOK <U024B> <S024B>;<BASE>;<MIN>;<U024B> % LATIN SMALL LETTER Q WITH HOOK TAIL <U0138> <S0138>;<BASE>;<MIN>;<U0138> % LATIN SMALL LETTER KRA -<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R <UFF52> <S0072>;<BASE>;<WIDE>;<UFF52> % FULLWIDTH LATIN SMALL LETTER R <U036C> <S0072>;<BASE>;<COMPAT>;<U036C> % COMBINING LATIN SMALL LETTER R <U1DCA> <S0072>;<BASE>;<COMPAT>;<U1DCA> % COMBINING LATIN SMALL LETTER R BELOW @@ -65354,7 +65368,6 @@ endif <UA775> <SA775>;<BASE>;<MIN>;<UA775> % LATIN SMALL LETTER RUM <UA776> <SA776>;<BASE>;<MIN>;<UA776> % LATIN LETTER SMALL CAPITAL RUM <UA75D> <SA75D>;<BASE>;<MIN>;<UA75D> % LATIN SMALL LETTER RUM ROTUNDA -<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S <UFF53> <S0073>;<BASE>;<WIDE>;<UFF53> % FULLWIDTH LATIN SMALL LETTER S <U1DE4> <S0073>;<BASE>;<COMPAT>;<U1DE4> % COMBINING LATIN SMALL LETTER S <U24AE> <S0073>;<BASE>;<COMPAT>;<U24AE> % PARENTHESIZED LATIN SMALL LETTER S @@ -65417,7 +65430,6 @@ endif <U0285> <S0285>;<BASE>;<MIN>;<U0285> % LATIN SMALL LETTER SQUAT REVERSED ESH <U1D98> <S1D98>;<BASE>;<MIN>;<U1D98> % LATIN SMALL LETTER ESH WITH RETROFLEX HOOK <U0286> <S0286>;<BASE>;<MIN>;<U0286> % LATIN SMALL LETTER ESH WITH CURL -<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T <UFF54> <S0074>;<BASE>;<WIDE>;<UFF54> % FULLWIDTH LATIN SMALL LETTER T <U036D> <S0074>;<BASE>;<COMPAT>;<U036D> % COMBINING LATIN SMALL LETTER T <U24AF> <S0074>;<BASE>;<COMPAT>;<U24AF> % PARENTHESIZED LATIN SMALL LETTER T @@ -65467,7 +65479,6 @@ endif <U0236> <S0236>;<BASE>;<MIN>;<U0236> % LATIN SMALL LETTER T WITH CURL <UA777> <SA777>;<BASE>;<MIN>;<UA777> % LATIN SMALL LETTER TUM <U0287> <S0287>;<BASE>;<MIN>;<U0287> % LATIN SMALL LETTER TURNED T -<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U <UFF55> <S0075>;<BASE>;<WIDE>;<UFF55> % FULLWIDTH LATIN SMALL LETTER U <U0367> <S0075>;<BASE>;<COMPAT>;<U0367> % COMBINING LATIN SMALL LETTER U <U24B0> <S0075>;<BASE>;<COMPAT>;<U24B0> % PARENTHESIZED LATIN SMALL LETTER U @@ -65552,7 +65563,6 @@ endif <U028A> <S028A>;<BASE>;<MIN>;<U028A> % LATIN SMALL LETTER UPSILON <U1DB7> <S028A>;<BASE>;<MNN>;<U1DB7> % MODIFIER LETTER SMALL UPSILON <U1D7F> <S1D7F>;<BASE>;<MIN>;<U1D7F> % LATIN SMALL LETTER UPSILON WITH STROKE -<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V <UFF56> <S0076>;<BASE>;<WIDE>;<UFF56> % FULLWIDTH LATIN SMALL LETTER V <U036E> <S0076>;<BASE>;<COMPAT>;<U036E> % COMBINING LATIN SMALL LETTER V <U2174> <S0076>;<BASE>;<COMPAT>;<U2174> % SMALL ROMAN NUMERAL FIVE @@ -65593,7 +65603,6 @@ endif <U1EFD> <S1EFD>;<BASE>;<MIN>;<U1EFD> % LATIN SMALL LETTER MIDDLE-WELSH V <U028C> <S028C>;<BASE>;<MIN>;<U028C> % LATIN SMALL LETTER TURNED V <U1DBA> <S028C>;<BASE>;<MNN>;<U1DBA> % MODIFIER LETTER SMALL TURNED V -<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W <UFF57> <S0077>;<BASE>;<WIDE>;<UFF57> % FULLWIDTH LATIN SMALL LETTER W <U1DF1> <S0077>;<BASE>;<COMPAT>;<U1DF1> % COMBINING LATIN SMALL LETTER W <U24B2> <S0077>;<BASE>;<COMPAT>;<U24B2> % PARENTHESIZED LATIN SMALL LETTER W @@ -65627,7 +65636,6 @@ endif <U1D21> <S1D21>;<BASE>;<MIN>;<U1D21> % LATIN LETTER SMALL CAPITAL W <U2C73> <S2C73>;<BASE>;<MIN>;<U2C73> % LATIN SMALL LETTER W WITH HOOK <U028D> <S028D>;<BASE>;<MIN>;<U028D> % LATIN SMALL LETTER TURNED W -<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X <UFF58> <S0078>;<BASE>;<WIDE>;<UFF58> % FULLWIDTH LATIN SMALL LETTER X <U036F> <S0078>;<BASE>;<COMPAT>;<U036F> % COMBINING LATIN SMALL LETTER X <U2179> <S0078>;<BASE>;<COMPAT>;<U2179> % SMALL ROMAN NUMERAL TEN @@ -65660,7 +65668,6 @@ endif <UAB53> <SAB53>;<BASE>;<MIN>;<UAB53> % LATIN SMALL LETTER CHI <UAB54> <SAB54>;<BASE>;<MIN>;<UAB54> % LATIN SMALL LETTER CHI WITH LOW RIGHT RING <UAB55> <SAB55>;<BASE>;<MIN>;<UAB55> % LATIN SMALL LETTER CHI WITH LOW LEFT SERIF -<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y <UFF59> <S0079>;<BASE>;<WIDE>;<UFF59> % FULLWIDTH LATIN SMALL LETTER Y <U24B4> <S0079>;<BASE>;<COMPAT>;<U24B4> % PARENTHESIZED LATIN SMALL LETTER Y <U0001D432> <S0079>;<BASE>;<FONT>;<U0001D432> % MATHEMATICAL BOLD SMALL Y @@ -65694,7 +65701,6 @@ endif <U1EFF> <S1EFF>;<BASE>;<MIN>;<U1EFF> % LATIN SMALL LETTER Y WITH LOOP <UAB5A> <SAB5A>;<BASE>;<MIN>;<UAB5A> % LATIN SMALL LETTER Y WITH SHORT RIGHT LEG <U021D> <S021D>;<BASE>;<MIN>;<U021D> % LATIN SMALL LETTER YOGH -<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z <UFF5A> <S007A>;<BASE>;<WIDE>;<UFF5A> % FULLWIDTH LATIN SMALL LETTER Z <U1DE6> <S007A>;<BASE>;<COMPAT>;<U1DE6> % COMBINING LATIN SMALL LETTER Z <U24B5> <S007A>;<BASE>;<COMPAT>;<U24B5> % PARENTHESIZED LATIN SMALL LETTER Z @@ -65796,7 +65802,35 @@ endif <U0001D736> <S03B1>;<BASE>;<FONT>;<U0001D736> % MATHEMATICAL BOLD ITALIC SMALL ALPHA <U0001D770> <S03B1>;<BASE>;<FONT>;<U0001D770> % MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA <U0001D7AA> <S03B1>;<BASE>;<FONT>;<U0001D7AA> % MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA +% Implement rational range for [A-Z] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. <U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A +<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B +<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C +<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D +<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E +<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F +<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G +<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H +<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I +<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J +<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K +<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L +<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M +<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N +<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O +<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P +<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q +<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R +<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S +<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T +<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U +<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V +<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W +<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X +<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y +<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z <UFF21> <S0061>;<BASE>;<WIDECAP>;<UFF21> % FULLWIDTH LATIN CAPITAL LETTER A <U0001F110> <S0061>;<BASE>;<COMPATCAP>;<U0001F110> % PARENTHESIZED LATIN CAPITAL LETTER A <U0001D400> <S0061>;<BASE>;<FONTCAP>;<U0001D400> % MATHEMATICAL BOLD CAPITAL A @@ -65860,7 +65894,6 @@ endif <U2C6F> <S0250>;<BASE>;<CAP>;<U2C6F> % LATIN CAPITAL LETTER TURNED A <U2C6D> <S0251>;<BASE>;<CAP>;<U2C6D> % LATIN CAPITAL LETTER ALPHA <U2C70> <S0252>;<BASE>;<CAP>;<U2C70> % LATIN CAPITAL LETTER TURNED ALPHA -<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B <UFF22> <S0062>;<BASE>;<WIDECAP>;<UFF22> % FULLWIDTH LATIN CAPITAL LETTER B <U0001F111> <S0062>;<BASE>;<COMPATCAP>;<U0001F111> % PARENTHESIZED LATIN CAPITAL LETTER B <U212C> <S0062>;<BASE>;<FONTCAP>;<U212C> % SCRIPT CAPITAL B @@ -65888,7 +65921,6 @@ endif <U0181> <S0253>;<BASE>;<CAP>;<U0181> % LATIN CAPITAL LETTER B WITH HOOK <U0182> <S0183>;<BASE>;<CAP>;<U0182> % LATIN CAPITAL LETTER B WITH TOPBAR <UA7B4> <SA7B5>;<BASE>;<CAP>;<UA7B4> % LATIN CAPITAL LETTER BETA -<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C <UFF23> <S0063>;<BASE>;<WIDECAP>;<UFF23> % FULLWIDTH LATIN CAPITAL LETTER C <U216D> <S0063>;<BASE>;<COMPATCAP>;<U216D> % ROMAN NUMERAL ONE HUNDRED <U0001F112> <S0063>;<BASE>;<COMPATCAP>;<U0001F112> % PARENTHESIZED LATIN CAPITAL LETTER C @@ -65921,7 +65953,6 @@ endif <U0187> <S0188>;<BASE>;<CAP>;<U0187> % LATIN CAPITAL LETTER C WITH HOOK <U2183> <S2184>;<BASE>;<CAP>;<U2183> % ROMAN NUMERAL REVERSED ONE HUNDRED <UA73E> <SA73F>;<BASE>;<CAP>;<UA73E> % LATIN CAPITAL LETTER REVERSED C WITH DOT -<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D <UFF24> <S0064>;<BASE>;<WIDECAP>;<UFF24> % FULLWIDTH LATIN CAPITAL LETTER D <U216E> <S0064>;<BASE>;<COMPATCAP>;<U216E> % ROMAN NUMERAL FIVE HUNDRED <U0001F113> <S0064>;<BASE>;<COMPATCAP>;<U0001F113> % PARENTHESIZED LATIN CAPITAL LETTER D @@ -65959,7 +65990,6 @@ endif <U0189> <S0256>;<BASE>;<CAP>;<U0189> % LATIN CAPITAL LETTER AFRICAN D <U018A> <S0257>;<BASE>;<CAP>;<U018A> % LATIN CAPITAL LETTER D WITH HOOK <U018B> <S018C>;<BASE>;<CAP>;<U018B> % LATIN CAPITAL LETTER D WITH TOPBAR -<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E <UFF25> <S0065>;<BASE>;<WIDECAP>;<UFF25> % FULLWIDTH LATIN CAPITAL LETTER E <U0001F114> <S0065>;<BASE>;<COMPATCAP>;<U0001F114> % PARENTHESIZED LATIN CAPITAL LETTER E <U2130> <S0065>;<BASE>;<FONTCAP>;<U2130> % SCRIPT CAPITAL E @@ -66010,7 +66040,6 @@ endif <U0190> <S025B>;<BASE>;<CAP>;<U0190> % LATIN CAPITAL LETTER OPEN E <U2107> <S025B>;<BASE>;<COMPATCAP>;<U2107> % EULER CONSTANT <UA7AB> <S025C>;<BASE>;<CAP>;<UA7AB> % LATIN CAPITAL LETTER REVERSED OPEN E -<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F <UFF26> <S0066>;<BASE>;<WIDECAP>;<UFF26> % FULLWIDTH LATIN CAPITAL LETTER F <U0001F115> <S0066>;<BASE>;<COMPATCAP>;<U0001F115> % PARENTHESIZED LATIN CAPITAL LETTER F <U2131> <S0066>;<BASE>;<FONTCAP>;<U2131> % SCRIPT CAPITAL F @@ -66035,7 +66064,6 @@ endif <UA798> <SA799>;<BASE>;<CAP>;<UA798> % LATIN CAPITAL LETTER F WITH STROKE <U0191> <S0192>;<BASE>;<CAP>;<U0191> % LATIN CAPITAL LETTER F WITH HOOK <U2132> <S214E>;<BASE>;<CAP>;<U2132> % TURNED CAPITAL F -<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G <UFF27> <S0067>;<BASE>;<WIDECAP>;<UFF27> % FULLWIDTH LATIN CAPITAL LETTER G <U0001F116> <S0067>;<BASE>;<COMPATCAP>;<U0001F116> % PARENTHESIZED LATIN CAPITAL LETTER G <U0001D406> <S0067>;<BASE>;<FONTCAP>;<U0001D406> % MATHEMATICAL BOLD CAPITAL G @@ -66071,7 +66099,6 @@ endif <UA77E> <SA77F>;<BASE>;<CAP>;<UA77E> % LATIN CAPITAL LETTER TURNED INSULAR G <U0194> <S0263>;<BASE>;<CAP>;<U0194> % LATIN CAPITAL LETTER GAMMA <U01A2> <S01A3>;<BASE>;<CAP>;<U01A2> % LATIN CAPITAL LETTER OI -<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H <UFF28> <S0068>;<BASE>;<WIDECAP>;<UFF28> % FULLWIDTH LATIN CAPITAL LETTER H <U0001F117> <S0068>;<BASE>;<COMPATCAP>;<U0001F117> % PARENTHESIZED LATIN CAPITAL LETTER H <U210B> <S0068>;<BASE>;<FONTCAP>;<U210B> % SCRIPT CAPITAL H @@ -66104,7 +66131,6 @@ endif <U2C67> <S2C68>;<BASE>;<CAP>;<U2C67> % LATIN CAPITAL LETTER H WITH DESCENDER <U2C75> <S2C76>;<BASE>;<CAP>;<U2C75> % LATIN CAPITAL LETTER HALF H <UA726> <SA727>;<BASE>;<CAP>;<UA726> % LATIN CAPITAL LETTER HENG -<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I <UFF29> <S0069>;<BASE>;<WIDECAP>;<UFF29> % FULLWIDTH LATIN CAPITAL LETTER I <U2160> <S0069>;<BASE>;<COMPATCAP>;<U2160> % ROMAN NUMERAL ONE <U0001F118> <S0069>;<BASE>;<COMPATCAP>;<U0001F118> % PARENTHESIZED LATIN CAPITAL LETTER I @@ -66149,7 +66175,6 @@ endif <UA7AE> <S026A>;<BASE>;<CAP>;<UA7AE> % LATIN CAPITAL LETTER SMALL CAPITAL I <U0197> <S0268>;<BASE>;<CAP>;<U0197> % LATIN CAPITAL LETTER I WITH STROKE <U0196> <S0269>;<BASE>;<CAP>;<U0196> % LATIN CAPITAL LETTER IOTA -<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J <UFF2A> <S006A>;<BASE>;<WIDECAP>;<UFF2A> % FULLWIDTH LATIN CAPITAL LETTER J <U0001F119> <S006A>;<BASE>;<COMPATCAP>;<U0001F119> % PARENTHESIZED LATIN CAPITAL LETTER J <U0001D409> <S006A>;<BASE>;<FONTCAP>;<U0001D409> % MATHEMATICAL BOLD CAPITAL J @@ -66172,7 +66197,6 @@ endif <U0134> <S006A>;"<BASE><CIRCF>";"<CAP><MIN>";<U0134> % LATIN CAPITAL LETTER J WITH CIRCUMFLEX <U0248> <S0249>;<BASE>;<CAP>;<U0248> % LATIN CAPITAL LETTER J WITH STROKE <UA7B2> <S029D>;<BASE>;<CAP>;<UA7B2> % LATIN CAPITAL LETTER J WITH CROSSED-TAIL -<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K <U212A> <S006B>;<BASE>;<CAP>;<U212A> % KELVIN SIGN <UFF2B> <S006B>;<BASE>;<WIDECAP>;<UFF2B> % FULLWIDTH LATIN CAPITAL LETTER K <U0001F11A> <S006B>;<BASE>;<COMPATCAP>;<U0001F11A> % PARENTHESIZED LATIN CAPITAL LETTER K @@ -66206,7 +66230,6 @@ endif <UA742> <SA743>;<BASE>;<CAP>;<UA742> % LATIN CAPITAL LETTER K WITH DIAGONAL STROKE <UA744> <SA745>;<BASE>;<CAP>;<UA744> % LATIN CAPITAL LETTER K WITH STROKE AND DIAGONAL STROKE <UA7B0> <S029E>;<BASE>;<CAP>;<UA7B0> % LATIN CAPITAL LETTER TURNED K -<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L <UFF2C> <S006C>;<BASE>;<WIDECAP>;<UFF2C> % FULLWIDTH LATIN CAPITAL LETTER L <U216C> <S006C>;<BASE>;<COMPATCAP>;<U216C> % ROMAN NUMERAL FIFTY <U0001F11B> <S006C>;<BASE>;<COMPATCAP>;<U0001F11B> % PARENTHESIZED LATIN CAPITAL LETTER L @@ -66249,7 +66272,6 @@ endif <U2C62> <S026B>;<BASE>;<CAP>;<U2C62> % LATIN CAPITAL LETTER L WITH MIDDLE TILDE <UA7AD> <S026C>;<BASE>;<CAP>;<UA7AD> % LATIN CAPITAL LETTER L WITH BELT <UA780> <SA781>;<BASE>;<CAP>;<UA780> % LATIN CAPITAL LETTER TURNED L -<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M <UFF2D> <S006D>;<BASE>;<WIDECAP>;<UFF2D> % FULLWIDTH LATIN CAPITAL LETTER M <U216F> <S006D>;<BASE>;<COMPATCAP>;<U216F> % ROMAN NUMERAL ONE THOUSAND <U0001F11C> <S006D>;<BASE>;<COMPATCAP>;<U0001F11C> % PARENTHESIZED LATIN CAPITAL LETTER M @@ -66275,7 +66297,6 @@ endif <U1E42> <S006D>;"<BASE><POINS>";"<CAP><MIN>";<U1E42> % LATIN CAPITAL LETTER M WITH DOT BELOW <U1DDF> <S1D0D>;<BASE>;<COMPAT>;<U1DDF> % COMBINING LATIN LETTER SMALL CAPITAL M <U2C6E> <S0271>;<BASE>;<CAP>;<U2C6E> % LATIN CAPITAL LETTER M WITH HOOK -<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N <UFF2E> <S006E>;<BASE>;<WIDECAP>;<UFF2E> % FULLWIDTH LATIN CAPITAL LETTER N <U0001F11D> <S006E>;<BASE>;<COMPATCAP>;<U0001F11D> % PARENTHESIZED LATIN CAPITAL LETTER N <U2115> <S006E>;<BASE>;<FONTCAP>;<U2115> % DOUBLE-STRUCK CAPITAL N @@ -66312,7 +66333,6 @@ endif <U0220> <S019E>;<BASE>;<CAP>;<U0220> % LATIN CAPITAL LETTER N WITH LONG RIGHT LEG <UA790> <SA791>;<BASE>;<CAP>;<UA790> % LATIN CAPITAL LETTER N WITH DESCENDER <U014A> <S014B>;<BASE>;<CAP>;<U014A> % LATIN CAPITAL LETTER ENG -<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O <UFF2F> <S006F>;<BASE>;<WIDECAP>;<UFF2F> % FULLWIDTH LATIN CAPITAL LETTER O <U0001F11E> <S006F>;<BASE>;<COMPATCAP>;<U0001F11E> % PARENTHESIZED LATIN CAPITAL LETTER O <U0001D40E> <S006F>;<BASE>;<FONTCAP>;<U0001D40E> % MATHEMATICAL BOLD CAPITAL O @@ -66377,7 +66397,6 @@ endif <UA74A> <SA74B>;<BASE>;<CAP>;<UA74A> % LATIN CAPITAL LETTER O WITH LONG STROKE OVERLAY <UA7B6> <SA7B7>;<BASE>;<CAP>;<UA7B6> % LATIN CAPITAL LETTER OMEGA <U0222> <S0223>;<BASE>;<CAP>;<U0222> % LATIN CAPITAL LETTER OU -<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P <UFF30> <S0070>;<BASE>;<WIDECAP>;<UFF30> % FULLWIDTH LATIN CAPITAL LETTER P <U0001F11F> <S0070>;<BASE>;<COMPATCAP>;<U0001F11F> % PARENTHESIZED LATIN CAPITAL LETTER P <U2119> <S0070>;<BASE>;<FONTCAP>;<U2119> % DOUBLE-STRUCK CAPITAL P @@ -66405,7 +66424,6 @@ endif <U01A4> <S01A5>;<BASE>;<CAP>;<U01A4> % LATIN CAPITAL LETTER P WITH HOOK <UA752> <SA753>;<BASE>;<CAP>;<UA752> % LATIN CAPITAL LETTER P WITH FLOURISH <UA754> <SA755>;<BASE>;<CAP>;<UA754> % LATIN CAPITAL LETTER P WITH SQUIRREL TAIL -<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q <UFF31> <S0071>;<BASE>;<WIDECAP>;<UFF31> % FULLWIDTH LATIN CAPITAL LETTER Q <U0001F120> <S0071>;<BASE>;<COMPATCAP>;<U0001F120> % PARENTHESIZED LATIN CAPITAL LETTER Q <U211A> <S0071>;<BASE>;<FONTCAP>;<U211A> % DOUBLE-STRUCK CAPITAL Q @@ -66428,7 +66446,6 @@ endif <UA756> <SA757>;<BASE>;<CAP>;<UA756> % LATIN CAPITAL LETTER Q WITH STROKE THROUGH DESCENDER <UA758> <SA759>;<BASE>;<CAP>;<UA758> % LATIN CAPITAL LETTER Q WITH DIAGONAL STROKE <U024A> <S024B>;<BASE>;<CAP>;<U024A> % LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL -<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R <UFF32> <S0072>;<BASE>;<WIDECAP>;<UFF32> % FULLWIDTH LATIN CAPITAL LETTER R <U0001F121> <S0072>;<BASE>;<COMPATCAP>;<U0001F121> % PARENTHESIZED LATIN CAPITAL LETTER R <U211B> <S0072>;<BASE>;<FONTCAP>;<U211B> % SCRIPT CAPITAL R @@ -66466,7 +66483,6 @@ endif <U024C> <S024D>;<BASE>;<CAP>;<U024C> % LATIN CAPITAL LETTER R WITH STROKE <U2C64> <S027D>;<BASE>;<CAP>;<U2C64> % LATIN CAPITAL LETTER R WITH TAIL <UA75C> <SA75D>;<BASE>;<CAP>;<UA75C> % LATIN CAPITAL LETTER RUM ROTUNDA -<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S <UFF33> <S0073>;<BASE>;<WIDECAP>;<UFF33> % FULLWIDTH LATIN CAPITAL LETTER S <U0001F122> <S0073>;<BASE>;<COMPATCAP>;<U0001F122> % PARENTHESIZED LATIN CAPITAL LETTER S <U0001F12A> <S0073>;<BASE>;<COMPATCAP>;<U0001F12A> % TORTOISE SHELL BRACKETED LATIN CAPITAL LETTER S @@ -66502,7 +66518,6 @@ endif <U1E9E> "<S0073><S0073>";"<BASE><VRNT1><BASE>";"<COMPATCAP><COMPAT><COMPATCAP>";<U1E9E> % LATIN CAPITAL LETTER SHARP S <U2C7E> <S023F>;<BASE>;<CAP>;<U2C7E> % LATIN CAPITAL LETTER S WITH SWASH TAIL <U01A9> <S0283>;<BASE>;<CAP>;<U01A9> % LATIN CAPITAL LETTER ESH -<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T <UFF34> <S0074>;<BASE>;<WIDECAP>;<UFF34> % FULLWIDTH LATIN CAPITAL LETTER T <U0001F123> <S0074>;<BASE>;<COMPATCAP>;<U0001F123> % PARENTHESIZED LATIN CAPITAL LETTER T <U0001D413> <S0074>;<BASE>;<FONTCAP>;<U0001D413> % MATHEMATICAL BOLD CAPITAL T @@ -66536,7 +66551,6 @@ endif <U01AC> <S01AD>;<BASE>;<CAP>;<U01AC> % LATIN CAPITAL LETTER T WITH HOOK <U01AE> <S0288>;<BASE>;<CAP>;<U01AE> % LATIN CAPITAL LETTER T WITH RETROFLEX HOOK <UA7B1> <S0287>;<BASE>;<CAP>;<UA7B1> % LATIN CAPITAL LETTER TURNED T -<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U <UFF35> <S0075>;<BASE>;<WIDECAP>;<UFF35> % FULLWIDTH LATIN CAPITAL LETTER U <U0001F124> <S0075>;<BASE>;<COMPATCAP>;<U0001F124> % PARENTHESIZED LATIN CAPITAL LETTER U <U0001D414> <S0075>;<BASE>;<FONTCAP>;<U0001D414> % MATHEMATICAL BOLD CAPITAL U @@ -66591,7 +66605,6 @@ endif <UA78D> <S0265>;<BASE>;<CAP>;<UA78D> % LATIN CAPITAL LETTER TURNED H <U019C> <S026F>;<BASE>;<CAP>;<U019C> % LATIN CAPITAL LETTER TURNED M <U01B1> <S028A>;<BASE>;<CAP>;<U01B1> % LATIN CAPITAL LETTER UPSILON -<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V <UFF36> <S0076>;<BASE>;<WIDECAP>;<UFF36> % FULLWIDTH LATIN CAPITAL LETTER V <U2164> <S0076>;<BASE>;<COMPATCAP>;<U2164> % ROMAN NUMERAL FIVE <U0001F125> <S0076>;<BASE>;<COMPATCAP>;<U0001F125> % PARENTHESIZED LATIN CAPITAL LETTER V @@ -66622,7 +66635,6 @@ endif <U01B2> <S028B>;<BASE>;<CAP>;<U01B2> % LATIN CAPITAL LETTER V WITH HOOK <U1EFC> <S1EFD>;<BASE>;<CAP>;<U1EFC> % LATIN CAPITAL LETTER MIDDLE-WELSH V <U0245> <S028C>;<BASE>;<CAP>;<U0245> % LATIN CAPITAL LETTER TURNED V -<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W <UFF37> <S0077>;<BASE>;<WIDECAP>;<UFF37> % FULLWIDTH LATIN CAPITAL LETTER W <U0001F126> <S0077>;<BASE>;<COMPATCAP>;<U0001F126> % PARENTHESIZED LATIN CAPITAL LETTER W <U0001D416> <S0077>;<BASE>;<FONTCAP>;<U0001D416> % MATHEMATICAL BOLD CAPITAL W @@ -66649,7 +66661,6 @@ endif <U1E86> <S0077>;"<BASE><POINT>";"<CAP><MIN>";<U1E86> % LATIN CAPITAL LETTER W WITH DOT ABOVE <U1E88> <S0077>;"<BASE><POINS>";"<CAP><MIN>";<U1E88> % LATIN CAPITAL LETTER W WITH DOT BELOW <U2C72> <S2C73>;<BASE>;<CAP>;<U2C72> % LATIN CAPITAL LETTER W WITH HOOK -<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X <UFF38> <S0078>;<BASE>;<WIDECAP>;<UFF38> % FULLWIDTH LATIN CAPITAL LETTER X <U2169> <S0078>;<BASE>;<COMPATCAP>;<U2169> % ROMAN NUMERAL TEN <U0001F127> <S0078>;<BASE>;<COMPATCAP>;<U0001F127> % PARENTHESIZED LATIN CAPITAL LETTER X @@ -66675,7 +66686,6 @@ endif <U216A> "<S0078><S0069>";"<BASE><BASE>";"<COMPATCAP><COMPATCAP>";<U216A> % ROMAN NUMERAL ELEVEN <U216B> "<S0078><S0069><S0069>";"<BASE><BASE><BASE>";"<COMPATCAP><COMPATCAP><COMPATCAP>";<U216B> % ROMAN NUMERAL TWELVE <UA7B3> <SAB53>;<BASE>;<CAP>;<UA7B3> % LATIN CAPITAL LETTER CHI -<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y <UFF39> <S0079>;<BASE>;<WIDECAP>;<UFF39> % FULLWIDTH LATIN CAPITAL LETTER Y <U0001F128> <S0079>;<BASE>;<COMPATCAP>;<U0001F128> % PARENTHESIZED LATIN CAPITAL LETTER Y <U0001D418> <S0079>;<BASE>;<FONTCAP>;<U0001D418> % MATHEMATICAL BOLD CAPITAL Y @@ -66708,7 +66718,6 @@ endif <U01B3> <S01B4>;<BASE>;<CAP>;<U01B3> % LATIN CAPITAL LETTER Y WITH HOOK <U1EFE> <S1EFF>;<BASE>;<CAP>;<U1EFE> % LATIN CAPITAL LETTER Y WITH LOOP <U021C> <S021D>;<BASE>;<CAP>;<U021C> % LATIN CAPITAL LETTER YOGH -<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z <UFF3A> <S007A>;<BASE>;<WIDECAP>;<UFF3A> % FULLWIDTH LATIN CAPITAL LETTER Z <U0001F129> <S007A>;<BASE>;<COMPATCAP>;<U0001F129> % PARENTHESIZED LATIN CAPITAL LETTER Z <U2124> <S007A>;<BASE>;<FONTCAP>;<U2124> % DOUBLE-STRUCK CAPITAL Z diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input index dc2ca8d01a..0b3c78fd1c 100644 --- a/posix/tst-fnmatch.input +++ b/posix/tst-fnmatch.input @@ -67,9 +67,11 @@ # https://sourceware.org/bugzilla/show_bug.cgi?id=23393 # https://sourceware.org/bugzilla/show_bug.cgi?id=23420 # -# No consensus exists on how best to handle the changes so the -# iso14651_t1_common collation element order (CEO) has been changed to -# deinterlace the a-z and A-Z regions. +# The solution was to implement rational ranges by moving the collation +# element order to fix this for [a-z], [A-Z], and [0-9]. Likewise the +# upper and lower case letters are deinterlaced to allow for accented +# ranges that don't include uppercase e.g. [a-ñ] should not include +# any uppercase letters but may include a-z and more. # # With the deinterlacing commit ac3a3b4b0d561d776b60317d6a926050c8541655 # could be reverted to re-test the correct non-interleaved expectations. @@ -77,9 +79,7 @@ # Please note that despite the region being deinterlaced, the ordering # of collation remains the same. In glibc we implement CEO and because of # that we can reorder the elements to reorder ranges without impacting -# collation which depends on weights. The collation element ordering -# could have been changed to include just a-z, A-Z, and 0-9 in three -# distinct blocks, but this needs more discussion by the community. +# collation which depends on weights. # B.6 004(C) C "!#%+,-./01234567889" "!#%+,-./01234567889" 0 @@ -477,9 +477,9 @@ C "-" "[Z-\\]]" NOMATCH # handling of ranges and the recognition of character (vs bytes). de_DE.ISO-8859-1 "a" "[a-z]" 0 de_DE.ISO-8859-1 "z" "[a-z]" 0 -de_DE.ISO-8859-1 "ä" "[a-z]" 0 -de_DE.ISO-8859-1 "ö" "[a-z]" 0 -de_DE.ISO-8859-1 "ü" "[a-z]" 0 +de_DE.ISO-8859-1 "ä" "[a-z]" NOMATCH +de_DE.ISO-8859-1 "ö" "[a-z]" NOMATCH +de_DE.ISO-8859-1 "ü" "[a-z]" NOMATCH de_DE.ISO-8859-1 "A" "[a-z]" NOMATCH de_DE.ISO-8859-1 "Z" "[a-z]" NOMATCH de_DE.ISO-8859-1 "Ä" "[a-z]" NOMATCH @@ -492,9 +492,9 @@ de_DE.ISO-8859-1 " de_DE.ISO-8859-1 "ü" "[A-Z]" NOMATCH de_DE.ISO-8859-1 "A" "[A-Z]" 0 de_DE.ISO-8859-1 "Z" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ä" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ö" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ü" "[A-Z]" 0 +de_DE.ISO-8859-1 "Ä" "[A-Z]" NOMATCH +de_DE.ISO-8859-1 "Ö" "[A-Z]" NOMATCH +de_DE.ISO-8859-1 "Ü" "[A-Z]" NOMATCH de_DE.ISO-8859-1 "a" "[[:lower:]]" 0 de_DE.ISO-8859-1 "z" "[[:lower:]]" 0 de_DE.ISO-8859-1 "ä" "[[:lower:]]" 0 @@ -568,20 +568,34 @@ de_DE.ISO-8859-1 "ba" "[[.a.]]a" NOMATCH # And with a multibyte character set. en_US.UTF-8 "a" "[a-z]" 0 +# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [a-z]. +en_US.UTF-8 "ñ" "[a-z]" NOMATCH en_US.UTF-8 "z" "[a-z]" 0 en_US.UTF-8 "A" "[a-z]" NOMATCH +# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [a-z]. +en_US.UTF-8 "Ã" "[a-z]" NOMATCH en_US.UTF-8 "Z" "[a-z]" NOMATCH en_US.UTF-8 "a" "[A-Z]" NOMATCH +# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [A-Z]. +en_US.UTF-8 "ñ" "[A-Z]" NOMATCH en_US.UTF-8 "z" "[A-Z]" NOMATCH en_US.UTF-8 "A" "[A-Z]" 0 +# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [A-Z]. +en_US.UTF-8 "Ã" "[A-Z]" NOMATCH en_US.UTF-8 "Z" "[A-Z]" 0 en_US.UTF-8 "0" "[0-9]" 0 +# Test that <UFF10> FULLWIDTH DIGIT ZERO is not in [0-9]. +en_US.UTF-8 "ï¼" "[0-9]" NOMATCH +# Test that <U00BD> VULGAR FRACTION ONE HALF is not in [0-9]. +en_US.UTF-8 "½" "[0-9]" NOMATCH en_US.UTF-8 "9" "[0-9]" 0 +# Test that <UFF19> FULLWIDTH DIGIT NINE is not in [0-9]. +en_US.UTF-8 "ï¼" "[0-9]" NOMATCH de_DE.UTF-8 "a" "[a-z]" 0 de_DE.UTF-8 "z" "[a-z]" 0 -de_DE.UTF-8 "ä" "[a-z]" 0 -de_DE.UTF-8 "ö" "[a-z]" 0 -de_DE.UTF-8 "ü" "[a-z]" 0 +de_DE.UTF-8 "ä" "[a-z]" NOMATCH +de_DE.UTF-8 "ö" "[a-z]" NOMATCH +de_DE.UTF-8 "ü" "[a-z]" NOMATCH de_DE.UTF-8 "A" "[a-z]" NOMATCH de_DE.UTF-8 "Z" "[a-z]" NOMATCH de_DE.UTF-8 "Ã" "[a-z]" NOMATCH @@ -594,9 +608,9 @@ de_DE.UTF-8 "ö" "[A-Z]" NOMATCH de_DE.UTF-8 "ü" "[A-Z]" NOMATCH de_DE.UTF-8 "A" "[A-Z]" 0 de_DE.UTF-8 "Z" "[A-Z]" 0 -de_DE.UTF-8 "Ã" "[A-Z]" 0 -de_DE.UTF-8 "Ã" "[A-Z]" 0 -de_DE.UTF-8 "Ã" "[A-Z]" 0 +de_DE.UTF-8 "Ã" "[A-Z]" NOMATCH +de_DE.UTF-8 "Ã" "[A-Z]" NOMATCH +de_DE.UTF-8 "Ã" "[A-Z]" NOMATCH de_DE.UTF-8 "a" "[[:lower:]]" 0 de_DE.UTF-8 "z" "[[:lower:]]" 0 de_DE.UTF-8 "ä" "[[:lower:]]" 0 ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-20 18:49 ` Carlos O'Donell @ 2018-07-20 19:02 ` Rich Felker 2018-07-20 19:19 ` Florian Weimer 1 sibling, 0 replies; 42+ messages in thread From: Rich Felker @ 2018-07-20 19:02 UTC (permalink / raw) To: Carlos O'Donell Cc: Florian Weimer, GNU C Library, Mike Fabian, Zorro Lang, Joseph S. Myers On Fri, Jul 20, 2018 at 02:49:07PM -0400, Carlos O'Donell wrote: > On 07/19/2018 04:39 PM, Florian Weimer wrote: > > On 07/19/2018 09:43 PM, Carlos O'Donell wrote: > >> * Add back tests to tst-fnmatch.input and tst-regexloc.c which > >> exercise that [a-z] does not match A or Z. > > > > [a-z] still matches ñ, ð, but not ð£, which I doubt is useful. > > Sorry, I don't follow, it absolutely matches ASCII z. That's not an ASCII z. It's some plane-1 mathematical z. :-) Rich ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-20 18:49 ` Carlos O'Donell 2018-07-20 19:02 ` Rich Felker @ 2018-07-20 19:19 ` Florian Weimer 2018-07-20 21:56 ` Carlos O'Donell 1 sibling, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-20 19:19 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/20/2018 08:49 PM, Carlos O'Donell wrote: > On 07/19/2018 04:39 PM, Florian Weimer wrote: >> On 07/19/2018 09:43 PM, Carlos O'Donell wrote: >>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which >>> exercise that [a-z] does not match A or Z. >> >> [a-z] still matches ñ, ð, but not ð£, which I doubt is useful. > > Sorry, I don't follow, it absolutely matches ASCII z. The z I wrote above is one of the non-BMP math characters. > We deinterlace the collation element ordering (not sequence) to get > the right range expression resolution. > > See the added fnmatch tests: > > +en_US.UTF-8 "a" "[a-z]" 0 > +en_US.UTF-8 "z" "[a-z]" 0 > +en_US.UTF-8 "A" "[a-z]" NOMATCH > +en_US.UTF-8 "Z" "[a-z]" NOMATCH > +en_US.UTF-8 "a" "[A-Z]" NOMATCH > +en_US.UTF-8 "z" "[A-Z]" NOMATCH > +en_US.UTF-8 "A" "[A-Z]" 0 > +en_US.UTF-8 "Z" "[A-Z]" 0 > +en_US.UTF-8 "0" "[0-9]" 0 > +en_US.UTF-8 "9" "[0-9]" 0 > > [a-z] matches a-z (including z), *and* all the lowercase inbetween, > and so behaves like :lower: effectively. There are characters equivalent to ASCII z (like the z above), but which sort after z, so they are not matched. This is one reason why I think this is a bad idea: it looks like [:lower:], but it's not. Same for [0-9], I assume. >> It's an improvement, and it may be good enough for glibc 2.28, but I would >> rather see us implement the rational ranges interpretation. > > That requires all ranges behave rationally? > > We could fix a-z, A-Z, and 0-9 easily. > > Patch attached. (NB: Patch is relative to the previous patch.) My enumeration tester likes it much more. 8-) actual: "abcdefghijklmnopqrstuvwxyz" actual: "ABCDEFGHIJKLMNOPQRSTUVWXYZ" actual: "0123456789" That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. However, I still get this: tst-regex-classes.script:85:0: result character set difference in locale tr_TR.ISO-8859-9 enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz"; ^ expected: "abcdefghijklmnopqrstuvwxyz" actual: "abcdefghjklmnopqrstuvwxyz" tst-regex-classes.script:86:0: result character set difference in locale tr_TR.ISO-8859-9 enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; ^ expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ" actual: "ABCDEFGHJKLMNOPQRSTUVWXYZ" error: 2 test failures Can you fix this with data-only changes, too? posix/bug-regex17 regresses as well in the test for bug 9697, but I can incorporate that into my enumeration tester. I don't think the bug is actually regressing, it's just that the test objective is not expressed properly in it. posix/tst-rxspencer fails as well, presumably due to this: UTF-8 aA FAIL regcomp failed: Invalid range end UTF-8 aAcC FAIL regcomp failed: Invalid range end I think this happens because the test blindly replaces ASCII characters with non-ASCII characters, which causes issues if they are not ordered as expected. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-20 19:19 ` Florian Weimer @ 2018-07-20 21:56 ` Carlos O'Donell 2018-07-23 15:11 ` Florian Weimer 0 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-20 21:56 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers [-- Attachment #1: Type: text/plain, Size: 5429 bytes --] On 07/20/2018 03:19 PM, Florian Weimer wrote: > On 07/20/2018 08:49 PM, Carlos O'Donell wrote: >> On 07/19/2018 04:39 PM, Florian Weimer wrote: >>> On 07/19/2018 09:43 PM, Carlos O'Donell wrote: >>>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which >>>> exercise that [a-z] does not match A or Z. >>> >>> [a-z] still matches ñ, ð, but not ð£, which I doubt is useful. >> >> Sorry, I don't follow, it absolutely matches ASCII z. > > The z I wrote above is one of the non-BMP math characters. Thanks :-} It was a conservative solution. >> We deinterlace the collation element ordering (not sequence) to get >> the right range expression resolution. >> >> See the added fnmatch tests: >> >> +en_US.UTF-8    "a"                   "[a-z]"               0 >> +en_US.UTF-8    "z"                   "[a-z]"               0 >> +en_US.UTF-8    "A"                   "[a-z]"               NOMATCH >> +en_US.UTF-8    "Z"                   "[a-z]"               NOMATCH >> +en_US.UTF-8    "a"                   "[A-Z]"               NOMATCH >> +en_US.UTF-8    "z"                   "[A-Z]"               NOMATCH >> +en_US.UTF-8    "A"                   "[A-Z]"               0 >> +en_US.UTF-8    "Z"                   "[A-Z]"               0 >> +en_US.UTF-8    "0"                   "[0-9]"               0 >> +en_US.UTF-8    "9"                   "[0-9]"               0 >> >> [a-z] matches a-z (including z), *and* all the lowercase inbetween, >> and so behaves like :lower: effectively. > > There are characters equivalent to ASCII z (like the z above), but > which sort after z, so they are not matched. This is one reason why > I think this is a bad idea: it looks like [:lower:], but it's not. > Same for [0-9], I assume. Again, conservatively, this is how it worked before, and now works again the same, but retains the improvement of ISO 14651 data being added. >>> It's an improvement, and it may be good enough for glibc 2.28, but I would >>> rather see us implement the rational ranges interpretation. >> >> That requires all ranges behave rationally? >> >> We could fix a-z, A-Z, and 0-9 easily. >> >> Patch attached. > > (NB: Patch is relative to the previous patch.) > > My enumeration tester likes it much more. 8-) It was designed exactly for your enumerator ;-) >  actual:  "abcdefghijklmnopqrstuvwxyz" >  actual:  "ABCDEFGHIJKLMNOPQRSTUVWXYZ" >  actual:  "0123456789" > > That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. However, I still get this: > > tst-regex-classes.script:85:0: result character set difference in locale tr_TR.ISO-8859-9 > enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz"; > ^ >  expected: "abcdefghijklmnopqrstuvwxyz" >  actual:  "abcdefghjklmnopqrstuvwxyz" > > tst-regex-classes.script:86:0: result character set difference in locale tr_TR.ISO-8859-9 > enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; > ^ >  expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ" >  actual:  "ABCDEFGHJKLMNOPQRSTUVWXYZ" > error: 2 test failures > > Can you fix this with data-only changes, too? Yes, I need to duplicate the rational range for A-Z in tr_TR and remove 'i' since it's just fine the way it is, the existing New patch attached with additional tests in tst-fnmatch.input to test tr_TR.UTF-8, and ISO-8859-9. Noticed equivalence class issues and filed a bug and added an XFAIL-ish test case in test-fnmatch.input: https://sourceware.org/bugzilla/show_bug.cgi?id=23437 > posix/bug-regex17 regresses as well in the test for bug 9697, but I > can incorporate that into my enumeration tester. I don't think the > bug is actually regressing, it's just that the test objective is not > expressed properly in it. Fixed. > > posix/tst-rxspencer fails as well, presumably due to this: > > UTF-8 aA FAIL regcomp failed: Invalid range end > UTF-8 aAcC FAIL regcomp failed: Invalid range end > > I think this happens because the test blindly replaces ASCII > characters with non-ASCII characters, which causes issues if they are > not ordered as expected. Fixed. v2 - Fixed tr_TR by duplicating A-Z rational range. - Fixed tst-rxspender. - Fixed bug-regex17. Tell me how the new version does. -- Cheers, Carlos. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: rational-ranges-v2.diff --] [-- Type: text/x-patch; name="rational-ranges-v2.diff", Size: 49826 bytes --] diff --git a/localedata/locales/iso14651_t1_common b/localedata/locales/iso14651_t1_common index 227400cc4e..7248074a8b 100644 --- a/localedata/locales/iso14651_t1_common +++ b/localedata/locales/iso14651_t1_common @@ -63177,7 +63177,19 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U20BC> <S20BC>;<BASE>;<MIN>;<U20BC> % MANAT SIGN <U20BD> <S20BD>;<BASE>;<MIN>;<U20BD> % RUBLE SIGN <U20BE> <S20BE>;<BASE>;<MIN>;<U20BE> % LARI SIGN +% Implement rational range for [0-9] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. <U0030> <S0030>;<BASE>;<MIN>;<U0030> % DIGIT ZERO +<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE +<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO +<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE +<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR +<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE +<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX +<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN +<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT +<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE <U0660> <S0030>;<BASE>;<MIN>;<U0660> % ARABIC-INDIC DIGIT ZERO <U06F0> <S0030>;<BASE>;<MIN>;<U06F0> % EXTENDED ARABIC-INDIC DIGIT ZERO <U07C0> <S0030>;<BASE>;<MIN>;<U07C0> % NKO DIGIT ZERO @@ -63250,7 +63262,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U2080> <S0030>;<BASE>;<MNS>;<U2080> % SUBSCRIPT ZERO <U2189> "<S0030><S0033>";"<BASE><BASE>";"<FRACTION><FRACTION>";<U2189> % VULGAR FRACTION ZERO THIRDS <U3358> "<S0030><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3358> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO -<U0031> <S0031>;<BASE>;<MIN>;<U0031> % DIGIT ONE <U0661> <S0031>;<BASE>;<MIN>;<U0661> % ARABIC-INDIC DIGIT ONE <U06F1> <S0031>;<BASE>;<MIN>;<U06F1> % EXTENDED ARABIC-INDIC DIGIT ONE <U07C1> <S0031>;<BASE>;<MIN>;<U07C1> % NKO DIGIT ONE @@ -63440,7 +63451,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E0> "<S0031><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY ONE <U32C0> "<S0031><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C0> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JANUARY <U3359> "<S0031><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3359> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ONE -<U0032> <S0032>;<BASE>;<MIN>;<U0032> % DIGIT TWO <U0662> <S0032>;<BASE>;<MIN>;<U0662> % ARABIC-INDIC DIGIT TWO <U06F2> <S0032>;<BASE>;<MIN>;<U06F2> % EXTENDED ARABIC-INDIC DIGIT TWO <U07C2> <S0032>;<BASE>;<MIN>;<U07C2> % NKO DIGIT TWO @@ -63583,7 +63593,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E1> "<S0032><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY TWO <U32C1> "<S0032><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C1> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR FEBRUARY <U335A> "<S0032><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335A> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWO -<U0033> <S0033>;<BASE>;<MIN>;<U0033> % DIGIT THREE <U0663> <S0033>;<BASE>;<MIN>;<U0663> % ARABIC-INDIC DIGIT THREE <U06F3> <S0033>;<BASE>;<MIN>;<U06F3> % EXTENDED ARABIC-INDIC DIGIT THREE <U07C3> <S0033>;<BASE>;<MIN>;<U07C3> % NKO DIGIT THREE @@ -63709,7 +63718,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E2> "<S0033><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THREE <U32C2> "<S0033><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C2> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MARCH <U335B> "<S0033><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335B> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR THREE -<U0034> <S0034>;<BASE>;<MIN>;<U0034> % DIGIT FOUR <U0664> <S0034>;<BASE>;<MIN>;<U0664> % ARABIC-INDIC DIGIT FOUR <U06F4> <S0034>;<BASE>;<MIN>;<U06F4> % EXTENDED ARABIC-INDIC DIGIT FOUR <U07C4> <S0034>;<BASE>;<MIN>;<U07C4> % NKO DIGIT FOUR @@ -63829,7 +63837,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E3> "<S0034><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FOUR <U32C3> "<S0034><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C3> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR APRIL <U335C> "<S0034><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335C> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FOUR -<U0035> <S0035>;<BASE>;<MIN>;<U0035> % DIGIT FIVE <U0665> <S0035>;<BASE>;<MIN>;<U0665> % ARABIC-INDIC DIGIT FIVE <U06F5> <S0035>;<BASE>;<MIN>;<U06F5> % EXTENDED ARABIC-INDIC DIGIT FIVE <U07C5> <S0035>;<BASE>;<MIN>;<U07C5> % NKO DIGIT FIVE @@ -63941,7 +63948,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E4> "<S0035><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY FIVE <U32C4> "<S0035><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C4> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR MAY <U335D> "<S0035><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335D> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR FIVE -<U0036> <S0036>;<BASE>;<MIN>;<U0036> % DIGIT SIX <U0666> <S0036>;<BASE>;<MIN>;<U0666> % ARABIC-INDIC DIGIT SIX <U06F6> <S0036>;<BASE>;<MIN>;<U06F6> % EXTENDED ARABIC-INDIC DIGIT SIX <U07C6> <S0036>;<BASE>;<MIN>;<U07C6> % NKO DIGIT SIX @@ -64036,7 +64042,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E5> "<S0036><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SIX <U32C5> "<S0036><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C5> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JUNE <U335E> "<S0036><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335E> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SIX -<U0037> <S0037>;<BASE>;<MIN>;<U0037> % DIGIT SEVEN <U0667> <S0037>;<BASE>;<MIN>;<U0667> % ARABIC-INDIC DIGIT SEVEN <U06F7> <S0037>;<BASE>;<MIN>;<U06F7> % EXTENDED ARABIC-INDIC DIGIT SEVEN <U07C7> <S0037>;<BASE>;<MIN>;<U07C7> % NKO DIGIT SEVEN @@ -64132,7 +64137,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E6> "<S0037><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY SEVEN <U32C6> "<S0037><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C6> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR JULY <U335F> "<S0037><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U335F> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR SEVEN -<U0038> <S0038>;<BASE>;<MIN>;<U0038> % DIGIT EIGHT <U0668> <S0038>;<BASE>;<MIN>;<U0668> % ARABIC-INDIC DIGIT EIGHT <U06F8> <S0038>;<BASE>;<MIN>;<U06F8> % EXTENDED ARABIC-INDIC DIGIT EIGHT <U07C8> <S0038>;<BASE>;<MIN>;<U07C8> % NKO DIGIT EIGHT @@ -64226,7 +64230,6 @@ order_start <SPECIAL>;forward;backward;forward;forward,position <U33E7> "<S0038><RFB40><TE5E5>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U33E7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY EIGHT <U32C7> "<S0038><RFB40><TE708>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U32C7> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR AUGUST <U3360> "<S0038><RFB40><TF0B9>";"<BASE><BASE>";"<COMPAT><COMPAT>";<U3360> % IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR EIGHT -<U0039> <S0039>;<BASE>;<MIN>;<U0039> % DIGIT NINE <U0669> <S0039>;<BASE>;<MIN>;<U0669> % ARABIC-INDIC DIGIT NINE <U06F9> <S0039>;<BASE>;<MIN>;<U06F9> % EXTENDED ARABIC-INDIC DIGIT NINE <U07C9> <S0039>;<BASE>;<MIN>;<U07C9> % NKO DIGIT NINE @@ -64326,7 +64329,35 @@ order_start <LATIN>;forward;backward;forward;forward,position else order_start <LATIN>;forward;forward;forward;forward,position endif +% Implement rational range for [a-z] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. <U0061> <S0061>;<BASE>;<MIN>;<U0061> % LATIN SMALL LETTER A +<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B +<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C +<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D +<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E +<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F +<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G +<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H +<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I +<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J +<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K +<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L +<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M +<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N +<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O +<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P +<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q +<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R +<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S +<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T +<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U +<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V +<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W +<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X +<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y +<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z <UFF41> <S0061>;<BASE>;<WIDE>;<UFF41> % FULLWIDTH LATIN SMALL LETTER A <U0363> <S0061>;<BASE>;<COMPAT>;<U0363> % COMBINING LATIN SMALL LETTER A <U249C> <S0061>;<BASE>;<COMPAT>;<U249C> % PARENTHESIZED LATIN SMALL LETTER A @@ -64418,7 +64449,6 @@ endif <U0252> <S0252>;<BASE>;<MIN>;<U0252> % LATIN SMALL LETTER TURNED ALPHA <U1D9B> <S0252>;<BASE>;<MNN>;<U1D9B> % MODIFIER LETTER SMALL TURNED ALPHA <UAB64> <SAB64>;<BASE>;<MIN>;<UAB64> % LATIN SMALL LETTER INVERTED ALPHA -<U0062> <S0062>;<BASE>;<MIN>;<U0062> % LATIN SMALL LETTER B <UFF42> <S0062>;<BASE>;<WIDE>;<UFF42> % FULLWIDTH LATIN SMALL LETTER B <U1DE8> <S0062>;<BASE>;<COMPAT>;<U1DE8> % COMBINING LATIN SMALL LETTER B <U249D> <S0062>;<BASE>;<COMPAT>;<U249D> % PARENTHESIZED LATIN SMALL LETTER B @@ -64454,7 +64484,6 @@ endif <U0183> <S0183>;<BASE>;<MIN>;<U0183> % LATIN SMALL LETTER B WITH TOPBAR <UA7B5> <SA7B5>;<BASE>;<MIN>;<UA7B5> % LATIN SMALL LETTER BETA <U1DE9> <SA7B5>;<BASE>;<COMPAT>;<U1DE9> % COMBINING LATIN SMALL LETTER BETA -<U0063> <S0063>;<BASE>;<MIN>;<U0063> % LATIN SMALL LETTER C <UFF43> <S0063>;<BASE>;<WIDE>;<UFF43> % FULLWIDTH LATIN SMALL LETTER C <U0368> <S0063>;<BASE>;<COMPAT>;<U0368> % COMBINING LATIN SMALL LETTER C <U217D> <S0063>;<BASE>;<COMPAT>;<U217D> % SMALL ROMAN NUMERAL ONE HUNDRED @@ -64504,7 +64533,6 @@ endif <U1D9D> <S0255>;<BASE>;<MNN>;<U1D9D> % MODIFIER LETTER SMALL C WITH CURL <U2184> <S2184>;<BASE>;<MIN>;<U2184> % LATIN SMALL LETTER REVERSED C <UA73F> <SA73F>;<BASE>;<MIN>;<UA73F> % LATIN SMALL LETTER REVERSED C WITH DOT -<U0064> <S0064>;<BASE>;<MIN>;<U0064> % LATIN SMALL LETTER D <UFF44> <S0064>;<BASE>;<WIDE>;<UFF44> % FULLWIDTH LATIN SMALL LETTER D <U0369> <S0064>;<BASE>;<COMPAT>;<U0369> % COMBINING LATIN SMALL LETTER D <U217E> <S0064>;<BASE>;<COMPAT>;<U217E> % SMALL ROMAN NUMERAL FIVE HUNDRED @@ -64563,7 +64591,6 @@ endif <U0221> <S0221>;<BASE>;<MIN>;<U0221> % LATIN SMALL LETTER D WITH CURL <UA771> <SA771>;<BASE>;<MIN>;<UA771> % LATIN SMALL LETTER DUM <U1E9F> <S1E9F>;<BASE>;<MIN>;<U1E9F> % LATIN SMALL LETTER DELTA -<U0065> <S0065>;<BASE>;<MIN>;<U0065> % LATIN SMALL LETTER E <UFF45> <S0065>;<BASE>;<WIDE>;<UFF45> % FULLWIDTH LATIN SMALL LETTER E <U0364> <S0065>;<BASE>;<COMPAT>;<U0364> % COMBINING LATIN SMALL LETTER E <U24A0> <S0065>;<BASE>;<COMPAT>;<U24A0> % PARENTHESIZED LATIN SMALL LETTER E @@ -64641,7 +64668,6 @@ endif <U025E> <S025E>;<BASE>;<MIN>;<U025E> % LATIN SMALL LETTER CLOSED REVERSED OPEN E <U029A> <S029A>;<BASE>;<MIN>;<U029A> % LATIN SMALL LETTER CLOSED OPEN E <U0264> <S0264>;<BASE>;<MIN>;<U0264> % LATIN SMALL LETTER RAMS HORN -<U0066> <S0066>;<BASE>;<MIN>;<U0066> % LATIN SMALL LETTER F <UFF46> <S0066>;<BASE>;<WIDE>;<UFF46> % FULLWIDTH LATIN SMALL LETTER F <U1DEB> <S0066>;<BASE>;<COMPAT>;<U1DEB> % COMBINING LATIN SMALL LETTER F <U24A1> <S0066>;<BASE>;<COMPAT>;<U24A1> % PARENTHESIZED LATIN SMALL LETTER F @@ -64680,7 +64706,6 @@ endif <U0192> <S0192>;<BASE>;<MIN>;<U0192> % LATIN SMALL LETTER F WITH HOOK <U214E> <S214E>;<BASE>;<MIN>;<U214E> % TURNED SMALL F <UA7FB> <SA7FB>;<BASE>;<MIN>;<UA7FB> % LATIN EPIGRAPHIC LETTER REVERSED F -<U0067> <S0067>;<BASE>;<MIN>;<U0067> % LATIN SMALL LETTER G <UFF47> <S0067>;<BASE>;<WIDE>;<UFF47> % FULLWIDTH LATIN SMALL LETTER G <U1DDA> <S0067>;<BASE>;<COMPAT>;<U1DDA> % COMBINING LATIN SMALL LETTER G <U24A2> <S0067>;<BASE>;<COMPAT>;<U24A2> % PARENTHESIZED LATIN SMALL LETTER G @@ -64727,7 +64752,6 @@ endif <U0263> <S0263>;<BASE>;<MIN>;<U0263> % LATIN SMALL LETTER GAMMA <U02E0> <S0263>;<BASE>;<MNN>;<U02E0> % MODIFIER LETTER SMALL GAMMA <U01A3> <S01A3>;<BASE>;<MIN>;<U01A3> % LATIN SMALL LETTER OI -<U0068> <S0068>;<BASE>;<MIN>;<U0068> % LATIN SMALL LETTER H <UFF48> <S0068>;<BASE>;<WIDE>;<UFF48> % FULLWIDTH LATIN SMALL LETTER H <U036A> <S0068>;<BASE>;<COMPAT>;<U036A> % COMBINING LATIN SMALL LETTER H <U24A3> <S0068>;<BASE>;<COMPAT>;<U24A3> % PARENTHESIZED LATIN SMALL LETTER H @@ -64780,7 +64804,6 @@ endif <U0267> <S0267>;<BASE>;<MIN>;<U0267> % LATIN SMALL LETTER HENG WITH HOOK <U02BB> <S02BB>;<BASE>;<MIN>;<U02BB> % MODIFIER LETTER TURNED COMMA <U02BD> <S02BD>;<BASE>;<MIN>;<U02BD> % MODIFIER LETTER REVERSED COMMA -<U0069> <S0069>;<BASE>;<MIN>;<U0069> % LATIN SMALL LETTER I <UFF49> <S0069>;<BASE>;<WIDE>;<UFF49> % FULLWIDTH LATIN SMALL LETTER I <U0365> <S0069>;<BASE>;<COMPAT>;<U0365> % COMBINING LATIN SMALL LETTER I <U2170> <S0069>;<BASE>;<COMPAT>;<U2170> % SMALL ROMAN NUMERAL ONE @@ -64844,7 +64867,6 @@ endif <U0269> <S0269>;<BASE>;<MIN>;<U0269> % LATIN SMALL LETTER IOTA <U1DA5> <S0269>;<BASE>;<MNN>;<U1DA5> % MODIFIER LETTER SMALL IOTA <U1D7C> <S1D7C>;<BASE>;<MIN>;<U1D7C> % LATIN SMALL LETTER IOTA WITH STROKE -<U006A> <S006A>;<BASE>;<MIN>;<U006A> % LATIN SMALL LETTER J <UFF4A> <S006A>;<BASE>;<WIDE>;<UFF4A> % FULLWIDTH LATIN SMALL LETTER J <U24A5> <S006A>;<BASE>;<COMPAT>;<U24A5> % PARENTHESIZED LATIN SMALL LETTER J <U2149> <S006A>;<BASE>;<FONT>;<U2149> % DOUBLE-STRUCK ITALIC SMALL J @@ -64876,7 +64898,6 @@ endif <U025F> <S025F>;<BASE>;<MIN>;<U025F> % LATIN SMALL LETTER DOTLESS J WITH STROKE <U1DA1> <S025F>;<BASE>;<MNN>;<U1DA1> % MODIFIER LETTER SMALL DOTLESS J WITH STROKE <U0284> <S0284>;<BASE>;<MIN>;<U0284> % LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK -<U006B> <S006B>;<BASE>;<MIN>;<U006B> % LATIN SMALL LETTER K <UFF4B> <S006B>;<BASE>;<WIDE>;<UFF4B> % FULLWIDTH LATIN SMALL LETTER K <U1DDC> <S006B>;<BASE>;<COMPAT>;<U1DDC> % COMBINING LATIN SMALL LETTER K <U24A6> <S006B>;<BASE>;<COMPAT>;<U24A6> % PARENTHESIZED LATIN SMALL LETTER K @@ -64926,7 +64947,6 @@ endif <UA743> <SA743>;<BASE>;<MIN>;<UA743> % LATIN SMALL LETTER K WITH DIAGONAL STROKE <UA745> <SA745>;<BASE>;<MIN>;<UA745> % LATIN SMALL LETTER K WITH STROKE AND DIAGONAL STROKE <U029E> <S029E>;<BASE>;<MIN>;<U029E> % LATIN SMALL LETTER TURNED K -<U006C> <S006C>;<BASE>;<MIN>;<U006C> % LATIN SMALL LETTER L <UFF4C> <S006C>;<BASE>;<WIDE>;<UFF4C> % FULLWIDTH LATIN SMALL LETTER L <U1DDD> <S006C>;<BASE>;<COMPAT>;<U1DDD> % COMBINING LATIN SMALL LETTER L <U217C> <S006C>;<BASE>;<COMPAT>;<U217C> % SMALL ROMAN NUMERAL FIFTY @@ -64996,7 +65016,6 @@ endif <UA781> <SA781>;<BASE>;<MIN>;<UA781> % LATIN SMALL LETTER TURNED L <U019B> <S019B>;<BASE>;<MIN>;<U019B> % LATIN SMALL LETTER LAMBDA WITH STROKE <U028E> <S028E>;<BASE>;<MIN>;<U028E> % LATIN SMALL LETTER TURNED Y -<U006D> <S006D>;<BASE>;<MIN>;<U006D> % LATIN SMALL LETTER M <UFF4D> <S006D>;<BASE>;<WIDE>;<UFF4D> % FULLWIDTH LATIN SMALL LETTER M <U036B> <S006D>;<BASE>;<COMPAT>;<U036B> % COMBINING LATIN SMALL LETTER M <U217F> <S006D>;<BASE>;<COMPAT>;<U217F> % SMALL ROMAN NUMERAL ONE THOUSAND @@ -65055,7 +65074,6 @@ endif <UA7FD> <SA7FD>;<BASE>;<MIN>;<UA7FD> % LATIN EPIGRAPHIC LETTER INVERTED M <UA7FF> <SA7FF>;<BASE>;<MIN>;<UA7FF> % LATIN EPIGRAPHIC LETTER ARCHAIC M <UA773> <SA773>;<BASE>;<MIN>;<UA773> % LATIN SMALL LETTER MUM -<U006E> <S006E>;<BASE>;<MIN>;<U006E> % LATIN SMALL LETTER N <UFF4E> <S006E>;<BASE>;<WIDE>;<UFF4E> % FULLWIDTH LATIN SMALL LETTER N <U1DE0> <S006E>;<BASE>;<COMPAT>;<U1DE0> % COMBINING LATIN SMALL LETTER N <U24A9> <S006E>;<BASE>;<COMPAT>;<U24A9> % PARENTHESIZED LATIN SMALL LETTER N @@ -65114,7 +65132,6 @@ endif <U014B> <S014B>;<BASE>;<MIN>;<U014B> % LATIN SMALL LETTER ENG <U1D51> <S014B>;<BASE>;<MNN>;<U1D51> % MODIFIER LETTER SMALL ENG <UAB3C> <SAB3C>;<BASE>;<MIN>;<UAB3C> % LATIN SMALL LETTER ENG WITH CROSSED-TAIL -<U006F> <S006F>;<BASE>;<MIN>;<U006F> % LATIN SMALL LETTER O <UFF4F> <S006F>;<BASE>;<WIDE>;<UFF4F> % FULLWIDTH LATIN SMALL LETTER O <U0366> <S006F>;<BASE>;<COMPAT>;<U0366> % COMBINING LATIN SMALL LETTER O <U24AA> <S006F>;<BASE>;<COMPAT>;<U24AA> % PARENTHESIZED LATIN SMALL LETTER O @@ -65213,7 +65230,6 @@ endif <U0223> <S0223>;<BASE>;<MIN>;<U0223> % LATIN SMALL LETTER OU <U1D3D> <S0223>;<BASE>;<MISCCAP>;<U1D3D> % MODIFIER LETTER CAPITAL OU <U1D15> <S1D15>;<BASE>;<MIN>;<U1D15> % LATIN LETTER SMALL CAPITAL OU -<U0070> <S0070>;<BASE>;<MIN>;<U0070> % LATIN SMALL LETTER P <UFF50> <S0070>;<BASE>;<WIDE>;<UFF50> % FULLWIDTH LATIN SMALL LETTER P <U1DEE> <S0070>;<BASE>;<COMPAT>;<U1DEE> % COMBINING LATIN SMALL LETTER P <U24AB> <S0070>;<BASE>;<COMPAT>;<U24AB> % PARENTHESIZED LATIN SMALL LETTER P @@ -65262,7 +65278,6 @@ endif <U0278> <S0278>;<BASE>;<MIN>;<U0278> % LATIN SMALL LETTER PHI <U1DB2> <S0278>;<BASE>;<MNN>;<U1DB2> % MODIFIER LETTER SMALL PHI <U2C77> <S2C77>;<BASE>;<MIN>;<U2C77> % LATIN SMALL LETTER TAILLESS PHI -<U0071> <S0071>;<BASE>;<MIN>;<U0071> % LATIN SMALL LETTER Q <UFF51> <S0071>;<BASE>;<WIDE>;<UFF51> % FULLWIDTH LATIN SMALL LETTER Q <U24AC> <S0071>;<BASE>;<COMPAT>;<U24AC> % PARENTHESIZED LATIN SMALL LETTER Q <U0001D42A> <S0071>;<BASE>;<FONT>;<U0001D42A> % MATHEMATICAL BOLD SMALL Q @@ -65285,7 +65300,6 @@ endif <U02A0> <S02A0>;<BASE>;<MIN>;<U02A0> % LATIN SMALL LETTER Q WITH HOOK <U024B> <S024B>;<BASE>;<MIN>;<U024B> % LATIN SMALL LETTER Q WITH HOOK TAIL <U0138> <S0138>;<BASE>;<MIN>;<U0138> % LATIN SMALL LETTER KRA -<U0072> <S0072>;<BASE>;<MIN>;<U0072> % LATIN SMALL LETTER R <UFF52> <S0072>;<BASE>;<WIDE>;<UFF52> % FULLWIDTH LATIN SMALL LETTER R <U036C> <S0072>;<BASE>;<COMPAT>;<U036C> % COMBINING LATIN SMALL LETTER R <U1DCA> <S0072>;<BASE>;<COMPAT>;<U1DCA> % COMBINING LATIN SMALL LETTER R BELOW @@ -65354,7 +65368,6 @@ endif <UA775> <SA775>;<BASE>;<MIN>;<UA775> % LATIN SMALL LETTER RUM <UA776> <SA776>;<BASE>;<MIN>;<UA776> % LATIN LETTER SMALL CAPITAL RUM <UA75D> <SA75D>;<BASE>;<MIN>;<UA75D> % LATIN SMALL LETTER RUM ROTUNDA -<U0073> <S0073>;<BASE>;<MIN>;<U0073> % LATIN SMALL LETTER S <UFF53> <S0073>;<BASE>;<WIDE>;<UFF53> % FULLWIDTH LATIN SMALL LETTER S <U1DE4> <S0073>;<BASE>;<COMPAT>;<U1DE4> % COMBINING LATIN SMALL LETTER S <U24AE> <S0073>;<BASE>;<COMPAT>;<U24AE> % PARENTHESIZED LATIN SMALL LETTER S @@ -65417,7 +65430,6 @@ endif <U0285> <S0285>;<BASE>;<MIN>;<U0285> % LATIN SMALL LETTER SQUAT REVERSED ESH <U1D98> <S1D98>;<BASE>;<MIN>;<U1D98> % LATIN SMALL LETTER ESH WITH RETROFLEX HOOK <U0286> <S0286>;<BASE>;<MIN>;<U0286> % LATIN SMALL LETTER ESH WITH CURL -<U0074> <S0074>;<BASE>;<MIN>;<U0074> % LATIN SMALL LETTER T <UFF54> <S0074>;<BASE>;<WIDE>;<UFF54> % FULLWIDTH LATIN SMALL LETTER T <U036D> <S0074>;<BASE>;<COMPAT>;<U036D> % COMBINING LATIN SMALL LETTER T <U24AF> <S0074>;<BASE>;<COMPAT>;<U24AF> % PARENTHESIZED LATIN SMALL LETTER T @@ -65467,7 +65479,6 @@ endif <U0236> <S0236>;<BASE>;<MIN>;<U0236> % LATIN SMALL LETTER T WITH CURL <UA777> <SA777>;<BASE>;<MIN>;<UA777> % LATIN SMALL LETTER TUM <U0287> <S0287>;<BASE>;<MIN>;<U0287> % LATIN SMALL LETTER TURNED T -<U0075> <S0075>;<BASE>;<MIN>;<U0075> % LATIN SMALL LETTER U <UFF55> <S0075>;<BASE>;<WIDE>;<UFF55> % FULLWIDTH LATIN SMALL LETTER U <U0367> <S0075>;<BASE>;<COMPAT>;<U0367> % COMBINING LATIN SMALL LETTER U <U24B0> <S0075>;<BASE>;<COMPAT>;<U24B0> % PARENTHESIZED LATIN SMALL LETTER U @@ -65552,7 +65563,6 @@ endif <U028A> <S028A>;<BASE>;<MIN>;<U028A> % LATIN SMALL LETTER UPSILON <U1DB7> <S028A>;<BASE>;<MNN>;<U1DB7> % MODIFIER LETTER SMALL UPSILON <U1D7F> <S1D7F>;<BASE>;<MIN>;<U1D7F> % LATIN SMALL LETTER UPSILON WITH STROKE -<U0076> <S0076>;<BASE>;<MIN>;<U0076> % LATIN SMALL LETTER V <UFF56> <S0076>;<BASE>;<WIDE>;<UFF56> % FULLWIDTH LATIN SMALL LETTER V <U036E> <S0076>;<BASE>;<COMPAT>;<U036E> % COMBINING LATIN SMALL LETTER V <U2174> <S0076>;<BASE>;<COMPAT>;<U2174> % SMALL ROMAN NUMERAL FIVE @@ -65593,7 +65603,6 @@ endif <U1EFD> <S1EFD>;<BASE>;<MIN>;<U1EFD> % LATIN SMALL LETTER MIDDLE-WELSH V <U028C> <S028C>;<BASE>;<MIN>;<U028C> % LATIN SMALL LETTER TURNED V <U1DBA> <S028C>;<BASE>;<MNN>;<U1DBA> % MODIFIER LETTER SMALL TURNED V -<U0077> <S0077>;<BASE>;<MIN>;<U0077> % LATIN SMALL LETTER W <UFF57> <S0077>;<BASE>;<WIDE>;<UFF57> % FULLWIDTH LATIN SMALL LETTER W <U1DF1> <S0077>;<BASE>;<COMPAT>;<U1DF1> % COMBINING LATIN SMALL LETTER W <U24B2> <S0077>;<BASE>;<COMPAT>;<U24B2> % PARENTHESIZED LATIN SMALL LETTER W @@ -65627,7 +65636,6 @@ endif <U1D21> <S1D21>;<BASE>;<MIN>;<U1D21> % LATIN LETTER SMALL CAPITAL W <U2C73> <S2C73>;<BASE>;<MIN>;<U2C73> % LATIN SMALL LETTER W WITH HOOK <U028D> <S028D>;<BASE>;<MIN>;<U028D> % LATIN SMALL LETTER TURNED W -<U0078> <S0078>;<BASE>;<MIN>;<U0078> % LATIN SMALL LETTER X <UFF58> <S0078>;<BASE>;<WIDE>;<UFF58> % FULLWIDTH LATIN SMALL LETTER X <U036F> <S0078>;<BASE>;<COMPAT>;<U036F> % COMBINING LATIN SMALL LETTER X <U2179> <S0078>;<BASE>;<COMPAT>;<U2179> % SMALL ROMAN NUMERAL TEN @@ -65660,7 +65668,6 @@ endif <UAB53> <SAB53>;<BASE>;<MIN>;<UAB53> % LATIN SMALL LETTER CHI <UAB54> <SAB54>;<BASE>;<MIN>;<UAB54> % LATIN SMALL LETTER CHI WITH LOW RIGHT RING <UAB55> <SAB55>;<BASE>;<MIN>;<UAB55> % LATIN SMALL LETTER CHI WITH LOW LEFT SERIF -<U0079> <S0079>;<BASE>;<MIN>;<U0079> % LATIN SMALL LETTER Y <UFF59> <S0079>;<BASE>;<WIDE>;<UFF59> % FULLWIDTH LATIN SMALL LETTER Y <U24B4> <S0079>;<BASE>;<COMPAT>;<U24B4> % PARENTHESIZED LATIN SMALL LETTER Y <U0001D432> <S0079>;<BASE>;<FONT>;<U0001D432> % MATHEMATICAL BOLD SMALL Y @@ -65694,7 +65701,6 @@ endif <U1EFF> <S1EFF>;<BASE>;<MIN>;<U1EFF> % LATIN SMALL LETTER Y WITH LOOP <UAB5A> <SAB5A>;<BASE>;<MIN>;<UAB5A> % LATIN SMALL LETTER Y WITH SHORT RIGHT LEG <U021D> <S021D>;<BASE>;<MIN>;<U021D> % LATIN SMALL LETTER YOGH -<U007A> <S007A>;<BASE>;<MIN>;<U007A> % LATIN SMALL LETTER Z <UFF5A> <S007A>;<BASE>;<WIDE>;<UFF5A> % FULLWIDTH LATIN SMALL LETTER Z <U1DE6> <S007A>;<BASE>;<COMPAT>;<U1DE6> % COMBINING LATIN SMALL LETTER Z <U24B5> <S007A>;<BASE>;<COMPAT>;<U24B5> % PARENTHESIZED LATIN SMALL LETTER Z @@ -65796,7 +65802,35 @@ endif <U0001D736> <S03B1>;<BASE>;<FONT>;<U0001D736> % MATHEMATICAL BOLD ITALIC SMALL ALPHA <U0001D770> <S03B1>;<BASE>;<FONT>;<U0001D770> % MATHEMATICAL SANS-SERIF BOLD SMALL ALPHA <U0001D7AA> <S03B1>;<BASE>;<FONT>;<U0001D7AA> % MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL ALPHA +% Implement rational range for [A-Z] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. <U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A +<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B +<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C +<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D +<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E +<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F +<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G +<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H +<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I +<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J +<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K +<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L +<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M +<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N +<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O +<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P +<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q +<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R +<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S +<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T +<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U +<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V +<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W +<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X +<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y +<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z <UFF21> <S0061>;<BASE>;<WIDECAP>;<UFF21> % FULLWIDTH LATIN CAPITAL LETTER A <U0001F110> <S0061>;<BASE>;<COMPATCAP>;<U0001F110> % PARENTHESIZED LATIN CAPITAL LETTER A <U0001D400> <S0061>;<BASE>;<FONTCAP>;<U0001D400> % MATHEMATICAL BOLD CAPITAL A @@ -65860,7 +65894,6 @@ endif <U2C6F> <S0250>;<BASE>;<CAP>;<U2C6F> % LATIN CAPITAL LETTER TURNED A <U2C6D> <S0251>;<BASE>;<CAP>;<U2C6D> % LATIN CAPITAL LETTER ALPHA <U2C70> <S0252>;<BASE>;<CAP>;<U2C70> % LATIN CAPITAL LETTER TURNED ALPHA -<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B <UFF22> <S0062>;<BASE>;<WIDECAP>;<UFF22> % FULLWIDTH LATIN CAPITAL LETTER B <U0001F111> <S0062>;<BASE>;<COMPATCAP>;<U0001F111> % PARENTHESIZED LATIN CAPITAL LETTER B <U212C> <S0062>;<BASE>;<FONTCAP>;<U212C> % SCRIPT CAPITAL B @@ -65888,7 +65921,6 @@ endif <U0181> <S0253>;<BASE>;<CAP>;<U0181> % LATIN CAPITAL LETTER B WITH HOOK <U0182> <S0183>;<BASE>;<CAP>;<U0182> % LATIN CAPITAL LETTER B WITH TOPBAR <UA7B4> <SA7B5>;<BASE>;<CAP>;<UA7B4> % LATIN CAPITAL LETTER BETA -<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C <UFF23> <S0063>;<BASE>;<WIDECAP>;<UFF23> % FULLWIDTH LATIN CAPITAL LETTER C <U216D> <S0063>;<BASE>;<COMPATCAP>;<U216D> % ROMAN NUMERAL ONE HUNDRED <U0001F112> <S0063>;<BASE>;<COMPATCAP>;<U0001F112> % PARENTHESIZED LATIN CAPITAL LETTER C @@ -65921,7 +65953,6 @@ endif <U0187> <S0188>;<BASE>;<CAP>;<U0187> % LATIN CAPITAL LETTER C WITH HOOK <U2183> <S2184>;<BASE>;<CAP>;<U2183> % ROMAN NUMERAL REVERSED ONE HUNDRED <UA73E> <SA73F>;<BASE>;<CAP>;<UA73E> % LATIN CAPITAL LETTER REVERSED C WITH DOT -<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D <UFF24> <S0064>;<BASE>;<WIDECAP>;<UFF24> % FULLWIDTH LATIN CAPITAL LETTER D <U216E> <S0064>;<BASE>;<COMPATCAP>;<U216E> % ROMAN NUMERAL FIVE HUNDRED <U0001F113> <S0064>;<BASE>;<COMPATCAP>;<U0001F113> % PARENTHESIZED LATIN CAPITAL LETTER D @@ -65959,7 +65990,6 @@ endif <U0189> <S0256>;<BASE>;<CAP>;<U0189> % LATIN CAPITAL LETTER AFRICAN D <U018A> <S0257>;<BASE>;<CAP>;<U018A> % LATIN CAPITAL LETTER D WITH HOOK <U018B> <S018C>;<BASE>;<CAP>;<U018B> % LATIN CAPITAL LETTER D WITH TOPBAR -<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E <UFF25> <S0065>;<BASE>;<WIDECAP>;<UFF25> % FULLWIDTH LATIN CAPITAL LETTER E <U0001F114> <S0065>;<BASE>;<COMPATCAP>;<U0001F114> % PARENTHESIZED LATIN CAPITAL LETTER E <U2130> <S0065>;<BASE>;<FONTCAP>;<U2130> % SCRIPT CAPITAL E @@ -66010,7 +66040,6 @@ endif <U0190> <S025B>;<BASE>;<CAP>;<U0190> % LATIN CAPITAL LETTER OPEN E <U2107> <S025B>;<BASE>;<COMPATCAP>;<U2107> % EULER CONSTANT <UA7AB> <S025C>;<BASE>;<CAP>;<UA7AB> % LATIN CAPITAL LETTER REVERSED OPEN E -<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F <UFF26> <S0066>;<BASE>;<WIDECAP>;<UFF26> % FULLWIDTH LATIN CAPITAL LETTER F <U0001F115> <S0066>;<BASE>;<COMPATCAP>;<U0001F115> % PARENTHESIZED LATIN CAPITAL LETTER F <U2131> <S0066>;<BASE>;<FONTCAP>;<U2131> % SCRIPT CAPITAL F @@ -66035,7 +66064,6 @@ endif <UA798> <SA799>;<BASE>;<CAP>;<UA798> % LATIN CAPITAL LETTER F WITH STROKE <U0191> <S0192>;<BASE>;<CAP>;<U0191> % LATIN CAPITAL LETTER F WITH HOOK <U2132> <S214E>;<BASE>;<CAP>;<U2132> % TURNED CAPITAL F -<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G <UFF27> <S0067>;<BASE>;<WIDECAP>;<UFF27> % FULLWIDTH LATIN CAPITAL LETTER G <U0001F116> <S0067>;<BASE>;<COMPATCAP>;<U0001F116> % PARENTHESIZED LATIN CAPITAL LETTER G <U0001D406> <S0067>;<BASE>;<FONTCAP>;<U0001D406> % MATHEMATICAL BOLD CAPITAL G @@ -66071,7 +66099,6 @@ endif <UA77E> <SA77F>;<BASE>;<CAP>;<UA77E> % LATIN CAPITAL LETTER TURNED INSULAR G <U0194> <S0263>;<BASE>;<CAP>;<U0194> % LATIN CAPITAL LETTER GAMMA <U01A2> <S01A3>;<BASE>;<CAP>;<U01A2> % LATIN CAPITAL LETTER OI -<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H <UFF28> <S0068>;<BASE>;<WIDECAP>;<UFF28> % FULLWIDTH LATIN CAPITAL LETTER H <U0001F117> <S0068>;<BASE>;<COMPATCAP>;<U0001F117> % PARENTHESIZED LATIN CAPITAL LETTER H <U210B> <S0068>;<BASE>;<FONTCAP>;<U210B> % SCRIPT CAPITAL H @@ -66104,7 +66131,6 @@ endif <U2C67> <S2C68>;<BASE>;<CAP>;<U2C67> % LATIN CAPITAL LETTER H WITH DESCENDER <U2C75> <S2C76>;<BASE>;<CAP>;<U2C75> % LATIN CAPITAL LETTER HALF H <UA726> <SA727>;<BASE>;<CAP>;<UA726> % LATIN CAPITAL LETTER HENG -<U0049> <S0069>;<BASE>;<CAP>;<U0049> % LATIN CAPITAL LETTER I <UFF29> <S0069>;<BASE>;<WIDECAP>;<UFF29> % FULLWIDTH LATIN CAPITAL LETTER I <U2160> <S0069>;<BASE>;<COMPATCAP>;<U2160> % ROMAN NUMERAL ONE <U0001F118> <S0069>;<BASE>;<COMPATCAP>;<U0001F118> % PARENTHESIZED LATIN CAPITAL LETTER I @@ -66149,7 +66175,6 @@ endif <UA7AE> <S026A>;<BASE>;<CAP>;<UA7AE> % LATIN CAPITAL LETTER SMALL CAPITAL I <U0197> <S0268>;<BASE>;<CAP>;<U0197> % LATIN CAPITAL LETTER I WITH STROKE <U0196> <S0269>;<BASE>;<CAP>;<U0196> % LATIN CAPITAL LETTER IOTA -<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J <UFF2A> <S006A>;<BASE>;<WIDECAP>;<UFF2A> % FULLWIDTH LATIN CAPITAL LETTER J <U0001F119> <S006A>;<BASE>;<COMPATCAP>;<U0001F119> % PARENTHESIZED LATIN CAPITAL LETTER J <U0001D409> <S006A>;<BASE>;<FONTCAP>;<U0001D409> % MATHEMATICAL BOLD CAPITAL J @@ -66172,7 +66197,6 @@ endif <U0134> <S006A>;"<BASE><CIRCF>";"<CAP><MIN>";<U0134> % LATIN CAPITAL LETTER J WITH CIRCUMFLEX <U0248> <S0249>;<BASE>;<CAP>;<U0248> % LATIN CAPITAL LETTER J WITH STROKE <UA7B2> <S029D>;<BASE>;<CAP>;<UA7B2> % LATIN CAPITAL LETTER J WITH CROSSED-TAIL -<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K <U212A> <S006B>;<BASE>;<CAP>;<U212A> % KELVIN SIGN <UFF2B> <S006B>;<BASE>;<WIDECAP>;<UFF2B> % FULLWIDTH LATIN CAPITAL LETTER K <U0001F11A> <S006B>;<BASE>;<COMPATCAP>;<U0001F11A> % PARENTHESIZED LATIN CAPITAL LETTER K @@ -66206,7 +66230,6 @@ endif <UA742> <SA743>;<BASE>;<CAP>;<UA742> % LATIN CAPITAL LETTER K WITH DIAGONAL STROKE <UA744> <SA745>;<BASE>;<CAP>;<UA744> % LATIN CAPITAL LETTER K WITH STROKE AND DIAGONAL STROKE <UA7B0> <S029E>;<BASE>;<CAP>;<UA7B0> % LATIN CAPITAL LETTER TURNED K -<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L <UFF2C> <S006C>;<BASE>;<WIDECAP>;<UFF2C> % FULLWIDTH LATIN CAPITAL LETTER L <U216C> <S006C>;<BASE>;<COMPATCAP>;<U216C> % ROMAN NUMERAL FIFTY <U0001F11B> <S006C>;<BASE>;<COMPATCAP>;<U0001F11B> % PARENTHESIZED LATIN CAPITAL LETTER L @@ -66249,7 +66272,6 @@ endif <U2C62> <S026B>;<BASE>;<CAP>;<U2C62> % LATIN CAPITAL LETTER L WITH MIDDLE TILDE <UA7AD> <S026C>;<BASE>;<CAP>;<UA7AD> % LATIN CAPITAL LETTER L WITH BELT <UA780> <SA781>;<BASE>;<CAP>;<UA780> % LATIN CAPITAL LETTER TURNED L -<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M <UFF2D> <S006D>;<BASE>;<WIDECAP>;<UFF2D> % FULLWIDTH LATIN CAPITAL LETTER M <U216F> <S006D>;<BASE>;<COMPATCAP>;<U216F> % ROMAN NUMERAL ONE THOUSAND <U0001F11C> <S006D>;<BASE>;<COMPATCAP>;<U0001F11C> % PARENTHESIZED LATIN CAPITAL LETTER M @@ -66275,7 +66297,6 @@ endif <U1E42> <S006D>;"<BASE><POINS>";"<CAP><MIN>";<U1E42> % LATIN CAPITAL LETTER M WITH DOT BELOW <U1DDF> <S1D0D>;<BASE>;<COMPAT>;<U1DDF> % COMBINING LATIN LETTER SMALL CAPITAL M <U2C6E> <S0271>;<BASE>;<CAP>;<U2C6E> % LATIN CAPITAL LETTER M WITH HOOK -<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N <UFF2E> <S006E>;<BASE>;<WIDECAP>;<UFF2E> % FULLWIDTH LATIN CAPITAL LETTER N <U0001F11D> <S006E>;<BASE>;<COMPATCAP>;<U0001F11D> % PARENTHESIZED LATIN CAPITAL LETTER N <U2115> <S006E>;<BASE>;<FONTCAP>;<U2115> % DOUBLE-STRUCK CAPITAL N @@ -66312,7 +66333,6 @@ endif <U0220> <S019E>;<BASE>;<CAP>;<U0220> % LATIN CAPITAL LETTER N WITH LONG RIGHT LEG <UA790> <SA791>;<BASE>;<CAP>;<UA790> % LATIN CAPITAL LETTER N WITH DESCENDER <U014A> <S014B>;<BASE>;<CAP>;<U014A> % LATIN CAPITAL LETTER ENG -<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O <UFF2F> <S006F>;<BASE>;<WIDECAP>;<UFF2F> % FULLWIDTH LATIN CAPITAL LETTER O <U0001F11E> <S006F>;<BASE>;<COMPATCAP>;<U0001F11E> % PARENTHESIZED LATIN CAPITAL LETTER O <U0001D40E> <S006F>;<BASE>;<FONTCAP>;<U0001D40E> % MATHEMATICAL BOLD CAPITAL O @@ -66377,7 +66397,6 @@ endif <UA74A> <SA74B>;<BASE>;<CAP>;<UA74A> % LATIN CAPITAL LETTER O WITH LONG STROKE OVERLAY <UA7B6> <SA7B7>;<BASE>;<CAP>;<UA7B6> % LATIN CAPITAL LETTER OMEGA <U0222> <S0223>;<BASE>;<CAP>;<U0222> % LATIN CAPITAL LETTER OU -<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P <UFF30> <S0070>;<BASE>;<WIDECAP>;<UFF30> % FULLWIDTH LATIN CAPITAL LETTER P <U0001F11F> <S0070>;<BASE>;<COMPATCAP>;<U0001F11F> % PARENTHESIZED LATIN CAPITAL LETTER P <U2119> <S0070>;<BASE>;<FONTCAP>;<U2119> % DOUBLE-STRUCK CAPITAL P @@ -66405,7 +66424,6 @@ endif <U01A4> <S01A5>;<BASE>;<CAP>;<U01A4> % LATIN CAPITAL LETTER P WITH HOOK <UA752> <SA753>;<BASE>;<CAP>;<UA752> % LATIN CAPITAL LETTER P WITH FLOURISH <UA754> <SA755>;<BASE>;<CAP>;<UA754> % LATIN CAPITAL LETTER P WITH SQUIRREL TAIL -<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q <UFF31> <S0071>;<BASE>;<WIDECAP>;<UFF31> % FULLWIDTH LATIN CAPITAL LETTER Q <U0001F120> <S0071>;<BASE>;<COMPATCAP>;<U0001F120> % PARENTHESIZED LATIN CAPITAL LETTER Q <U211A> <S0071>;<BASE>;<FONTCAP>;<U211A> % DOUBLE-STRUCK CAPITAL Q @@ -66428,7 +66446,6 @@ endif <UA756> <SA757>;<BASE>;<CAP>;<UA756> % LATIN CAPITAL LETTER Q WITH STROKE THROUGH DESCENDER <UA758> <SA759>;<BASE>;<CAP>;<UA758> % LATIN CAPITAL LETTER Q WITH DIAGONAL STROKE <U024A> <S024B>;<BASE>;<CAP>;<U024A> % LATIN CAPITAL LETTER SMALL Q WITH HOOK TAIL -<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R <UFF32> <S0072>;<BASE>;<WIDECAP>;<UFF32> % FULLWIDTH LATIN CAPITAL LETTER R <U0001F121> <S0072>;<BASE>;<COMPATCAP>;<U0001F121> % PARENTHESIZED LATIN CAPITAL LETTER R <U211B> <S0072>;<BASE>;<FONTCAP>;<U211B> % SCRIPT CAPITAL R @@ -66466,7 +66483,6 @@ endif <U024C> <S024D>;<BASE>;<CAP>;<U024C> % LATIN CAPITAL LETTER R WITH STROKE <U2C64> <S027D>;<BASE>;<CAP>;<U2C64> % LATIN CAPITAL LETTER R WITH TAIL <UA75C> <SA75D>;<BASE>;<CAP>;<UA75C> % LATIN CAPITAL LETTER RUM ROTUNDA -<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S <UFF33> <S0073>;<BASE>;<WIDECAP>;<UFF33> % FULLWIDTH LATIN CAPITAL LETTER S <U0001F122> <S0073>;<BASE>;<COMPATCAP>;<U0001F122> % PARENTHESIZED LATIN CAPITAL LETTER S <U0001F12A> <S0073>;<BASE>;<COMPATCAP>;<U0001F12A> % TORTOISE SHELL BRACKETED LATIN CAPITAL LETTER S @@ -66502,7 +66518,6 @@ endif <U1E9E> "<S0073><S0073>";"<BASE><VRNT1><BASE>";"<COMPATCAP><COMPAT><COMPATCAP>";<U1E9E> % LATIN CAPITAL LETTER SHARP S <U2C7E> <S023F>;<BASE>;<CAP>;<U2C7E> % LATIN CAPITAL LETTER S WITH SWASH TAIL <U01A9> <S0283>;<BASE>;<CAP>;<U01A9> % LATIN CAPITAL LETTER ESH -<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T <UFF34> <S0074>;<BASE>;<WIDECAP>;<UFF34> % FULLWIDTH LATIN CAPITAL LETTER T <U0001F123> <S0074>;<BASE>;<COMPATCAP>;<U0001F123> % PARENTHESIZED LATIN CAPITAL LETTER T <U0001D413> <S0074>;<BASE>;<FONTCAP>;<U0001D413> % MATHEMATICAL BOLD CAPITAL T @@ -66536,7 +66551,6 @@ endif <U01AC> <S01AD>;<BASE>;<CAP>;<U01AC> % LATIN CAPITAL LETTER T WITH HOOK <U01AE> <S0288>;<BASE>;<CAP>;<U01AE> % LATIN CAPITAL LETTER T WITH RETROFLEX HOOK <UA7B1> <S0287>;<BASE>;<CAP>;<UA7B1> % LATIN CAPITAL LETTER TURNED T -<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U <UFF35> <S0075>;<BASE>;<WIDECAP>;<UFF35> % FULLWIDTH LATIN CAPITAL LETTER U <U0001F124> <S0075>;<BASE>;<COMPATCAP>;<U0001F124> % PARENTHESIZED LATIN CAPITAL LETTER U <U0001D414> <S0075>;<BASE>;<FONTCAP>;<U0001D414> % MATHEMATICAL BOLD CAPITAL U @@ -66591,7 +66605,6 @@ endif <UA78D> <S0265>;<BASE>;<CAP>;<UA78D> % LATIN CAPITAL LETTER TURNED H <U019C> <S026F>;<BASE>;<CAP>;<U019C> % LATIN CAPITAL LETTER TURNED M <U01B1> <S028A>;<BASE>;<CAP>;<U01B1> % LATIN CAPITAL LETTER UPSILON -<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V <UFF36> <S0076>;<BASE>;<WIDECAP>;<UFF36> % FULLWIDTH LATIN CAPITAL LETTER V <U2164> <S0076>;<BASE>;<COMPATCAP>;<U2164> % ROMAN NUMERAL FIVE <U0001F125> <S0076>;<BASE>;<COMPATCAP>;<U0001F125> % PARENTHESIZED LATIN CAPITAL LETTER V @@ -66622,7 +66635,6 @@ endif <U01B2> <S028B>;<BASE>;<CAP>;<U01B2> % LATIN CAPITAL LETTER V WITH HOOK <U1EFC> <S1EFD>;<BASE>;<CAP>;<U1EFC> % LATIN CAPITAL LETTER MIDDLE-WELSH V <U0245> <S028C>;<BASE>;<CAP>;<U0245> % LATIN CAPITAL LETTER TURNED V -<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W <UFF37> <S0077>;<BASE>;<WIDECAP>;<UFF37> % FULLWIDTH LATIN CAPITAL LETTER W <U0001F126> <S0077>;<BASE>;<COMPATCAP>;<U0001F126> % PARENTHESIZED LATIN CAPITAL LETTER W <U0001D416> <S0077>;<BASE>;<FONTCAP>;<U0001D416> % MATHEMATICAL BOLD CAPITAL W @@ -66649,7 +66661,6 @@ endif <U1E86> <S0077>;"<BASE><POINT>";"<CAP><MIN>";<U1E86> % LATIN CAPITAL LETTER W WITH DOT ABOVE <U1E88> <S0077>;"<BASE><POINS>";"<CAP><MIN>";<U1E88> % LATIN CAPITAL LETTER W WITH DOT BELOW <U2C72> <S2C73>;<BASE>;<CAP>;<U2C72> % LATIN CAPITAL LETTER W WITH HOOK -<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X <UFF38> <S0078>;<BASE>;<WIDECAP>;<UFF38> % FULLWIDTH LATIN CAPITAL LETTER X <U2169> <S0078>;<BASE>;<COMPATCAP>;<U2169> % ROMAN NUMERAL TEN <U0001F127> <S0078>;<BASE>;<COMPATCAP>;<U0001F127> % PARENTHESIZED LATIN CAPITAL LETTER X @@ -66675,7 +66686,6 @@ endif <U216A> "<S0078><S0069>";"<BASE><BASE>";"<COMPATCAP><COMPATCAP>";<U216A> % ROMAN NUMERAL ELEVEN <U216B> "<S0078><S0069><S0069>";"<BASE><BASE><BASE>";"<COMPATCAP><COMPATCAP><COMPATCAP>";<U216B> % ROMAN NUMERAL TWELVE <UA7B3> <SAB53>;<BASE>;<CAP>;<UA7B3> % LATIN CAPITAL LETTER CHI -<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y <UFF39> <S0079>;<BASE>;<WIDECAP>;<UFF39> % FULLWIDTH LATIN CAPITAL LETTER Y <U0001F128> <S0079>;<BASE>;<COMPATCAP>;<U0001F128> % PARENTHESIZED LATIN CAPITAL LETTER Y <U0001D418> <S0079>;<BASE>;<FONTCAP>;<U0001D418> % MATHEMATICAL BOLD CAPITAL Y @@ -66708,7 +66718,6 @@ endif <U01B3> <S01B4>;<BASE>;<CAP>;<U01B3> % LATIN CAPITAL LETTER Y WITH HOOK <U1EFE> <S1EFF>;<BASE>;<CAP>;<U1EFE> % LATIN CAPITAL LETTER Y WITH LOOP <U021C> <S021D>;<BASE>;<CAP>;<U021C> % LATIN CAPITAL LETTER YOGH -<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z <UFF3A> <S007A>;<BASE>;<WIDECAP>;<UFF3A> % FULLWIDTH LATIN CAPITAL LETTER Z <U0001F129> <S007A>;<BASE>;<COMPATCAP>;<U0001F129> % PARENTHESIZED LATIN CAPITAL LETTER Z <U2124> <S007A>;<BASE>;<FONTCAP>;<U2124> % DOUBLE-STRUCK CAPITAL Z diff --git a/localedata/locales/tr_TR b/localedata/locales/tr_TR index f7c13ddf4b..7d5c9d878e 100644 --- a/localedata/locales/tr_TR +++ b/localedata/locales/tr_TR @@ -81,6 +81,8 @@ copy "iso14651_t1" % % The following rules implement the same order for glibc. +% All of these collating symbols are used as primary weights +% and cause equivalnce class problems, see Bug 23437. collating-symbol <c-cedilla> collating-symbol <g-breve> collating-symbol <i-dotless> @@ -111,8 +113,40 @@ reorder-after <AFTER-U> <U011F> <g-breve>;<BASE>;<MIN>;IGNORE % Ä <U011E> <g-breve>;<BASE>;<CAP>;IGNORE % Ä <U0131> <i-dotless>;<BASE>;<MIN>;IGNORE % ı + +% tr_TR must copy the rational range definition here for CEO: +% Implement rational range for [A-Z] in regular expressions. +% We order the collation element order to support rational ranges. +% Collation is unaffected because the 4-level weights remain the same. +<U0041> <S0061>;<BASE>;<CAP>;<U0041> % LATIN CAPITAL LETTER A +<U0042> <S0062>;<BASE>;<CAP>;<U0042> % LATIN CAPITAL LETTER B +<U0043> <S0063>;<BASE>;<CAP>;<U0043> % LATIN CAPITAL LETTER C +<U0044> <S0064>;<BASE>;<CAP>;<U0044> % LATIN CAPITAL LETTER D +<U0045> <S0065>;<BASE>;<CAP>;<U0045> % LATIN CAPITAL LETTER E +<U0046> <S0066>;<BASE>;<CAP>;<U0046> % LATIN CAPITAL LETTER F +<U0047> <S0067>;<BASE>;<CAP>;<U0047> % LATIN CAPITAL LETTER G +<U0048> <S0068>;<BASE>;<CAP>;<U0048> % LATIN CAPITAL LETTER H +% Turkish sorting of I, but within rational range. +% FIXME: 'I' is no longer in the equivalence class of i's. <U0049> <i-dotless>;<BASE>;<CAP>;IGNORE % I -<U0069> <S0069>;<BASE>;<MIN>;IGNORE % i +<U004A> <S006A>;<BASE>;<CAP>;<U004A> % LATIN CAPITAL LETTER J +<U004B> <S006B>;<BASE>;<CAP>;<U004B> % LATIN CAPITAL LETTER K +<U004C> <S006C>;<BASE>;<CAP>;<U004C> % LATIN CAPITAL LETTER L +<U004D> <S006D>;<BASE>;<CAP>;<U004D> % LATIN CAPITAL LETTER M +<U004E> <S006E>;<BASE>;<CAP>;<U004E> % LATIN CAPITAL LETTER N +<U004F> <S006F>;<BASE>;<CAP>;<U004F> % LATIN CAPITAL LETTER O +<U0050> <S0070>;<BASE>;<CAP>;<U0050> % LATIN CAPITAL LETTER P +<U0051> <S0071>;<BASE>;<CAP>;<U0051> % LATIN CAPITAL LETTER Q +<U0052> <S0072>;<BASE>;<CAP>;<U0052> % LATIN CAPITAL LETTER R +<U0053> <S0073>;<BASE>;<CAP>;<U0053> % LATIN CAPITAL LETTER S +<U0054> <S0074>;<BASE>;<CAP>;<U0054> % LATIN CAPITAL LETTER T +<U0055> <S0075>;<BASE>;<CAP>;<U0055> % LATIN CAPITAL LETTER U +<U0056> <S0076>;<BASE>;<CAP>;<U0056> % LATIN CAPITAL LETTER V +<U0057> <S0077>;<BASE>;<CAP>;<U0057> % LATIN CAPITAL LETTER W +<U0058> <S0078>;<BASE>;<CAP>;<U0058> % LATIN CAPITAL LETTER X +<U0059> <S0079>;<BASE>;<CAP>;<U0059> % LATIN CAPITAL LETTER Y +<U005A> <S007A>;<BASE>;<CAP>;<U005A> % LATIN CAPITAL LETTER Z + <U0130> <S0069>;<BASE>;<CAP>;IGNORE % Ä° <U00F6> <o-diaresis>;<BASE>;<MIN>;IGNORE % ö <U00D6> <o-diaresis>;<BASE>;<CAP>;IGNORE % à diff --git a/posix/bug-regex17.c b/posix/bug-regex17.c index 893b9654b8..341fe4d827 100644 --- a/posix/bug-regex17.c +++ b/posix/bug-regex17.c @@ -46,14 +46,25 @@ struct { { 2, 10 }, { -1, -1 } } }, /* Tests for bug 9697: + Look for a multibyte sequence in a range. We pick the range based + on collation element order, since a-z is no longer valid since it's + a rational range. + + We use U+FF53 FULLWIDTH LATIN SMALL LETTER S as the start of the + range, and U+33DC SQUARE SV as the end of the range. These were + chosen by looking at collation element ordering and picking a range + in which the matching character was listed. + + U+02E2 \xcb\xa2 MODIFIER LETTER SMALL S U+00DF \xc3\x9f LATIN SMALL LETTER SHARP S U+02DA \xcb\x9a RING ABOVE - U+02E2 \xcb\xa2 MODIFIER LETTER SMALL S */ - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2, + + The U+02DA RING ABOVE is chosen because it's not in [ï½-ã]. */ + { "[ï½-ã]|[^ï½-ã]", "\xcb\xa2", REG_EXTENDED, 2, { { 0, 2 }, { -1, -1 } } }, - { "[a-z]", "\xc3\x9f", REG_EXTENDED, 2, + { "[ï½-ã]", "\xc3\x9f", REG_EXTENDED, 2, { { 0, 2 }, { -1, -1 } } }, - { "[^a-z]", "\xcb\x9a", REG_EXTENDED, 2, + { "[^ï½-ã]", "\xcb\x9a", REG_EXTENDED, 2, { { 0, 2 }, { -1, -1 } } }, }; diff --git a/posix/tst-fnmatch.input b/posix/tst-fnmatch.input index dc2ca8d01a..2131d1e437 100644 --- a/posix/tst-fnmatch.input +++ b/posix/tst-fnmatch.input @@ -67,9 +67,11 @@ # https://sourceware.org/bugzilla/show_bug.cgi?id=23393 # https://sourceware.org/bugzilla/show_bug.cgi?id=23420 # -# No consensus exists on how best to handle the changes so the -# iso14651_t1_common collation element order (CEO) has been changed to -# deinterlace the a-z and A-Z regions. +# The solution was to implement rational ranges by moving the collation +# element order to fix this for [a-z], [A-Z], and [0-9]. Likewise the +# upper and lower case letters are deinterlaced to allow for accented +# ranges that don't include uppercase e.g. [a-ñ] should not include +# any uppercase letters but may include a-z and more. # # With the deinterlacing commit ac3a3b4b0d561d776b60317d6a926050c8541655 # could be reverted to re-test the correct non-interleaved expectations. @@ -77,9 +79,7 @@ # Please note that despite the region being deinterlaced, the ordering # of collation remains the same. In glibc we implement CEO and because of # that we can reorder the elements to reorder ranges without impacting -# collation which depends on weights. The collation element ordering -# could have been changed to include just a-z, A-Z, and 0-9 in three -# distinct blocks, but this needs more discussion by the community. +# collation which depends on weights. # B.6 004(C) C "!#%+,-./01234567889" "!#%+,-./01234567889" 0 @@ -477,9 +477,9 @@ C "-" "[Z-\\]]" NOMATCH # handling of ranges and the recognition of character (vs bytes). de_DE.ISO-8859-1 "a" "[a-z]" 0 de_DE.ISO-8859-1 "z" "[a-z]" 0 -de_DE.ISO-8859-1 "ä" "[a-z]" 0 -de_DE.ISO-8859-1 "ö" "[a-z]" 0 -de_DE.ISO-8859-1 "ü" "[a-z]" 0 +de_DE.ISO-8859-1 "ä" "[a-z]" NOMATCH +de_DE.ISO-8859-1 "ö" "[a-z]" NOMATCH +de_DE.ISO-8859-1 "ü" "[a-z]" NOMATCH de_DE.ISO-8859-1 "A" "[a-z]" NOMATCH de_DE.ISO-8859-1 "Z" "[a-z]" NOMATCH de_DE.ISO-8859-1 "Ä" "[a-z]" NOMATCH @@ -492,9 +492,9 @@ de_DE.ISO-8859-1 " de_DE.ISO-8859-1 "ü" "[A-Z]" NOMATCH de_DE.ISO-8859-1 "A" "[A-Z]" 0 de_DE.ISO-8859-1 "Z" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ä" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ö" "[A-Z]" 0 -de_DE.ISO-8859-1 "Ü" "[A-Z]" 0 +de_DE.ISO-8859-1 "Ä" "[A-Z]" NOMATCH +de_DE.ISO-8859-1 "Ö" "[A-Z]" NOMATCH +de_DE.ISO-8859-1 "Ü" "[A-Z]" NOMATCH de_DE.ISO-8859-1 "a" "[[:lower:]]" 0 de_DE.ISO-8859-1 "z" "[[:lower:]]" 0 de_DE.ISO-8859-1 "ä" "[[:lower:]]" 0 @@ -566,22 +566,46 @@ de_DE.ISO-8859-1 "aa" "[[.a.]]a" 0 de_DE.ISO-8859-1 "ba" "[[.a.]]a" NOMATCH -# And with a multibyte character set. +# And with a multibyte character set: +# Ensure that Turkish reordering rules don't move 'i' out of a-z set, +# or 'I' out of A-Z set. +tr_TR.UTF-8 "i" "[a-z]" 0 +tr_TR.UTF-8 "ı" "[a-z]" NOMATCH +tr_TR.UTF-8 "I" "[A-Z]" 0 +tr_TR.UTF-8 "Ä°" "[A-Z]" NOMATCH +tr_TR.ISO-8859-9 "i" "[a-z]" 0 +tr_TR.ISO-8859-9 "I" "[A-Z]" 0 +# See bug 23437 for I not being in [=i=]. +tr_TR.UTF-8 "I" "[=i=]" NOMATCH en_US.UTF-8 "a" "[a-z]" 0 +# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [a-z]. +en_US.UTF-8 "ñ" "[a-z]" NOMATCH en_US.UTF-8 "z" "[a-z]" 0 en_US.UTF-8 "A" "[a-z]" NOMATCH +# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [a-z]. +en_US.UTF-8 "Ã" "[a-z]" NOMATCH en_US.UTF-8 "Z" "[a-z]" NOMATCH en_US.UTF-8 "a" "[A-Z]" NOMATCH +# Test that <U00F1> LATIN SMALL LETTER N WITH TILDE is not in [A-Z]. +en_US.UTF-8 "ñ" "[A-Z]" NOMATCH en_US.UTF-8 "z" "[A-Z]" NOMATCH en_US.UTF-8 "A" "[A-Z]" 0 +# Test that <U00D1> LATIN CAPITAL LETTER N WITH TILDE is not in [A-Z]. +en_US.UTF-8 "Ã" "[A-Z]" NOMATCH en_US.UTF-8 "Z" "[A-Z]" 0 en_US.UTF-8 "0" "[0-9]" 0 +# Test that <UFF10> FULLWIDTH DIGIT ZERO is not in [0-9]. +en_US.UTF-8 "ï¼" "[0-9]" NOMATCH +# Test that <U00BD> VULGAR FRACTION ONE HALF is not in [0-9]. +en_US.UTF-8 "½" "[0-9]" NOMATCH en_US.UTF-8 "9" "[0-9]" 0 +# Test that <UFF19> FULLWIDTH DIGIT NINE is not in [0-9]. +en_US.UTF-8 "ï¼" "[0-9]" NOMATCH de_DE.UTF-8 "a" "[a-z]" 0 de_DE.UTF-8 "z" "[a-z]" 0 -de_DE.UTF-8 "ä" "[a-z]" 0 -de_DE.UTF-8 "ö" "[a-z]" 0 -de_DE.UTF-8 "ü" "[a-z]" 0 +de_DE.UTF-8 "ä" "[a-z]" NOMATCH +de_DE.UTF-8 "ö" "[a-z]" NOMATCH +de_DE.UTF-8 "ü" "[a-z]" NOMATCH de_DE.UTF-8 "A" "[a-z]" NOMATCH de_DE.UTF-8 "Z" "[a-z]" NOMATCH de_DE.UTF-8 "Ã" "[a-z]" NOMATCH @@ -594,9 +618,9 @@ de_DE.UTF-8 "ö" "[A-Z]" NOMATCH de_DE.UTF-8 "ü" "[A-Z]" NOMATCH de_DE.UTF-8 "A" "[A-Z]" 0 de_DE.UTF-8 "Z" "[A-Z]" 0 -de_DE.UTF-8 "Ã" "[A-Z]" 0 -de_DE.UTF-8 "Ã" "[A-Z]" 0 -de_DE.UTF-8 "Ã" "[A-Z]" 0 +de_DE.UTF-8 "Ã" "[A-Z]" NOMATCH +de_DE.UTF-8 "Ã" "[A-Z]" NOMATCH +de_DE.UTF-8 "Ã" "[A-Z]" NOMATCH de_DE.UTF-8 "a" "[[:lower:]]" 0 de_DE.UTF-8 "z" "[[:lower:]]" 0 de_DE.UTF-8 "ä" "[[:lower:]]" 0 diff --git a/posix/tst-rxspencer.c b/posix/tst-rxspencer.c index 9d597ef3e9..a3d836679a 100644 --- a/posix/tst-rxspencer.c +++ b/posix/tst-rxspencer.c @@ -155,7 +155,12 @@ mb_frob_pattern (const char *str, const char *letters) *dst++ = *src; continue; } - else if (!in_class && strchr (letters, *src)) + /* We do a replacement, but not for the start of ranges, because + mb_replace will create invalid rational ranges. For example + [á-z] is an invalid range because á comes after z, but [a-á] + is a valid range. So we avoid replacing the start of ranges + to avoid this problem. */ + else if (!in_class && src[1] != '-' && strchr (letters, *src)) dst = mb_replace (dst, *src); else { ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-20 21:56 ` Carlos O'Donell @ 2018-07-23 15:11 ` Florian Weimer 2018-07-23 18:09 ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell 2018-07-25 15:54 ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell 0 siblings, 2 replies; 42+ messages in thread From: Florian Weimer @ 2018-07-23 15:11 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/20/2018 11:56 PM, Carlos O'Donell wrote: > v2 > - Fixed tr_TR by duplicating A-Z rational range. > - Fixed tst-rxspender. > - Fixed bug-regex17. > > Tell me how the new version does. My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch support, too, and initial results look good as well. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Rational Ranges - Rafal and Mike's opinion? (Bug 23393). 2018-07-23 15:11 ` Florian Weimer @ 2018-07-23 18:09 ` Carlos O'Donell 2018-07-24 20:45 ` Rafal Luzynski 2018-07-25 15:44 ` Mike FABIAN 2018-07-25 15:54 ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell 1 sibling, 2 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-23 18:09 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers, Rafal Luzynski On 07/23/2018 11:10 AM, Florian Weimer wrote: > On 07/20/2018 11:56 PM, Carlos O'Donell wrote: >> v2 >> - Fixed tr_TR by duplicating A-Z rational range. >> - Fixed tst-rxspender. >> - Fixed bug-regex17. >> >> Tell me how the new version does. > > My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch > support, too, and initial results look good as well. OK, so we have the capability to deploy rational ranges. Florian, Should we do so in 2.28? Avoiding all possible problems in the future and making the ranges portable, rational, and safe from a security perspective? Rafal, As localedata maintainer what is your opinion of changing the meaning of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales which mean exactly the latin character sequences you would expect e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z], [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}? Mike, Same question to you. For historical context in gawk: https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html For context from POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html (see the section on "RE Bracket Expressions"). Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges rational for all locales, and would no longer include mixed case, or accents. I'd like to year affirmatives from the localedata maintainers on this issue. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393). 2018-07-23 18:09 ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell @ 2018-07-24 20:45 ` Rafal Luzynski 2018-07-24 20:53 ` Carlos O'Donell 2018-07-24 20:59 ` Carlos O'Donell 2018-07-25 15:44 ` Mike FABIAN 1 sibling, 2 replies; 42+ messages in thread From: Rafal Luzynski @ 2018-07-24 20:45 UTC (permalink / raw) To: GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers, Carlos O'Donell 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote: > [...] > Rafal, > > As localedata maintainer what is your opinion of changing the meaning > of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales > which mean exactly the latin character sequences you would expect > e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z], > [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}? Having discussed this off-list my answer is: I'm in favor of implementing rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as code-point ranges. But I understand that this is possible only in 2.29. Therefore for 2.28 I support this data-based solution. Regards, Rafal ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393). 2018-07-24 20:45 ` Rafal Luzynski @ 2018-07-24 20:53 ` Carlos O'Donell 2018-07-24 20:59 ` Carlos O'Donell 1 sibling, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-24 20:53 UTC (permalink / raw) To: Rafal Luzynski, GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers On 07/24/2018 04:45 PM, Rafal Luzynski wrote: > 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote: >> [...] >> Rafal, >> >> As localedata maintainer what is your opinion of changing the meaning >> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales >> which mean exactly the latin character sequences you would expect >> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z], >> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}? > > Having discussed this off-list my answer is: I'm in favor of implementing > rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as > code-point ranges. But I understand that this is possible only in 2.29. > Therefore for 2.28 I support this data-based solution. From the perspective of the user of the library and the locales the rational ranges we implement will look as-if they were code point ranges for the ranges in question e.g. a-z, A-Z, 0-9 and their subranges. For 2.28 we will implement rational ranges for [a-z], [A-Z], and [0-9], and all of their subsets via a data-only solution. Just wanted to make it clear that all subsets will be treated as rational ranges. It is only for other subsets like [!-~] (ASCII range) where we will not have a rational range until we switch to making ranges operate on code points. That will be a 2.29 optimization. OK, I will prepare a patch to fix this. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393). 2018-07-24 20:45 ` Rafal Luzynski 2018-07-24 20:53 ` Carlos O'Donell @ 2018-07-24 20:59 ` Carlos O'Donell 1 sibling, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-24 20:59 UTC (permalink / raw) To: Rafal Luzynski, GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers On 07/24/2018 04:45 PM, Rafal Luzynski wrote: > 23.07.2018 20:09 Carlos O'Donell <carlos@redhat.com> wrote: >> [...] >> Rafal, >> >> As localedata maintainer what is your opinion of changing the meaning >> of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales >> which mean exactly the latin character sequences you would expect >> e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z], >> [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}? > > Having discussed this off-list my answer is: I'm in favor of implementing > rational ranges treating [a-z], [A-Z], [0-9], and all their subsets as > code-point ranges. But I understand that this is possible only in 2.29. > Therefore for 2.28 I support this data-based solution. I'll put together a final patch ASAP that provides: * Deinterlace upper/lower * Group a-z, A-Z, 0-9, * NEWS entry for rational ranges. Note: manual/stdio.texi also makes the mistake of saying [a-z] is lowercase characters, so this will fix the manual bug with no change :-) Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: Rational Ranges - Rafal and Mike's opinion? (Bug 23393). 2018-07-23 18:09 ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell 2018-07-24 20:45 ` Rafal Luzynski @ 2018-07-25 15:44 ` Mike FABIAN 1 sibling, 0 replies; 42+ messages in thread From: Mike FABIAN @ 2018-07-25 15:44 UTC (permalink / raw) To: Carlos O'Donell Cc: Florian Weimer, GNU C Library, Rich Felker, Zorro Lang, Joseph S. Myers, Rafal Luzynski Carlos O'Donell <carlos@redhat.com> ããã¯ããã¾ãã: > On 07/23/2018 11:10 AM, Florian Weimer wrote: >> On 07/20/2018 11:56 PM, Carlos O'Donell wrote: >>> v2 >>> - Fixed tr_TR by duplicating A-Z rational range. >>> - Fixed tst-rxspender. >>> - Fixed bug-regex17. >>> >>> Tell me how the new version does. >> >> My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch >> support, too, and initial results look good as well. > > OK, so we have the capability to deploy rational ranges. > > Florian, > > Should we do so in 2.28? Avoiding all possible problems in the future > and making the ranges portable, rational, and safe from a security > perspective? > > Rafal, > > As localedata maintainer what is your opinion of changing the meaning > of [a-z], [A-Z], and [0-9] to be rational ranges for *all* locales > which mean exactly the latin character sequences you would expect > e.g. {a,b,c,d,e,f,g,h,i,j,k,l,n,m,o,p,q,r,s,t,u,v,w,x,y,z} for [a-z], > [A-Z] likewise, and {0,1,2,3,4,5,6,7,8,9}? > > Mike, > > Same question to you. I agree that rational ranges are much more useful. I cannot imagine any use case for [a-z] matching aAbB...z and not Z. One never knows what [a-z] would match if it uses the locale sort order, it is just too confusing. In the long run, I think implementing ranges by code points would be the best solution and make updates of the iso14651_t1_common file easier because we need to make less changes to the upstream version of that file then. But for 2.28 this cannot be done. Therefore, I think the solution by Carlos is very good. > For historical context in gawk: > https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html > > For context from POSIX: > http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html > (see the section on "RE Bracket Expressions"). > > Support for rational ranges would make [a-z], [A-Z], [0-9] and other subranges > rational for all locales, and would no longer include mixed case, or accents. > > I'd like to year affirmatives from the localedata maintainers on this issue. > > Cheers, > Carlos. -- Mike FABIAN <mfabian@redhat.com> ç¡ç ä¸è¶³ã¯ããä»äºã®æµã ã ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-23 15:11 ` Florian Weimer 2018-07-23 18:09 ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell @ 2018-07-25 15:54 ` Carlos O'Donell 2018-07-25 20:19 ` Florian Weimer 1 sibling, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-25 15:54 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers [-- Attachment #1: Type: text/plain, Size: 1921 bytes --] On 07/23/2018 11:10 AM, Florian Weimer wrote: > On 07/20/2018 11:56 PM, Carlos O'Donell wrote: >> v2 >> - Fixed tr_TR by duplicating A-Z rational range. >> - Fixed tst-rxspender. >> - Fixed bug-regex17. >> >> Tell me how the new version does. > > My tester likes it. tr_TR.ISO-8859-9 is now fixed. I added fnmatch > support, too, and initial results look good as well. OK, here is v3. ~~~ NEWS ~~ * The GNU C Library now uses rational ranges for regular expression matching of ranges that are within a-z, A-Z, and 0-9 for all locales. This means that the range [a-c] will no longer match accented letter a's and will only match exactly a, b, and c. Likewise [0-9] will only include the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and no other characters. Rational ranges have been implemented by several other GNU projects to provide straight forward rules for regular expression ranges and to make them portable across locales. The current rational ranges are implemented using collation element ordering, which may yield unexpected results if the range includes accented characters e.g. [a-ñ], since such a range will include a-z since ñ comes after the rational range in collation element order. In the future the library may implement full rational ranges covering all characters by using Unicode code point ordering which will make the sequences faster to match and more portable. ~~~ We have approval from Mike and Rafal, the two localedata subsystem maintainers. This solution matches what you and Rich Felker both thinks is the correct solution. So for 2.28 we would use rational ranges for a-z, A-Z, and 0-9, until we can implement code point ranges. v3 - Merged lowercase/uppercase deinterlacing. - Added NEWS entry. Please run this through your checker, and ACK this for 2.28 and I'll commit. Attaching it as swbz23393v3.tar.gz to avoid spam rejection. Cheers, Carlos. [-- Attachment #2: swbz23393v3.tar.gz --] [-- Type: application/gzip, Size: 50219 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 15:54 ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell @ 2018-07-25 20:19 ` Florian Weimer 2018-07-25 20:25 ` Carlos O'Donell 0 siblings, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-25 20:19 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 05:54 PM, Carlos O'Donell wrote: > Attaching it as swbz23393v3.tar.gz to avoid spam rejection. Quick comment. The middle line here adds trailing whitespace: - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2, + + The U+02DA RING ABOVE is chosen because it's not in [ï½-ã]. */ Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 20:19 ` Florian Weimer @ 2018-07-25 20:25 ` Carlos O'Donell 2018-07-25 20:31 ` Florian Weimer 2018-07-25 21:06 ` [PATCHv3] " Rafal Luzynski 0 siblings, 2 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-25 20:25 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 04:18 PM, Florian Weimer wrote: > On 07/25/2018 05:54 PM, Carlos O'Donell wrote: >> Attaching it as swbz23393v3.tar.gz to avoid spam rejection. > > Quick comment. The middle line here adds trailing whitespace: > > - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2, > + > +    The U+02DA RING ABOVE is chosen because it's not in [ï½-ã]. */ Thanks. I'll fix this with v4. I had to fix the following locales: modified: localedata/locales/ar_SA modified: localedata/locales/km_KH modified: localedata/locales/lo_LA modified: localedata/locales/or_IN modified: localedata/locales/sl_SI modified: localedata/locales/th_TH They all re-arranged ASCII character collation element ordering like tr_TR, and so they needed manual fixing. Could you please add these locales to your tester? c. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 20:25 ` Carlos O'Donell @ 2018-07-25 20:31 ` Florian Weimer 2018-07-25 20:57 ` [PATCHv4] " Carlos O'Donell 2018-07-25 21:06 ` [PATCHv3] " Rafal Luzynski 1 sibling, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-25 20:31 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 10:25 PM, Carlos O'Donell wrote: > On 07/25/2018 04:18 PM, Florian Weimer wrote: >> On 07/25/2018 05:54 PM, Carlos O'Donell wrote: >>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection. >> >> Quick comment. The middle line here adds trailing whitespace: >> >> - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2, >> + >> +    The U+02DA RING ABOVE is chosen because it's not in [ï½-ã]. */ > > Thanks. I'll fix this with v4. I have verified that localedata/locales/iso14651_t1_common is just a reordering (except for the new comments). localedata/locales/tr_TR is more complicated, but looks like an order-only change for me too. > I had to fix the following locales: > > modified: localedata/locales/ar_SA > modified: localedata/locales/km_KH > modified: localedata/locales/lo_LA > modified: localedata/locales/or_IN > modified: localedata/locales/sl_SI > modified: localedata/locales/th_TH Do you have the actual locale names handy? localedata/SUPPORTED contains charsets, but I'm not sure if the translation to locale names is completely regular. > They all re-arranged ASCII character collation element ordering like tr_TR, > and so they needed manual fixing. > > Could you please add these locales to your tester? I will try. I already have an xtests part, and these probably need to go there as well. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCHv4] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 20:31 ` Florian Weimer @ 2018-07-25 20:57 ` Carlos O'Donell 2018-07-26 2:34 ` [PATCHv4a] " Carlos O'Donell 0 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-25 20:57 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers [-- Attachment #1: Type: text/plain, Size: 2016 bytes --] On 07/25/2018 04:31 PM, Florian Weimer wrote: > On 07/25/2018 10:25 PM, Carlos O'Donell wrote: >> On 07/25/2018 04:18 PM, Florian Weimer wrote: >>> On 07/25/2018 05:54 PM, Carlos O'Donell wrote: >>>> Attaching it as swbz23393v3.tar.gz to avoid spam rejection. >>> >>> Quick comment. The middle line here adds trailing whitespace: >>> >>> - { "[a-z]|[^a-z]", "\xcb\xa2", REG_EXTENDED, 2, >>> + >>> +    The U+02DA RING ABOVE is chosen because it's not in [ï½-ã]. */ >> >> Thanks. I'll fix this with v4. > > I have verified that localedata/locales/iso14651_t1_common is just a reordering (except for the new comments). > > localedata/locales/tr_TR is more complicated, but looks like an order-only change for me too. > >> I had to fix the following locales: >> >>     modified:  localedata/locales/ar_SA >>     modified:  localedata/locales/km_KH >>     modified:  localedata/locales/lo_LA >>     modified:  localedata/locales/or_IN >>     modified:  localedata/locales/sl_SI >>     modified:  localedata/locales/th_TH > > Do you have the actual locale names handy? localedata/SUPPORTED contains charsets, but I'm not sure if the translation to locale names is completely regular. It is completely regular. In that ar_SA => ar_SA.UTF-8. And so forth. >> They all re-arranged ASCII character collation element ordering like tr_TR, >> and so they needed manual fixing. >> >> Could you please add these locales to your tester? > > I will try. I already have an xtests part, and these probably need to go there as well. v4 - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH. - Added range checking for a-z, A-Z for all supported UTF-8 locales. All of my testers are clean. So the question is now: Do we commit to rational ranges for a-z, A-Z, 0-9 ... for 2.28. or Do we just do the deinterlacing of iso14651_t1_common to fix en_US.UTF-8? Cheers, Carlos. [-- Attachment #2: swbz23393v4.tar.gz --] [-- Type: application/gzip, Size: 67108 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv4a] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 20:57 ` [PATCHv4] " Carlos O'Donell @ 2018-07-26 2:34 ` Carlos O'Donell 2018-07-26 14:51 ` Florian Weimer 0 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-26 2:34 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers [-- Attachment #1: Type: text/plain, Size: 432 bytes --] On 07/25/2018 04:57 PM, Carlos O'Donell wrote: > v4 > - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH. > - Added range checking for a-z, A-Z for all supported UTF-8 locales. > > All of my testers are clean. Attaching v4 on top of the current master. This fixes all the locales. All locales, even with tailoring have rational range support now. If this passes your tests tomorrow I'm OK to put this into 2.28. Cheers, Carlos. [-- Attachment #2: swbz23393v4a.tar.gz --] [-- Type: application/gzip, Size: 29142 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv4a] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-26 2:34 ` [PATCHv4a] " Carlos O'Donell @ 2018-07-26 14:51 ` Florian Weimer 2018-07-26 14:59 ` Carlos O'Donell 2018-07-28 1:12 ` [WIPv5] " Carlos O'Donell 0 siblings, 2 replies; 42+ messages in thread From: Florian Weimer @ 2018-07-26 14:51 UTC (permalink / raw) To: libc-alpha [-- Attachment #1: Type: text/plain, Size: 3967 bytes --] On 07/26/2018 04:34 AM, Carlos O'Donell wrote: > On 07/25/2018 04:57 PM, Carlos O'Donell wrote: >> v4 >> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH. >> - Added range checking for a-z, A-Z for all supported UTF-8 locales. >> >> All of my testers are clean. > > Attaching v4 on top of the current master. > > This fixes all the locales. I wrote another enumeration tester, this time covering all locales. It found these issues: az_AZ: U+000069 fails to match /[a-z]/ az_AZ: U+000049 fails to match /[A-Z]/ az_AZ.utf8: U+000069 fails to match /[a-z]/ az_AZ.utf8: U+000049 fails to match /[A-Z]/ crh_UA: U+000069 fails to match /[a-z]/ crh_UA: U+000049 fails to match /[A-Z]/ crh_UA.utf8: U+000069 fails to match /[a-z]/ crh_UA.utf8: U+000049 fails to match /[A-Z]/ ku_TR: U+000069 fails to match /[a-z]/ ku_TR: U+000049 fails to match /[A-Z]/ ku_TR.iso88599: U+000069 fails to match /[a-z]/ ku_TR.iso88599: U+000049 fails to match /[A-Z]/ ku_TR.utf8: U+000069 fails to match /[a-z]/ ku_TR.utf8: U+000049 fails to match /[A-Z]/ lv_LV: U+000079 fails to match /[a-z]/ lv_LV: U+000059 fails to match /[A-Z]/ lv_LV.iso885913: U+000079 fails to match /[a-z]/ lv_LV.iso885913: U+000059 fails to match /[A-Z]/ lv_LV.utf8: U+000079 fails to match /[a-z]/ lv_LV.utf8: U+000059 fails to match /[A-Z]/ shs_CA: U+0000E6 matches /[a-z]/ unexpectedly shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly slovene: U+00006A fails to match /[a-z]/ slovene: U+00006B fails to match /[a-z]/ slovene: U+00006C fails to match /[a-z]/ slovene: U+00006D fails to match /[a-z]/ slovene: U+00006E fails to match /[a-z]/ slovene: U+00006F fails to match /[a-z]/ slovenian: U+00006A fails to match /[a-z]/ slovenian: U+00006B fails to match /[a-z]/ slovenian: U+00006C fails to match /[a-z]/ slovenian: U+00006D fails to match /[a-z]/ slovenian: U+00006E fails to match /[a-z]/ slovenian: U+00006F fails to match /[a-z]/ sl_SI: U+00006A fails to match /[a-z]/ sl_SI: U+00006B fails to match /[a-z]/ sl_SI: U+00006C fails to match /[a-z]/ sl_SI: U+00006D fails to match /[a-z]/ sl_SI: U+00006E fails to match /[a-z]/ sl_SI: U+00006F fails to match /[a-z]/ sl_SI.iso88592: U+00006A fails to match /[a-z]/ sl_SI.iso88592: U+00006B fails to match /[a-z]/ sl_SI.iso88592: U+00006C fails to match /[a-z]/ sl_SI.iso88592: U+00006D fails to match /[a-z]/ sl_SI.iso88592: U+00006E fails to match /[a-z]/ sl_SI.iso88592: U+00006F fails to match /[a-z]/ sl_SI.utf8: U+00006A fails to match /[a-z]/ sl_SI.utf8: U+00006B fails to match /[a-z]/ sl_SI.utf8: U+00006C fails to match /[a-z]/ sl_SI.utf8: U+00006D fails to match /[a-z]/ sl_SI.utf8: U+00006E fails to match /[a-z]/ sl_SI.utf8: U+00006F fails to match /[a-z]/ sv_FI: U+000077 fails to match /[a-z]/ sv_FI: U+000057 fails to match /[A-Z]/ sv_FI@euro: U+000077 fails to match /[a-z]/ sv_FI@euro: U+000057 fails to match /[A-Z]/ sv_FI.iso88591: U+000077 fails to match /[a-z]/ sv_FI.iso88591: U+000057 fails to match /[A-Z]/ sv_FI.iso885915@euro: U+000077 fails to match /[a-z]/ sv_FI.iso885915@euro: U+000057 fails to match /[A-Z]/ sv_FI.utf8: U+000077 fails to match /[a-z]/ sv_FI.utf8: U+000057 fails to match /[A-Z]/ sv_SE: U+000077 fails to match /[a-z]/ sv_SE: U+000057 fails to match /[A-Z]/ sv_SE.iso88591: U+000077 fails to match /[a-z]/ sv_SE.iso88591: U+000057 fails to match /[A-Z]/ sv_SE.utf8: U+000077 fails to match /[a-z]/ sv_SE.utf8: U+000057 fails to match /[A-Z]/ swedish: U+000077 fails to match /[a-z]/ swedish: U+000057 fails to match /[A-Z]/ tt_RU: U+000069 fails to match /[a-z]/ tt_RU: U+000049 fails to match /[A-Z]/ tt_RU@iqtelif: U+000069 fails to match /[a-z]/ tt_RU@iqtelif: U+000049 fails to match /[A-Z]/ tt_RU.utf8: U+000069 fails to match /[a-z]/ tt_RU.utf8: U+000049 fails to match /[A-Z]/ tt_RU.utf8@iqtelif: U+000069 fails to match /[a-z]/ tt_RU.utf8@iqtelif: U+000049 fails to match /[A-Z]/ Thanks, Florian [-- Attachment #2: rational-ranges-1.cc --] [-- Type: text/x-c++src, Size: 2944 bytes --] #include <err.h> #include <errno.h> #include <limits.h> #include <locale.h> #include <regex.h> #include <stdio.h> #include <stdlib.h> #include <wchar.h> #include <algorithm> #include <string> #include <vector> static std::vector<std::string> get_locales() { FILE *fp = popen("locale -a", "r"); if (fp == NULL) err(1, "running locale -a"); std::vector<std::string> result; while (!feof(fp)) { char *elem{}; int ret = fscanf(fp, "%ms", &elem); if (ret == 1) { if (elem == nullptr) errx(1, "invalid fscanf result"); result.emplace_back(elem); free(elem); } else if (ferror(fp)) err(1, "fscanf failed"); } int ret = pclose(fp); if (ret != 0) err(1, "locale -a failed with status %d", ret); std::sort(result.begin(), result.end()); return result; } static void test_regexp_range(const char *locale, const char *pattern, std::pair<wchar_t, wchar_t> range) { regex_t reg; { int ret = regcomp(®, pattern, REG_EXTENDED | REG_NOSUB); if (ret != 0) errx(1, "Cannot compile regular expression /%s/: %d", pattern, ret); } const wchar_t maximum_character = 0x10FFFF; const unsigned maximum_length = 5; /* With NUL. */ for (wchar_t ch = 1; ch <= maximum_character; ++ch) { char uch[MB_LEN_MAX]; mbstate_t ps{}; { size_t ret = wcrtomb(uch, ch, &ps); if (ret == static_cast<size_t>(-1)) { if (errno == EILSEQ) continue; err(1, "wcrtomb(0x%x)", static_cast<unsigned>(ch)); } else if (ret == 0) continue; // Some anomaly. if (ret >= maximum_length) errx(1, "multi-byte length %zu at 0x%x exceeds %u", ret, ch, maximum_length); uch[ret] = '\0'; } int ret = regexec(®, uch, 0, NULL, 0); if (ret != 0 && ret != REG_NOMATCH) errx(1, "regexec of /%s/ failed with code %d", pattern, ret); bool regex_matches = ret == 0; bool range_matches = range.first <= ch && ch <= range.second; if (regex_matches != range_matches) { if (regex_matches) printf("%s: U+%06X matches /%s/ unexpectedly\n", locale, static_cast<unsigned>(ch), pattern); else printf("%s: U+%06X fails to match /%s/\n", locale, static_cast<unsigned>(ch), pattern); } } regfree(®); } int main() { std::vector<std::string> locales{get_locales()}; for (const auto &locale : locales) { if (setlocale(LC_ALL, locale.c_str()) == NULL) err(1, "Cannot set locale to %s", locale.c_str()); test_regexp_range(locale.c_str(), "[0-9]", std::make_pair(L'0', L'9')); test_regexp_range(locale.c_str(), "[a-z]", std::make_pair(L'a', L'z')); test_regexp_range(locale.c_str(), "[A-Z]", std::make_pair(L'A', L'Z')); } } ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv4a] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-26 14:51 ` Florian Weimer @ 2018-07-26 14:59 ` Carlos O'Donell 2018-07-28 1:12 ` [WIPv5] " Carlos O'Donell 1 sibling, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-26 14:59 UTC (permalink / raw) To: Florian Weimer, libc-alpha On 07/26/2018 10:50 AM, Florian Weimer wrote: > On 07/26/2018 04:34 AM, Carlos O'Donell wrote: >> On 07/25/2018 04:57 PM, Carlos O'Donell wrote: >>> v4 >>> - Fixed ar_SA, km_KH, lo_LA, or_IN, sl_SI, th_TH. >>> - Added range checking for a-z, A-Z for all supported UTF-8 locales. >>> >>> All of my testers are clean. >> >> Attaching v4 on top of the current master. >> >> This fixes all the locales. > > I wrote another enumeration tester, this time covering all locales. It found these issues: > > az_AZ: U+000069 fails to match /[a-z]/ > az_AZ: U+000049 fails to match /[A-Z]/ > az_AZ.utf8: U+000069 fails to match /[a-z]/ > az_AZ.utf8: U+000049 fails to match /[A-Z]/ See it. > crh_UA: U+000069 fails to match /[a-z]/ > crh_UA: U+000049 fails to match /[A-Z]/ > crh_UA.utf8: U+000069 fails to match /[a-z]/ > crh_UA.utf8: U+000049 fails to match /[A-Z]/ See it. > ku_TR: U+000069 fails to match /[a-z]/ > ku_TR: U+000049 fails to match /[A-Z]/ > ku_TR.iso88599: U+000069 fails to match /[a-z]/ > ku_TR.iso88599: U+000049 fails to match /[A-Z]/ > ku_TR.utf8: U+000069 fails to match /[a-z]/ > ku_TR.utf8: U+000049 fails to match /[A-Z]/ See it. > lv_LV: U+000079 fails to match /[a-z]/ > lv_LV: U+000059 fails to match /[A-Z]/ > lv_LV.iso885913: U+000079 fails to match /[a-z]/ > lv_LV.iso885913: U+000059 fails to match /[A-Z]/ > lv_LV.utf8: U+000079 fails to match /[a-z]/ > lv_LV.utf8: U+000059 fails to match /[A-Z]/ See it. > shs_CA: U+0000E6 matches /[a-z]/ unexpectedly > shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly > shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly > shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly Good catch. These were the ones I was hoping your finder would catch. > slovene: U+00006A fails to match /[a-z]/ > slovene: U+00006B fails to match /[a-z]/ > slovene: U+00006C fails to match /[a-z]/ > slovene: U+00006D fails to match /[a-z]/ > slovene: U+00006E fails to match /[a-z]/ > slovene: U+00006F fails to match /[a-z]/ This is an alias for sl_SI.ISO-8859-2 and we see it below. > slovenian: U+00006A fails to match /[a-z]/ > slovenian: U+00006B fails to match /[a-z]/ > slovenian: U+00006C fails to match /[a-z]/ > slovenian: U+00006D fails to match /[a-z]/ > slovenian: U+00006E fails to match /[a-z]/ > slovenian: U+00006F fails to match /[a-z]/ This is an alias for sl_SI.ISO-8859-2 and we see it below. > sl_SI: U+00006A fails to match /[a-z]/ > sl_SI: U+00006B fails to match /[a-z]/ > sl_SI: U+00006C fails to match /[a-z]/ > sl_SI: U+00006D fails to match /[a-z]/ > sl_SI: U+00006E fails to match /[a-z]/ > sl_SI: U+00006F fails to match /[a-z]/ See it. > sl_SI.iso88592: U+00006A fails to match /[a-z]/ > sl_SI.iso88592: U+00006B fails to match /[a-z]/ > sl_SI.iso88592: U+00006C fails to match /[a-z]/ > sl_SI.iso88592: U+00006D fails to match /[a-z]/ > sl_SI.iso88592: U+00006E fails to match /[a-z]/ > sl_SI.iso88592: U+00006F fails to match /[a-z]/ See it (aliased above twice). > sl_SI.utf8: U+00006A fails to match /[a-z]/ > sl_SI.utf8: U+00006B fails to match /[a-z]/ > sl_SI.utf8: U+00006C fails to match /[a-z]/ > sl_SI.utf8: U+00006D fails to match /[a-z]/ > sl_SI.utf8: U+00006E fails to match /[a-z]/ > sl_SI.utf8: U+00006F fails to match /[a-z]/ See it. > sv_FI: U+000077 fails to match /[a-z]/ > sv_FI: U+000057 fails to match /[A-Z]/ See it. > sv_FI@euro: U+000077 fails to match /[a-z]/ > sv_FI@euro: U+000057 fails to match /[A-Z]/ Same as sv_FI. > sv_FI.iso88591: U+000077 fails to match /[a-z]/ > sv_FI.iso88591: U+000057 fails to match /[A-Z]/ Likewise. > sv_FI.iso885915@euro: U+000077 fails to match /[a-z]/ > sv_FI.iso885915@euro: U+000057 fails to match /[A-Z]/ Likewise. > sv_FI.utf8: U+000077 fails to match /[a-z]/ > sv_FI.utf8: U+000057 fails to match /[A-Z]/ Likewise. > sv_SE: U+000077 fails to match /[a-z]/ > sv_SE: U+000057 fails to match /[A-Z]/ See it. > sv_SE.iso88591: U+000077 fails to match /[a-z]/ > sv_SE.iso88591: U+000057 fails to match /[A-Z]/ Same as above. > sv_SE.utf8: U+000077 fails to match /[a-z]/ > sv_SE.utf8: U+000057 fails to match /[A-Z]/ Likewise. > swedish: U+000077 fails to match /[a-z]/ > swedish: U+000057 fails to match /[A-Z]/ Alias for sv_SE. > tt_RU: U+000069 fails to match /[a-z]/ > tt_RU: U+000049 fails to match /[A-Z]/ See it. > tt_RU@iqtelif: U+000069 fails to match /[a-z]/ > tt_RU@iqtelif: U+000049 fails to match /[A-Z]/ See it. > tt_RU.utf8: U+000069 fails to match /[a-z]/ > tt_RU.utf8: U+000049 fails to match /[A-Z]/ See it. > tt_RU.utf8@iqtelif: U+000069 fails to match /[a-z]/ > tt_RU.utf8@iqtelif: U+000049 fails to match /[A-Z]/ See it. Thanks you! I increased tst-fnmatch.input coverage and I get this: Line #3699: Test #3548 (az_AZ.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #3751: Test #3600 (az_AZ.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #6819: Test #6668 (crh_UA.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #6871: Test #6720 (crh_UA.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #18675: Test #18524 (ku_TR.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #18727: Test #18576 (ku_TR.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #19835: Test #19684 (lv_LV.UTF-8): fnmatch ("[a-z]", "y", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #19887: Test #19736 (lv_LV.UTF-8): fnmatch ("[A-Z]", "Y", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #26684: Test #26533 (sl_SI.UTF-8): fnmatch ("[a-z]", "j", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #26685: Test #26534 (sl_SI.UTF-8): fnmatch ("[a-z]", "k", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #26686: Test #26535 (sl_SI.UTF-8): fnmatch ("[a-z]", "l", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #26687: Test #26536 (sl_SI.UTF-8): fnmatch ("[a-z]", "m", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #26688: Test #26537 (sl_SI.UTF-8): fnmatch ("[a-z]", "n", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #26689: Test #26538 (sl_SI.UTF-8): fnmatch ("[a-z]", "o", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #28049: Test #27898 (sv_FI.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #28101: Test #27950 (sv_FI.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #28153: Test #28002 (sv_SE.UTF-8): fnmatch ("[a-z]", "w", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #28205: Test #28054 (sv_SE.UTF-8): fnmatch ("[A-Z]", "W", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #30427: Test #30276 (tt_RU.UTF-8): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #30479: Test #30328 (tt_RU.UTF-8): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #30531: Test #30380 (tt_RU.UTF-8@iqtelif): fnmatch ("[a-z]", "i", 0) = FNM_NOMATCH (FAIL, expected 0) *** Line #30583: Test #30432 (tt_RU.UTF-8@iqtelif): fnmatch ("[A-Z]", "I", 0) = FNM_NOMATCH (FAIL, expected 0) *** Which matches all the locales you saw failures in except for shs_CA, which is a real bug. I'll fix these up quickly. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-26 14:51 ` Florian Weimer 2018-07-26 14:59 ` Carlos O'Donell @ 2018-07-28 1:12 ` Carlos O'Donell 2018-07-30 17:40 ` Florian Weimer 1 sibling, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-28 1:12 UTC (permalink / raw) To: Florian Weimer, libc-alpha [-- Attachment #1: Type: text/plain, Size: 900 bytes --] On 07/26/2018 10:50 AM, Florian Weimer wrote: > shs_CA: U+0000E6 matches /[a-z]/ unexpectedly > shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly > shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly > shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly This is a WIP, because the number of tests now is too big to simply add them to tst-fnmatch.input, and so I'm writing a new tester tst-rational-ranges.c. I'm parsing SUPPORTED, expecting all of the locales to be built for testing, and then running through all the rational ranges to test inclusion of the required datums. How slow is your tester? Should I do what you do to test for the inclusion of characters that shouldn't be in the range? Or will that take too long? v5 - Add ~30k+ tests to tst-fnmatch.input. - Fix broken locales: - Fix shs_CA to not reorder-after for no reason. Could you run this through the tester please? Cheers, Carlos. [-- Attachment #2: swbz23393v5.tar.gz --] [-- Type: application/gzip, Size: 126966 bytes --] ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-28 1:12 ` [WIPv5] " Carlos O'Donell @ 2018-07-30 17:40 ` Florian Weimer 2018-07-30 17:45 ` Carlos O'Donell 2018-07-31 2:18 ` Carlos O'Donell 0 siblings, 2 replies; 42+ messages in thread From: Florian Weimer @ 2018-07-30 17:40 UTC (permalink / raw) To: Carlos O'Donell, libc-alpha On 07/28/2018 03:12 AM, Carlos O'Donell wrote: > On 07/26/2018 10:50 AM, Florian Weimer wrote: >> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly >> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly >> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly >> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly > This is a WIP, because the number of tests now is too big > to simply add them to tst-fnmatch.input, and so I'm writing > a new tester tst-rational-ranges.c. I'm parsing SUPPORTED, > expecting all of the locales to be built for testing, and > then running through all the rational ranges to test > inclusion of the required datums. Let me repeat my suggestion that we should initially fix the locales with the common collation order, where glibc 2.28 regresses. > How slow is your tester? Should I do what you do to test > for the inclusion of characters that shouldn't be in the > range? Or will that take too long? > > v5 > - Add ~30k+ tests to tst-fnmatch.input. > - Fix broken locales: > - Fix shs_CA to not reorder-after for no reason. > > Could you run this through the tester please? It fails installation for me: $ make localedata/install-locales DESTDIR=/tmp/locales sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined at locales/sl_SI:998 locales/sl_SI:1231: [error] symbol `S0062' not defined locales/sl_SI:1231: [error] symbol `BASE' not defined /bin/sh: line 17: 4148 Segmentation fault (core dumped) I18NPATH=. GCONV_PATH=/home/fweimer/src/gnu/glibc/build/iconvdata LC_ALL=C /home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home/fweimer/src/gnu/glibc/build:/home/fweimer/src/gnu/glibc/build/math:/home/fweimer/src/gnu/glibc/build/elf:/home/fweimer/src/gnu/glibc/build/dlfcn:/home/fweimer/src/gnu/glibc/build/nss:/home/fweimer/src/gnu/glibc/build/nis:/home/fweimer/src/gnu/glibc/build/rt:/home/fweimer/src/gnu/glibc/build/resolv:/home/fweimer/src/gnu/glibc/build/mathvec:/home/fweimer/src/gnu/glibc/build/support:/home/fweimer/src/gnu/glibc/build/crypt:/home/fweimer/src/gnu/glibc/build/nptl /home/fweimer/src/gnu/glibc/build/locale/localedef $flags --alias-file=../intl/locale.alias -i locales/$input -f charmaps/$charset --prefix=/tmp/locales $locale GDB says this: Core was generated by `/home/fweimer/src/gnu/glibc/build/elf/ld-linux-x86-64.so.2 --library-path /home'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000000000419234 in output_weight (pool=pool@entry=0x7ffdf1550ce0, collate=collate@entry=0x7fd5a8a03240, elem=elem@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912 1912 len += utf8_encode (&buf[len], (gdb) bt #0 0x0000000000419234 in output_weight (pool=pool@entry=0x7ffdf1550ce0, collate=collate@entry=0x7fd5a8a03240, elem=elem@entry=0x7fd5a8a9b300) at programs/ld-collate.c:1912 #1 0x000000000041dc4a in collate_output () at programs/ld-collate.c:2180 #2 0x000000000042709f in write_all_categories (definitions=0x7ffdf15513c0, charmap=charmap@entry=0x7fd5a71786a0, locname=0x7ffdf1552e33 "sl_SI.UTF-8", output_path=output_path@entry=0x7fd5a7178310 "/tmp/locales/usr/lib64/locale/sl_SI.utf8/") at programs/locfile.c:337 #3 0x0000000000402f69 in main (argc=<optimized out>, argv=0x7ffdf1551630) at programs/localedef.c:300 (gdb) l 1907 int i; 1908 1909 for (i = 0; i < elem->weights[cnt].cnt; ++i) 1910 /* Encode the weight value. We do nothing for IGNORE entries. */ 1911 if (elem->weights[cnt].w[i] != NULL) 1912 len += utf8_encode (&buf[len], 1913 elem->weights[cnt].w[i]->mborder[cnt]); 1914 1915 /* And add the buffer content. */ 1916 obstack_1grow (pool, len); (gdb) print elem->weights[cnt].w[i]->mborder[cnt] Cannot access memory at address 0x0 (gdb) print elem->weights[cnt].w[i]->mborder $3 = (int *) 0x0 (gdb) Any idea what is going on? Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-30 17:40 ` Florian Weimer @ 2018-07-30 17:45 ` Carlos O'Donell 2018-07-30 17:54 ` Florian Weimer 2018-07-31 2:18 ` Carlos O'Donell 1 sibling, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-30 17:45 UTC (permalink / raw) To: Florian Weimer, libc-alpha On 07/30/2018 01:39 PM, Florian Weimer wrote: > On 07/28/2018 03:12 AM, Carlos O'Donell wrote: >> On 07/26/2018 10:50 AM, Florian Weimer wrote: >>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly >>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly >>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly >>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly > >> This is a WIP, because the number of tests now is too big >> to simply add them to tst-fnmatch.input, and so I'm writing >> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED, >> expecting all of the locales to be built for testing, and >> then running through all the rational ranges to test >> inclusion of the required datums. > > Let me repeat my suggestion that we should initially fix the locales > with the common collation order, where glibc 2.28 regresses. I do not think it is appropriate to release rational range support on only a subset of the SUPPORTED set of locales. Either we support it on all SUPPORTED locales or we work until we are ready. At present glibc 2.28 does not regress because of commit 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and uppercase. In glibc 2.28 we simply have ~2500 characters in the range of a-z, and in 2.27 we had ~250, it's still a large set of non-ASCII characters accepted by the range, all because we caught up to Unicode 9.0.0 with the ISO 14651 collation update (and will soon updated to Unicode 10.0.0 with the next release, and probably always lagging a bit). I don't see an urgent need to get rational range support into 2.28. I was happy to get it in earlier, but now with deeper testing showing that not all locales are working correctly, I'm not happy to see this go out the door. I think it will be ready very shortly, and we can check it in immediately into 2.29, and then continue our work on code point ranges as the next step, which will require even more testing, and internal API cleanup. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-30 17:45 ` Carlos O'Donell @ 2018-07-30 17:54 ` Florian Weimer 2018-07-30 18:26 ` Carlos O'Donell 0 siblings, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-30 17:54 UTC (permalink / raw) To: Carlos O'Donell, libc-alpha On 07/30/2018 07:45 PM, Carlos O'Donell wrote: > On 07/30/2018 01:39 PM, Florian Weimer wrote: >> On 07/28/2018 03:12 AM, Carlos O'Donell wrote: >>> On 07/26/2018 10:50 AM, Florian Weimer wrote: >>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly >>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly >>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly >>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly >> >>> This is a WIP, because the number of tests now is too big >>> to simply add them to tst-fnmatch.input, and so I'm writing >>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED, >>> expecting all of the locales to be built for testing, and >>> then running through all the rational ranges to test >>> inclusion of the required datums. >> >> Let me repeat my suggestion that we should initially fix the locales >> with the common collation order, where glibc 2.28 regresses. > > I do not think it is appropriate to release rational range support on > only a subset of the SUPPORTED set of locales. Either we support it on > all SUPPORTED locales or we work until we are ready. > > At present glibc 2.28 does not regress because of commit > 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and > uppercase. > > In glibc 2.28 we simply have ~2500 characters in the range of a-z, > and in 2.27 we had ~250, it's still a large set of non-ASCII characters > accepted by the range, all because we caught up to Unicode 9.0.0 with > the ISO 14651 collation update (and will soon updated to Unicode 10.0.0 > with the next release, and probably always lagging a bit). Ahh. So it's more complex and a regression longer in the making. > I don't see an urgent need to get rational range support into 2.28. > I was happy to get it in earlier, but now with deeper testing showing > that not all locales are working correctly, I'm not happy to see this > go out the door. I think it will be ready very shortly, and we can check > it in immediately into 2.29, and then continue our work on code point > ranges as the next step, which will require even more testing, and > internal API cleanup. Sounds reasonable. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-30 17:54 ` Florian Weimer @ 2018-07-30 18:26 ` Carlos O'Donell 2018-07-30 18:34 ` Florian Weimer 0 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-30 18:26 UTC (permalink / raw) To: Florian Weimer, libc-alpha On 07/30/2018 01:54 PM, Florian Weimer wrote: > On 07/30/2018 07:45 PM, Carlos O'Donell wrote: >> On 07/30/2018 01:39 PM, Florian Weimer wrote: >>> On 07/28/2018 03:12 AM, Carlos O'Donell wrote: >>>> On 07/26/2018 10:50 AM, Florian Weimer wrote: >>>>> shs_CA: U+0000E6 matches /[a-z]/ unexpectedly >>>>> shs_CA: U+0000C6 matches /[A-Z]/ unexpectedly >>>>> shs_CA.utf8: U+0000E6 matches /[a-z]/ unexpectedly >>>>> shs_CA.utf8: U+0000C6 matches /[A-Z]/ unexpectedly >>> >>>> This is a WIP, because the number of tests now is too big >>>> to simply add them to tst-fnmatch.input, and so I'm writing >>>> a new tester tst-rational-ranges.c. I'm parsing SUPPORTED, >>>> expecting all of the locales to be built for testing, and >>>> then running through all the rational ranges to test >>>> inclusion of the required datums. >>> >>> Let me repeat my suggestion that we should initially fix the locales >>> with the common collation order, where glibc 2.28 regresses. >> >> I do not think it is appropriate to release rational range support on >> only a subset of the SUPPORTED set of locales. Either we support it on >> all SUPPORTED locales or we work until we are ready. >> >> At present glibc 2.28 does not regress because of commit >> 7cd7d36f1feb3ccacf476e909b115b45cdd46e77 to deinterlace lower and >> uppercase. >> >> In glibc 2.28 we simply have ~2500 characters in the range of a-z, >> and in 2.27 we had ~250, it's still a large set of non-ASCII characters >> accepted by the range, all because we caught up to Unicode 9.0.0 with >> the ISO 14651 collation update (and will soon updated to Unicode 10.0.0 >> with the next release, and probably always lagging a bit). > > Ahh. So it's more complex and a regression longer in the making. I'm worried I don't quite follow your statement of "longer in the making," but let me summarize what I think you wrote, and tell me if I have it right. The regression, from the perspective of en_US, is that [a-z] in master accepts uppercase ASCII characters, and this breaks user expectations. This is the only regression I'm considering serious enough to block the release for and we've fixed it for now. The regression which you say is "longer in the making" is that at some point in the past the collation data for en_US contained only ASCII ranges for a-z, A-Z, and 0-9. Then at some point in the past the ranges, particularly those from a-z, and A-Z began accepting non-ASCII characters. Thus the regression, from your perspective, happened far in the past. As far as I can tell the regression has existed since the first import for en_US which copied LC_COLLATE from en_DK (showing en_DK): ~~~ f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 967) <A> <A>;<NONE>;<CAPITAL>;IGNORE f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 968) <a> <A>;<NONE>;<SMALL>;IGNORE ... f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'> <Z>;<ACUTE>;<CAPITAL>;IGNORE ~~~ Is this what you mean by "longer in the making?" I expect that en_US at some point along the way is switched to use the iso14651_t1 data, and so gains non-interleaved a-z/A-Z CEO, but it's hard to tell exactly if CEO was fully functional, if fnmatch worked as expected, etc. Either way this is all a poorly understood and structured solution at this point, and I hope that in 1 or 2 releases we go from "unusable interface" to "rational ranges (data)" to "full rational ranges (code point ranges)" and end up with a sensible portable solution. >> I don't see an urgent need to get rational range support into 2.28. >> I was happy to get it in earlier, but now with deeper testing showing >> that not all locales are working correctly, I'm not happy to see this >> go out the door. I think it will be ready very shortly, and we can check >> it in immediately into 2.29, and then continue our work on code point >> ranges as the next step, which will require even more testing, and >> internal API cleanup. > > Sounds reasonable. That sounds great. I will continue to update this patch set and get some independent checking from your scripts, and my own testing. I also need to add collation tests for all the locales I touch to ensure that the reordering is just that, and that it doesn't materially change the collation sequence (if it does it's a bug). This all adds more coverage to the SUPPORTED set of languages which is a positive thing. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-30 18:26 ` Carlos O'Donell @ 2018-07-30 18:34 ` Florian Weimer 0 siblings, 0 replies; 42+ messages in thread From: Florian Weimer @ 2018-07-30 18:34 UTC (permalink / raw) To: Carlos O'Donell, libc-alpha On 07/30/2018 08:25 PM, Carlos O'Donell wrote: > As far as I can tell the regression has existed since the first import > for en_US which copied LC_COLLATE from en_DK (showing en_DK): > ~~~ > f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 967) <A> <A>;<NONE>;<CAPITAL>;IGNORE > f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 968) <a> <A>;<NONE>;<SMALL>;IGNORE > ... > f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1546) <Z> <Z>;<NONE>;<CAPITAL>;IGNORE > f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1547) <z> <Z>;<NONE>;<SMALL>;IGNORE > f5f52655ceb (Ulrich Drepper 1997-03-05 00:35:19 +0000 1548) <Z'> <Z>;<ACUTE>;<CAPITAL>;IGNORE > ~~~ > Is this what you mean by "longer in the making?" Yes, that's what I meant. I didn't check whether it went back to 2.17, 2.12, or even earlier. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [WIPv5] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-30 17:40 ` Florian Weimer 2018-07-30 17:45 ` Carlos O'Donell @ 2018-07-31 2:18 ` Carlos O'Donell 1 sibling, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-31 2:18 UTC (permalink / raw) To: Florian Weimer, libc-alpha On 07/30/2018 01:39 PM, Florian Weimer wrote: > It fails installation for me: I'm so sorry to waste your time like this. I apparently failed to test sl_SI. > $ make localedata/install-locales DESTDIR=/tmp/locales > sl_SI.UTF-8...locales/sl_SI:1230: order for `U00000061' already defined at locales/sl_SI:998 > locales/sl_SI:1231: [error] symbol `S0062' not defined > locales/sl_SI:1231: [error] symbol `BASE' not defined ... this is a cascading set of errors. > (gdb) print elem->weights[cnt].w[i]->mborder[cnt] > Cannot access memory at address 0x0 > (gdb) print elem->weights[cnt].w[i]->mborder > $3 = (int *) 0x0 > (gdb) > > Any idea what is going on? The parser should have stopped at the first error IMO, going any further just results in problems. It's very hard to rollback the state of the parser and data structures if there is an error in the source files. It should just have stopped at the duplicate U0061 definition. I'm testing a v6 with the sl_SI fixes, and a new test case. -- Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 20:25 ` Carlos O'Donell 2018-07-25 20:31 ` Florian Weimer @ 2018-07-25 21:06 ` Rafal Luzynski 2018-07-25 21:12 ` Carlos O'Donell 1 sibling, 1 reply; 42+ messages in thread From: Rafal Luzynski @ 2018-07-25 21:06 UTC (permalink / raw) To: GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers, Carlos O'Donell 25.07.2018 22:25 Carlos O'Donell <carlos@redhat.com> wrote: > [...] > I had to fix the following locales: > > modified: localedata/locales/ar_SA > modified: localedata/locales/km_KH > modified: localedata/locales/lo_LA > modified: localedata/locales/or_IN > modified: localedata/locales/sl_SI > modified: localedata/locales/th_TH > > They all re-arranged ASCII character collation element ordering like tr_TR, > and so they needed manual fixing. Please check bg_BG. It also has a large reorder: puts all Cyrillic characters before Latin. (However, this may not be relevant at all.) Regards, Rafal ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 (Bug 23393). 2018-07-25 21:06 ` [PATCHv3] " Rafal Luzynski @ 2018-07-25 21:12 ` Carlos O'Donell 0 siblings, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-25 21:12 UTC (permalink / raw) To: Rafal Luzynski, GNU C Library, Mike Fabian, Florian Weimer, Joseph S. Myers On 07/25/2018 05:06 PM, Rafal Luzynski wrote: > 25.07.2018 22:25 Carlos O'Donell <carlos@redhat.com> wrote: >> [...] >> I had to fix the following locales: >> >> modified: localedata/locales/ar_SA >> modified: localedata/locales/km_KH >> modified: localedata/locales/lo_LA >> modified: localedata/locales/or_IN >> modified: localedata/locales/sl_SI >> modified: localedata/locales/th_TH >> >> They all re-arranged ASCII character collation element ordering like tr_TR, >> and so they needed manual fixing. > > Please check bg_BG. It also has a large reorder: puts all Cyrillic characters > before Latin. (However, this may not be relevant at all.) Right, that won't affect the rational range for ASCII. The new tst-fnmatch.input has this: 886 bg_BG.UTF-8 "a" "[a-z]" 0 887 bg_BG.UTF-8 "z" "[a-z]" 0 888 bg_BG.UTF-8 "A" "[a-z]" NOMATCH 889 bg_BG.UTF-8 "Z" "[a-z]" NOMATCH 890 bg_BG.UTF-8 "A" "[A-Z]" 0 891 bg_BG.UTF-8 "Z" "[A-Z]" 0 892 bg_BG.UTF-8 "a" "[A-Z]" NOMATCH 893 bg_BG.UTF-8 "z" "[A-Z]" NOMATCH Which tests the range extremes, and it passes. It doesn't reorder any actual LATIN characters and so it's safe. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell 2018-07-19 20:39 ` Florian Weimer @ 2018-07-25 21:35 ` Carlos O'Donell 2018-07-25 22:50 ` Florian Weimer 2018-07-26 1:33 ` Jonathan Nieder 2 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-25 21:35 UTC (permalink / raw) To: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/19/2018 03:43 PM, Carlos O'Donell wrote: > In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of > the collation data to harmonize with the new version of ISO 14651 > which is derived from Unicode 9.0.0. This collation update brought > with it some changes to locales which were not desirable by some > users, in particular it altered the meaning of the > locale-dependent-range regular expression, namely [a-z] and [A-Z], and > for en_US it caused uppercase letters to be matched by [a-z] for the > first time. The matching of uppercase letters by [a-z] is something > which is already known to users of other locales which have this > property, but this change could cause significant problems to en_US > and other similar locales that had never had this change before. > Whether this behaviour is desirable or not is contentious and GNU Awk > has this to say on the topic: > https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html > While the POSIX standard also has this further to say: "RE Bracket > Expression": > http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html > "The current standard leaves unspecified the behavior of a range > expression outside the POSIX locale. ... As noted above, efforts were > made to resolve the differences, but no solution has been found that > would be specific enough to allow for portable software while not > invalidating existing implementations." > In glibc we implement the requirement of ISO POSIX-2:1993 and use > collation element order (CEO) to construct the range expression, the > API internally is __collseq_table_lookup(). The fact that we use CEO > and also have 4-level weights on each collation rule means that we can > in practice reorder the collation rules in iso14651_t1_common (the new > data) to provide consistent range expression resolution *and* the > weights should maintain the expected total order. Therefore this > patch does three things: > > * Reorder the collation rules for the LATIN script in > iso14651_t1_common to deinterlace uppercase and lowercase letters in > the collation element orders. > > * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises > strcoll* and strxfrm* and ensures the ISO 14651 collation remains. > > * Add back tests to tst-fnmatch.input and tst-regexloc.c which > exercise that [a-z] does not match A or Z. > > The reordering of the ISO 14651 data is done in an entirely mechanical > fashion using the following program attached to the bug: > https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c28 > > It is up for discussion if the iso14651_t1_common data should be > refined further to have 3 very tight collation element ranges that > include only a-z, A-Z, and 0-9, which would implement the solution > sought after in: > https://sourceware.org/bugzilla/show_bug.cgi?id=23393#c12 > > No regressions on x86_64. > Verified that removal of the iso14651_t1_common change causes tst-fnmatch > to regress with: > 422: fnmatch ("[a-z]", "A", 0) = 0 (FAIL, expected FNM_NOMATCH) *** > ... > 425: fnmatch ("[A-Z]", "z", 0) = 0 (FAIL, expected FNM_NOMATCH) *** > --- > ChangeLog | 11 + > localedata/Makefile | 1 + > localedata/en_US.UTF-8.in | 2159 +++++++++++++++++++++++++++++++++ > localedata/locales/iso14651_t1_common | 1928 ++++++++++++++--------------- > posix/tst-fnmatch.input | 125 +- > posix/tst-regexloc.c | 8 +- > 6 files changed, 3224 insertions(+), 1008 deletions(-) > create mode 100644 localedata/en_US.UTF-8.in > > I'm suggesting this change immediately for 2.28 to avoid further > problems with users expectations and sorting with [a-z] and [A-Z] until > a clearer consensus can be reached for a final solution. > > File attached as .tar.gz to get past spam detectors. There is a lot > of UTF-8 data in en_US.UTF-8 (every possible character in the LATIN > set that can be sorted with the existing test case infrastructure). > I have committed only the most conservative fix for this issue, which is to deinterlace the lower and upper case ranges. I think we are too late to commit rational ranges, and we can do that in 2.29 when it opens. Right now I want to remove the blocker that is causing regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z]. We have consensus that this is the right direction to take a solution, and if anyone objects, please speak up before I cut the branch on August 1st (if we can still achieve that and get good machine coverage). Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell @ 2018-07-25 22:50 ` Florian Weimer 2018-07-26 1:20 ` Carlos O'Donell 0 siblings, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-25 22:50 UTC (permalink / raw) To: Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 11:35 PM, Carlos O'Donell wrote: > I have committed only the most conservative fix for this issue, which is > to deinterlace the lower and upper case ranges. > > I think we are too late to commit rational ranges, and we can do that in > 2.29 when it opens. Right now I want to remove the blocker that is causing > regressions for en_US.UTF-8 scripts that use [a-z], and [A-Z]. How is this the most conservative fix, relative to glibc 2.27 upstream? [a-z] still matches lots of non-ASCII characters, which it did not before. When I meant that we left regression-fixing territory, I was talking about the locales which had iso14651_t1_common customizations. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-25 22:50 ` Florian Weimer @ 2018-07-26 1:20 ` Carlos O'Donell 2018-07-26 8:09 ` Andreas Schwab 0 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-26 1:20 UTC (permalink / raw) To: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 06:50 PM, Florian Weimer wrote: > On 07/25/2018 11:35 PM, Carlos O'Donell wrote: >> I have committed only the most conservative fix for this issue, >> which is to deinterlace the lower and upper case ranges. >> >> I think we are too late to commit rational ranges, and we can do >> that in 2.29 when it opens. Right now I want to remove the blocker >> that is causing regressions for en_US.UTF-8 scripts that use [a-z], >> and [A-Z]. > > How is this the most conservative fix, relative to glibc 2.27 > upstream? We have two solutions to fix the regression: * Revert the entire ISO 14651 udpate. - This is 13 commits for just the update. - Several more commits for Rafal and Mike's work on locales on top of that. * Fix the key issue of a-z interleaving with A-Z. My opinion is that is most conservative to fix the interleaving. In 2.27 we accepted 297 characters between A-Z. In 2.28 we accept 2280 characters between A-Z as part of the ISO 14651 update. > [a-z] still matches lots of non-ASCII characters, which it did not > before. This is not true, we were already matching 297 characters between A-Z in 2.27. It has always been the case that we accepted non-ASCII characters in the range. With the ISO 14651 update the *key* issue was that lowercase and uppercase were now mixed in collation element ordering, resulting in surprising matches and failures like the reported xfs test failure where [a-z] matched "Makefile" and broke their test infrastructure. > When I meant that we left regression-fixing territory, I was talking > about the locales which had iso14651_t1_common customizations. OK, so to be clear you think we *should* go forward with rational ranges? I don't think it's too late, we could commit it tomorrow, it should not impact machine testing in way. My v4 fixes all of the locales that either have customizations on iso14651_t1_common or have their own custom locales. No more locales remain to be fixed, I tested all of them with tst-fnmatch.input additions to catch the ones that needed fixing. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 1:20 ` Carlos O'Donell @ 2018-07-26 8:09 ` Andreas Schwab 2018-07-26 9:16 ` Florian Weimer 0 siblings, 1 reply; 42+ messages in thread From: Andreas Schwab @ 2018-07-26 8:09 UTC (permalink / raw) To: Carlos O'Donell Cc: Florian Weimer, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On Jul 25 2018, Carlos O'Donell <carlos@redhat.com> wrote: > surprising matches and failures like the reported xfs test failure where > [a-z] matched "Makefile" ??? [a-z] has always done that. Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 8:09 ` Andreas Schwab @ 2018-07-26 9:16 ` Florian Weimer 0 siblings, 0 replies; 42+ messages in thread From: Florian Weimer @ 2018-07-26 9:16 UTC (permalink / raw) To: Andreas Schwab, Carlos O'Donell Cc: GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/26/2018 10:08 AM, Andreas Schwab wrote: > On Jul 25 2018, Carlos O'Donell <carlos@redhat.com> wrote: > >> surprising matches and failures like the reported xfs test failure where >> [a-z] matched "Makefile" > > ??? [a-z] has always done that. It's about the glob/fnmatch pattern â[a-z]*â. Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell 2018-07-19 20:39 ` Florian Weimer 2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell @ 2018-07-26 1:33 ` Jonathan Nieder 2018-07-26 1:49 ` Carlos O'Donell 2 siblings, 1 reply; 42+ messages in thread From: Jonathan Nieder @ 2018-07-26 1:33 UTC (permalink / raw) To: Carlos O'Donell Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers Hi, Carlos O'Donell wrote: > In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of > the collation data to harmonize with the new version of ISO 14651 > which is derived from Unicode 9.0.0. This collation update brought > with it some changes to locales which were not desirable by some > users, in particular it altered the meaning of the > locale-dependent-range regular expression, namely [a-z] and [A-Z], and > for en_US it caused uppercase letters to be matched by [a-z] for the > first time. The Debian system where it is most convenient for me to test has Debian's libc6 package, version 2.24-12. [a-z] matches uppercase letters. I've always considered that undesirable but I'm confused about the described regression. Did one of Debian's patches to localedata cause it to pick up the regression early (by which I mean, more than 5 years ago)? > In glibc we implement the requirement of ISO POSIX-2:1993 and use > collation element order (CEO) to construct the range expression, the > API internally is __collseq_table_lookup(). The fact that we use CEO > and also have 4-level weights on each collation rule means that we can > in practice reorder the collation rules in iso14651_t1_common (the new > data) to provide consistent range expression resolution *and* the > weights should maintain the expected total order. [...] > * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises > strcoll* and strxfrm* and ensures the ISO 14651 collation remains. Cool! Checking my understanding: does this mean that if I have files lll MMM nnn that with this patch, echo [a-z]* would no longer match MMM, and ls | sort would continue to sort in the order lll < MMM < nnn? I wish we had done it 10 years ago. ;-) Thanks for getting it done. Jonathan ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 1:33 ` Jonathan Nieder @ 2018-07-26 1:49 ` Carlos O'Donell 2018-07-26 2:16 ` Jonathan Nieder 0 siblings, 1 reply; 42+ messages in thread From: Carlos O'Donell @ 2018-07-26 1:49 UTC (permalink / raw) To: Jonathan Nieder Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 09:33 PM, Jonathan Nieder wrote: > Hi, > > Carlos O'Donell wrote: > >> In commit 9479b6d5e08eacce06c6ab60abc9b2f4eb8b71e4 we updated all of >> the collation data to harmonize with the new version of ISO 14651 >> which is derived from Unicode 9.0.0. This collation update brought >> with it some changes to locales which were not desirable by some >> users, in particular it altered the meaning of the >> locale-dependent-range regular expression, namely [a-z] and [A-Z], and >> for en_US it caused uppercase letters to be matched by [a-z] for the >> first time. > > The Debian system where it is most convenient for me to test has > Debian's libc6 package, version 2.24-12. [a-z] matches uppercase > letters. I've always considered that undesirable but I'm confused > about the described regression. Did one of Debian's patches to > localedata cause it to pick up the regression early (by which I mean, > more than 5 years ago)? It depends entirely on the locale you use. Some locales already have [a-z] matching uppercase and have had it for years. The problem is that this is new for en_US.UTF-8. Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have done something different with iso14651_t1_common to change this, or added something else. I did a quick look at the debian patches for 2.24-12 and didn't see anything that would change this materially for en_US. >> In glibc we implement the requirement of ISO POSIX-2:1993 and use >> collation element order (CEO) to construct the range expression, the >> API internally is __collseq_table_lookup(). The fact that we use CEO >> and also have 4-level weights on each collation rule means that we can >> in practice reorder the collation rules in iso14651_t1_common (the new >> data) to provide consistent range expression resolution *and* the >> weights should maintain the expected total order. > [...] >> * Adds new test data en_US.UTF-8.in for sort-test.sh which exercises >> strcoll* and strxfrm* and ensures the ISO 14651 collation remains. > > Cool! Checking my understanding: does this mean that if I have files > > lll > MMM > nnn > > that with this patch, > > echo [a-z]* > > would no longer match MMM, and Correct. > > ls | sort > > would continue to sort in the order lll < MMM < nnn? Yes. > > I wish we had done it 10 years ago. ;-) Thanks for getting it done. The rational ranges follow code point order. The sorting follows collation sequence. I think this was never an issue because most locales following ISO 14651 were using an old data set which never exhibited this issue. However, thanks to Mike Fabian's hard work (and no good deed goes unpunished) we have updated collation all the way to Unicode 9.0.0-era and so encountered this problem. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 1:49 ` Carlos O'Donell @ 2018-07-26 2:16 ` Jonathan Nieder 2018-07-26 3:48 ` Carlos O'Donell 2018-07-26 7:42 ` Florian Weimer 0 siblings, 2 replies; 42+ messages in thread From: Jonathan Nieder @ 2018-07-26 2:16 UTC (permalink / raw) To: Carlos O'Donell Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers Carlos O'Donell wrote: > On 07/25/2018 09:33 PM, Jonathan Nieder wrote: >> The Debian system where it is most convenient for me to test has >> Debian's libc6 package, version 2.24-12. [a-z] matches uppercase >> letters. I've always considered that undesirable but I'm confused >> about the described regression. Did one of Debian's patches to >> localedata cause it to pick up the regression early (by which I mean, >> more than 5 years ago)? > > It depends entirely on the locale you use. Some locales already have > [a-z] matching uppercase and have had it for years. The problem is that > this is new for en_US.UTF-8. > > Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have > done something different with iso14651_t1_common to change this, or added > something else. I did a quick look at the debian patches for 2.24-12 and > didn't see anything that would change this materially for en_US. I tried with the following locales: en_US: matches (bad) en_US.UTF-8: matches (bad) C: does not match (good) C.UTF-8: does not match (good) fr_CH: matches (bad) fr_CH.UTF-8: matches (bad) Looking over https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata and https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale, I don't see any obvious culprits. Anyway, please just take this as more feedback in favor of your approach. See the user reports merged with https://bugs.debian.org/301717. Thanks, Jonathan ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 2:16 ` Jonathan Nieder @ 2018-07-26 3:48 ` Carlos O'Donell 2018-07-26 7:42 ` Florian Weimer 1 sibling, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-26 3:48 UTC (permalink / raw) To: Jonathan Nieder Cc: GNU C Library, Florian Weimer, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/25/2018 10:16 PM, Jonathan Nieder wrote: > Carlos O'Donell wrote: >> On 07/25/2018 09:33 PM, Jonathan Nieder wrote: > >>> The Debian system where it is most convenient for me to test has >>> Debian's libc6 package, version 2.24-12. [a-z] matches uppercase >>> letters. I've always considered that undesirable but I'm confused >>> about the described regression. Did one of Debian's patches to >>> localedata cause it to pick up the regression early (by which I mean, >>> more than 5 years ago)? >> >> It depends entirely on the locale you use. Some locales already have >> [a-z] matching uppercase and have had it for years. The problem is that >> this is new for en_US.UTF-8. >> >> Which locale did you use? en_US.UTF-8? If so, then yes, Debian must have >> done something different with iso14651_t1_common to change this, or added >> something else. I did a quick look at the debian patches for 2.24-12 and >> didn't see anything that would change this materially for en_US. > > I tried with the following locales: > > en_US: matches (bad) > en_US.UTF-8: matches (bad) > C: does not match (good) > C.UTF-8: does not match (good) > fr_CH: matches (bad) > fr_CH.UTF-8: matches (bad) > > Looking over > https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata > and https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale, > I don't see any obvious culprits. Anyway, please just take this as more > feedback in favor of your approach. > > See the user reports merged with https://bugs.debian.org/301717. This is your shell doing the expanding, and worse doing it differently from glibc. My bash shell also handles [a-z] expansion differently given the locale data. It appears to be using collation sequence i.e. the order in which the elements sort in. Using grep doesn't result in these matches. The fix is this: `shopt -s globasciiranges`, and we should make it the default from now on. The option turns on rational ranges for bash. Florian found this out when digging into the issue. We have a lot of cleanup to do to get rational ranges on at each step of expansion. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 2:16 ` Jonathan Nieder 2018-07-26 3:48 ` Carlos O'Donell @ 2018-07-26 7:42 ` Florian Weimer 2018-07-26 8:18 ` Andreas Schwab 1 sibling, 1 reply; 42+ messages in thread From: Florian Weimer @ 2018-07-26 7:42 UTC (permalink / raw) To: Jonathan Nieder, Carlos O'Donell Cc: GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/26/2018 04:16 AM, Jonathan Nieder wrote: > Looking over > https://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/localedata > andhttps://salsa.debian.org/glibc-team/glibc/tree/sid/debian/patches/locale, > I don't see any obvious culprits. Anyway, please just take this as more > feedback in favor of your approach. > > See the user reports merged with https://bugs.debian.org/301717. The bash implementation of glob always uses strcoll/wcscoll ordering when globasciirange is not active. It does not use collation element ordering, so rearranging collation data does not affect it. This means that the changes discussed here will not affect bash (well, the glob part at least). Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 7:42 ` Florian Weimer @ 2018-07-26 8:18 ` Andreas Schwab 2018-07-26 9:15 ` Florian Weimer 2018-07-26 13:25 ` Carlos O'Donell 0 siblings, 2 replies; 42+ messages in thread From: Andreas Schwab @ 2018-07-26 8:18 UTC (permalink / raw) To: Florian Weimer Cc: Jonathan Nieder, Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On Jul 26 2018, Florian Weimer <fweimer@redhat.com> wrote: > The bash implementation of glob always uses strcoll/wcscoll ordering when > globasciirange is not active. It does not use collation element ordering, > so rearranging collation data does not affect it. Why does strcoll not agree with the collation sequence? Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 8:18 ` Andreas Schwab @ 2018-07-26 9:15 ` Florian Weimer 2018-07-26 13:25 ` Carlos O'Donell 1 sibling, 0 replies; 42+ messages in thread From: Florian Weimer @ 2018-07-26 9:15 UTC (permalink / raw) To: Andreas Schwab Cc: Jonathan Nieder, Carlos O'Donell, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/26/2018 10:18 AM, Andreas Schwab wrote: > On Jul 26 2018, Florian Weimer <fweimer@redhat.com> wrote: > >> The bash implementation of glob always uses strcoll/wcscoll ordering when >> globasciirange is not active. It does not use collation element ordering, >> so rearranging collation data does not affect it. > > Why does strcoll not agree with the collation sequence? The collation element ordering is encoded in the _NL_COLLATE_COLLSEQMB and _NL_COLLATE_COLLSEQWC tables, and not the weights used by strcoll. Thanks, Florian ^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393). 2018-07-26 8:18 ` Andreas Schwab 2018-07-26 9:15 ` Florian Weimer @ 2018-07-26 13:25 ` Carlos O'Donell 1 sibling, 0 replies; 42+ messages in thread From: Carlos O'Donell @ 2018-07-26 13:25 UTC (permalink / raw) To: Andreas Schwab, Florian Weimer Cc: Jonathan Nieder, GNU C Library, Rich Felker, Mike Fabian, Zorro Lang, Joseph S. Myers On 07/26/2018 04:18 AM, Andreas Schwab wrote: > On Jul 26 2018, Florian Weimer <fweimer@redhat.com> wrote: > >> The bash implementation of glob always uses strcoll/wcscoll ordering when >> globasciirange is not active. It does not use collation element ordering, >> so rearranging collation data does not affect it. > > Why does strcoll not agree with the collation sequence? There are two terms that mean very different things. The strcoll output and collation sequence are the same. The collation sequence is not the same as the collation element ordering (the order of the rules in the source file). POSIX mandated the use of collation element ordering (not sequence) for regular expression ranges, and then decided this was a bad idea and instead made it unspecified. In glibc we continue to implement and support collation element ordering, not collation sequence, for posix regular expression ranges. Even collation sequence is a bad idea because [a-z] does not include all the z's that are sorted after z, and you need special collation element markers like AFTER-Z to find all the z's. Instead we should use rational ranges and make everything based on code points to make it portable across all locales. Cheers, Carlos. ^ permalink raw reply [flat|nested] 42+ messages in thread
end of thread, other threads:[~2018-07-31 2:18 UTC | newest] Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-19 19:43 [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393) Carlos O'Donell 2018-07-19 20:39 ` Florian Weimer 2018-07-20 18:49 ` Carlos O'Donell 2018-07-20 19:02 ` Rich Felker 2018-07-20 19:19 ` Florian Weimer 2018-07-20 21:56 ` Carlos O'Donell 2018-07-23 15:11 ` Florian Weimer 2018-07-23 18:09 ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell 2018-07-24 20:45 ` Rafal Luzynski 2018-07-24 20:53 ` Carlos O'Donell 2018-07-24 20:59 ` Carlos O'Donell 2018-07-25 15:44 ` Mike FABIAN 2018-07-25 15:54 ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell 2018-07-25 20:19 ` Florian Weimer 2018-07-25 20:25 ` Carlos O'Donell 2018-07-25 20:31 ` Florian Weimer 2018-07-25 20:57 ` [PATCHv4] " Carlos O'Donell 2018-07-26 2:34 ` [PATCHv4a] " Carlos O'Donell 2018-07-26 14:51 ` Florian Weimer 2018-07-26 14:59 ` Carlos O'Donell 2018-07-28 1:12 ` [WIPv5] " Carlos O'Donell 2018-07-30 17:40 ` Florian Weimer 2018-07-30 17:45 ` Carlos O'Donell 2018-07-30 17:54 ` Florian Weimer 2018-07-30 18:26 ` Carlos O'Donell 2018-07-30 18:34 ` Florian Weimer 2018-07-31 2:18 ` Carlos O'Donell 2018-07-25 21:06 ` [PATCHv3] " Rafal Luzynski 2018-07-25 21:12 ` Carlos O'Donell 2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell 2018-07-25 22:50 ` Florian Weimer 2018-07-26 1:20 ` Carlos O'Donell 2018-07-26 8:09 ` Andreas Schwab 2018-07-26 9:16 ` Florian Weimer 2018-07-26 1:33 ` Jonathan Nieder 2018-07-26 1:49 ` Carlos O'Donell 2018-07-26 2:16 ` Jonathan Nieder 2018-07-26 3:48 ` Carlos O'Donell 2018-07-26 7:42 ` Florian Weimer 2018-07-26 8:18 ` Andreas Schwab 2018-07-26 9:15 ` Florian Weimer 2018-07-26 13:25 ` Carlos O'Donell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).