On 07/20/2018 03:19 PM, Florian Weimer wrote:
> On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
>> On 07/19/2018 04:39 PM, Florian Weimer wrote:
>>> On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
>>>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>>>> exercise that [a-z] does not match A or Z.
>>>
>>> [a-z] still matches Ã±, ð, but not ð£, which I doubt is useful.
>>
>> Sorry, I don't follow, it absolutely matches ASCII z.
> 
> The z I wrote above is one of the non-BMP math characters.

Thanks :-}

It was a conservative solution.

>> We deinterlace the collation element ordering (not sequence) to get
>> the right range expression resolution.
>>
>> See the added fnmatch tests:
>>
>> +en_US.UTF-8Â Â Â Â  "a"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "A"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "Z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[a-z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "a"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  NOMATCH
>> +en_US.UTF-8Â Â Â Â  "A"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "Z"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[A-Z]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "0"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[0-9]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>> +en_US.UTF-8Â Â Â Â  "9"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  "[0-9]"Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  0
>>
>> [a-z] matches a-z (including z), *and* all the lowercase inbetween,
>> and so behaves like :lower: effectively.
> 
> There are characters equivalent to ASCII z (like the z above), but
> which sort after z, so they are not matched.  This is one reason why
> I think this is a bad idea: it looks like [:lower:], but it's not.
> Same for [0-9], I assume.

Again, conservatively, this is how it worked before, and now works again
the same, but retains the improvement of ISO 14651 data being added.
 
>>> It's an improvement, and it may be good enough for glibc 2.28, but I would
>>> rather see us implement the rational ranges interpretation.
>>
>> That requires all ranges behave rationally?
>>
>> We could fix a-z, A-Z, and 0-9 easily.
>>
>> Patch attached.
> 
> (NB: Patch is relative to the previous patch.)
> 
> My enumeration tester likes it much more. 8-)

It was designed exactly for your enumerator ;-)

> Â  actual:Â Â  "abcdefghijklmnopqrstuvwxyz"
> Â  actual:Â Â  "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
> Â  actual:Â Â  "0123456789"
> 
> That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. However, I still get this:
> 
> tst-regex-classes.script:85:0: result character set difference in locale tr_TR.ISO-8859-9
> enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
> ^
> Â  expected: "abcdefghijklmnopqrstuvwxyz"
> Â  actual:Â Â  "abcdefghjklmnopqrstuvwxyz"
>
> tst-regex-classes.script:86:0: result character set difference in locale tr_TR.ISO-8859-9
> enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
> ^
> Â  expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
> Â  actual:Â Â  "ABCDEFGHJKLMNOPQRSTUVWXYZ"
> error: 2 test failures
> 
> Can you fix this with data-only changes, too?

Yes, I need to duplicate the rational range for A-Z in tr_TR and
remove 'i' since it's just fine the way it is, the existing

New patch attached with additional tests in tst-fnmatch.input to
test tr_TR.UTF-8, and ISO-8859-9.

Noticed equivalence class issues and filed a bug and added an XFAIL-ish
test case in test-fnmatch.input:
https://sourceware.org/bugzilla/show_bug.cgi?id=23437

> posix/bug-regex17 regresses as well in the test for bug 9697, but I
> can incorporate that into my enumeration tester.  I don't think the
> bug is actually regressing, it's just that the test objective is not
> expressed properly in it.

Fixed.

> 
> posix/tst-rxspencer fails as well, presumably due to this:
> 
> UTF-8 aA FAIL regcomp failed: Invalid range end
> UTF-8 aAcC FAIL regcomp failed: Invalid range end
> 
> I think this happens because the test blindly replaces ASCII
> characters with non-ASCII characters, which causes issues if they are
> not ordered as expected.

Fixed.

v2
- Fixed tr_TR by duplicating A-Z rational range.
- Fixed tst-rxspender.
- Fixed bug-regex17.

Tell me how the new version does.

-- 
Cheers,
Carlos.