public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Florian Weimer <fweimer@redhat.com>
To: Carlos O'Donell <carlos@redhat.com>,
	GNU C Library <libc-alpha@sourceware.org>,
	Rich Felker <dalias@aerifal.cx>, Mike Fabian <mfabian@redhat.com>,
	Zorro Lang <zlang@redhat.com>,
	"Joseph S. Myers" <joseph@codesourcery.com>
Subject: Re: [PATCH] Keep expected behaviour for [a-z] and [A-z] (Bug 23393).
Date: Fri, 20 Jul 2018 19:19:00 -0000	[thread overview]
Message-ID: <5bcef059-b928-d2e9-82dd-2ae68be96020@redhat.com> (raw)
In-Reply-To: <f905879a-fd42-331e-eac1-46ed54d06d9e@redhat.com>

On 07/20/2018 08:49 PM, Carlos O'Donell wrote:
> On 07/19/2018 04:39 PM, Florian Weimer wrote:
>> On 07/19/2018 09:43 PM, Carlos O'Donell wrote:
>>> * Add back tests to tst-fnmatch.input and tst-regexloc.c which
>>> exercise that [a-z] does not match A or Z.
>>
>> [a-z] still matches ñ, 𝚗, but not 𝚣, which I doubt is useful.
> 
> Sorry, I don't follow, it absolutely matches ASCII z.

The z I wrote above is one of the non-BMP math characters.

> We deinterlace the collation element ordering (not sequence) to get
> the right range expression resolution.
> 
> See the added fnmatch tests:
> 
> +en_US.UTF-8     "a"                    "[a-z]"                0
> +en_US.UTF-8     "z"                    "[a-z]"                0
> +en_US.UTF-8     "A"                    "[a-z]"                NOMATCH
> +en_US.UTF-8     "Z"                    "[a-z]"                NOMATCH
> +en_US.UTF-8     "a"                    "[A-Z]"                NOMATCH
> +en_US.UTF-8     "z"                    "[A-Z]"                NOMATCH
> +en_US.UTF-8     "A"                    "[A-Z]"                0
> +en_US.UTF-8     "Z"                    "[A-Z]"                0
> +en_US.UTF-8     "0"                    "[0-9]"                0
> +en_US.UTF-8     "9"                    "[0-9]"                0
> 
> [a-z] matches a-z (including z), *and* all the lowercase inbetween,
> and so behaves like :lower: effectively.

There are characters equivalent to ASCII z (like the z above), but which 
sort after z, so they are not matched.  This is one reason why I think 
this is a bad idea: it looks like [:lower:], but it's not.  Same for 
[0-9], I assume.

>> It's an improvement, and it may be good enough for glibc 2.28, but I would
>> rather see us implement the rational ranges interpretation.
> 
> That requires all ranges behave rationally?
> 
> We could fix a-z, A-Z, and 0-9 easily.
> 
> Patch attached.

(NB: Patch is relative to the previous patch.)

My enumeration tester likes it much more. 8-)

   actual:   "abcdefghijklmnopqrstuvwxyz"
   actual:   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
   actual:   "0123456789"

That's for [a-z], [A-Z], [0-9], in en_US.UTF-8 and de_DE.ISO-8859-1. 
However, I still get this:

tst-regex-classes.script:85:0: result character set difference in locale 
tr_TR.ISO-8859-9
enumerate_chars '[a-z]' "abcdefghijklmnopqrstuvwxyz";
^
   expected: "abcdefghijklmnopqrstuvwxyz"
   actual:   "abcdefghjklmnopqrstuvwxyz"
tst-regex-classes.script:86:0: result character set difference in locale 
tr_TR.ISO-8859-9
enumerate_chars '[A-Z]' "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
^
   expected: "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
   actual:   "ABCDEFGHJKLMNOPQRSTUVWXYZ"
error: 2 test failures

Can you fix this with data-only changes, too?

posix/bug-regex17 regresses as well in the test for bug 9697, but I can 
incorporate that into my enumeration tester.  I don't think the bug is 
actually regressing, it's just that the test objective is not expressed 
properly in it.

posix/tst-rxspencer fails as well, presumably due to this:

UTF-8 aA FAIL regcomp failed: Invalid range end
UTF-8 aAcC FAIL regcomp failed: Invalid range end

I think this happens because the test blindly replaces ASCII characters 
with non-ASCII characters, which causes issues if they are not ordered 
as expected.

Thanks,
Florian

  parent reply	other threads:[~2018-07-20 19:19 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-19 19:43 Carlos O'Donell
2018-07-19 20:39 ` Florian Weimer
2018-07-20 18:49   ` Carlos O'Donell
2018-07-20 19:02     ` Rich Felker
2018-07-20 19:19     ` Florian Weimer [this message]
2018-07-20 21:56       ` Carlos O'Donell
2018-07-23 15:11         ` Florian Weimer
2018-07-23 18:09           ` Rational Ranges - Rafal and Mike's opinion? " Carlos O'Donell
2018-07-24 20:45             ` Rafal Luzynski
2018-07-24 20:53               ` Carlos O'Donell
2018-07-24 20:59               ` Carlos O'Donell
2018-07-25 15:44             ` Mike FABIAN
2018-07-25 15:54           ` [PATCHv3] Expected behaviour for a-z, A-Z, and 0-9 " Carlos O'Donell
2018-07-25 20:19             ` Florian Weimer
2018-07-25 20:25               ` Carlos O'Donell
2018-07-25 20:31                 ` Florian Weimer
2018-07-25 20:57                   ` [PATCHv4] " Carlos O'Donell
2018-07-26  2:34                     ` [PATCHv4a] " Carlos O'Donell
2018-07-26 14:51                       ` Florian Weimer
2018-07-26 14:59                         ` Carlos O'Donell
2018-07-28  1:12                         ` [WIPv5] " Carlos O'Donell
2018-07-30 17:40                           ` Florian Weimer
2018-07-30 17:45                             ` Carlos O'Donell
2018-07-30 17:54                               ` Florian Weimer
2018-07-30 18:26                                 ` Carlos O'Donell
2018-07-30 18:34                                   ` Florian Weimer
2018-07-31  2:18                             ` Carlos O'Donell
2018-07-25 21:06                 ` [PATCHv3] " Rafal Luzynski
2018-07-25 21:12                   ` Carlos O'Donell
2018-07-25 21:35 ` [PATCH] Keep expected behaviour for [a-z] and [A-z] " Carlos O'Donell
2018-07-25 22:50   ` Florian Weimer
2018-07-26  1:20     ` Carlos O'Donell
2018-07-26  8:09       ` Andreas Schwab
2018-07-26  9:16         ` Florian Weimer
2018-07-26  1:33 ` Jonathan Nieder
2018-07-26  1:49   ` Carlos O'Donell
2018-07-26  2:16     ` Jonathan Nieder
2018-07-26  3:48       ` Carlos O'Donell
2018-07-26  7:42       ` Florian Weimer
2018-07-26  8:18         ` Andreas Schwab
2018-07-26  9:15           ` Florian Weimer
2018-07-26 13:25           ` Carlos O'Donell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5bcef059-b928-d2e9-82dd-2ae68be96020@redhat.com \
    --to=fweimer@redhat.com \
    --cc=carlos@redhat.com \
    --cc=dalias@aerifal.cx \
    --cc=joseph@codesourcery.com \
    --cc=libc-alpha@sourceware.org \
    --cc=mfabian@redhat.com \
    --cc=zlang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).