public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com
Subject: Re: Bug in collation functions?
Date: Fri, 30 Oct 2015 19:11:00 -0000	[thread overview]
Message-ID: <20151030120320.GO5319@calimero.vinschen.de> (raw)
In-Reply-To: <56329BE8.808@cornell.edu>

[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]

Hi Ken,

On Oct 29 18:21, Ken Brown wrote:
> On 10/29/2015 5:49 PM, Ken Brown wrote:
> >On 10/29/2015 2:42 PM, Ken Brown wrote:
> >>On 10/29/2015 12:51 PM, Eric Blake wrote:
> >>>Careful.  POSIX is proposing some wording that say that normal locales
> >>>should always implement a fallback of last resort (and that locales that
> >>>do not do so should have a special name including '@', to make it
> >>>obvious).  It is not standardized yet, but worth thinking about.
> >>>
> >>>http://austingroupbugs.net/view.php?id=938
> >>>http://austingroupbugs.net/view.php?id=963
> >>>
> >>>The intent of that wording is that if ignoring punctuation could cause
> >>>two strings to otherwise compare equal, the fallback of a total ordering
> >>>on all characters means that the final result of strcoll() will not be 0
> >>>unless the two strings are identical.
> >>
> >>In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
> >>non-POSIX locales, with the goal of eventually moving toward emulating
> >>glibc.  I don't know what fallback glibc uses or how hard it would be to
> >>implement this on Cygwin.
> >
> >I withdraw this suggestion.  I took a look at the glibc code, and I
> >don't see any reasonable way for Cygwin to emulate it precisely.  On the
> >other hand, I have an idea for a simple fallback.  I'll play with it a
> >little and then submit a patch.
> 
> The fallback I had in mind is to return the shorter string if they have
> different lengths and otherwise to revert to wcscmp.  Using this, both
> Cygwin and Linux give the following comparisons:
> 
> "11" > "1.1" in POSIX locale
> "11" < "1.1" in en_US.UTF-8 locale
> "11" > "1 2" in POSIX locale
> "11" < "1.2" in en_US.UTF-8 locale
> "1 1" < "1.1" in POSIX locale
> "1 1" < "1.1" in en_US.UTF-8 locale
> 
> If this seems reasonable, I'll test it more extensively and then submit a
> patch.

I had a longer look into this suggestion and the below code and it took
me some time to find out what bugged me with it:

What about str/wcsxfrm?

Per POSIX, calling strcmp on the result of strxfrm is equivalent to
calling strcoll (analogue with wcs*).  If you extend *coll to perform an
extra check on the length, you will have cases in which the above rule
fails.  You can't perform the length test on the result of *xfrm and
expect the same result as in *coll.

In fact, when calling LCMapStringW with NORM_IGNORESYMBOLS (you would
have to do this anyway if we add this flag in *coll), the resulting
transformed strings created from the input strings "11" and "1.1" would
be identical, so a length test on the xfrm string is not meaningful at
all.

The bottom line is, afaics, we must make sure that CompareStringW and
LCMapStringW are called the same way, and their result/output has to be
returned to the caller.  Performing an extra check in *coll which can't
be reliably performed in *xfrm is not feasible.

Does that make sense?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

  reply	other threads:[~2015-10-30 12:03 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-29  7:41 Ken Brown
2015-10-29  7:50 ` Eric Blake
2015-10-29 12:58   ` Corinna Vinschen
2015-10-29 15:35     ` Corinna Vinschen
2015-10-29 15:51       ` Ken Brown
2015-10-29 16:14         ` Corinna Vinschen
2015-10-29 16:14           ` Ken Brown
2015-10-29 16:51             ` Ken Brown
2015-10-29 18:09               ` Eric Blake
2015-10-29 21:58                 ` Ken Brown
2015-10-30  8:05                   ` Ken Brown
2015-10-30 14:07                     ` Ken Brown
2015-10-30 19:11                       ` Corinna Vinschen [this message]
2015-10-30 19:14                         ` Ken Brown
2015-10-30 21:13                           ` Corinna Vinschen
     [not found]                           ` <5634F6BA.7070301@cornell.edu>
2015-11-02 11:14                             ` Corinna Vinschen
2015-10-29 16:17           ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20151030120320.GO5319@calimero.vinschen.de \
    --to=corinna-cygwin@cygwin.com \
    --cc=cygwin@cygwin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).