Bug in collation functions?

public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed

* Bug in collation functions?
@ 2015-10-29  7:41 Ken Brown
  2015-10-29  7:50 ` Eric Blake
  0 siblings, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-29  7:41 UTC (permalink / raw)
  To: cygwin

It's my understanding that collation is supposed to take whitespace and 
punctuation into account in the POSIX locale but not in other locales. 
This doesn't seem to be the case on Cygwin.  Here's a test case using 
wcscoll, but the same problem occurs with strcoll.

$ cat wcscoll_test.c
#include <wchar.h>
#include <stdio.h>
#include <locale.h>

void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
   setlocale (LC_COLLATE, loc);
   char res = wcscoll (a, b) < 0 ? '<' : '>';
   printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, res, b, loc);
}

int
main ()
{
   compare (L"11", L"1.1", "POSIX");
   compare (L"11", L"1.1", "en_US.UTF-8");
   compare (L"11", L"1 2", "POSIX");
   compare (L"11", L"1 2", "en_US.UTF-8");
}

$ gcc wcscoll_test.c -o wcscoll_test

$ ./wcscoll_test
"11" > "1.1" in POSIX locale
"11" > "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" > "1 2" in en_US.UTF-8 locale

On Linux, the output from the same program is

"11" > "1.1" in POSIX locale
"11" < "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29  7:41 Bug in collation functions? Ken Brown
@ 2015-10-29  7:50 ` Eric Blake
  2015-10-29 12:58   ` Corinna Vinschen
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2015-10-29  7:50 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

On 10/28/2015 04:14 PM, Ken Brown wrote:
> It's my understanding that collation is supposed to take whitespace and
> punctuation into account in the POSIX locale but not in other locales.

Not quite right. It is up to the locale definition whether whitespace
affects collation.  But you are correct that in the POSIX locale,
whitespace must not be ignored in collation.

> This doesn't seem to be the case on Cygwin.  Here's a test case using
> wcscoll, but the same problem occurs with strcoll.

That's because the locale definitions are different in cygwin than they
are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
systems to have different locale definitions while still using the same
locale name like en_US.UTF-8.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29  7:50 ` Eric Blake
@ 2015-10-29 12:58   ` Corinna Vinschen
  2015-10-29 15:35     ` Corinna Vinschen
  0 siblings, 1 reply; 17+ messages in thread
From: Corinna Vinschen @ 2015-10-29 12:58 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]

On Oct 28 21:58, Eric Blake wrote:
> On 10/28/2015 04:14 PM, Ken Brown wrote:
> > It's my understanding that collation is supposed to take whitespace and
> > punctuation into account in the POSIX locale but not in other locales.
> 
> Not quite right. It is up to the locale definition whether whitespace
> affects collation.  But you are correct that in the POSIX locale,
> whitespace must not be ignored in collation.
> 
> > This doesn't seem to be the case on Cygwin.  Here's a test case using
> > wcscoll, but the same problem occurs with strcoll.
> 
> That's because the locale definitions are different in cygwin than they
> are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
> systems to have different locale definitions while still using the same
> locale name like en_US.UTF-8.

Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
function CompareStringW with the LCID set to the locale matching the
POSIX locale setting.  I'm rather glad I didn't have to implement this
by myself... :}


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 12:58   ` Corinna Vinschen
@ 2015-10-29 15:35     ` Corinna Vinschen
  2015-10-29 15:51       ` Ken Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Corinna Vinschen @ 2015-10-29 15:35 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1877 bytes --]

On Oct 29 08:50, Corinna Vinschen wrote:
> On Oct 28 21:58, Eric Blake wrote:
> > On 10/28/2015 04:14 PM, Ken Brown wrote:
> > > It's my understanding that collation is supposed to take whitespace and
> > > punctuation into account in the POSIX locale but not in other locales.
> > 
> > Not quite right. It is up to the locale definition whether whitespace
> > affects collation.  But you are correct that in the POSIX locale,
> > whitespace must not be ignored in collation.
> > 
> > > This doesn't seem to be the case on Cygwin.  Here's a test case using
> > > wcscoll, but the same problem occurs with strcoll.
> > 
> > That's because the locale definitions are different in cygwin than they
> > are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
> > systems to have different locale definitions while still using the same
> > locale name like en_US.UTF-8.
> 
> Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
> function CompareStringW with the LCID set to the locale matching the
> POSIX locale setting.  I'm rather glad I didn't have to implement this
> by myself... :}

OTOH, CompareString has a couple of flags to control its behaviour, see
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx

Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but there
are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
discussion how to change the settings to more closely resemble the rules
on Linux.

E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
C/POSIX locale anyway.  So, would it makes sense to set the flags to
NORM_IGNORESYMBOLS in other locales?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 15:35     ` Corinna Vinschen
@ 2015-10-29 15:51       ` Ken Brown
  2015-10-29 16:14         ` Corinna Vinschen
  0 siblings, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-29 15:51 UTC (permalink / raw)
  To: cygwin

On 10/29/2015 4:30 AM, Corinna Vinschen wrote:
> On Oct 29 08:50, Corinna Vinschen wrote:
>> On Oct 28 21:58, Eric Blake wrote:
>>> On 10/28/2015 04:14 PM, Ken Brown wrote:
>>>> It's my understanding that collation is supposed to take whitespace and
>>>> punctuation into account in the POSIX locale but not in other locales.
>>>
>>> Not quite right. It is up to the locale definition whether whitespace
>>> affects collation.  But you are correct that in the POSIX locale,
>>> whitespace must not be ignored in collation.
>>>
>>>> This doesn't seem to be the case on Cygwin.  Here's a test case using
>>>> wcscoll, but the same problem occurs with strcoll.
>>>
>>> That's because the locale definitions are different in cygwin than they
>>> are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
>>> systems to have different locale definitions while still using the same
>>> locale name like en_US.UTF-8.
>>
>> Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
>> function CompareStringW with the LCID set to the locale matching the
>> POSIX locale setting.  I'm rather glad I didn't have to implement this
>> by myself... :}
>
> OTOH, CompareString has a couple of flags to control its behaviour, see
> https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx
>
> Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but there
> are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
> discussion how to change the settings to more closely resemble the rules
> on Linux.
>
> E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
> C/POSIX locale anyway.  So, would it makes sense to set the flags to
> NORM_IGNORESYMBOLS in other locales?

I think so.  That's what the native Windows build of emacs does in this 
situation.  (I came across the issue because one of the tests in the 
emacs test suite was failing on Cygwin.)

Ken


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 15:51       ` Ken Brown
@ 2015-10-29 16:14         ` Corinna Vinschen
  2015-10-29 16:14           ` Ken Brown
  2015-10-29 16:17           ` Eric Blake
  0 siblings, 2 replies; 17+ messages in thread
From: Corinna Vinschen @ 2015-10-29 16:14 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 2404 bytes --]

On Oct 29 08:59, Ken Brown wrote:
> On 10/29/2015 4:30 AM, Corinna Vinschen wrote:
> >On Oct 29 08:50, Corinna Vinschen wrote:
> >>On Oct 28 21:58, Eric Blake wrote:
> >>>On 10/28/2015 04:14 PM, Ken Brown wrote:
> >>>>It's my understanding that collation is supposed to take whitespace and
> >>>>punctuation into account in the POSIX locale but not in other locales.
> >>>
> >>>Not quite right. It is up to the locale definition whether whitespace
> >>>affects collation.  But you are correct that in the POSIX locale,
> >>>whitespace must not be ignored in collation.
> >>>
> >>>>This doesn't seem to be the case on Cygwin.  Here's a test case using
> >>>>wcscoll, but the same problem occurs with strcoll.
> >>>
> >>>That's because the locale definitions are different in cygwin than they
> >>>are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
> >>>systems to have different locale definitions while still using the same
> >>>locale name like en_US.UTF-8.
> >>
> >>Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
> >>function CompareStringW with the LCID set to the locale matching the
> >>POSIX locale setting.  I'm rather glad I didn't have to implement this
> >>by myself... :}
> >
> >OTOH, CompareString has a couple of flags to control its behaviour, see
> >https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx
> >
> >Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but there
> >are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
> >discussion how to change the settings to more closely resemble the rules
> >on Linux.
> >
> >E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
> >C/POSIX locale anyway.  So, would it makes sense to set the flags to
> >NORM_IGNORESYMBOLS in other locales?
> 
> I think so.  That's what the native Windows build of emacs does in this
> situation.

Is that all it's doing?  I'm asking because using NORM_IGNORESYMBOLS
does not exaclty resemble the behaviour on Linux on my W10 box:

    "11" > "1.1" in POSIX locale
!!! "11" > "1.1" in en_US.UTF-8 locale
    "11" > "1 2" in POSIX locale
    "11" < "1 2" in en_US.UTF-8 locale


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 16:14         ` Corinna Vinschen
@ 2015-10-29 16:14           ` Ken Brown
  2015-10-29 16:51             ` Ken Brown
  2015-10-29 16:17           ` Eric Blake
  1 sibling, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-29 16:14 UTC (permalink / raw)
  To: cygwin

On 10/29/2015 11:35 AM, Corinna Vinschen wrote:
> On Oct 29 08:59, Ken Brown wrote:
>> On 10/29/2015 4:30 AM, Corinna Vinschen wrote:
>>> On Oct 29 08:50, Corinna Vinschen wrote:
>>>> On Oct 28 21:58, Eric Blake wrote:
>>>>> On 10/28/2015 04:14 PM, Ken Brown wrote:
>>>>>> It's my understanding that collation is supposed to take whitespace and
>>>>>> punctuation into account in the POSIX locale but not in other locales.
>>>>>
>>>>> Not quite right. It is up to the locale definition whether whitespace
>>>>> affects collation.  But you are correct that in the POSIX locale,
>>>>> whitespace must not be ignored in collation.
>>>>>
>>>>>> This doesn't seem to be the case on Cygwin.  Here's a test case using
>>>>>> wcscoll, but the same problem occurs with strcoll.
>>>>>
>>>>> That's because the locale definitions are different in cygwin than they
>>>>> are in glibc.  But it is not a bug in Cygwin; POSIX allows for different
>>>>> systems to have different locale definitions while still using the same
>>>>> locale name like en_US.UTF-8.
>>>>
>>>> Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
>>>> function CompareStringW with the LCID set to the locale matching the
>>>> POSIX locale setting.  I'm rather glad I didn't have to implement this
>>>> by myself... :}
>>>
>>> OTOH, CompareString has a couple of flags to control its behaviour, see
>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx
>>>
>>> Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but there
>>> are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
>>> discussion how to change the settings to more closely resemble the rules
>>> on Linux.
>>>
>>> E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
>>> C/POSIX locale anyway.  So, would it makes sense to set the flags to
>>> NORM_IGNORESYMBOLS in other locales?
>>
>> I think so.  That's what the native Windows build of emacs does in this
>> situation.
>
> Is that all it's doing?  I'm asking because using NORM_IGNORESYMBOLS
> does not exaclty resemble the behaviour on Linux on my W10 box:
>
>      "11" > "1.1" in POSIX locale
> !!! "11" > "1.1" in en_US.UTF-8 locale
>      "11" > "1 2" in POSIX locale
>      "11" < "1 2" in en_US.UTF-8 locale

I just noticed that myself and was going to ask about that difference. 
I don't see anything else that emacs is doing on native Windows.  But in 
the test I referred to above, the locale is set to "enu_USA" in the 
native Windows build.  Does that explain the discrepancy?  If not, I can 
ask on the emacs-devel list whether the test passes on Windows.

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 16:14         ` Corinna Vinschen
  2015-10-29 16:14           ` Ken Brown
@ 2015-10-29 16:17           ` Eric Blake
  1 sibling, 0 replies; 17+ messages in thread
From: Eric Blake @ 2015-10-29 16:17 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 2199 bytes --]

On 10/29/2015 09:35 AM, Corinna Vinschen wrote:

>>> Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but there
>>> are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
>>> discussion how to change the settings to more closely resemble the rules
>>> on Linux.
>>>
>>> E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
>>> C/POSIX locale anyway.  So, would it makes sense to set the flags to
>>> NORM_IGNORESYMBOLS in other locales?
>>
>> I think so.  That's what the native Windows build of emacs does in this
>> situation.
> 
> Is that all it's doing?  I'm asking because using NORM_IGNORESYMBOLS
> does not exaclty resemble the behaviour on Linux on my W10 box:
> 
>     "11" > "1.1" in POSIX locale
> !!! "11" > "1.1" in en_US.UTF-8 locale
>     "11" > "1 2" in POSIX locale
>     "11" < "1 2" in en_US.UTF-8 locale
> 

I'm not sure if blindly enabling the flags for all locales makes sense,
though.  I haven't audited glibc locales to know for sure, but it is my
impression that it is up to the locale author on whether whitespace
affects collation; and while the author of glibc en_US.UTF-8 may have
chosen that way, I can't guarantee that some other locales in glibc
still treat whitespace as significant.

POSIX has a notion of writing your own locale definition - and glibc
definitely supports that (although I haven't personally tried doing it),
where you can set your OWN collation rules while inheriting the bulk of
the work from an existing locale.   So in glibc, it is possible to have
a locale similar to en_US.UTF-8 but where whitespace IS significant in
collation.  I know cygwin isn't there yet (we expose the Windows locale,
but do not let you define your own).

This seems like the sort of thing where maybe we'd want support for
user-defined locales, compiled into a binary format, and then cygwin
opens the binary locale definition for deciding which flags to use
according to the locale being used.  But that sounds like a LOT of work,
for a questionable amount of gain.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 16:14           ` Ken Brown
@ 2015-10-29 16:51             ` Ken Brown
  2015-10-29 18:09               ` Eric Blake
  0 siblings, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-29 16:51 UTC (permalink / raw)
  To: cygwin

On 10/29/2015 11:45 AM, Ken Brown wrote:
> On 10/29/2015 11:35 AM, Corinna Vinschen wrote:
>> On Oct 29 08:59, Ken Brown wrote:
>>> On 10/29/2015 4:30 AM, Corinna Vinschen wrote:
>>>> On Oct 29 08:50, Corinna Vinschen wrote:
>>>>> On Oct 28 21:58, Eric Blake wrote:
>>>>>> On 10/28/2015 04:14 PM, Ken Brown wrote:
>>>>>>> It's my understanding that collation is supposed to take
>>>>>>> whitespace and
>>>>>>> punctuation into account in the POSIX locale but not in other
>>>>>>> locales.
>>>>>>
>>>>>> Not quite right. It is up to the locale definition whether whitespace
>>>>>> affects collation.  But you are correct that in the POSIX locale,
>>>>>> whitespace must not be ignored in collation.
>>>>>>
>>>>>>> This doesn't seem to be the case on Cygwin.  Here's a test case
>>>>>>> using
>>>>>>> wcscoll, but the same problem occurs with strcoll.
>>>>>>
>>>>>> That's because the locale definitions are different in cygwin than
>>>>>> they
>>>>>> are in glibc.  But it is not a bug in Cygwin; POSIX allows for
>>>>>> different
>>>>>> systems to have different locale definitions while still using the
>>>>>> same
>>>>>> locale name like en_US.UTF-8.
>>>>>
>>>>> Btw, strcoll and wcscoll in Cygwin are implemented using the Windows
>>>>> function CompareStringW with the LCID set to the locale matching the
>>>>> POSIX locale setting.  I'm rather glad I didn't have to implement this
>>>>> by myself... :}
>>>>
>>>> OTOH, CompareString has a couple of flags to control its behaviour, see
>>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761%28v=vs.85%29.aspx
>>>>
>>>>
>>>> Right now Cygwin calls CompareStringW with dwCmpFlags set to 0, but
>>>> there
>>>> are flags like NORM_IGNORENONSPACE, NORM_IGNORESYMBOLS.  I'm open to a
>>>> discussion how to change the settings to more closely resemble the
>>>> rules
>>>> on Linux.
>>>>
>>>> E.g.  wcscoll simply calls wcscmp rather than CompareStringW for the
>>>> C/POSIX locale anyway.  So, would it makes sense to set the flags to
>>>> NORM_IGNORESYMBOLS in other locales?
>>>
>>> I think so.  That's what the native Windows build of emacs does in this
>>> situation.
>>
>> Is that all it's doing?  I'm asking because using NORM_IGNORESYMBOLS
>> does not exaclty resemble the behaviour on Linux on my W10 box:
>>
>>      "11" > "1.1" in POSIX locale
>> !!! "11" > "1.1" in en_US.UTF-8 locale
>>      "11" > "1 2" in POSIX locale
>>      "11" < "1 2" in en_US.UTF-8 locale
>
> I just noticed that myself and was going to ask about that difference. I
> don't see anything else that emacs is doing on native Windows.  But in
> the test I referred to above, the locale is set to "enu_USA" in the
> native Windows build.  Does that explain the discrepancy?  If not, I can
> ask on the emacs-devel list whether the test passes on Windows.

Never mind.  My test case was flawed, because it didn't check for the 
possibility that wcscoll might return 0.  Here's a revised definition of 
the "compare" function:

void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
   setlocale (LC_COLLATE, loc);
   int res = wcscoll (a, b);
   char c = res < 0 ? '<' : res > 0 ? '>' : '=';
   printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
}

With this change (and the use of NORM_IGNORESYMBOLS) the test returns 
the following on Cygwin:

$ ./wcscoll_test
"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale

It still differs from Linux, but it's good enough to make the emacs test 
pass.  Moreover, this behavior actually seems more reasonable to me than 
the Linux behavior.  After all, if you're ignoring punctuation, how can 
you decide which of "11" or "1.1" comes first?

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 16:51             ` Ken Brown
@ 2015-10-29 18:09               ` Eric Blake
  2015-10-29 21:58                 ` Ken Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Blake @ 2015-10-29 18:09 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

On 10/29/2015 10:13 AM, Ken Brown wrote:

> Never mind.  My test case was flawed, because it didn't check for the
> possibility that wcscoll might return 0.  Here's a revised definition of
> the "compare" function:
> 
> void
> compare (const wchar_t *a, const wchar_t *b, const char *loc)
> {
>   setlocale (LC_COLLATE, loc);
>   int res = wcscoll (a, b);
>   char c = res < 0 ? '<' : res > 0 ? '>' : '=';
>   printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
> }
> 
> With this change (and the use of NORM_IGNORESYMBOLS) the test returns
> the following on Cygwin:
> 
> $ ./wcscoll_test
> "11" > "1.1" in POSIX locale
> "11" = "1.1" in en_US.UTF-8 locale
> "11" > "1 2" in POSIX locale
> "11" < "1 2" in en_US.UTF-8 locale
> 
> It still differs from Linux, but it's good enough to make the emacs test
> pass.  Moreover, this behavior actually seems more reasonable to me than
> the Linux behavior.  After all, if you're ignoring punctuation, how can
> you decide which of "11" or "1.1" comes first?

Careful.  POSIX is proposing some wording that say that normal locales
should always implement a fallback of last resort (and that locales that
do not do so should have a special name including '@', to make it
obvious).  It is not standardized yet, but worth thinking about.

http://austingroupbugs.net/view.php?id=938
http://austingroupbugs.net/view.php?id=963

The intent of that wording is that if ignoring punctuation could cause
two strings to otherwise compare equal, the fallback of a total ordering
on all characters means that the final result of strcoll() will not be 0
unless the two strings are identical.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 18:09               ` Eric Blake
@ 2015-10-29 21:58                 ` Ken Brown
  2015-10-30  8:05                   ` Ken Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-29 21:58 UTC (permalink / raw)
  To: cygwin

On 10/29/2015 12:51 PM, Eric Blake wrote:
> On 10/29/2015 10:13 AM, Ken Brown wrote:
>
>> Never mind.  My test case was flawed, because it didn't check for the
>> possibility that wcscoll might return 0.  Here's a revised definition of
>> the "compare" function:
>>
>> void
>> compare (const wchar_t *a, const wchar_t *b, const char *loc)
>> {
>>    setlocale (LC_COLLATE, loc);
>>    int res = wcscoll (a, b);
>>    char c = res < 0 ? '<' : res > 0 ? '>' : '=';
>>    printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
>> }
>>
>> With this change (and the use of NORM_IGNORESYMBOLS) the test returns
>> the following on Cygwin:
>>
>> $ ./wcscoll_test
>> "11" > "1.1" in POSIX locale
>> "11" = "1.1" in en_US.UTF-8 locale
>> "11" > "1 2" in POSIX locale
>> "11" < "1 2" in en_US.UTF-8 locale
>>
>> It still differs from Linux, but it's good enough to make the emacs test
>> pass.  Moreover, this behavior actually seems more reasonable to me than
>> the Linux behavior.  After all, if you're ignoring punctuation, how can
>> you decide which of "11" or "1.1" comes first?
>
> Careful.  POSIX is proposing some wording that say that normal locales
> should always implement a fallback of last resort (and that locales that
> do not do so should have a special name including '@', to make it
> obvious).  It is not standardized yet, but worth thinking about.
>
> http://austingroupbugs.net/view.php?id=938
> http://austingroupbugs.net/view.php?id=963
>
> The intent of that wording is that if ignoring punctuation could cause
> two strings to otherwise compare equal, the fallback of a total ordering
> on all characters means that the final result of strcoll() will not be 0
> unless the two strings are identical.

In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in 
non-POSIX locales, with the goal of eventually moving toward emulating 
glibc.  I don't know what fallback glibc uses or how hard it would be to 
implement this on Cygwin.

Here's a tangentially related issue, also motivated by a failing emacs 
test: Should setlocale return null to indicate an error if it's given an 
invalid locale name?  This happens on Linux but not on Cygwin, as the 
following modified test case shows:

$ cat wcscoll_test.c
#include <wchar.h>
#include <stdio.h>
#include <locale.h>

void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
   if (! setlocale (LC_COLLATE, loc))
     printf ("Unable to set locale to %s\n", loc);
   else
     {
       int res = wcscoll (a, b);
       char c = res < 0 ? '<' : res > 0 ? '>' : '=';
       printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
     }
}

int
main ()
{
   compare (L"11", L"1.1", "POSIX");
   compare (L"11", L"1.1", "en_US.UTF-8");
   compare (L"11", L"1 2", "POSIX");
   compare (L"11", L"1 2", "en_US.UTF-8");
   compare (L"11", L"1 2", "en_DE.UTF-8");
}

On Cygwin (with NORM_IGNORESYMBOLS), the output is

"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale
"11" < "1 2" in en_DE.UTF-8 locale

but on Linux it is

"11" > "1.1" in POSIX locale
"11" < "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale
Unable to set locale to en_DE.UTF-8

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-29 21:58                 ` Ken Brown
@ 2015-10-30  8:05                   ` Ken Brown
  2015-10-30 14:07                     ` Ken Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-30  8:05 UTC (permalink / raw)
  To: cygwin

On 10/29/2015 2:42 PM, Ken Brown wrote:
> On 10/29/2015 12:51 PM, Eric Blake wrote:
>> On 10/29/2015 10:13 AM, Ken Brown wrote:
>>
>>> Never mind.  My test case was flawed, because it didn't check for the
>>> possibility that wcscoll might return 0.  Here's a revised definition of
>>> the "compare" function:
>>>
>>> void
>>> compare (const wchar_t *a, const wchar_t *b, const char *loc)
>>> {
>>>    setlocale (LC_COLLATE, loc);
>>>    int res = wcscoll (a, b);
>>>    char c = res < 0 ? '<' : res > 0 ? '>' : '=';
>>>    printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
>>> }
>>>
>>> With this change (and the use of NORM_IGNORESYMBOLS) the test returns
>>> the following on Cygwin:
>>>
>>> $ ./wcscoll_test
>>> "11" > "1.1" in POSIX locale
>>> "11" = "1.1" in en_US.UTF-8 locale
>>> "11" > "1 2" in POSIX locale
>>> "11" < "1 2" in en_US.UTF-8 locale
>>>
>>> It still differs from Linux, but it's good enough to make the emacs test
>>> pass.  Moreover, this behavior actually seems more reasonable to me than
>>> the Linux behavior.  After all, if you're ignoring punctuation, how can
>>> you decide which of "11" or "1.1" comes first?
>>
>> Careful.  POSIX is proposing some wording that say that normal locales
>> should always implement a fallback of last resort (and that locales that
>> do not do so should have a special name including '@', to make it
>> obvious).  It is not standardized yet, but worth thinking about.
>>
>> http://austingroupbugs.net/view.php?id=938
>> http://austingroupbugs.net/view.php?id=963
>>
>> The intent of that wording is that if ignoring punctuation could cause
>> two strings to otherwise compare equal, the fallback of a total ordering
>> on all characters means that the final result of strcoll() will not be 0
>> unless the two strings are identical.
>
> In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
> non-POSIX locales, with the goal of eventually moving toward emulating
> glibc.  I don't know what fallback glibc uses or how hard it would be to
> implement this on Cygwin.

I withdraw this suggestion.  I took a look at the glibc code, and I 
don't see any reasonable way for Cygwin to emulate it precisely.  On the 
other hand, I have an idea for a simple fallback.  I'll play with it a 
little and then submit a patch.

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-30  8:05                   ` Ken Brown
@ 2015-10-30 14:07                     ` Ken Brown
  2015-10-30 19:11                       ` Corinna Vinschen
  0 siblings, 1 reply; 17+ messages in thread
From: Ken Brown @ 2015-10-30 14:07 UTC (permalink / raw)
  To: cygwin

On 10/29/2015 5:49 PM, Ken Brown wrote:
> On 10/29/2015 2:42 PM, Ken Brown wrote:
>> On 10/29/2015 12:51 PM, Eric Blake wrote:
>>> On 10/29/2015 10:13 AM, Ken Brown wrote:
>>>
>>>> Never mind.  My test case was flawed, because it didn't check for the
>>>> possibility that wcscoll might return 0.  Here's a revised
>>>> definition of
>>>> the "compare" function:
>>>>
>>>> void
>>>> compare (const wchar_t *a, const wchar_t *b, const char *loc)
>>>> {
>>>>    setlocale (LC_COLLATE, loc);
>>>>    int res = wcscoll (a, b);
>>>>    char c = res < 0 ? '<' : res > 0 ? '>' : '=';
>>>>    printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
>>>> }
>>>>
>>>> With this change (and the use of NORM_IGNORESYMBOLS) the test returns
>>>> the following on Cygwin:
>>>>
>>>> $ ./wcscoll_test
>>>> "11" > "1.1" in POSIX locale
>>>> "11" = "1.1" in en_US.UTF-8 locale
>>>> "11" > "1 2" in POSIX locale
>>>> "11" < "1 2" in en_US.UTF-8 locale
>>>>
>>>> It still differs from Linux, but it's good enough to make the emacs
>>>> test
>>>> pass.  Moreover, this behavior actually seems more reasonable to me
>>>> than
>>>> the Linux behavior.  After all, if you're ignoring punctuation, how can
>>>> you decide which of "11" or "1.1" comes first?
>>>
>>> Careful.  POSIX is proposing some wording that say that normal locales
>>> should always implement a fallback of last resort (and that locales that
>>> do not do so should have a special name including '@', to make it
>>> obvious).  It is not standardized yet, but worth thinking about.
>>>
>>> http://austingroupbugs.net/view.php?id=938
>>> http://austingroupbugs.net/view.php?id=963
>>>
>>> The intent of that wording is that if ignoring punctuation could cause
>>> two strings to otherwise compare equal, the fallback of a total ordering
>>> on all characters means that the final result of strcoll() will not be 0
>>> unless the two strings are identical.
>>
>> In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
>> non-POSIX locales, with the goal of eventually moving toward emulating
>> glibc.  I don't know what fallback glibc uses or how hard it would be to
>> implement this on Cygwin.
>
> I withdraw this suggestion.  I took a look at the glibc code, and I
> don't see any reasonable way for Cygwin to emulate it precisely.  On the
> other hand, I have an idea for a simple fallback.  I'll play with it a
> little and then submit a patch.

The fallback I had in mind is to return the shorter string if they have 
different lengths and otherwise to revert to wcscmp.  Using this, both 
Cygwin and Linux give the following comparisons:

"11" > "1.1" in POSIX locale
"11" < "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1.2" in en_US.UTF-8 locale
"1 1" < "1.1" in POSIX locale
"1 1" < "1.1" in en_US.UTF-8 locale

If this seems reasonable, I'll test it more extensively and then submit 
a patch.

Ken

P.S. In case others want to test this in different locales, here's the 
patch so far, just for wcscoll:

diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index f7031f9..c33aa24 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1156,10 +1156,15 @@ wcscoll (const wchar_t *__restrict ws1, const 
wchar_t *__restrict ws2)

    if (!collate_lcid)
      return wcscmp (ws1, ws2);
-  ret = CompareStringW (collate_lcid, 0, ws1, -1, ws2, -1);
+  ret = CompareStringW (collate_lcid, NORM_IGNORESYMBOLS, ws1, -1, ws2, 
-1);
    if (!ret)
      set_errno (EINVAL);
-  return ret - CSTR_EQUAL;
+  ret -= CSTR_EQUAL;
+  if (!ret)
+    ret = wcslen (ws1) - wcslen (ws2);
+  if (!ret)
+    ret = wcscmp (ws1, ws2);
+  return ret;
  }

  extern "C" int


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-30 14:07                     ` Ken Brown
@ 2015-10-30 19:11                       ` Corinna Vinschen
  2015-10-30 19:14                         ` Ken Brown
  0 siblings, 1 reply; 17+ messages in thread
From: Corinna Vinschen @ 2015-10-30 19:11 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]

Hi Ken,

On Oct 29 18:21, Ken Brown wrote:
> On 10/29/2015 5:49 PM, Ken Brown wrote:
> >On 10/29/2015 2:42 PM, Ken Brown wrote:
> >>On 10/29/2015 12:51 PM, Eric Blake wrote:
> >>>Careful.  POSIX is proposing some wording that say that normal locales
> >>>should always implement a fallback of last resort (and that locales that
> >>>do not do so should have a special name including '@', to make it
> >>>obvious).  It is not standardized yet, but worth thinking about.
> >>>
> >>>http://austingroupbugs.net/view.php?id=938
> >>>http://austingroupbugs.net/view.php?id=963
> >>>
> >>>The intent of that wording is that if ignoring punctuation could cause
> >>>two strings to otherwise compare equal, the fallback of a total ordering
> >>>on all characters means that the final result of strcoll() will not be 0
> >>>unless the two strings are identical.
> >>
> >>In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
> >>non-POSIX locales, with the goal of eventually moving toward emulating
> >>glibc.  I don't know what fallback glibc uses or how hard it would be to
> >>implement this on Cygwin.
> >
> >I withdraw this suggestion.  I took a look at the glibc code, and I
> >don't see any reasonable way for Cygwin to emulate it precisely.  On the
> >other hand, I have an idea for a simple fallback.  I'll play with it a
> >little and then submit a patch.
> 
> The fallback I had in mind is to return the shorter string if they have
> different lengths and otherwise to revert to wcscmp.  Using this, both
> Cygwin and Linux give the following comparisons:
> 
> "11" > "1.1" in POSIX locale
> "11" < "1.1" in en_US.UTF-8 locale
> "11" > "1 2" in POSIX locale
> "11" < "1.2" in en_US.UTF-8 locale
> "1 1" < "1.1" in POSIX locale
> "1 1" < "1.1" in en_US.UTF-8 locale
> 
> If this seems reasonable, I'll test it more extensively and then submit a
> patch.

I had a longer look into this suggestion and the below code and it took
me some time to find out what bugged me with it:

What about str/wcsxfrm?

Per POSIX, calling strcmp on the result of strxfrm is equivalent to
calling strcoll (analogue with wcs*).  If you extend *coll to perform an
extra check on the length, you will have cases in which the above rule
fails.  You can't perform the length test on the result of *xfrm and
expect the same result as in *coll.

In fact, when calling LCMapStringW with NORM_IGNORESYMBOLS (you would
have to do this anyway if we add this flag in *coll), the resulting
transformed strings created from the input strings "11" and "1.1" would
be identical, so a length test on the xfrm string is not meaningful at
all.

The bottom line is, afaics, we must make sure that CompareStringW and
LCMapStringW are called the same way, and their result/output has to be
returned to the caller.  Performing an extra check in *coll which can't
be reliably performed in *xfrm is not feasible.

Does that make sense?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-30 19:11                       ` Corinna Vinschen
@ 2015-10-30 19:14                         ` Ken Brown
  2015-10-30 21:13                           ` Corinna Vinschen
       [not found]                           ` <5634F6BA.7070301@cornell.edu>
  0 siblings, 2 replies; 17+ messages in thread
From: Ken Brown @ 2015-10-30 19:14 UTC (permalink / raw)
  To: cygwin

Hi Corinna,

On 10/30/2015 8:03 AM, Corinna Vinschen wrote:
> On Oct 29 18:21, Ken Brown wrote:
>> The fallback I had in mind is to return the shorter string if they have
>> different lengths and otherwise to revert to wcscmp.
 >
> I had a longer look into this suggestion and the below code and it took
> me some time to find out what bugged me with it:
>
> What about str/wcsxfrm?
>
> Per POSIX, calling strcmp on the result of strxfrm is equivalent to
> calling strcoll (analogue with wcs*).  If you extend *coll to perform an
> extra check on the length, you will have cases in which the above rule
> fails.  You can't perform the length test on the result of *xfrm and
> expect the same result as in *coll.
>
> In fact, when calling LCMapStringW with NORM_IGNORESYMBOLS (you would
> have to do this anyway if we add this flag in *coll), the resulting
> transformed strings created from the input strings "11" and "1.1" would
> be identical, so a length test on the xfrm string is not meaningful at
> all.
>
> The bottom line is, afaics, we must make sure that CompareStringW and
> LCMapStringW are called the same way, and their result/output has to be
> returned to the caller.  Performing an extra check in *coll which can't
> be reliably performed in *xfrm is not feasible.
>
> Does that make sense?

Yes, I see the problem, and I don't see a good way around it.  So I 
think we probably have to leave things as they are and live with the 
fact that we can't do comparisons that ignore whitespace and punctuation.

The alternative of allowing str/wcscoll to return 0 on unequal strings 
doesn't seem feasible in view of Eric's comments.

What about the other issue I raised: Should setlocale return null to 
indicate an error if it's given an invalid locale name like en_DE.UTF-8?

Ken

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
  2015-10-30 19:14                         ` Ken Brown
@ 2015-10-30 21:13                           ` Corinna Vinschen
       [not found]                           ` <5634F6BA.7070301@cornell.edu>
  1 sibling, 0 replies; 17+ messages in thread
From: Corinna Vinschen @ 2015-10-30 21:13 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 2608 bytes --]

On Oct 30 10:07, Ken Brown wrote:
> Hi Corinna,
> 
> On 10/30/2015 8:03 AM, Corinna Vinschen wrote:
> >On Oct 29 18:21, Ken Brown wrote:
> >>The fallback I had in mind is to return the shorter string if they have
> >>different lengths and otherwise to revert to wcscmp.
> >
> >I had a longer look into this suggestion and the below code and it took
> >me some time to find out what bugged me with it:
> >
> >What about str/wcsxfrm?
> >
> >Per POSIX, calling strcmp on the result of strxfrm is equivalent to
> >calling strcoll (analogue with wcs*).  If you extend *coll to perform an
> >extra check on the length, you will have cases in which the above rule
> >fails.  You can't perform the length test on the result of *xfrm and
> >expect the same result as in *coll.
> >
> >In fact, when calling LCMapStringW with NORM_IGNORESYMBOLS (you would
> >have to do this anyway if we add this flag in *coll), the resulting
> >transformed strings created from the input strings "11" and "1.1" would
> >be identical, so a length test on the xfrm string is not meaningful at
> >all.
> >
> >The bottom line is, afaics, we must make sure that CompareStringW and
> >LCMapStringW are called the same way, and their result/output has to be
> >returned to the caller.  Performing an extra check in *coll which can't
> >be reliably performed in *xfrm is not feasible.
> >
> >Does that make sense?
> 
> Yes, I see the problem, and I don't see a good way around it.  So I think we
> probably have to leave things as they are and live with the fact that we
> can't do comparisons that ignore whitespace and punctuation.
> 
> The alternative of allowing str/wcscoll to return 0 on unequal strings
> doesn't seem feasible in view of Eric's comments.
> 
> What about the other issue I raised: Should setlocale return null to
> indicate an error if it's given an invalid locale name like en_DE.UTF-8?

Huh.  Interesting.  You're runing Windows10, right?  After some digging
it turns out there's a bug in W10.  LocaleNameToLCID() does *not* fail
and return with an error if it doesn't know a locale.  That would be too
simple I guess.  Rather, it returns a value LOCALE_CUSTOM_UNSPECIFIED,
0x1000.  So all unknown locales are now treated as custom locale.  Duh!
I fear the answer when trying to report this.  Probably it's a feature...

I applied a patch to workaround this feature.


Thanks for the testcase, btw :)


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Bug in collation functions?
       [not found]                           ` <5634F6BA.7070301@cornell.edu>
@ 2015-11-02 11:14                             ` Corinna Vinschen
  0 siblings, 0 replies; 17+ messages in thread
From: Corinna Vinschen @ 2015-11-02 11:14 UTC (permalink / raw)
  To: cygwin

[-- Attachment #1: Type: text/plain, Size: 2908 bytes --]

On Oct 31 13:13, Ken Brown wrote:
> On 10/30/2015 10:07 AM, Ken Brown wrote:
> >Hi Corinna,
> >
> >On 10/30/2015 8:03 AM, Corinna Vinschen wrote:
> >>On Oct 29 18:21, Ken Brown wrote:
> >>>The fallback I had in mind is to return the shorter string if they have
> >>>different lengths and otherwise to revert to wcscmp.
> > >
> >>I had a longer look into this suggestion and the below code and it took
> >>me some time to find out what bugged me with it:
> >>
> >>What about str/wcsxfrm?
> >>
> >>Per POSIX, calling strcmp on the result of strxfrm is equivalent to
> >>calling strcoll (analogue with wcs*).  If you extend *coll to perform an
> >>extra check on the length, you will have cases in which the above rule
> >>fails.  You can't perform the length test on the result of *xfrm and
> >>expect the same result as in *coll.
> >>
> >>In fact, when calling LCMapStringW with NORM_IGNORESYMOLS (you would
> >>have to do this anyway if we add this flag in *coll), the resulting
> >>transformed strings created from the input strings "11" and "1.1" would
> >>be identical, so a length test on the xfrm string is not meaningful at
> >>all.
> >>
> >>The bottom line is, afaics, we must make sure that CompareStringW and
> >>LCMapStringW are called the same way, and their result/output has to be
> >>returned to the caller.  Performing an extra check in *coll which can't
> >>be reliably performed in *xfrm is not feasible.
> >>
> >>Does that make sense?
> >
> >Yes, I see the problem, and I don't see a good way around it.  So I
> >think we probably have to leave things as they are and live with the
> >fact that we can't do comparisons that ignore whitespace and punctuation.
> >
> >The alternative of allowing str/wcscoll to return 0 on unequal strings
> >doesn't seem feasible in view of Eric's comments.
> 
> I have one other idea.  What would you think of defining a function
> cygwin_strcoll that's like strcoll but with an extra bool parameter
> 'ignoresymbols'?  If ignoresymbols = false, this would be the same as
> strcoll.  If ignoresymbols = true, this would use NORM_IGNORESYMBOLS with
> the fallback I suggested.
> 
> That way applications that prefer to be more glibc-compatible and don't need
> strxfrm could do something like
> 
>   #define strcoll(A,B) cygwin_strcoll ((A), (B), true)
> 
> If you think this is reasonable, I'll submit a patch.  If not, no problem.

No, I don't think this is feasible.  Given Eric's comments, can an
application ever expect that strcoll behaves exactly as on Linux?  For
portability reasons, it has to expect different results on different
platforms.  Only if the result is POSIXly incorrect, it makes sense to
fix the behaviour, IMHO.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-11-02 11:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-29  7:41 Bug in collation functions? Ken Brown
2015-10-29  7:50 ` Eric Blake
2015-10-29 12:58   ` Corinna Vinschen
2015-10-29 15:35     ` Corinna Vinschen
2015-10-29 15:51       ` Ken Brown
2015-10-29 16:14         ` Corinna Vinschen
2015-10-29 16:14           ` Ken Brown
2015-10-29 16:51             ` Ken Brown
2015-10-29 18:09               ` Eric Blake
2015-10-29 21:58                 ` Ken Brown
2015-10-30  8:05                   ` Ken Brown
2015-10-30 14:07                     ` Ken Brown
2015-10-30 19:11                       ` Corinna Vinschen
2015-10-30 19:14                         ` Ken Brown
2015-10-30 21:13                           ` Corinna Vinschen
     [not found]                           ` <5634F6BA.7070301@cornell.edu>
2015-11-02 11:14                             ` Corinna Vinschen
2015-10-29 16:17           ` Eric Blake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).