public inbox for cygwin@cygwin.com
 help / color / mirror / Atom feed
* newlocale: Linux incompatibility
@ 2023-03-23 19:48 Ken Brown
  2023-03-23 20:09 ` Thomas Wolff
  2023-03-23 21:14 ` Corinna Vinschen
  0 siblings, 2 replies; 11+ messages in thread
From: Ken Brown @ 2023-03-23 19:48 UTC (permalink / raw)
  To: cygwin

I'm reporting this here rather than the newlib list because the behavior 
is compatible with Posix but not Linux, so I think it's a Cygwin issue.

Consider the following test case:

$ cat locale_test.c
#include <stdio.h>
#include <locale.h>

int main ()
{
   const char *locale = "en_DE.UTF-8";
   locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
   if (!loc)
     perror ("newlocale");
   else
     printf ("newlocale succeeded on invalid locale %s\n", locale);
}

$ gcc -o locale_test locale_test.c

$ ./locale_test.exe
newlocale succeeded on invalid locale en_DE.UTF-8

On Linux, the newlocale call fails with ENOENT, as is documented on the 
man page.  Posix doesn't say what should happen on an invalid locale, so 
this is not, strictly speaking, a bug.

Ken

P.S. I noticed this because of a failing Emacs test.  No one else has 
reported this test failure, so it seems that newlocale fails on an 
invalid locale on all platforms supported by Emacs other than Cygwin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-23 19:48 newlocale: Linux incompatibility Ken Brown
@ 2023-03-23 20:09 ` Thomas Wolff
  2023-03-23 21:14 ` Corinna Vinschen
  1 sibling, 0 replies; 11+ messages in thread
From: Thomas Wolff @ 2023-03-23 20:09 UTC (permalink / raw)
  To: cygwin


Am 23.03.2023 um 20:48 schrieb Ken Brown via Cygwin:
> I'm reporting this here rather than the newlib list because the 
> behavior is compatible with Posix but not Linux, so I think it's a 
> Cygwin issue.
>
> Consider the following test case:
>
> $ cat locale_test.c
> #include <stdio.h>
> #include <locale.h>
>
> int main ()
> {
>   const char *locale = "en_DE.UTF-8";
>   locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
>   if (!loc)
>     perror ("newlocale");
>   else
>     printf ("newlocale succeeded on invalid locale %s\n", locale);
> }
>
> $ gcc -o locale_test locale_test.c
>
> $ ./locale_test.exe
> newlocale succeeded on invalid locale en_DE.UTF-8
>
> On Linux, the newlocale call fails with ENOENT, as is documented on 
> the man page.  Posix doesn't say what should happen on an invalid 
> locale, so this is not, strictly speaking, a bug.
So the question is what is an invalid locale. In Linux, locales are only 
valid if explicitly listed somewhere.
This strict behaviour may be a problem. A much better approach is to 
allow any combination of known language_REGIOIN tags with encoding 
indications, to be much more flexible and dynamic.
So if such combinations are considered legal, as in cygwin, this is not 
a bug.

>
> Ken
>
> P.S. I noticed this because of a failing Emacs test.  No one else has 
> reported this test failure, so it seems that newlocale fails on an 
> invalid locale on all platforms supported by Emacs other than Cygwin.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-23 19:48 newlocale: Linux incompatibility Ken Brown
  2023-03-23 20:09 ` Thomas Wolff
@ 2023-03-23 21:14 ` Corinna Vinschen
  2023-03-24 12:18   ` Corinna Vinschen
  1 sibling, 1 reply; 11+ messages in thread
From: Corinna Vinschen @ 2023-03-23 21:14 UTC (permalink / raw)
  To: cygwin

On Mar 23 15:48, Ken Brown via Cygwin wrote:
> I'm reporting this here rather than the newlib list because the behavior is
> compatible with Posix but not Linux, so I think it's a Cygwin issue.

Actually, it's a Windows issue :)

> Consider the following test case:
> 
> $ cat locale_test.c
> #include <stdio.h>
> #include <locale.h>
> 
> int main ()
> {
>   const char *locale = "en_DE.UTF-8";
>   locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
>   if (!loc)
>     perror ("newlocale");
>   else
>     printf ("newlocale succeeded on invalid locale %s\n", locale);
> }
> 
> $ gcc -o locale_test locale_test.c
> 
> $ ./locale_test.exe
> newlocale succeeded on invalid locale en_DE.UTF-8
> 
> On Linux, the newlocale call fails with ENOENT, as is documented on the man
> page.  Posix doesn't say what should happen on an invalid locale, so this is
> not, strictly speaking, a bug.

Three bugs in fact.

First, it's a bug in the Emacs testsuite.  The test simply assumes that
there's no en_DE locale on any system, but that's just not true.
Windows support the RFC 5646 locale "en-DE", which is called "English
(Germany)" in the "Region" settings.

You can also check with `locale -av | less' and search for en_DE.

For the reminder of this mail, I assume you're talking about Cygwin 3.5.
I won't fix this for 3.4 anymore, given how much locale handling has
changed for 3.5.

The second bug is that Cygwin blindly trusts the Windows function
ResolveLocaleName().  That function blatantly converts even vaguely
similar locales into something it supports.  E.g., it converts "en-XY"
to "en-US".  I. .e., even if you use "en_XY.utf8" as locale, the above
testcase will wrongly succeed.  So I have to rethink how I resolve POSIX
locales to Windows locales.

And the third bug is that Cygwin fails to set errno if it doesn't
support a locale, but that's a minor inconvenience in comparison.

Thanks for the report, I totally missed the above problem with
ResolveLocaleName.


Corinna

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-23 21:14 ` Corinna Vinschen
@ 2023-03-24 12:18   ` Corinna Vinschen
  2023-03-24 13:57     ` Ken Brown
  2023-03-24 22:49     ` Brian Inglis
  0 siblings, 2 replies; 11+ messages in thread
From: Corinna Vinschen @ 2023-03-24 12:18 UTC (permalink / raw)
  To: cygwin

On Mar 23 22:14, Corinna Vinschen via Cygwin wrote:
> On Mar 23 15:48, Ken Brown via Cygwin wrote:
> > I'm reporting this here rather than the newlib list because the behavior is
> > compatible with Posix but not Linux, so I think it's a Cygwin issue.
> 
> Actually, it's a Windows issue :)
> 
> > Consider the following test case:
> > 
> > $ cat locale_test.c
> > #include <stdio.h>
> > #include <locale.h>
> > 
> > int main ()
> > {
> >   const char *locale = "en_DE.UTF-8";
> >   locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
> >   if (!loc)
> >     perror ("newlocale");
> >   else
> >     printf ("newlocale succeeded on invalid locale %s\n", locale);
> > }
> > 
> > $ gcc -o locale_test locale_test.c
> > 
> > $ ./locale_test.exe
> > newlocale succeeded on invalid locale en_DE.UTF-8
> > 
> > On Linux, the newlocale call fails with ENOENT, as is documented on the man
> > page.  Posix doesn't say what should happen on an invalid locale, so this is
> > not, strictly speaking, a bug.
> 
> Three bugs in fact.
> 
> First, it's a bug in the Emacs testsuite.  The test simply assumes that
> there's no en_DE locale on any system, but that's just not true.
> Windows support the RFC 5646 locale "en-DE", which is called "English
> (Germany)" in the "Region" settings.
> 
> You can also check with `locale -av | less' and search for en_DE.
> 
> For the reminder of this mail, I assume you're talking about Cygwin 3.5.
> I won't fix this for 3.4 anymore, given how much locale handling has
> changed for 3.5.
> 
> The second bug is that Cygwin blindly trusts the Windows function
> ResolveLocaleName().  That function blatantly converts even vaguely
> similar locales into something it supports.  E.g., it converts "en-XY"
> to "en-US".  I. .e., even if you use "en_XY.utf8" as locale, the above
> testcase will wrongly succeed.  So I have to rethink how I resolve POSIX
> locales to Windows locales.
> 
> And the third bug is that Cygwin fails to set errno if it doesn't
> support a locale, but that's a minor inconvenience in comparison.
> 
> Thanks for the report, I totally missed the above problem with
> ResolveLocaleName.

I pushed a couple of patches which hopefully clean up the code.  It's
really frustrating how these Windows locale functions work.  Or, rather,
not work.  I mean, come on...

- ResolveLocaleName() resolves "ff-BF" to "ff-Latn-SN", not to
  "ff-Adlm-BF" or "ff-Latn-BF", even though both exist.  

- There's a locale called "sd-Arab-PK" and a locale "sd-Deva-IN".  If
  you ask for the script used in "sd-IN", the result is "Arab", not
  "Deva".

/*facepalm*/

I had to create a replacement function for ResolveLocaleName which
doesn't return totally screwy and unexpected results, and special case
two more locales in /proc/locales output so the output makes sense.

Oh, and I added error handling to the code so newlocale is now able to
set errno to ENOENT if the locale is not supported.

If you want to test this, the changes are in test release
3.5.0-0.260.gb5b67a65f87c, which is just building.


HTH,
Corinna

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-24 12:18   ` Corinna Vinschen
@ 2023-03-24 13:57     ` Ken Brown
  2023-03-24 14:44       ` Corinna Vinschen
  2023-03-24 22:49     ` Brian Inglis
  1 sibling, 1 reply; 11+ messages in thread
From: Ken Brown @ 2023-03-24 13:57 UTC (permalink / raw)
  To: cygwin

On 3/24/2023 8:18 AM, Corinna Vinschen via Cygwin wrote:
> On Mar 23 22:14, Corinna Vinschen via Cygwin wrote:
>> On Mar 23 15:48, Ken Brown via Cygwin wrote:
>>> Consider the following test case:
>>>
>>> $ cat locale_test.c
>>> #include <stdio.h>
>>> #include <locale.h>
>>>
>>> int main ()
>>> {
>>>    const char *locale = "en_DE.UTF-8";
>>>    locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
>>>    if (!loc)
>>>      perror ("newlocale");
>>>    else
>>>      printf ("newlocale succeeded on invalid locale %s\n", locale);
>>> }
>>>
>>> $ gcc -o locale_test locale_test.c
>>>
>>> $ ./locale_test.exe
>>> newlocale succeeded on invalid locale en_DE.UTF-8
>>>
>>> On Linux, the newlocale call fails with ENOENT, as is documented on the man
>>> page.
>> Three bugs in fact.
>>
>> First, it's a bug in the Emacs testsuite.  The test simply assumes that
>> there's no en_DE locale on any system, but that's just not true.
>> Windows support the RFC 5646 locale "en-DE", which is called "English
>> (Germany)" in the "Region" settings.
>>
>> You can also check with `locale -av | less' and search for en_DE.
>>
>> For the reminder of this mail, I assume you're talking about Cygwin 3.5.
>> I won't fix this for 3.4 anymore, given how much locale handling has
>> changed for 3.5.
>>
>> The second bug is that Cygwin blindly trusts the Windows function
>> ResolveLocaleName().  That function blatantly converts even vaguely
>> similar locales into something it supports.  E.g., it converts "en-XY"
>> to "en-US".  I. .e., even if you use "en_XY.utf8" as locale, the above
>> testcase will wrongly succeed.  So I have to rethink how I resolve POSIX
>> locales to Windows locales.
>>
>> And the third bug is that Cygwin fails to set errno if it doesn't
>> support a locale, but that's a minor inconvenience in comparison.
>>
>> Thanks for the report, I totally missed the above problem with
>> ResolveLocaleName.
> 
> I pushed a couple of patches which hopefully clean up the code. 
> 
> I had to create a replacement function for ResolveLocaleName which
> doesn't return totally screwy and unexpected results, and special case
> two more locales in /proc/locales output so the output makes sense.
> 
> Oh, and I added error handling to the code so newlocale is now able to
> set errno to ENOENT if the locale is not supported.
> 
> If you want to test this, the changes are in test release
> 3.5.0-0.260.gb5b67a65f87c, which is just building.

That was fast!  I can confirm that newlocale now fails with ENOENT on 
the invalid locale en_XY.utf8.

Thanks.

Ken

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-24 13:57     ` Ken Brown
@ 2023-03-24 14:44       ` Corinna Vinschen
  0 siblings, 0 replies; 11+ messages in thread
From: Corinna Vinschen @ 2023-03-24 14:44 UTC (permalink / raw)
  To: cygwin

On Mar 24 09:57, Ken Brown via Cygwin wrote:
> On 3/24/2023 8:18 AM, Corinna Vinschen via Cygwin wrote:
> > On Mar 23 22:14, Corinna Vinschen via Cygwin wrote:
> > > On Mar 23 15:48, Ken Brown via Cygwin wrote:
> > > > Consider the following test case:
> > > > 
> > > > $ cat locale_test.c
> > > > #include <stdio.h>
> > > > #include <locale.h>
> > > > 
> > > > int main ()
> > > > {
> > > >    const char *locale = "en_DE.UTF-8";
> > > >    locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
> > > >    if (!loc)
> > > >      perror ("newlocale");
> > > >    else
> > > >      printf ("newlocale succeeded on invalid locale %s\n", locale);
> > > > }
> > > > 
> > > > $ gcc -o locale_test locale_test.c
> > > > 
> > > > $ ./locale_test.exe
> > > > newlocale succeeded on invalid locale en_DE.UTF-8
> > > > 
> > > > On Linux, the newlocale call fails with ENOENT, as is documented on the man
> > > > page.
> > > Three bugs in fact.
> > > [...]
> > I pushed a couple of patches which hopefully clean up the code.
> > [...]
> > If you want to test this, the changes are in test release
> > 3.5.0-0.260.gb5b67a65f87c, which is just building.
> 
> That was fast!  I can confirm that newlocale now fails with ENOENT on the
> invalid locale en_XY.utf8.

Thanks for testing!


Corinna

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-24 12:18   ` Corinna Vinschen
  2023-03-24 13:57     ` Ken Brown
@ 2023-03-24 22:49     ` Brian Inglis
  2023-03-25 11:49       ` Corinna Vinschen
  1 sibling, 1 reply; 11+ messages in thread
From: Brian Inglis @ 2023-03-24 22:49 UTC (permalink / raw)
  To: cygwin

On 2023-03-24 06:18, Corinna Vinschen via Cygwin wrote:
> On Mar 23 22:14, Corinna Vinschen via Cygwin wrote:
>> On Mar 23 15:48, Ken Brown via Cygwin wrote:
>>> I'm reporting this here rather than the newlib list because the behavior is
>>> compatible with Posix but not Linux, so I think it's a Cygwin issue.
>>
>> Actually, it's a Windows issue :)
>>
>>> Consider the following test case:
>>>
>>> $ cat locale_test.c
>>> #include <stdio.h>
>>> #include <locale.h>
>>>
>>> int main ()
>>> {
>>>    const char *locale = "en_DE.UTF-8";
>>>    locale_t loc = newlocale (LC_COLLATE_MASK | LC_CTYPE_MASK, locale, 0);
>>>    if (!loc)
>>>      perror ("newlocale");
>>>    else
>>>      printf ("newlocale succeeded on invalid locale %s\n", locale);
>>> }
>>>
>>> $ gcc -o locale_test locale_test.c
>>>
>>> $ ./locale_test.exe
>>> newlocale succeeded on invalid locale en_DE.UTF-8
>>>
>>> On Linux, the newlocale call fails with ENOENT, as is documented on the man
>>> page.  Posix doesn't say what should happen on an invalid locale, so this is
>>> not, strictly speaking, a bug.
>>
>> Three bugs in fact.
>>
>> First, it's a bug in the Emacs testsuite.  The test simply assumes that
>> there's no en_DE locale on any system, but that's just not true.
>> Windows support the RFC 5646 locale "en-DE", which is called "English
>> (Germany)" in the "Region" settings.
>>
>> You can also check with `locale -av | less' and search for en_DE.
>>
>> For the reminder of this mail, I assume you're talking about Cygwin 3.5.
>> I won't fix this for 3.4 anymore, given how much locale handling has
>> changed for 3.5.
>>
>> The second bug is that Cygwin blindly trusts the Windows function
>> ResolveLocaleName().  That function blatantly converts even vaguely
>> similar locales into something it supports.  E.g., it converts "en-XY"
>> to "en-US".  I. .e., even if you use "en_XY.utf8" as locale, the above
>> testcase will wrongly succeed.  So I have to rethink how I resolve POSIX
>> locales to Windows locales.

Does Windows even consider https://www.rfc-editor.org/rfc/rfc4647 "Matching of 
Language Tags", part of https://www.rfc-editor.org/info/bcp47 "Language Tags", 
and if POSIX only matches exactly, will LANGUAGE be able to be used for fallback?

I currently define LANGUAGE=en_CA:en_GB:en in case en-CA is unsupported by 
anything.
[I use my own en-CA locale not the glibc default created by https://rap.dk/.]

Will "-" be supported like "_" as a separator in values?

>> And the third bug is that Cygwin fails to set errno if it doesn't
>> support a locale, but that's a minor inconvenience in comparison.
>>
>> Thanks for the report, I totally missed the above problem with
>> ResolveLocaleName.
> 
> I pushed a couple of patches which hopefully clean up the code.  It's
> really frustrating how these Windows locale functions work.  Or, rather,
> not work.  I mean, come on...
> 
> - ResolveLocaleName() resolves "ff-BF" to "ff-Latn-SN", not to
>    "ff-Adlm-BF" or "ff-Latn-BF", even though both exist.
> 
> - There's a locale called "sd-Arab-PK" and a locale "sd-Deva-IN".  If
>    you ask for the script used in "sd-IN", the result is "Arab", not
>    "Deva".
 >
> I had to create a replacement function for ResolveLocaleName which
> doesn't return totally screwy and unexpected results, and special case
> two more locales in /proc/locales output so the output makes sense.

Aha - a nice new 3.5.0 feature - as well as /proc/codesets - is that charsets 
e.g. ISO-10646, etc. rather than encodings e.g. UTF-8, etc.!

FYI Google fixed their English L14N falling back to en-GB except US territories:

https://developer.android.com/guide/topics/resources/multilingual-support#postN
https://issuetracker.google.com/issues/64429534#comment6

and there have been similar issues posted for other languages.

> Oh, and I added error handling to the code so newlocale is now able to
> set errno to ENOENT if the locale is not supported.
> 
> If you want to test this, the changes are in test release
> 3.5.0-0.260.gb5b67a65f87c, which is just building.
-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-24 22:49     ` Brian Inglis
@ 2023-03-25 11:49       ` Corinna Vinschen
  2023-03-25 19:03         ` Brian Inglis
  0 siblings, 1 reply; 11+ messages in thread
From: Corinna Vinschen @ 2023-03-25 11:49 UTC (permalink / raw)
  To: cygwin

On Mar 24 16:49, Brian Inglis via Cygwin wrote:
> On 2023-03-24 06:18, Corinna Vinschen via Cygwin wrote:
> > > First, it's a bug in the Emacs testsuite.  The test simply assumes that
> > > there's no en_DE locale on any system, but that's just not true.
> > > Windows support the RFC 5646 locale "en-DE", which is called "English
> > > (Germany)" in the "Region" settings.
> > > 
> > > You can also check with `locale -av | less' and search for en_DE.
> > > 
> > > For the reminder of this mail, I assume you're talking about Cygwin 3.5.
> > > I won't fix this for 3.4 anymore, given how much locale handling has
> > > changed for 3.5.
> > > 
> > > The second bug is that Cygwin blindly trusts the Windows function
> > > ResolveLocaleName().  That function blatantly converts even vaguely
> > > similar locales into something it supports.  E.g., it converts "en-XY"
> > > to "en-US".  I. .e., even if you use "en_XY.utf8" as locale, the above
> > > testcase will wrongly succeed.  So I have to rethink how I resolve POSIX
> > > locales to Windows locales.
> 
> Does Windows even consider https://www.rfc-editor.org/rfc/rfc4647 "Matching
> of Language Tags", part of https://www.rfc-editor.org/info/bcp47 "Language
> Tags", and if POSIX only matches exactly, will LANGUAGE be able to be used
> for fallback?

I never heard about an environment variable called LANGUAGE.  This is
about LANG/LC_ALL/LC_whatever, so POSIX syntax is required...

> I currently define LANGUAGE=en_CA:en_GB:en in case en-CA is unsupported by
> anything.
> [I use my own en-CA locale not the glibc default created by https://rap.dk/.]
> 
> Will "-" be supported like "_" as a separator in values?

In Cygwin?  No.  The POSIX syntax is required, it's converted into
a matching Windows RFC 5646 locale internally.

> > > And the third bug is that Cygwin fails to set errno if it doesn't
> > > support a locale, but that's a minor inconvenience in comparison.
> > > 
> > > Thanks for the report, I totally missed the above problem with
> > > ResolveLocaleName.
> > 
> > I pushed a couple of patches which hopefully clean up the code.  It's
> > really frustrating how these Windows locale functions work.  Or, rather,
> > not work.  I mean, come on...
> > 
> > - ResolveLocaleName() resolves "ff-BF" to "ff-Latn-SN", not to
> >    "ff-Adlm-BF" or "ff-Latn-BF", even though both exist.
> > 
> > - There's a locale called "sd-Arab-PK" and a locale "sd-Deva-IN".  If
> >    you ask for the script used in "sd-IN", the result is "Arab", not
> >    "Deva".
> >
> > I had to create a replacement function for ResolveLocaleName which
> > doesn't return totally screwy and unexpected results, and special case
> > two more locales in /proc/locales output so the output makes sense.
> 
> Aha - a nice new 3.5.0 feature - as well as /proc/codesets - is that
> charsets e.g. ISO-10646, etc. rather than encodings e.g. UTF-8, etc.!

It's a list of what you can use as codeset in $LANG and friends as in

  LC_CTYPE=lang_TERRITORY.codeset@modifier


Corinna

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-25 11:49       ` Corinna Vinschen
@ 2023-03-25 19:03         ` Brian Inglis
  2023-03-25 21:19           ` Corinna Vinschen
  2023-03-25 21:26           ` Corinna Vinschen
  0 siblings, 2 replies; 11+ messages in thread
From: Brian Inglis @ 2023-03-25 19:03 UTC (permalink / raw)
  To: cygwin

On 2023-03-25 05:49, Corinna Vinschen via Cygwin wrote:
> On Mar 24 16:49, Brian Inglis via Cygwin wrote:
>> On 2023-03-24 06:18, Corinna Vinschen via Cygwin wrote:
>>>> First, it's a bug in the Emacs testsuite.  The test simply assumes that
>>>> there's no en_DE locale on any system, but that's just not true.
>>>> Windows support the RFC 5646 locale "en-DE", which is called "English
>>>> (Germany)" in the "Region" settings.
>>>> You can also check with `locale -av | less' and search for en_DE.
>>>> For the reminder of this mail, I assume you're talking about Cygwin 3.5.
>>>> I won't fix this for 3.4 anymore, given how much locale handling has
>>>> changed for 3.5.
>>>> The second bug is that Cygwin blindly trusts the Windows function
>>>> ResolveLocaleName().  That function blatantly converts even vaguely
>>>> similar locales into something it supports.  E.g., it converts "en-XY"
>>>> to "en-US".  I. .e., even if you use "en_XY.utf8" as locale, the above
>>>> testcase will wrongly succeed.  So I have to rethink how I resolve POSIX
>>>> locales to Windows locales.

>> Does Windows even consider https://www.rfc-editor.org/rfc/rfc4647 "Matching
>> of Language Tags", part of https://www.rfc-editor.org/info/bcp47 "Language
>> Tags", and if POSIX only matches exactly, will LANGUAGE be able to be used
>> for fallback?

> I never heard about an environment variable called LANGUAGE.  This is
> about LANG/LC_ALL/LC_whatever, so POSIX syntax is required...

Used by gettext:

https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html

also LINGUAS FYI controlling, documentating, or limiting translations:

https://www.gnu.org/software/gettext/manual/html_node/po_002fLINGUAS.html
https://www.gnu.org/software/gettext/manual/html_node/Installers.html

as POSIX punts a lot of locale handling into the (hand waving) implementation 
defined bucket, where this is the primary implementation.

>> I currently define LANGUAGE=en_CA:en_GB:en in case en-CA is unsupported by
>> anything.
>> [I use my own en-CA locale not the glibc default created by https://rap.dk/.]
>> Will "-" be supported like "_" as a separator in values?

> In Cygwin?  No.  The POSIX syntax is required, it's converted into
> a matching Windows RFC 5646 locale internally.

>>>> And the third bug is that Cygwin fails to set errno if it doesn't
>>>> support a locale, but that's a minor inconvenience in comparison.
>>>> Thanks for the report, I totally missed the above problem with
>>>> ResolveLocaleName.

>>> I pushed a couple of patches which hopefully clean up the code.  It's
>>> really frustrating how these Windows locale functions work.  Or, rather,
>>> not work.  I mean, come on...
>>> - ResolveLocaleName() resolves "ff-BF" to "ff-Latn-SN", not to
>>>     "ff-Adlm-BF" or "ff-Latn-BF", even though both exist.
>>> - There's a locale called "sd-Arab-PK" and a locale "sd-Deva-IN".  If
>>>     you ask for the script used in "sd-IN", the result is "Arab", not
>>>     "Deva".
>>> I had to create a replacement function for ResolveLocaleName which
>>> doesn't return totally screwy and unexpected results, and special case
>>> two more locales in /proc/locales output so the output makes sense.

>> Aha - a nice new 3.5.0 feature - as well as /proc/codesets - is that
>> charsets e.g. ISO-10646, etc. rather than encodings e.g. UTF-8, etc.!

> It's a list of what you can use as codeset in $LANG and friends as in
>    LC_CTYPE=lang_TERRITORY.codeset@modifier

You are using codeset to mean encoding, whereas in Unicode and W3 it usually 
means coded character set/charset; it can also mean charmap; see iconv(1):

	https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html

Further confused by codeset definition:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_99

linking to:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02

which says POSIX "provides no means of defining a wide-character codeset" 
implying encodings such as UCS-2/UTF-16 and UCS-4/UTF-32 can not be specified, 
requiring a non-POSIX approach to conversion.

Also IBM uses codeset to distinguish between EBCDIC and ASCII encodings.

Adding to the confusion ISO uses codeset to refer generically to each set of 
codes supported by each part of ISO-639-1/2/3/5, ISO-3166-1/2/3, and ISO-15924, 
as well as ISO-8859-1...16.

I get no hits from RFCs.

To avoid ambiguity and reduce possible confusion, it may be better to name this 
charmaps as used in locale(1), and produced by locale -m with the same apparent 
content?
It looks like /proc/locales contains the same content as produced by locale -a?

JM2c ;^>

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-25 19:03         ` Brian Inglis
@ 2023-03-25 21:19           ` Corinna Vinschen
  2023-03-25 21:26           ` Corinna Vinschen
  1 sibling, 0 replies; 11+ messages in thread
From: Corinna Vinschen @ 2023-03-25 21:19 UTC (permalink / raw)
  To: cygwin

On Mar 25 13:03, Brian Inglis via Cygwin wrote:
> On 2023-03-25 05:49, Corinna Vinschen via Cygwin wrote:
> > On Mar 24 16:49, Brian Inglis via Cygwin wrote:
> > I never heard about an environment variable called LANGUAGE.  This is
> > about LANG/LC_ALL/LC_whatever, so POSIX syntax is required...
> 
> Used by gettext:
> 
> https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html

Ok, I'm not using that because I didn't even know that.  But I'm not
sure why you even mention it, it has nothing to do with Cygwin's
locale implementation which is based on the POSIX definitions.
Exception here is where the data comes from since we don't maintain
locale definition files and thus we don't follow
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
to the letter.

> > > Aha - a nice new 3.5.0 feature - as well as /proc/codesets - is that
> > > charsets e.g. ISO-10646, etc. rather than encodings e.g. UTF-8, etc.!
> 
> > It's a list of what you can use as codeset in $LANG and friends as in
> >    LC_CTYPE=lang_TERRITORY.codeset@modifier
> 
> You are using codeset to mean encoding, whereas in Unicode and W3 it usually
> means coded character set/charset; it can also mean charmap; see iconv(1):

I'm using the POSIX definition here.  Codeset is codeset, as in
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html

Quote:

  If the locale value has the form:

  language[_territory][.codeset]

  it refers to an implementation-provided locale, where settings of
  language, territory, and codeset are implementation-defined.

So I'm using the name "codesets" to follow POSIX documentation for
setting the matching locale environment variables, exactly to avoid
confusion.


Corinna

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: newlocale: Linux incompatibility
  2023-03-25 19:03         ` Brian Inglis
  2023-03-25 21:19           ` Corinna Vinschen
@ 2023-03-25 21:26           ` Corinna Vinschen
  1 sibling, 0 replies; 11+ messages in thread
From: Corinna Vinschen @ 2023-03-25 21:26 UTC (permalink / raw)
  To: cygwin

On Mar 25 13:03, Brian Inglis via Cygwin wrote:
> On 2023-03-25 05:49, Corinna Vinschen via Cygwin wrote:
> It looks like /proc/locales contains the same content as produced by locale -a?

Yes, locale -a actually opens /proc/locales to read the locales from the
Cygwin core, just as it opens /proc/codesets to implement locale -m.
The idea was to have these definitions collected inside the DLL instead
of having to duplicate code in an external tool.


Corinna

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-25 21:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-23 19:48 newlocale: Linux incompatibility Ken Brown
2023-03-23 20:09 ` Thomas Wolff
2023-03-23 21:14 ` Corinna Vinschen
2023-03-24 12:18   ` Corinna Vinschen
2023-03-24 13:57     ` Ken Brown
2023-03-24 14:44       ` Corinna Vinschen
2023-03-24 22:49     ` Brian Inglis
2023-03-25 11:49       ` Corinna Vinschen
2023-03-25 19:03         ` Brian Inglis
2023-03-25 21:19           ` Corinna Vinschen
2023-03-25 21:26           ` Corinna Vinschen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).