wctomb() accepts out-of-range character in C-locale

public inbox for newlib@sourceware.org
 help / color / mirror / Atom feed

* wctomb() accepts out-of-range character in C-locale
@ 2024-03-25  7:45 Jun T
  2024-03-25 10:32 ` Corinna Vinschen
  0 siblings, 1 reply; 9+ messages in thread
From: Jun T @ 2024-03-25  7:45 UTC (permalink / raw)
  To: newlib

Dear newlib developers,
(this is the first time I post to this list)

On recent Cygwin, the following C code output '1' (i.e., wide character
0x80 can be converted into a valid single-byte character in C-locale):

---------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main() {
    char buf[MB_CUR_MAX];
    setlocale(LC_ALL, "C");
    printf("%d\n", wctomb(buf, 0x80));
    return 0;
}
---------------------------------------

On Linux it outputs '-1'.

It seems this is due to the following commit:

------------------------------------------------
commit 8a4318943875cd922601d34e54ce8a83ad2e733c
Author: Corinna Vinschen <corinna@vinschen.de>
Date:   Mon Jul 31 12:44:16 2023 +0200

    Revert "* libc/stdlib/mbtowc_r.c (__ascii_mbtowc): Disallow conversion of"

    This reverts commit 2b77087a48ea56e77fca5aeab478c922f6473d7c.

    For some reason lost in time, commit 2b77087a48ea5 introduced
    Cygwin-specific code treating single byte characters outside the
    portable character set as illegal chars.  However, Cygwin was
    always alone with this over-correct behaviour and it leads to
    stuff like gnulib replacing functions defined in Cygwin with
    their own implementation just due to that.
------------------------------------------------

Probably the function __ascii_wctomb() is used not only in C-locale
but also in some other locales, and the commit is for "fixing"
some problems in these locales?
But a wide character >= 0x80 can't be converted into a valid
character in C-loccale (7bit), I think.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-25  7:45 wctomb() accepts out-of-range character in C-locale Jun T
@ 2024-03-25 10:32 ` Corinna Vinschen
  2024-03-25 11:26   ` Bruno Haible
  0 siblings, 1 reply; 9+ messages in thread
From: Corinna Vinschen @ 2024-03-25 10:32 UTC (permalink / raw)
  To: Jun T; +Cc: newlib, Bruno Haible

[CC Bruno Haible, gnulib maintainer, to kick my memory]

Hi Jun,

On Mar 25 16:45, Jun T wrote:
> Dear newlib developers,
> (this is the first time I post to this list)
> 
> On recent Cygwin, the following C code output '1' (i.e., wide character
> 0x80 can be converted into a valid single-byte character in C-locale):
> 
> ---------------------------------------
> #include <stdio.h>
> #include <stdlib.h>
> #include <locale.h>
> 
> int main() {
>     char buf[MB_CUR_MAX];
>     setlocale(LC_ALL, "C");
>     printf("%d\n", wctomb(buf, 0x80));
>     return 0;
> }
> ---------------------------------------
> 
> On Linux it outputs '-1'.
> 
> It seems this is due to the following commit:
> 
> ------------------------------------------------
> commit 8a4318943875cd922601d34e54ce8a83ad2e733c
> Author: Corinna Vinschen <corinna@vinschen.de>
> Date:   Mon Jul 31 12:44:16 2023 +0200
> 
>     Revert "* libc/stdlib/mbtowc_r.c (__ascii_mbtowc): Disallow conversion of"
> 
>     This reverts commit 2b77087a48ea56e77fca5aeab478c922f6473d7c.
> 
>     For some reason lost in time, commit 2b77087a48ea5 introduced
>     Cygwin-specific code treating single byte characters outside the
>     portable character set as illegal chars.  However, Cygwin was
>     always alone with this over-correct behaviour and it leads to
>     stuff like gnulib replacing functions defined in Cygwin with
>     their own implementation just due to that.
> ------------------------------------------------
> 
> Probably the function __ascii_wctomb() is used not only in C-locale
> but also in some other locales, and the commit is for "fixing"
> some problems in these locales?

No, __ascii_wctomb is by default used in "C".

> But a wide character >= 0x80 can't be converted into a valid
> character in C-loccale (7bit), I think.

Yes, I know, and that was what the original code from 2b77087a48 did.
But at the time I reverted this special handling, Bruno had reported a
change in gnulib in terms of fnmatch starting at
https://cygwin.com/pipermail/cygwin/2023-July/254017.html

During testing I found that gnulib was replacing various functions built
into Cygwin for several reasons, and one of them was that the conversion
of wide char to multibyte in the "C" locale was not transparently
converting chars from 0x80 up to 0xff.

I'm actually puzzled right now that this doesn't work in GLibc either.

Bruno, I really need your input here, because I just don't remember :(

Do you have an idea what gnulib configure test might have been the
trigger for the above revert?

And if GLibc also doesn't let chars >= 0x80 slip through, then Cygwin's
special handling was right.  But then this would introduce gnulib
trouble again...

Can you help us?


Thanks,
Corinna


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-25 10:32 ` Corinna Vinschen
@ 2024-03-25 11:26   ` Bruno Haible
  2024-03-25 11:34     ` Corinna Vinschen
  2024-03-25 14:07     ` Jun. T
  0 siblings, 2 replies; 9+ messages in thread
From: Bruno Haible @ 2024-03-25 11:26 UTC (permalink / raw)
  To: Jun T, newlib

Hi Corinna,

> Jun T wrote:
> > ---------------------------------------
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <locale.h>
> > 
> > int main() {
> >     char buf[MB_CUR_MAX];
> >     setlocale(LC_ALL, "C");
> >     printf("%d\n", wctomb(buf, 0x80));
> >     return 0;
> > }
> > ---------------------------------------
> > 
> > On Linux it outputs '-1'.

"On Linux" is ambiguous:
  - In glibc, it outputs -1 because of this glibc bug:
    https://sourceware.org/bugzilla/show_bug.cgi?id=19932
    https://sourceware.org/bugzilla/show_bug.cgi?id=29511
  - In musl libc, it outputs -1 because the "C" locale (like all locales)
    uses UTF-8 encoding and the lone byte "\x80" is not an entire character
    in UTF-8.

> > But a wide character >= 0x80 can't be converted into a valid
> > character in C-loccale (7bit), I think.

Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
  "The POSIX locale shall contain 256 single-byte characters ..."

> During testing I found that gnulib was replacing various functions built
> into Cygwin for several reasons, and one of them was that the conversion
> of wide char to multibyte in the "C" locale was not transparently
> converting chars from 0x80 up to 0xff.

What you did is to make Cygwin POSIX compliant in this aspect, which is
good.

> I'm actually puzzled right now that this doesn't work in GLibc either.

It's the aforementioned glibc bug.

> Do you have an idea what gnulib configure test might have been the
> trigger for the above revert?

It's the "checking whether the C locale is free of encoding errors..." test
(macro gl_MBRTOWC_C_LOCALE in m4/mbrtowc.m4).

Bruno




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-25 11:26   ` Bruno Haible
@ 2024-03-25 11:34     ` Corinna Vinschen
  2024-03-25 14:07     ` Jun. T
  1 sibling, 0 replies; 9+ messages in thread
From: Corinna Vinschen @ 2024-03-25 11:34 UTC (permalink / raw)
  To: Bruno Haible; +Cc: Jun T, newlib

Hi Bruno,

On Mar 25 12:26, Bruno Haible wrote:
> Hi Corinna,
> 
> > Jun T wrote:
> > > ---------------------------------------
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <locale.h>
> > > 
> > > int main() {
> > >     char buf[MB_CUR_MAX];
> > >     setlocale(LC_ALL, "C");
> > >     printf("%d\n", wctomb(buf, 0x80));
> > >     return 0;
> > > }
> > > ---------------------------------------
> > > 
> > > On Linux it outputs '-1'.
> 
> "On Linux" is ambiguous:
>   - In glibc, it outputs -1 because of this glibc bug:
>     https://sourceware.org/bugzilla/show_bug.cgi?id=19932
>     https://sourceware.org/bugzilla/show_bug.cgi?id=29511
>   - In musl libc, it outputs -1 because the "C" locale (like all locales)
>     uses UTF-8 encoding and the lone byte "\x80" is not an entire character
>     in UTF-8.
> 
> > > But a wide character >= 0x80 can't be converted into a valid
> > > character in C-loccale (7bit), I think.
> 
> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
>   "The POSIX locale shall contain 256 single-byte characters ..."

Yeah, now that you mention it... I should have thought of this myself :}

> > During testing I found that gnulib was replacing various functions built
> > into Cygwin for several reasons, and one of them was that the conversion
> > of wide char to multibyte in the "C" locale was not transparently
> > converting chars from 0x80 up to 0xff.
> 
> What you did is to make Cygwin POSIX compliant in this aspect, which is
> good.
> 
> > I'm actually puzzled right now that this doesn't work in GLibc either.
> 
> It's the aforementioned glibc bug.
> 
> > Do you have an idea what gnulib configure test might have been the
> > trigger for the above revert?
> 
> It's the "checking whether the C locale is free of encoding errors..." test
> (macro gl_MBRTOWC_C_LOCALE in m4/mbrtowc.m4).
> 
> Bruno

Great, so we're in the clear here.

Thanks a lot for your (as usual 👍) informative input!


Corinna


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-25 11:26   ` Bruno Haible
  2024-03-25 11:34     ` Corinna Vinschen
@ 2024-03-25 14:07     ` Jun. T
  2024-03-25 20:18       ` brian.inglis
  1 sibling, 1 reply; 9+ messages in thread
From: Jun. T @ 2024-03-25 14:07 UTC (permalink / raw)
  To: newlib


> 2024/03/25 20:26, Bruno Haible <bruno@clisp.org> wrote:
> 
>>> But a wide character >= 0x80 can't be converted into a valid
>>> character in C-loccale (7bit), I think.
> 
> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
>  "The POSIX locale shall contain 256 single-byte characters ..."

I still can't understand why it is useful to convert wide char
in the range 0x80-0xff to an 8bit char in C-locale (for example
convert wide char 0xe1 (U+00e1) = á to an 8bit char 0xe1).

But if you say this is THE correct behavior then it's OK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-25 14:07     ` Jun. T
@ 2024-03-25 20:18       ` brian.inglis
  2024-03-26  1:43         ` Jun. T
  0 siblings, 1 reply; 9+ messages in thread
From: brian.inglis @ 2024-03-25 20:18 UTC (permalink / raw)
  To: newlib

On 2024-03-25 08:07, Jun. T wrote:
> 
>> 2024/03/25 20:26, Bruno Haible <bruno@clisp.org> wrote:
>>
>>>> But a wide character >= 0x80 can't be converted into a valid
>>>> character in C-loccale (7bit), I think.
>>
>> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
>> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
>>   "The POSIX locale shall contain 256 single-byte characters ..."
> 
> I still can't understand why it is useful to convert wide char
> in the range 0x80-0xff to an 8bit char in C-locale (for example
> convert wide char 0xe1 (U+00e1) = á to an 8bit char 0xe1).

Before Unicode, UCS, and UTF character sets, European Single Byte Character Sets 
such as ISO-8859-* were used for Latin script based languages, including most 
programming languages, with accented characters mainly in the high half, and 
supported (most of) the POSIX character set; whereas Arabic, Cyrillic, Greek, 
Hebrew, other Asian and Indian, and CJK Han script based languages used some 
local SBCS, fuller featured Double Byte Character Sets, and Multi Byte Character 
Sets, some of which supported (parts of) the POSIX character set, and used shift 
characters to switch to characters encoded using the second and other bytes.

For more info see https://en.wikipedia.org/wiki/SBCS and linked articles.

 > But if you say this is THE correct behavior then it's OK.

POSIX says it, so by definition, it's OK! ;^>

-- 
Take care. Thanks, Brian Inglis              Calgary, Alberta, Canada

La perfection est atteinte                   Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter  not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer     but when there is no more to cut
                                 -- Antoine de Saint-Exupéry

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-25 20:18       ` brian.inglis
@ 2024-03-26  1:43         ` Jun. T
       [not found]           ` <IBDYAS.IT0GDL3WNBOQ@att.net>
  0 siblings, 1 reply; 9+ messages in thread
From: Jun. T @ 2024-03-26  1:43 UTC (permalink / raw)
  To: newlib

> 2024/03/26 5:18, brian.inglis@systematicsw.ab.ca wrote:
> 
> POSIX says it, so by definition, it's OK! ;^>

I think POSIX doesn't say anything about the 8bit part of
the C-locale; it just says it can be implementation dependent.

It is newlib that choses the implementation in which chars
in 0x80-0xff in C-locale correspond to those chars with
the same wide-char values (virtually equivalent to latin1).

Other system may chose other implementation, I think.

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <IBDYAS.IT0GDL3WNBOQ@att.net>]

* Re: wctomb() accepts out-of-range character in C-locale
       [not found]           ` <IBDYAS.IT0GDL3WNBOQ@att.net>
@ 2024-03-26 11:48             ` Steven J Abner
  2024-03-27  8:01               ` Jun. T
  0 siblings, 1 reply; 9+ messages in thread
From: Steven J Abner @ 2024-03-26 11:48 UTC (permalink / raw)
  To: newlib



> On Tue, Mar 26 2024 at 01:43:57 AM +0000, Jun. T 
> <takimoto-j@kba.biglobe.ne.jp> wrote:
>> I think POSIX doesn't say anything about the 8bit part of
>> the C-locale; it just says it can be implementation dependent.
>> 
>> It is newlib that choses the implementation in which chars
>> in 0x80-0xff in C-locale correspond to those chars with
>> the same wide-char values (virtually equivalent to latin1).
>> 
>> Other system may chose other implementation, I think.
> 
The 'C' locale of old did have only 128 codes. These codes represented 
the
referred to 'portable character set', which you refer to as ascii. Then
POSIX redefined the 'C' locale and I quote:
"Conforming systems shall provide a POSIX locale, also known as the C 
locale.".
POSIX also states that these codes: "The POSIX locale shall contain 256 
single-byte
characters including the characters in Portable Character Set and 
Non-Portable Control Characters".
The character codes 0x80-0xFF are not really implementation defined. 
They are classified
as 'cntl' codes, thou not officially stated, and valid codes. This 
makes POSIX locale
as portable as the old 'C' locale.
The 'implementation' defined you seem to be referring to is the 
defining of
character map encodings for other locales. Then 0x80-0xFF take on 
meanings other
then the 'C', POSIX locale, but are still defined by a standard listed 
by IANA of
defined character maps.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: wctomb() accepts out-of-range character in C-locale
  2024-03-26 11:48             ` Steven J Abner
@ 2024-03-27  8:01               ` Jun. T
  0 siblings, 0 replies; 9+ messages in thread
From: Jun. T @ 2024-03-27  8:01 UTC (permalink / raw)
  To: newlib


> 2024/03/26 20:48、Steven J Abner <pheonix.sja@att.net>のメール:
> 
> The character codes 0x80-0xFF are not really implementation defined. They are classified
> as 'cntl' codes, thou not officially stated, and valid codes. 

In the current newlib, 0xe1 in C-locale corresponds to
the character U+00e1 = á (printable).

Anyway, newlib can do anything it wants.
But I think you can't expect other systems will do the
same thing.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-03-27  8:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25  7:45 wctomb() accepts out-of-range character in C-locale Jun T
2024-03-25 10:32 ` Corinna Vinschen
2024-03-25 11:26   ` Bruno Haible
2024-03-25 11:34     ` Corinna Vinschen
2024-03-25 14:07     ` Jun. T
2024-03-25 20:18       ` brian.inglis
2024-03-26  1:43         ` Jun. T
     [not found]           ` <IBDYAS.IT0GDL3WNBOQ@att.net>
2024-03-26 11:48             ` Steven J Abner
2024-03-27  8:01               ` Jun. T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).