* wctomb() accepts out-of-range character in C-locale
@ 2024-03-25 7:45 Jun T
2024-03-25 10:32 ` Corinna Vinschen
0 siblings, 1 reply; 9+ messages in thread
From: Jun T @ 2024-03-25 7:45 UTC (permalink / raw)
To: newlib
Dear newlib developers,
(this is the first time I post to this list)
On recent Cygwin, the following C code output '1' (i.e., wide character
0x80 can be converted into a valid single-byte character in C-locale):
---------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main() {
char buf[MB_CUR_MAX];
setlocale(LC_ALL, "C");
printf("%d\n", wctomb(buf, 0x80));
return 0;
}
---------------------------------------
On Linux it outputs '-1'.
It seems this is due to the following commit:
------------------------------------------------
commit 8a4318943875cd922601d34e54ce8a83ad2e733c
Author: Corinna Vinschen <corinna@vinschen.de>
Date: Mon Jul 31 12:44:16 2023 +0200
Revert "* libc/stdlib/mbtowc_r.c (__ascii_mbtowc): Disallow conversion of"
This reverts commit 2b77087a48ea56e77fca5aeab478c922f6473d7c.
For some reason lost in time, commit 2b77087a48ea5 introduced
Cygwin-specific code treating single byte characters outside the
portable character set as illegal chars. However, Cygwin was
always alone with this over-correct behaviour and it leads to
stuff like gnulib replacing functions defined in Cygwin with
their own implementation just due to that.
------------------------------------------------
Probably the function __ascii_wctomb() is used not only in C-locale
but also in some other locales, and the commit is for "fixing"
some problems in these locales?
But a wide character >= 0x80 can't be converted into a valid
character in C-loccale (7bit), I think.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-25 7:45 wctomb() accepts out-of-range character in C-locale Jun T
@ 2024-03-25 10:32 ` Corinna Vinschen
2024-03-25 11:26 ` Bruno Haible
0 siblings, 1 reply; 9+ messages in thread
From: Corinna Vinschen @ 2024-03-25 10:32 UTC (permalink / raw)
To: Jun T; +Cc: newlib, Bruno Haible
[CC Bruno Haible, gnulib maintainer, to kick my memory]
Hi Jun,
On Mar 25 16:45, Jun T wrote:
> Dear newlib developers,
> (this is the first time I post to this list)
>
> On recent Cygwin, the following C code output '1' (i.e., wide character
> 0x80 can be converted into a valid single-byte character in C-locale):
>
> ---------------------------------------
> #include <stdio.h>
> #include <stdlib.h>
> #include <locale.h>
>
> int main() {
> char buf[MB_CUR_MAX];
> setlocale(LC_ALL, "C");
> printf("%d\n", wctomb(buf, 0x80));
> return 0;
> }
> ---------------------------------------
>
> On Linux it outputs '-1'.
>
> It seems this is due to the following commit:
>
> ------------------------------------------------
> commit 8a4318943875cd922601d34e54ce8a83ad2e733c
> Author: Corinna Vinschen <corinna@vinschen.de>
> Date: Mon Jul 31 12:44:16 2023 +0200
>
> Revert "* libc/stdlib/mbtowc_r.c (__ascii_mbtowc): Disallow conversion of"
>
> This reverts commit 2b77087a48ea56e77fca5aeab478c922f6473d7c.
>
> For some reason lost in time, commit 2b77087a48ea5 introduced
> Cygwin-specific code treating single byte characters outside the
> portable character set as illegal chars. However, Cygwin was
> always alone with this over-correct behaviour and it leads to
> stuff like gnulib replacing functions defined in Cygwin with
> their own implementation just due to that.
> ------------------------------------------------
>
> Probably the function __ascii_wctomb() is used not only in C-locale
> but also in some other locales, and the commit is for "fixing"
> some problems in these locales?
No, __ascii_wctomb is by default used in "C".
> But a wide character >= 0x80 can't be converted into a valid
> character in C-loccale (7bit), I think.
Yes, I know, and that was what the original code from 2b77087a48 did.
But at the time I reverted this special handling, Bruno had reported a
change in gnulib in terms of fnmatch starting at
https://cygwin.com/pipermail/cygwin/2023-July/254017.html
During testing I found that gnulib was replacing various functions built
into Cygwin for several reasons, and one of them was that the conversion
of wide char to multibyte in the "C" locale was not transparently
converting chars from 0x80 up to 0xff.
I'm actually puzzled right now that this doesn't work in GLibc either.
Bruno, I really need your input here, because I just don't remember :(
Do you have an idea what gnulib configure test might have been the
trigger for the above revert?
And if GLibc also doesn't let chars >= 0x80 slip through, then Cygwin's
special handling was right. But then this would introduce gnulib
trouble again...
Can you help us?
Thanks,
Corinna
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-25 10:32 ` Corinna Vinschen
@ 2024-03-25 11:26 ` Bruno Haible
2024-03-25 11:34 ` Corinna Vinschen
2024-03-25 14:07 ` Jun. T
0 siblings, 2 replies; 9+ messages in thread
From: Bruno Haible @ 2024-03-25 11:26 UTC (permalink / raw)
To: Jun T, newlib
Hi Corinna,
> Jun T wrote:
> > ---------------------------------------
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <locale.h>
> >
> > int main() {
> > char buf[MB_CUR_MAX];
> > setlocale(LC_ALL, "C");
> > printf("%d\n", wctomb(buf, 0x80));
> > return 0;
> > }
> > ---------------------------------------
> >
> > On Linux it outputs '-1'.
"On Linux" is ambiguous:
- In glibc, it outputs -1 because of this glibc bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
https://sourceware.org/bugzilla/show_bug.cgi?id=29511
- In musl libc, it outputs -1 because the "C" locale (like all locales)
uses UTF-8 encoding and the lone byte "\x80" is not an entire character
in UTF-8.
> > But a wide character >= 0x80 can't be converted into a valid
> > character in C-loccale (7bit), I think.
Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
"The POSIX locale shall contain 256 single-byte characters ..."
> During testing I found that gnulib was replacing various functions built
> into Cygwin for several reasons, and one of them was that the conversion
> of wide char to multibyte in the "C" locale was not transparently
> converting chars from 0x80 up to 0xff.
What you did is to make Cygwin POSIX compliant in this aspect, which is
good.
> I'm actually puzzled right now that this doesn't work in GLibc either.
It's the aforementioned glibc bug.
> Do you have an idea what gnulib configure test might have been the
> trigger for the above revert?
It's the "checking whether the C locale is free of encoding errors..." test
(macro gl_MBRTOWC_C_LOCALE in m4/mbrtowc.m4).
Bruno
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-25 11:26 ` Bruno Haible
@ 2024-03-25 11:34 ` Corinna Vinschen
2024-03-25 14:07 ` Jun. T
1 sibling, 0 replies; 9+ messages in thread
From: Corinna Vinschen @ 2024-03-25 11:34 UTC (permalink / raw)
To: Bruno Haible; +Cc: Jun T, newlib
Hi Bruno,
On Mar 25 12:26, Bruno Haible wrote:
> Hi Corinna,
>
> > Jun T wrote:
> > > ---------------------------------------
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <locale.h>
> > >
> > > int main() {
> > > char buf[MB_CUR_MAX];
> > > setlocale(LC_ALL, "C");
> > > printf("%d\n", wctomb(buf, 0x80));
> > > return 0;
> > > }
> > > ---------------------------------------
> > >
> > > On Linux it outputs '-1'.
>
> "On Linux" is ambiguous:
> - In glibc, it outputs -1 because of this glibc bug:
> https://sourceware.org/bugzilla/show_bug.cgi?id=19932
> https://sourceware.org/bugzilla/show_bug.cgi?id=29511
> - In musl libc, it outputs -1 because the "C" locale (like all locales)
> uses UTF-8 encoding and the lone byte "\x80" is not an entire character
> in UTF-8.
>
> > > But a wide character >= 0x80 can't be converted into a valid
> > > character in C-loccale (7bit), I think.
>
> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
> "The POSIX locale shall contain 256 single-byte characters ..."
Yeah, now that you mention it... I should have thought of this myself :}
> > During testing I found that gnulib was replacing various functions built
> > into Cygwin for several reasons, and one of them was that the conversion
> > of wide char to multibyte in the "C" locale was not transparently
> > converting chars from 0x80 up to 0xff.
>
> What you did is to make Cygwin POSIX compliant in this aspect, which is
> good.
>
> > I'm actually puzzled right now that this doesn't work in GLibc either.
>
> It's the aforementioned glibc bug.
>
> > Do you have an idea what gnulib configure test might have been the
> > trigger for the above revert?
>
> It's the "checking whether the C locale is free of encoding errors..." test
> (macro gl_MBRTOWC_C_LOCALE in m4/mbrtowc.m4).
>
> Bruno
Great, so we're in the clear here.
Thanks a lot for your (as usual 👍) informative input!
Corinna
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-25 11:26 ` Bruno Haible
2024-03-25 11:34 ` Corinna Vinschen
@ 2024-03-25 14:07 ` Jun. T
2024-03-25 20:18 ` brian.inglis
1 sibling, 1 reply; 9+ messages in thread
From: Jun. T @ 2024-03-25 14:07 UTC (permalink / raw)
To: newlib
> 2024/03/25 20:26, Bruno Haible <bruno@clisp.org> wrote:
>
>>> But a wide character >= 0x80 can't be converted into a valid
>>> character in C-loccale (7bit), I think.
>
> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
> "The POSIX locale shall contain 256 single-byte characters ..."
I still can't understand why it is useful to convert wide char
in the range 0x80-0xff to an 8bit char in C-locale (for example
convert wide char 0xe1 (U+00e1) = á to an 8bit char 0xe1).
But if you say this is THE correct behavior then it's OK.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-25 14:07 ` Jun. T
@ 2024-03-25 20:18 ` brian.inglis
2024-03-26 1:43 ` Jun. T
0 siblings, 1 reply; 9+ messages in thread
From: brian.inglis @ 2024-03-25 20:18 UTC (permalink / raw)
To: newlib
On 2024-03-25 08:07, Jun. T wrote:
>
>> 2024/03/25 20:26, Bruno Haible <bruno@clisp.org> wrote:
>>
>>>> But a wide character >= 0x80 can't be converted into a valid
>>>> character in C-loccale (7bit), I think.
>>
>> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit.
>> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 :
>> "The POSIX locale shall contain 256 single-byte characters ..."
>
> I still can't understand why it is useful to convert wide char
> in the range 0x80-0xff to an 8bit char in C-locale (for example
> convert wide char 0xe1 (U+00e1) = á to an 8bit char 0xe1).
Before Unicode, UCS, and UTF character sets, European Single Byte Character Sets
such as ISO-8859-* were used for Latin script based languages, including most
programming languages, with accented characters mainly in the high half, and
supported (most of) the POSIX character set; whereas Arabic, Cyrillic, Greek,
Hebrew, other Asian and Indian, and CJK Han script based languages used some
local SBCS, fuller featured Double Byte Character Sets, and Multi Byte Character
Sets, some of which supported (parts of) the POSIX character set, and used shift
characters to switch to characters encoded using the second and other bytes.
For more info see https://en.wikipedia.org/wiki/SBCS and linked articles.
> But if you say this is THE correct behavior then it's OK.
POSIX says it, so by definition, it's OK! ;^>
--
Take care. Thanks, Brian Inglis Calgary, Alberta, Canada
La perfection est atteinte Perfection is achieved
non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add
mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut
-- Antoine de Saint-Exupéry
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-25 20:18 ` brian.inglis
@ 2024-03-26 1:43 ` Jun. T
[not found] ` <IBDYAS.IT0GDL3WNBOQ@att.net>
0 siblings, 1 reply; 9+ messages in thread
From: Jun. T @ 2024-03-26 1:43 UTC (permalink / raw)
To: newlib
> 2024/03/26 5:18, brian.inglis@systematicsw.ab.ca wrote:
>
> POSIX says it, so by definition, it's OK! ;^>
I think POSIX doesn't say anything about the 8bit part of
the C-locale; it just says it can be implementation dependent.
It is newlib that choses the implementation in which chars
in 0x80-0xff in C-locale correspond to those chars with
the same wide-char values (virtually equivalent to latin1).
Other system may chose other implementation, I think.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
[not found] ` <IBDYAS.IT0GDL3WNBOQ@att.net>
@ 2024-03-26 11:48 ` Steven J Abner
2024-03-27 8:01 ` Jun. T
0 siblings, 1 reply; 9+ messages in thread
From: Steven J Abner @ 2024-03-26 11:48 UTC (permalink / raw)
To: newlib
> On Tue, Mar 26 2024 at 01:43:57 AM +0000, Jun. T
> <takimoto-j@kba.biglobe.ne.jp> wrote:
>> I think POSIX doesn't say anything about the 8bit part of
>> the C-locale; it just says it can be implementation dependent.
>>
>> It is newlib that choses the implementation in which chars
>> in 0x80-0xff in C-locale correspond to those chars with
>> the same wide-char values (virtually equivalent to latin1).
>>
>> Other system may chose other implementation, I think.
>
The 'C' locale of old did have only 128 codes. These codes represented
the
referred to 'portable character set', which you refer to as ascii. Then
POSIX redefined the 'C' locale and I quote:
"Conforming systems shall provide a POSIX locale, also known as the C
locale.".
POSIX also states that these codes: "The POSIX locale shall contain 256
single-byte
characters including the characters in Portable Character Set and
Non-Portable Control Characters".
The character codes 0x80-0xFF are not really implementation defined.
They are classified
as 'cntl' codes, thou not officially stated, and valid codes. This
makes POSIX locale
as portable as the old 'C' locale.
The 'implementation' defined you seem to be referring to is the
defining of
character map encodings for other locales. Then 0x80-0xFF take on
meanings other
then the 'C', POSIX locale, but are still defined by a standard listed
by IANA of
defined character maps.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale
2024-03-26 11:48 ` Steven J Abner
@ 2024-03-27 8:01 ` Jun. T
0 siblings, 0 replies; 9+ messages in thread
From: Jun. T @ 2024-03-27 8:01 UTC (permalink / raw)
To: newlib
> 2024/03/26 20:48、Steven J Abner <pheonix.sja@att.net>のメール:
>
> The character codes 0x80-0xFF are not really implementation defined. They are classified
> as 'cntl' codes, thou not officially stated, and valid codes.
In the current newlib, 0xe1 in C-locale corresponds to
the character U+00e1 = á (printable).
Anyway, newlib can do anything it wants.
But I think you can't expect other systems will do the
same thing.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-03-27 8:01 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 7:45 wctomb() accepts out-of-range character in C-locale Jun T
2024-03-25 10:32 ` Corinna Vinschen
2024-03-25 11:26 ` Bruno Haible
2024-03-25 11:34 ` Corinna Vinschen
2024-03-25 14:07 ` Jun. T
2024-03-25 20:18 ` brian.inglis
2024-03-26 1:43 ` Jun. T
[not found] ` <IBDYAS.IT0GDL3WNBOQ@att.net>
2024-03-26 11:48 ` Steven J Abner
2024-03-27 8:01 ` Jun. T
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).