* wctomb() accepts out-of-range character in C-locale @ 2024-03-25 7:45 Jun T 2024-03-25 10:32 ` Corinna Vinschen 0 siblings, 1 reply; 9+ messages in thread From: Jun T @ 2024-03-25 7:45 UTC (permalink / raw) To: newlib Dear newlib developers, (this is the first time I post to this list) On recent Cygwin, the following C code output '1' (i.e., wide character 0x80 can be converted into a valid single-byte character in C-locale): --------------------------------------- #include <stdio.h> #include <stdlib.h> #include <locale.h> int main() { char buf[MB_CUR_MAX]; setlocale(LC_ALL, "C"); printf("%d\n", wctomb(buf, 0x80)); return 0; } --------------------------------------- On Linux it outputs '-1'. It seems this is due to the following commit: ------------------------------------------------ commit 8a4318943875cd922601d34e54ce8a83ad2e733c Author: Corinna Vinschen <corinna@vinschen.de> Date: Mon Jul 31 12:44:16 2023 +0200 Revert "* libc/stdlib/mbtowc_r.c (__ascii_mbtowc): Disallow conversion of" This reverts commit 2b77087a48ea56e77fca5aeab478c922f6473d7c. For some reason lost in time, commit 2b77087a48ea5 introduced Cygwin-specific code treating single byte characters outside the portable character set as illegal chars. However, Cygwin was always alone with this over-correct behaviour and it leads to stuff like gnulib replacing functions defined in Cygwin with their own implementation just due to that. ------------------------------------------------ Probably the function __ascii_wctomb() is used not only in C-locale but also in some other locales, and the commit is for "fixing" some problems in these locales? But a wide character >= 0x80 can't be converted into a valid character in C-loccale (7bit), I think. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-25 7:45 wctomb() accepts out-of-range character in C-locale Jun T @ 2024-03-25 10:32 ` Corinna Vinschen 2024-03-25 11:26 ` Bruno Haible 0 siblings, 1 reply; 9+ messages in thread From: Corinna Vinschen @ 2024-03-25 10:32 UTC (permalink / raw) To: Jun T; +Cc: newlib, Bruno Haible [CC Bruno Haible, gnulib maintainer, to kick my memory] Hi Jun, On Mar 25 16:45, Jun T wrote: > Dear newlib developers, > (this is the first time I post to this list) > > On recent Cygwin, the following C code output '1' (i.e., wide character > 0x80 can be converted into a valid single-byte character in C-locale): > > --------------------------------------- > #include <stdio.h> > #include <stdlib.h> > #include <locale.h> > > int main() { > char buf[MB_CUR_MAX]; > setlocale(LC_ALL, "C"); > printf("%d\n", wctomb(buf, 0x80)); > return 0; > } > --------------------------------------- > > On Linux it outputs '-1'. > > It seems this is due to the following commit: > > ------------------------------------------------ > commit 8a4318943875cd922601d34e54ce8a83ad2e733c > Author: Corinna Vinschen <corinna@vinschen.de> > Date: Mon Jul 31 12:44:16 2023 +0200 > > Revert "* libc/stdlib/mbtowc_r.c (__ascii_mbtowc): Disallow conversion of" > > This reverts commit 2b77087a48ea56e77fca5aeab478c922f6473d7c. > > For some reason lost in time, commit 2b77087a48ea5 introduced > Cygwin-specific code treating single byte characters outside the > portable character set as illegal chars. However, Cygwin was > always alone with this over-correct behaviour and it leads to > stuff like gnulib replacing functions defined in Cygwin with > their own implementation just due to that. > ------------------------------------------------ > > Probably the function __ascii_wctomb() is used not only in C-locale > but also in some other locales, and the commit is for "fixing" > some problems in these locales? No, __ascii_wctomb is by default used in "C". > But a wide character >= 0x80 can't be converted into a valid > character in C-loccale (7bit), I think. Yes, I know, and that was what the original code from 2b77087a48 did. But at the time I reverted this special handling, Bruno had reported a change in gnulib in terms of fnmatch starting at https://cygwin.com/pipermail/cygwin/2023-July/254017.html During testing I found that gnulib was replacing various functions built into Cygwin for several reasons, and one of them was that the conversion of wide char to multibyte in the "C" locale was not transparently converting chars from 0x80 up to 0xff. I'm actually puzzled right now that this doesn't work in GLibc either. Bruno, I really need your input here, because I just don't remember :( Do you have an idea what gnulib configure test might have been the trigger for the above revert? And if GLibc also doesn't let chars >= 0x80 slip through, then Cygwin's special handling was right. But then this would introduce gnulib trouble again... Can you help us? Thanks, Corinna ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-25 10:32 ` Corinna Vinschen @ 2024-03-25 11:26 ` Bruno Haible 2024-03-25 11:34 ` Corinna Vinschen 2024-03-25 14:07 ` Jun. T 0 siblings, 2 replies; 9+ messages in thread From: Bruno Haible @ 2024-03-25 11:26 UTC (permalink / raw) To: Jun T, newlib Hi Corinna, > Jun T wrote: > > --------------------------------------- > > #include <stdio.h> > > #include <stdlib.h> > > #include <locale.h> > > > > int main() { > > char buf[MB_CUR_MAX]; > > setlocale(LC_ALL, "C"); > > printf("%d\n", wctomb(buf, 0x80)); > > return 0; > > } > > --------------------------------------- > > > > On Linux it outputs '-1'. "On Linux" is ambiguous: - In glibc, it outputs -1 because of this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=19932 https://sourceware.org/bugzilla/show_bug.cgi?id=29511 - In musl libc, it outputs -1 because the "C" locale (like all locales) uses UTF-8 encoding and the lone byte "\x80" is not an entire character in UTF-8. > > But a wide character >= 0x80 can't be converted into a valid > > character in C-loccale (7bit), I think. Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit. Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 : "The POSIX locale shall contain 256 single-byte characters ..." > During testing I found that gnulib was replacing various functions built > into Cygwin for several reasons, and one of them was that the conversion > of wide char to multibyte in the "C" locale was not transparently > converting chars from 0x80 up to 0xff. What you did is to make Cygwin POSIX compliant in this aspect, which is good. > I'm actually puzzled right now that this doesn't work in GLibc either. It's the aforementioned glibc bug. > Do you have an idea what gnulib configure test might have been the > trigger for the above revert? It's the "checking whether the C locale is free of encoding errors..." test (macro gl_MBRTOWC_C_LOCALE in m4/mbrtowc.m4). Bruno ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-25 11:26 ` Bruno Haible @ 2024-03-25 11:34 ` Corinna Vinschen 2024-03-25 14:07 ` Jun. T 1 sibling, 0 replies; 9+ messages in thread From: Corinna Vinschen @ 2024-03-25 11:34 UTC (permalink / raw) To: Bruno Haible; +Cc: Jun T, newlib Hi Bruno, On Mar 25 12:26, Bruno Haible wrote: > Hi Corinna, > > > Jun T wrote: > > > --------------------------------------- > > > #include <stdio.h> > > > #include <stdlib.h> > > > #include <locale.h> > > > > > > int main() { > > > char buf[MB_CUR_MAX]; > > > setlocale(LC_ALL, "C"); > > > printf("%d\n", wctomb(buf, 0x80)); > > > return 0; > > > } > > > --------------------------------------- > > > > > > On Linux it outputs '-1'. > > "On Linux" is ambiguous: > - In glibc, it outputs -1 because of this glibc bug: > https://sourceware.org/bugzilla/show_bug.cgi?id=19932 > https://sourceware.org/bugzilla/show_bug.cgi?id=29511 > - In musl libc, it outputs -1 because the "C" locale (like all locales) > uses UTF-8 encoding and the lone byte "\x80" is not an entire character > in UTF-8. > > > > But a wide character >= 0x80 can't be converted into a valid > > > character in C-loccale (7bit), I think. > > Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit. > Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 : > "The POSIX locale shall contain 256 single-byte characters ..." Yeah, now that you mention it... I should have thought of this myself :} > > During testing I found that gnulib was replacing various functions built > > into Cygwin for several reasons, and one of them was that the conversion > > of wide char to multibyte in the "C" locale was not transparently > > converting chars from 0x80 up to 0xff. > > What you did is to make Cygwin POSIX compliant in this aspect, which is > good. > > > I'm actually puzzled right now that this doesn't work in GLibc either. > > It's the aforementioned glibc bug. > > > Do you have an idea what gnulib configure test might have been the > > trigger for the above revert? > > It's the "checking whether the C locale is free of encoding errors..." test > (macro gl_MBRTOWC_C_LOCALE in m4/mbrtowc.m4). > > Bruno Great, so we're in the clear here. Thanks a lot for your (as usual 👍) informative input! Corinna ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-25 11:26 ` Bruno Haible 2024-03-25 11:34 ` Corinna Vinschen @ 2024-03-25 14:07 ` Jun. T 2024-03-25 20:18 ` brian.inglis 1 sibling, 1 reply; 9+ messages in thread From: Jun. T @ 2024-03-25 14:07 UTC (permalink / raw) To: newlib > 2024/03/25 20:26, Bruno Haible <bruno@clisp.org> wrote: > >>> But a wide character >= 0x80 can't be converted into a valid >>> character in C-loccale (7bit), I think. > > Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit. > Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 : > "The POSIX locale shall contain 256 single-byte characters ..." I still can't understand why it is useful to convert wide char in the range 0x80-0xff to an 8bit char in C-locale (for example convert wide char 0xe1 (U+00e1) = á to an 8bit char 0xe1). But if you say this is THE correct behavior then it's OK. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-25 14:07 ` Jun. T @ 2024-03-25 20:18 ` brian.inglis 2024-03-26 1:43 ` Jun. T 0 siblings, 1 reply; 9+ messages in thread From: brian.inglis @ 2024-03-25 20:18 UTC (permalink / raw) To: newlib On 2024-03-25 08:07, Jun. T wrote: > >> 2024/03/25 20:26, Bruno Haible <bruno@clisp.org> wrote: >> >>>> But a wide character >= 0x80 can't be converted into a valid >>>> character in C-loccale (7bit), I think. >> >> Err. "C" locale, a.k.a. "POSIX" locale, is not 7-bit but 8-bit. >> Quoting https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/V1_chap06.html#tag_06_02 : >> "The POSIX locale shall contain 256 single-byte characters ..." > > I still can't understand why it is useful to convert wide char > in the range 0x80-0xff to an 8bit char in C-locale (for example > convert wide char 0xe1 (U+00e1) = á to an 8bit char 0xe1). Before Unicode, UCS, and UTF character sets, European Single Byte Character Sets such as ISO-8859-* were used for Latin script based languages, including most programming languages, with accented characters mainly in the high half, and supported (most of) the POSIX character set; whereas Arabic, Cyrillic, Greek, Hebrew, other Asian and Indian, and CJK Han script based languages used some local SBCS, fuller featured Double Byte Character Sets, and Multi Byte Character Sets, some of which supported (parts of) the POSIX character set, and used shift characters to switch to characters encoded using the second and other bytes. For more info see https://en.wikipedia.org/wiki/SBCS and linked articles. > But if you say this is THE correct behavior then it's OK. POSIX says it, so by definition, it's OK! ;^> -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-25 20:18 ` brian.inglis @ 2024-03-26 1:43 ` Jun. T [not found] ` <IBDYAS.IT0GDL3WNBOQ@att.net> 0 siblings, 1 reply; 9+ messages in thread From: Jun. T @ 2024-03-26 1:43 UTC (permalink / raw) To: newlib > 2024/03/26 5:18, brian.inglis@systematicsw.ab.ca wrote: > > POSIX says it, so by definition, it's OK! ;^> I think POSIX doesn't say anything about the 8bit part of the C-locale; it just says it can be implementation dependent. It is newlib that choses the implementation in which chars in 0x80-0xff in C-locale correspond to those chars with the same wide-char values (virtually equivalent to latin1). Other system may chose other implementation, I think. ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <IBDYAS.IT0GDL3WNBOQ@att.net>]
* Re: wctomb() accepts out-of-range character in C-locale [not found] ` <IBDYAS.IT0GDL3WNBOQ@att.net> @ 2024-03-26 11:48 ` Steven J Abner 2024-03-27 8:01 ` Jun. T 0 siblings, 1 reply; 9+ messages in thread From: Steven J Abner @ 2024-03-26 11:48 UTC (permalink / raw) To: newlib > On Tue, Mar 26 2024 at 01:43:57 AM +0000, Jun. T > <takimoto-j@kba.biglobe.ne.jp> wrote: >> I think POSIX doesn't say anything about the 8bit part of >> the C-locale; it just says it can be implementation dependent. >> >> It is newlib that choses the implementation in which chars >> in 0x80-0xff in C-locale correspond to those chars with >> the same wide-char values (virtually equivalent to latin1). >> >> Other system may chose other implementation, I think. > The 'C' locale of old did have only 128 codes. These codes represented the referred to 'portable character set', which you refer to as ascii. Then POSIX redefined the 'C' locale and I quote: "Conforming systems shall provide a POSIX locale, also known as the C locale.". POSIX also states that these codes: "The POSIX locale shall contain 256 single-byte characters including the characters in Portable Character Set and Non-Portable Control Characters". The character codes 0x80-0xFF are not really implementation defined. They are classified as 'cntl' codes, thou not officially stated, and valid codes. This makes POSIX locale as portable as the old 'C' locale. The 'implementation' defined you seem to be referring to is the defining of character map encodings for other locales. Then 0x80-0xFF take on meanings other then the 'C', POSIX locale, but are still defined by a standard listed by IANA of defined character maps. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: wctomb() accepts out-of-range character in C-locale 2024-03-26 11:48 ` Steven J Abner @ 2024-03-27 8:01 ` Jun. T 0 siblings, 0 replies; 9+ messages in thread From: Jun. T @ 2024-03-27 8:01 UTC (permalink / raw) To: newlib > 2024/03/26 20:48、Steven J Abner <pheonix.sja@att.net>のメール: > > The character codes 0x80-0xFF are not really implementation defined. They are classified > as 'cntl' codes, thou not officially stated, and valid codes. In the current newlib, 0xe1 in C-locale corresponds to the character U+00e1 = á (printable). Anyway, newlib can do anything it wants. But I think you can't expect other systems will do the same thing. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-03-27 8:01 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-25 7:45 wctomb() accepts out-of-range character in C-locale Jun T 2024-03-25 10:32 ` Corinna Vinschen 2024-03-25 11:26 ` Bruno Haible 2024-03-25 11:34 ` Corinna Vinschen 2024-03-25 14:07 ` Jun. T 2024-03-25 20:18 ` brian.inglis 2024-03-26 1:43 ` Jun. T [not found] ` <IBDYAS.IT0GDL3WNBOQ@att.net> 2024-03-26 11:48 ` Steven J Abner 2024-03-27 8:01 ` Jun. T
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).