* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
@ 2016-04-09 20:09 ` jim at meyering dot net
2016-04-09 20:10 ` eggert at gnu dot org
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: jim at meyering dot net @ 2016-04-09 20:09 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
jim at meyering dot net <jim at meyering dot net> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |jim at meyering dot net
--- Comment #1 from jim at meyering dot net <jim at meyering dot net> ---
Thank you for driving this.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
2016-04-09 20:09 ` [Bug localedata/19932] " jim at meyering dot net
@ 2016-04-09 20:10 ` eggert at gnu dot org
2016-04-09 20:10 ` bruno at clisp dot org
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: eggert at gnu dot org @ 2016-04-09 20:10 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
--- Comment #3 from Paul Eggert <eggert at gnu dot org> ---
(In reply to Bruno Haible from comment #2)
> Thus the mapping table would
> - map x (0 <= x <= 0x7F) to Unicode x,
> - map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).
Emacs maps the latter to 0x3FFF80+x, I suppose under the theory that these
integers are not Unicode code points, and thus won't be conflated with
private-use Unicode characters. I suppose we could be "compatible" with Emacs.
Are there other examples in the wild of this sort of thing, or is the Emacs
precedent good enough?
> Should we create a new encoding with this property?
> Or change the mapping tables of ANSI_X3.4-1968?
It is a bit of a dilemma. Would it make sense to change iconv so that it
recognizes values like 0x3FFF80 as corresponding to encoding-error bytes? iconv
could then behave the same way as before, even if we change the mapping tables
of ANSI_X3.4-1968.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
2016-04-09 20:09 ` [Bug localedata/19932] " jim at meyering dot net
2016-04-09 20:10 ` eggert at gnu dot org
@ 2016-04-09 20:10 ` bruno at clisp dot org
2016-04-11 11:48 ` fweimer at redhat dot com
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: bruno at clisp dot org @ 2016-04-09 20:10 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
Bruno Haible <bruno at clisp dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |bruno at clisp dot org
--- Comment #2 from Bruno Haible <bruno at clisp dot org> ---
> glibc mbrtowc reports an encoding error in the C locale when given a byte
> in the range 128-255 decimal
Assume this is indeed to be considered a bug. Then we need to change the
character encoding that glibc associates with the C locale - because the
mbrtowc behaviour depends on (and must remain consistent with) the character
encoding of the locale. This character encoding, nl_langinfo(CODESET) or
equivalently $(locale charmap), currently is defined as
$ LC_ALL=C locale charmap
ANSI_X3.4-1968
ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding,
To fix this bug, this encoding would need to be changed to an 8-bit encoding.
The question is: Which encoding?
> It was always the intent of POSIX that all 256 bytes be valid characters
> in the C locale
On the other hand, it was always the intent of the glibc i18n design (around
1999-2001) that users would use UTF-8 locales and that all plain text would be
encoded in UTF-8. This has come true (around 2005).
The C locale is still used in scripts that need to handle text in unknown
encodings. It is important here that no byte value >= 128 is considered to
have special character properties (per <ctype.h>), because this would have
undesired effects when processing byte sequences in UTF-8 encoding - which,
as said above, is the vast majority of text on current systems.
Therefore, when changing the value of nl_langinfo(CODESET) and
$(locale charmap), it is essential that we preserve the property that
no byte value >= 128 has special character properties. Otherwise we introduce
trouble in user scripts that have been working fine for the last 10 years.
In particular, this excludes the ISO-8859-* encodings.
We need an encoding that formally has 256 characters, but the characters
>= 128 are to be considered non-graphic (and therefore also non-printing).
And the mapping done by mbrtowc should not map these characters to defined
Unicode characters; I think they would best map into Private Use Areas of
Unicode. Thus the mapping table would
- map x (0 <= x <= 0x7F) to Unicode x,
- map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).
There is no such encoding among the list of encodings - $(locale -m) or
http://www.haible.de/bruno/charsets/conversion-tables/index.html.
Should we create a new encoding with this property?
Or change the mapping tables of ANSI_X3.4-1968?
Either approach will create trouble to user programs:
- If we create a new encoding, software like telnet or ssh passes the
encoding to different machines, which will not recognize it.
- If we change the mapping tables of ANSI_X3.4-1968, existing uses of
"iconv -f ANSI_X3.4-1968" will exhibit a behaviour change.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
` (2 preceding siblings ...)
2016-04-09 20:10 ` bruno at clisp dot org
@ 2016-04-11 11:48 ` fweimer at redhat dot com
2016-04-22 4:46 ` [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1 vapier at gentoo dot org
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: fweimer at redhat dot com @ 2016-04-11 11:48 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |fweimer at redhat dot com
Flags| |security-
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
` (3 preceding siblings ...)
2016-04-11 11:48 ` fweimer at redhat dot com
@ 2016-04-22 4:46 ` vapier at gentoo dot org
2023-03-29 9:42 ` bruno at clisp dot org
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: vapier at gentoo dot org @ 2016-04-22 4:46 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
Mike Frysinger <vapier at gentoo dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|mbrtowc returns (size_t) -1 |C locale: mbrtowc returns
|in C locale |(size_t) -1
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
` (4 preceding siblings ...)
2016-04-22 4:46 ` [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1 vapier at gentoo dot org
@ 2023-03-29 9:42 ` bruno at clisp dot org
2023-05-23 10:08 ` carenas at gmail dot com
2023-06-28 20:12 ` sam at gentoo dot org
7 siblings, 0 replies; 9+ messages in thread
From: bruno at clisp dot org @ 2023-03-29 9:42 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
--- Comment #4 from Bruno Haible <bruno at clisp dot org> ---
The POSIX bug has been fixed:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mbrtowc.html
now says
"[EILSEQ]
An invalid character sequence is detected. [CX] [Option Start] In the
POSIX locale an [EILSEQ] error cannot occur since all byte values are valid
characters. [Option End]"
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
` (5 preceding siblings ...)
2023-03-29 9:42 ` bruno at clisp dot org
@ 2023-05-23 10:08 ` carenas at gmail dot com
2023-06-28 20:12 ` sam at gentoo dot org
7 siblings, 0 replies; 9+ messages in thread
From: carenas at gmail dot com @ 2023-05-23 10:08 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
Carlo Marcelo Arenas Belón <carenas at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |carenas at gmail dot com
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
2016-04-09 8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
` (6 preceding siblings ...)
2023-05-23 10:08 ` carenas at gmail dot com
@ 2023-06-28 20:12 ` sam at gentoo dot org
7 siblings, 0 replies; 9+ messages in thread
From: sam at gentoo dot org @ 2023-06-28 20:12 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=19932
Sam James <sam at gentoo dot org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |sam at gentoo dot org
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 9+ messages in thread