public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale
@ 2016-04-09  8:15 eggert at gnu dot org
  2016-04-09 20:09 ` [Bug localedata/19932] " jim at meyering dot net
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: eggert at gnu dot org @ 2016-04-09  8:15 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

            Bug ID: 19932
           Summary: mbrtowc returns (size_t) -1 in C locale
           Product: glibc
           Version: 2.22
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: eggert at gnu dot org
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Created attachment 9173
  --> https://sourceware.org/bugzilla/attachment.cgi?id=9173&action=edit
test mbrtowc in the C locale

This follows up on a bug reported by Björn Jacke against GNU grep 2.23; see
<http://bugs.gnu.org/23234>. The bug occurs because GNU grep uses mbrtowc to
detect encoding errors, and because glibc mbrtowc reports an encoding error in
the C locale when given a byte in the range 128-255 decimal.

It was always the intent of POSIX that all 256 bytes be valid characters in the
C locale, as that was the traditional behavior. This wasn't clearly stated in
the standard, but this is a bug that is planned to be fixed in a future version
of POSIX; see <http://austingroupbugs.net/view.php?id=663#c2738> (2015-07-02).
Glibc should be fixed to conform to this, i.e., mbrtowc should never return
(size_t) -1 in the C locale.

I plan to work around this bug in the gnulib mbrtowc module, which should fix
the grep bug; but this is a hack and will slow grep down a bit. The bug should
be fixed in glibc.

Please see the attached program for an illustration of the bug. The program
should output nothing and exit with status 0, but on glibc it outputs lines
like the following:

byte 0x80 (0200) encoding error
byte 0x81 (0201) encoding error
...
byte 0xff (0377) encoding error

and exits with status 1.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
@ 2016-04-09 20:09 ` jim at meyering dot net
  2016-04-09 20:10 ` eggert at gnu dot org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jim at meyering dot net @ 2016-04-09 20:09 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

jim at meyering dot net <jim at meyering dot net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jim at meyering dot net

--- Comment #1 from jim at meyering dot net <jim at meyering dot net> ---
Thank you for driving this.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
  2016-04-09 20:09 ` [Bug localedata/19932] " jim at meyering dot net
  2016-04-09 20:10 ` eggert at gnu dot org
@ 2016-04-09 20:10 ` bruno at clisp dot org
  2016-04-11 11:48 ` fweimer at redhat dot com
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: bruno at clisp dot org @ 2016-04-09 20:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

Bruno Haible <bruno at clisp dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bruno at clisp dot org

--- Comment #2 from Bruno Haible <bruno at clisp dot org> ---
> glibc mbrtowc reports an encoding error in the C locale when given a byte
> in the range 128-255 decimal

Assume this is indeed to be considered a bug. Then we need to change the
character encoding that glibc associates with the C locale - because the
mbrtowc behaviour depends on (and must remain consistent with) the character
encoding of the locale. This character encoding, nl_langinfo(CODESET) or
equivalently $(locale charmap), currently is defined as

$ LC_ALL=C locale charmap
ANSI_X3.4-1968

ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding,

To fix this bug, this encoding would need to be changed to an 8-bit encoding.

The question is: Which encoding?

> It was always the intent of POSIX that all 256 bytes be valid characters
> in the C locale

On the other hand, it was always the intent of the glibc i18n design (around
1999-2001) that users would use UTF-8 locales and that all plain text would be
encoded in UTF-8. This has come true (around 2005).

The C locale is still used in scripts that need to handle text in unknown
encodings. It is important here that no byte value >= 128 is considered to
have special character properties (per <ctype.h>), because this would have
undesired effects when processing byte sequences in UTF-8 encoding - which,
as said above, is the vast majority of text on current systems.

Therefore, when changing the value of nl_langinfo(CODESET) and
$(locale charmap), it is essential that we preserve the property that
no byte value >= 128 has special character properties. Otherwise we introduce
trouble in user scripts that have been working fine for the last 10 years.

In particular, this excludes the ISO-8859-* encodings.

We need an encoding that formally has 256 characters, but the characters
>= 128 are to be considered non-graphic (and therefore also non-printing).
And the mapping done by mbrtowc should not map these characters to defined
Unicode characters; I think they would best map into Private Use Areas of
Unicode. Thus the mapping table would
- map x (0 <= x <= 0x7F) to Unicode x,
- map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).

There is no such encoding among the list of encodings - $(locale -m) or
http://www.haible.de/bruno/charsets/conversion-tables/index.html.
Should we create a new encoding with this property?
Or change the mapping tables of ANSI_X3.4-1968?
Either approach will create trouble to user programs:
- If we create a new encoding, software like telnet or ssh passes the
  encoding to different machines, which will not recognize it.
- If we change the mapping tables of ANSI_X3.4-1968, existing uses of
  "iconv -f ANSI_X3.4-1968" will exhibit a behaviour change.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
  2016-04-09 20:09 ` [Bug localedata/19932] " jim at meyering dot net
@ 2016-04-09 20:10 ` eggert at gnu dot org
  2016-04-09 20:10 ` bruno at clisp dot org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: eggert at gnu dot org @ 2016-04-09 20:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

--- Comment #3 from Paul Eggert <eggert at gnu dot org> ---
(In reply to Bruno Haible from comment #2)
> Thus the mapping table would
> - map x (0 <= x <= 0x7F) to Unicode x,
> - map x (0x80 <= x <= 0xFF) to Unicode 0xDF80+x (or similar).

Emacs maps the latter to 0x3FFF80+x, I suppose under the theory that these
integers are not Unicode code points, and thus won't be conflated with
private-use Unicode characters. I suppose we could be "compatible" with Emacs.
Are there other examples in the wild of this sort of thing, or is the Emacs
precedent good enough?

> Should we create a new encoding with this property?
> Or change the mapping tables of ANSI_X3.4-1968?

It is a bit of a dilemma. Would it make sense to change iconv so that it
recognizes values like 0x3FFF80 as corresponding to encoding-error bytes? iconv
could then behave the same way as before, even if we change the mapping tables
of ANSI_X3.4-1968.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
                   ` (2 preceding siblings ...)
  2016-04-09 20:10 ` bruno at clisp dot org
@ 2016-04-11 11:48 ` fweimer at redhat dot com
  2016-04-22  4:46 ` [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1 vapier at gentoo dot org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: fweimer at redhat dot com @ 2016-04-11 11:48 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |fweimer at redhat dot com
              Flags|                            |security-

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
                   ` (3 preceding siblings ...)
  2016-04-11 11:48 ` fweimer at redhat dot com
@ 2016-04-22  4:46 ` vapier at gentoo dot org
  2023-03-29  9:42 ` bruno at clisp dot org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vapier at gentoo dot org @ 2016-04-22  4:46 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|mbrtowc returns (size_t) -1 |C locale: mbrtowc returns
                   |in C locale                 |(size_t) -1

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
                   ` (4 preceding siblings ...)
  2016-04-22  4:46 ` [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1 vapier at gentoo dot org
@ 2023-03-29  9:42 ` bruno at clisp dot org
  2023-05-23 10:08 ` carenas at gmail dot com
  2023-06-28 20:12 ` sam at gentoo dot org
  7 siblings, 0 replies; 9+ messages in thread
From: bruno at clisp dot org @ 2023-03-29  9:42 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

--- Comment #4 from Bruno Haible <bruno at clisp dot org> ---
The POSIX bug has been fixed:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/mbrtowc.html
now says
"[EILSEQ]
    An invalid character sequence is detected. [CX] [Option Start]  In the
POSIX locale an [EILSEQ] error cannot occur since all byte values are valid
characters. [Option End]"

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
                   ` (5 preceding siblings ...)
  2023-03-29  9:42 ` bruno at clisp dot org
@ 2023-05-23 10:08 ` carenas at gmail dot com
  2023-06-28 20:12 ` sam at gentoo dot org
  7 siblings, 0 replies; 9+ messages in thread
From: carenas at gmail dot com @ 2023-05-23 10:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

Carlo Marcelo Arenas Belón <carenas at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |carenas at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1
  2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
                   ` (6 preceding siblings ...)
  2023-05-23 10:08 ` carenas at gmail dot com
@ 2023-06-28 20:12 ` sam at gentoo dot org
  7 siblings, 0 replies; 9+ messages in thread
From: sam at gentoo dot org @ 2023-06-28 20:12 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=19932

Sam James <sam at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sam at gentoo dot org

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-06-28 20:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-09  8:15 [Bug localedata/19932] New: mbrtowc returns (size_t) -1 in C locale eggert at gnu dot org
2016-04-09 20:09 ` [Bug localedata/19932] " jim at meyering dot net
2016-04-09 20:10 ` eggert at gnu dot org
2016-04-09 20:10 ` bruno at clisp dot org
2016-04-11 11:48 ` fweimer at redhat dot com
2016-04-22  4:46 ` [Bug localedata/19932] C locale: mbrtowc returns (size_t) -1 vapier at gentoo dot org
2023-03-29  9:42 ` bruno at clisp dot org
2023-05-23 10:08 ` carenas at gmail dot com
2023-06-28 20:12 ` sam at gentoo dot org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).