From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 44882 invoked by alias); 9 Apr 2016 20:10:02 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 37335 invoked by uid 48); 9 Apr 2016 16:07:28 -0000 From: "bruno at clisp dot org" To: libc-locales@sourceware.org Subject: [Bug localedata/19932] mbrtowc returns (size_t) -1 in C locale Date: Sat, 09 Apr 2016 20:10:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: 2.22 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: bruno at clisp dot org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2016-q2/txt/msg00019.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D19932 Bruno Haible changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bruno at clisp dot org --- Comment #2 from Bruno Haible --- > glibc mbrtowc reports an encoding error in the C locale when given a byte > in the range 128-255 decimal Assume this is indeed to be considered a bug. Then we need to change the character encoding that glibc associates with the C locale - because the mbrtowc behaviour depends on (and must remain consistent with) the character encoding of the locale. This character encoding, nl_langinfo(CODESET) or equivalently $(locale charmap), currently is defined as $ LC_ALL=3DC locale charmap ANSI_X3.4-1968 ANSI_X3.4-1968, a.k.a. US-ASCII, is a 7-bit encoding, To fix this bug, this encoding would need to be changed to an 8-bit encodin= g. The question is: Which encoding? > It was always the intent of POSIX that all 256 bytes be valid characters > in the C locale On the other hand, it was always the intent of the glibc i18n design (around 1999-2001) that users would use UTF-8 locales and that all plain text would= be encoded in UTF-8. This has come true (around 2005). The C locale is still used in scripts that need to handle text in unknown encodings. It is important here that no byte value >=3D 128 is considered to have special character properties (per ), because this would have undesired effects when processing byte sequences in UTF-8 encoding - which, as said above, is the vast majority of text on current systems. Therefore, when changing the value of nl_langinfo(CODESET) and $(locale charmap), it is essential that we preserve the property that no byte value >=3D 128 has special character properties. Otherwise we intro= duce trouble in user scripts that have been working fine for the last 10 years. In particular, this excludes the ISO-8859-* encodings. We need an encoding that formally has 256 characters, but the characters >=3D 128 are to be considered non-graphic (and therefore also non-printing). And the mapping done by mbrtowc should not map these characters to defined Unicode characters; I think they would best map into Private Use Areas of Unicode. Thus the mapping table would - map x (0 <=3D x <=3D 0x7F) to Unicode x, - map x (0x80 <=3D x <=3D 0xFF) to Unicode 0xDF80+x (or similar). There is no such encoding among the list of encodings - $(locale -m) or http://www.haible.de/bruno/charsets/conversion-tables/index.html. Should we create a new encoding with this property? Or change the mapping tables of ANSI_X3.4-1968? Either approach will create trouble to user programs: - If we create a new encoding, software like telnet or ssh passes the encoding to different machines, which will not recognize it. - If we change the mapping tables of ANSI_X3.4-1968, existing uses of "iconv -f ANSI_X3.4-1968" will exhibit a behaviour change. --=20 You are receiving this mail because: You are on the CC list for the bug.