From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22199 invoked by alias); 31 Jan 2011 16:58:34 -0000 Received: (qmail 22181 invoked by uid 22791); 31 Jan 2011 16:58:32 -0000 X-SWARE-Spam-Status: No, hits=-6.8 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_HI,SPF_HELO_PASS,T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 31 Jan 2011 16:58:26 +0000 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id p0VGwL0L023207 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 31 Jan 2011 11:58:21 -0500 Received: from [10.3.113.114] (ovpn-113-114.phx2.redhat.com [10.3.113.114]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id p0VGwJ9l013308; Mon, 31 Jan 2011 11:58:20 -0500 Message-ID: <4D46EA2B.1010307@redhat.com> Date: Mon, 31 Jan 2011 19:16:00 -0000 From: Eric Blake User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.13) Gecko/20101209 Fedora/3.1.7-0.35.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.7 MIME-Version: 1.0 To: Bruno Haible CC: bug-gnulib@gnu.org, cygwin , bug-coreutils Subject: Re: 16-bit wchar_t on Windows and Cygwin References: <201101310304.42975.bruno@clisp.org> In-Reply-To: <201101310304.42975.bruno@clisp.org> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------enig94CF3FEB4BA742E2A08505A3" X-IsSubscribed: yes Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com X-SW-Source: 2011-01/txt/msg00457.txt.bz2 --------------enig94CF3FEB4BA742E2A08505A3 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-length: 5032 [adding cygwin and coreutils for a wc issue] On 01/30/2011 07:04 PM, Bruno Haible wrote: > Hi, >=20 > It is known for a long time that on native Windows, the wchar_t[] encodin= g on > strings is UTF-16. [1] Now, Corinna Vinschen has confirmed that it is the= same > for Cygwin >=3D 1.7. [2] POSIX requires that 1 wchar_t corresponds to 1 character; so any use of surrogates to get the full benefit of UTF-16 falls outside the bounds of POSIX. At which point, the POSIX definition of those functions no longer apply, and we can (try) to make the various wc* functions try to behave as smartly as possible (as is the case with Cygwin); where those smarts are only needed when you use surrogate pairs. If cygwin's approach is correct, then maybe the thing to do is codify those smarts for all implementations with 16-bit wchar_t as an extension to POSIX that all gnulib clients can rely on, and thus minimize the #ifdefs in such clients. > What consequences does this have? >=20 > 1) All code that uses the functions from (wide character > classification and mapping) or wcwidth() malfunctions on strings that > contains Unicode characters outside the BMP, i.e. outside the range > U+0000..U+FFFF. Not necessarily. Such code falls outside of POSIX, but it may still be a well-behaved extension if given sane behavior for how to deal with surrogates. > 2) Code that uses mbrtowc() or wcrtomb() is also likely to malfunction. > On Cygwin >=3D 1.7 mbrtowc() and wcrtomb() is implemented in an inte= lligent > but somewhat surprising way: wcrtomb() may return 0, that is, produc= e no > output bytes when it consumes a wchar_t. > Now with a chinese character outside the BMP: > $=20=09 > 1 4 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 3 6 >=20 > On Cygwin 1.7.5 (with LANG=3DC.UTF-8 and 'wc' from GNU coreutils 8.5): >=20 > $ printf 'a\xf0\xa1\x88\xb4b\n' | wc -w -m > 1 5 > $ printf 'a \xf0\xa1\x88\xb4 b\n' | wc -w -m > 2 7 > > So both the number of characters and the number of words are counted > wrong as soon as non-BMP characters occur. > Does this represent a bug in cygwin's mbrtowc routines that could be fixed by cygwin? Or, does this represent a bug in coreutils for using mbrtowc one character at a time instead of something like mbsrtowcs to do bulk conversions? And if we decide that cygwin's extensions are sane, how much harder is it to characterize what a program must do to be portable to both 16-bit and 32-bit wchar_t if they are guaranteed the same behavior for all hosts of the same-size wchar_t? In other words, would it really require that many #ifdefs in coreutils to portably and simultaneously support both sizes of wchar_t? > I'm more in favour of overriding wchar_t and all functions that depend on= it - > like we did successfully for the socket functions. >=20 > In practice, this would mean that on Windows (both native Windows and > Cygwin >=3D 1.7) the use of a 'wchar_t' module will > - override wchar_t to be 32 bits, like in glibc, > - cause functions from mbrtowc() to wcwidth() to be overridden. Since t= he > corresponding system functions are unusable, the replacements will us= e the > modules from libunistring (such as unictype/ctype-alnum and uniwidth/= width). That's a lot of overriding, for anything that uses wchar_t in its API, and throws out a lot of what cygwin already provides. It also means that compiler primitives, like L"xyz", which result in 16-bit wchar_t arrays, will be unusable with your 32-bit wchar_t override. In other words, I don't think it's a good idea to be doing that. C1x will be adding compiler support for mandatory char16_t and char32_t types for UTF-16 and UTF-32 data, independently of whether wchar_t is 16-bit or 32-bit; maybe the better thing is to proactively start providing the new interfaces in that will result from C1x adoption (and convert GNU programs to use this rather than wchar_t for character operations), although without compiler support for u"" and U"" (and even u8""), we are no better than ditching compiler support for L"" if you force a wchar_t size override. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1516.pdf lists: 7.27 Unicode utilities 1 The header declares types and functions for manipulating Unicode characters. 2 The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19); char16_t which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and char32_t which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2). mbrtoc16 c16rtomb mbrtoc32 c32rtomb but no variants for replacing wprintf and friends (convert to multibyte and use printf and friends instead). --=20 Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org --------------enig94CF3FEB4BA742E2A08505A3 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" Content-length: 619 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iQEcBAEBCAAGBQJNRuorAAoJEKeha0olJ0Nq75oH/RpS/V6+I5kdmDbm3JNIQeS5 SwN7b6/jhycI9Hs5y/MvjSfo0auhwstLyGPutmqtDTAnJ3TRjO/NDUshuBo3vDMg 6jLLzYwqKRAyEFMmSpLygON8UIgrAScJxb5gEmRwzW1m6Y4zZojfVDpO/qRmhXfJ y+9rSgDhpU4ex3Pevg9IuGFHVNh11ClNEFm96cJjFYLK46zQXyGaY6UrZO6CkcYf bVwzLD5nWx3btYi75XdBppPvx1hA9q6e291BrAgf6IU1zhq76TX9k9D9HZIu7FEh bv8gDkYy/T5FCF4+qo2/TtOvAX3H9kbkwPUziH8lQ+fcbbt5euRvCbM/HjkfSN0= =m8Gr -----END PGP SIGNATURE----- --------------enig94CF3FEB4BA742E2A08505A3--