From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18827 invoked by alias); 7 Aug 2017 10:41:33 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 18811 invoked by uid 89); 7 Aug 2017 10:41:32 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-101.9 required=5.0 tests=AWL,BAYES_00,GOOD_FROM_CORINNA_CYGWIN,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=month, customer, Chinese, H*c:application X-HELO: drew.franken.de Received: from mail-n.franken.de (HELO drew.franken.de) (193.175.24.27) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 07 Aug 2017 10:41:31 +0000 Received: from aqua.hirmke.de (aquarius.franken.de [193.175.24.89]) (Authenticated sender: aquarius) by mail-n.franken.de (Postfix) with ESMTPSA id B5F9B721E281A for ; Mon, 7 Aug 2017 12:41:28 +0200 (CEST) Received: from calimero.vinschen.de (calimero.vinschen.de [192.168.129.6]) by aqua.hirmke.de (Postfix) with ESMTP id 127BD5E021D for ; Mon, 7 Aug 2017 12:41:28 +0200 (CEST) Received: by calimero.vinschen.de (Postfix, from userid 500) id E44F4A8056F; Mon, 7 Aug 2017 12:41:27 +0200 (CEST) Date: Mon, 07 Aug 2017 10:41:00 -0000 From: Corinna Vinschen To: cygwin@cygwin.com Subject: Re: Unicode width data inconsistent/outdated Message-ID: <20170807104127.GT25551@calimero.vinschen.de> Reply-To: cygwin@cygwin.com Mail-Followup-To: cygwin@cygwin.com References: <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net> <20170807092820.GQ25551@calimero.vinschen.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="/t6ASE28jIy1gGy9" Content-Disposition: inline In-Reply-To: <20170807092820.GQ25551@calimero.vinschen.de> User-Agent: Mutt/1.8.3 (2017-05-23) X-SW-Source: 2017-08/txt/msg00070.txt.bz2 --/t6ASE28jIy1gGy9 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-length: 2583 On Aug 7 11:28, Corinna Vinschen wrote: > On Aug 5 21:06, Thomas Wolff wrote: > > Am 04.08.2017 um 19:01 schrieb Corinna Vinschen: > > > This shouldn't matter to you, just keep it in place. It's a historic= al, > > > low footprint conversion for japanese characters without pulling in t= he > > > unicode stuff. Not used on Cygwin so just ignore. > > I had noticed meanwhile that this is not active in Cygwin, but it's bro= ken > > anyway for multiple reasons: > > * platforms for which wchar_t is not Unicode should be explicitly li= sted > > * if used, the transformation needs to be applied to all non-Unicode > > locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252) > > * for towupper and towlower, the result must be back-transformed int= o the > > respective locale encoding > > * particulary the locale-specific _l functions inconsistently do not= use > > the transformation but have this note: >=20 > No, no, no. The functionality is restricted to certain use-cases and > always was. It was a paid-for customer extension back in the day and it > was *sufficient* for the use-cases. It's not clear how many newlib > users are still using it, but it's not a good idea to remove it without > checking first. That means, ask on the newlib mailing list how many are > using the historical jp2uc code, and if we don't get a reply within, > say, a month, we can probably nuke it. To clarify where we're coming from: If you look into newlib/libc/locale/locale.c, function __loadlocale, you'll notice that outside of Cygwin, only six single/double/multi-bytes codesets are supported at all: ASCII ISO-8859-1 EUCJP JIS SJIS UTF-8 The multichar/widechar conversion functions for EUCJP, JIS and SJIS were implemented to have a low footprint in the first place, see, for instance, __sjis_wctomb in newlib/libc/stdlib/wctomb_r.c. This is all about simplification for small targets. There was never a requirement that converting a UTF-8 char to wchar_t, and converting the equivalent SJIS char to wchar_t would result in the same wide char. Consequentially, Cygwin does not use these conversion functions. Rather it uses Windows conversion functions, see the conversion functions in winsup/cygwin/strfuncs.cc, to get a consistent wide char representation (UTF-16). Another side-effect is that Cygwin does not support JIS at all, only SJIS, see the comment in strfuncs.cc. Corinna --=20 Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Maintainer cygwin AT cygwin DOT com Red Hat --/t6ASE28jIy1gGy9 Content-Type: application/pgp-signature; name="signature.asc" Content-length: 819 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJZiEPXAAoJEPU2Bp2uRE+gS1cP/2zlkr6oFCjHI4CmQkAyIRVS fXsW9QjzQja0sMA8mo43fek+VQtOWDc5S41nIghFb83RK9ySHtGxhXyjM/bCUXGJ WyWSisKGWJklxDtf3BA+1NYodo+y1tHgXNrDnny5oLVhwMo86RVzKL9ubzzFHsSZ 5XF80Bs1pDaGcm3cgV+qVF+O1PTP9tKO9RAyE8FCtMONLs1kUTTwPlkIlOPdUUfb ffFcf7ZcW72iM79A0+DT61s96FiMjCKRoqaDcKrAc3YZ+HnIjPHXwtuSkfrZ25nO txCXq2XClllYeTvvqzLr7XAcEs1ncMW8oMW7kGZlC0hM+/Wx8xr90eZYFlnY20aw ZaOeQQc56cbQyYZM19SP37bmbyhXGEK6HdhMzL9iTaA2BepTtYs3v877P+BkNhzs m+iIrUTyYU1gDYJA82pSvf87gO+B6woz7W+F24AUM5C6QfIiFStxLZYGj7DnjlGz 7RkoAr+gayVFdDNABSFmNkaOWYyPnDLGI93PmMSkMwvZ5H2W8Pnf82S287u+XKkY DrwawqKnYidMuvmppLUSYCrxEpgZnqC/Z6edwCmR9q2INkrd7pIQIyo4tDb+FpzA XirRof7PZb9y+MpbLFQSrelrnA2GPxCBeOGTuFr/Ui8HwOH6e7sLHStsNl9b+aYt 1ZS78CozFRpTqxAHw5P9 =DqFa -----END PGP SIGNATURE----- --/t6ASE28jIy1gGy9--