From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 68355 invoked by alias); 4 Aug 2017 17:02:07 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 67355 invoked by uid 89); 4 Aug 2017 17:02:06 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-99.4 required=5.0 tests=AWL,BAYES_50,GOOD_FROM_CORINNA_CYGWIN,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=quoted, Wolff, wolff, isw X-HELO: drew.franken.de Received: from mail-n.franken.de (HELO drew.franken.de) (193.175.24.27) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 04 Aug 2017 17:02:02 +0000 Received: from aqua.hirmke.de (aquarius.franken.de [193.175.24.89]) (Authenticated sender: aquarius) by mail-n.franken.de (Postfix) with ESMTPSA id 357A5721E280D for ; Fri, 4 Aug 2017 19:01:58 +0200 (CEST) Received: from calimero.vinschen.de (calimero.vinschen.de [192.168.129.6]) by aqua.hirmke.de (Postfix) with ESMTP id 1DE825E00E4 for ; Fri, 4 Aug 2017 19:01:57 +0200 (CEST) Received: by calimero.vinschen.de (Postfix, from userid 500) id EF6CAA807B6; Fri, 4 Aug 2017 19:01:56 +0200 (CEST) Date: Fri, 04 Aug 2017 17:02:00 -0000 From: Corinna Vinschen To: cygwin@cygwin.com Subject: Re: Unicode width data inconsistent/outdated Message-ID: <20170804170156.GL25551@calimero.vinschen.de> Reply-To: cygwin@cygwin.com Mail-Followup-To: cygwin@cygwin.com References: <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="PW0Eas8rCkcu1VkF" Content-Disposition: inline In-Reply-To: <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> User-Agent: Mutt/1.8.3 (2017-05-23) X-SW-Source: 2017-08/txt/msg00049.txt.bz2 --PW0Eas8rCkcu1VkF Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-length: 7567 On Aug 3 21:44, Thomas Wolff wrote: > Am 28.07.2017 um 21:58 schrieb Corinna Vinschen: > > On Jul 26 23:43, Thomas Wolff wrote: > > > Am 26.07.2017 um 11:50 schrieb Corinna Vinschen: > > > > On Jul 26 03:16, Yaakov Selkowitz wrote: > > > > > On 2017-07-26 03:08, Corinna Vinschen wrote: > > > > > > On Jul 26 08:49, Thomas Wolff wrote: > > > > > > > It would be good to keep wcwidth/wcswidth in sync with the in= stalled > > > > > > > Unicode data version (package unicode-ucd). > > > > > > > Currently it seems to be hard-coded (in newlib/libc/string/wc= width.c); > > > > > > > it refers to Unicode 5.0 while installed Unicode data suggest= 9.0 would > > > > > > > be used. > > > > > > > I can provide some scripts to generate the respective tables = if desired. > > > > > > > Thomas > > > > > > If you can update the newlib files this way and send matching p= atches > > > > > > to the newlib list, this would be highly appreciated. > > > > > Thomas, I just updated unicode-ucd to 10.0 for this purpose. > > > Thanks. > > > > Oh, and, btw, the comment in wcwidth.c isn't quite correct. The > > > > cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper= .c. > > > Oh, a number of other embedded tables. To make the tow* and isw* func= tions > > > more easily adaptable to Unicode updates, there will be some revision= s to do > > > here. And the to* and is* ones (without 'w') even refer to locales in= a way > > > I do not understand. Maybe I'll restrict my effort to wcwidth first... > > The to* and is* ones (without 'w') don't matter at all and you don't > > have to touch them. > >=20 > > The Unicode stuff only affects the tow and isw functions. > >=20 > > As for how to fetch the data, you may want to have a look into > > newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h. The > > header comments contain the awk scripts used to collect the data. > But there are no instructions to adapt the embedded conditional statements > referring to those data... Tables are scanned in-order. Each table handles a range of 256 characters. A table comprises the lower 8 bits of a character which matches the condition. A 0 character (except in array position 0) is a continuation marker, which means, all chars between the previous and the next value match the condition. Here's an example from utf8alpha.h: static const unsigned char ua7[] =3D { 0x17, 0x0, 0x1f, 0x22, 0x0, 0x88, 0x8b, 0x8c, 0xfb, 0x0, 0xff }; ua7 is the array handling the characters in the range 0xa700 up to 0xa7ff. The first alpha character in this range is 0xa717. The next char in the array is a 0x0, followed by 0x1f. That means, all character from 0xa717 up to 0xa71f are alphas. Then we have a 0x22, a 0, and a 0x88. So all chars from 0xa722 up to 0xa788 are alphas. Then we have two chars not followed by a 0, so they just stand for themselves. 0xa78b and 0xa78c are alpha chars. The last group 0xfb, 0x0, 0xff of course means, 0x8afb up to 0xa7ff are alpha chars. > My attempt would be to base the functions on a common table of character > categories instead. Keep in mind that the table is not loaded into memory on demand, as on Linux. Rather it will be part of the Cygwin DLL, and worse in case newlib, any target using the wctype functions. The idea here is that the tables take less space than a full-fledged category table. The tables in utf8print.h and utf8alpha.h and the code in iswalpha and iswprint combined are 10K, code and data of the tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K, covering Unicode 5.2 with 107K codepoints. A category table would have to contain the category bits for the entire Unicode codepoint range. The number of potential bits is > 8 as far as I know so it needs 2 bytes per char, but let's make that 1 byte for now. For Unicode 5.2 only the table would be at least 107K, and that would only cover the iswXXX functions. > > All other isw* files like iswblank.c contain comments explaining > > what Unicode character categories are covered. > I'm comparing results based on Unicode 5.2 data. There will be some > deviations and maybe some things to discuss. > For example, I wonder why in the current implementation currency symbols = are > considered as punctuation (which can be easily reproduced). iswpunct (c) =3D=3D !iswalnum (c) && iswgraph (c) Linux man page claims: This function's name is a misnomer when dealing with Unicode characters, because the wide-character class "punct" contains both punctuation characters and symbol (math, currency, etc.) characters. > Also, there are 3 other issues: >=20 > Issue 1 is about handling non-BMP characters by wcwidth. > This has been discussed before. > [...] > (https://sourceware.org/ml/cygwin/2011-02/msg00040.html) > Corinna Vinschen wrote: > > And, please note the wording in SUSv4, for instance in > > http://calimero.vinschen.de/susv4/functions/iswalpha.html > (not found) oops, good one. Just see the upstream SUSv4 iswalpha man page. > > The wc argument is a wint_t, the value of which the application shall > > ^^^^^^ ^^^^^^^^^^^ > > ensure is a wide-character code corresponding to a valid character in > > the current locale, or equal to the value of the macro WEOF. If the > > argument has any other value, the behavior is undefined. > > I don't see any words in that which would disallow to convert UTF-16 > > wchar_t surrogates to a wint_t UTF-32 value before calling one of the > > wctype functions. Just like you have to be careful not to call the > > ctype functions with a signed char. >=20 > While wcswidth works already (using internal __wcwidth), and the isw* and > tow* functions work as well because they use wint_t, wcwidth is the only > function (inconsistently insisting on wchar_t) that does not work. Trying to be close to the standard here. > But note https://linux.die.net/man/3/wcwidth which says > > Note that glibc before 2.2.5 used the prototype > > int wcwidth(wint_t c); > Why not revert to wcwidth(wint_t)? > I think for cygwin it is the only solution that makes wcwidth work for > non-BMP characters and is also compatible (unlike some proposals discussed > later in the quoted thread). We can do this, but it may result in complaints from the other newlib consumers. If in doubt, use #ifdef __CYGWIN__ > Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Unicode > character U+01CB). The current implementation considers them to be both > upper and lower (iswupper: return towlower (c) !=3D c); I'd rather consid= er > them as neither upper nor lower (iswalpha (c) && towupper (c) =3D=3D c). > https://linux.die.net/man/3/iswupper allows both interpretations: > > The wide-character class "upper" contains *at least* those characters wc > > which are equal to towupper(wc) and different from towlower(wc). Susv4 says "The iswupper() [...] functions shall test whether wc is a wide-character code representing a character of class upper." Whatever does that correctly with a low footprint is fine. > Issue 3 is the special conversion jp2uc which seems to be half-bred; there > is no such handling for Chinese or Korean. This shouldn't matter to you, just keep it in place. It's a historical, low footprint conversion for japanese characters without pulling in the unicode stuff. Not used on Cygwin so just ignore. Thanks, Corinna --=20 Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Maintainer cygwin AT cygwin DOT com Red Hat --PW0Eas8rCkcu1VkF Content-Type: application/pgp-signature; name="signature.asc" Content-length: 819 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJZhKiEAAoJEPU2Bp2uRE+ga00P/3hXui9rAnitM02BQ7HkcMnM UUjoJuNIWjekbjRtWTks5wVs1WOkPiYwDkxqnnmEl48rVBfh1+2ZOETgGtkQVNB7 QAWfUNSyMASpksy1X3B5ctlf+eASaBhFP1ZXf30LIS+AsjHSyBqq1Lb4ZMNRNMAf IJr35ZfnJ+xasldRbbwKtXv/5k2qzvfPmRFm+FoBJQEEFHOLCE21JxsH3Jc5nHAZ y07FJeGL7USBfxws6VLVEcjUJdVz7A6y/Pu4Evwt6/4W99aOwmT/NRGwck8YgtY7 3voPIxqfLfjJ9Yub+j7AnGADdpJ+Ubq/CiIkKGU5YW5ofClGLRQe/Fjf2buyRLZW yilBFM2P2oBUZsO28gpFeeExSSM3R4d4/weRNT1sduS1lLkFY6253+PtaYEKhlZp lDnRrqaaWQv9OgnNQ4pl1eBuuINoGyu9Y6htRAa7ewA1gQog7qbwKjYB/TVtxmaG MXh+e0J2Du4zDqEpKzy6I8+9KcAWqeQNE2qbjwuemL+jATyotR0ZdERZQpCuZXEB 49DBTsN1/ZSfR6tpiRg9bertmjXQ+X1PfJ3Zif9zTkqmeituFe3zOTp8tOEZ6Vxf LccHRuLPldA1y54H6XbwvK1rznIooGGkFFVNl/+08nXoyTf6q7f87CnfvdRTMX5M 6BFHtBRv1+sS5P0//wEQ =2tex -----END PGP SIGNATURE----- --PW0Eas8rCkcu1VkF--