From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cygwin-return-209330-listarch-cygwin=sourceware.org@cygwin.com>
Received: (qmail 24810 invoked by alias); 3 Aug 2017 19:44:38 -0000
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Received: (qmail 24077 invoked by uid 89); 3 Aug 2017 19:44:37 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.5 required=5.0 tests=AWL,BAYES_00,FOREIGN_BODY,GIT_PATCH_2,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy=auf, wurde, H*RU:sk:mrelaye, H*r:sk:mrelaye
X-HELO: mout.kundenserver.de
Received: from mout.kundenserver.de (HELO mout.kundenserver.de) (212.227.126.187) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 03 Aug 2017 19:44:34 +0000
Received: from [192.168.178.24] ([95.91.246.195]) by mrelayeu.kundenserver.de (mreue001 [212.227.15.167]) with ESMTPSA (Nemesis) id 0Lup8L-1dUJYM3nTl-0102yq for <cygwin@cygwin.com>; Thu, 03 Aug 2017 21:44:32 +0200
Subject: Re: Unicode width data inconsistent/outdated
To: cygwin@cygwin.com
References: <f3c1b415-7a26-8bbe-a67f-5619d356f058@towo.net> <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de>
From: Thomas Wolff <towo@towo.net>
Message-ID: <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net>
Date: Thu, 03 Aug 2017 19:44:00 -0000
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <20170728195826.GI24013@calimero.vinschen.de>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-UI-Out-Filterresults: notjunk:1;V01:K0:i1NGN1T7Rbo=:D406sZuHyo9Q4cN0gyjG6t Bwxgiwv9CHj26XA8h9WmiBCvd91i2x8i6Qnpfk25Ay+CkNcp/G7SG5gUGWQx7v908OtuwbZ1w 4W7DNXaMKwAOTl6swKsj3LxO9vGuUrpZkpHmo8XR4DGlZJr2TXQDnQMZuHHG+Nxi1j9m7ja9E CljtXq2vlQKOgBBeKs4SbH3sOPUD+LzLhdAa8xFKMH6vwkIrFKA7UiVP1SAPVCcYrBksRh44Z XQuizVEsT1eRNpTqhgAhBfYngw4PEGpdUoHrzsdOwrLlTX3cYsRnnPAEERzco3yRxdudBL/zC H+v4vsZLj2h8jcN5MEegu5acvZo2i+aH2GeudN0ktr8A0OTME0AglriX4vn+LijSSWYQdwjaT 0r061jTKjFjde6XiBuwIzUWDFg2hAWxBVazaVrKIcUQIxS71tKIIRJ3HN2eJRNbRaRKXKS0YG GCfhbBZ4rzsxR1pusJhkHCO3ca1XcrZ8kpyWHsUA1ra0in29D+gQxxdfg+sEFRMcRUf43SMjm bRUBt6lzTRoRhskXQhTSUzQUPj7OeogKAtu6OnL8pfjVFiSePYTypbSj0j5IMVeGo3D1wqulh ZF3E6Vwp3lRnHnl6d+Pduslcfw0HzqFDb/HhjRRBaeOtMKmlg4VBvkb0KbSf8acHHk4UMPt3A MR/DtKLb2Mw3kdPNXNO3TOgwsu0be7oXnp60npibCvI+6E3Bw4CGkEw+rLkUO5d800VX5cQGM kCPof1TwLnJnjzkvX/gJUjP/M4VxhPxTqNDvLJlj790odQK6Z399EZyDoCo=
X-IsSubscribed: yes
X-SW-Source: 2017-08/txt/msg00047.txt.bz2

Am 28.07.2017 um 21:58 schrieb Corinna Vinschen:
> On Jul 26 23:43, Thomas Wolff wrote:
>> Am 26.07.2017 um 11:50 schrieb Corinna Vinschen:
>>> On Jul 26 03:16, Yaakov Selkowitz wrote:
>>>> On 2017-07-26 03:08, Corinna Vinschen wrote:
>>>>> On Jul 26 08:49, Thomas Wolff wrote:
>>>>>> It would be good to keep wcwidth/wcswidth in sync with the installed
>>>>>> Unicode data version (package unicode-ucd).
>>>>>> Currently it seems to be hard-coded (in newlib/libc/string/wcwidth.c);
>>>>>> it refers to Unicode 5.0 while installed Unicode data suggest 9.0 would
>>>>>> be used.
>>>>>> I can provide some scripts to generate the respective tables if desired.
>>>>>> Thomas
>>>>> If you can update the newlib files this way and send matching patches
>>>>> to the newlib list, this would be highly appreciated.
>>>> Thomas, I just updated unicode-ucd to 10.0 for this purpose.
>> Thanks.
>>> Oh, and, btw, the comment in wcwidth.c isn't quite correct.  The
>>> cwstate in newlib is on Unicode 5.2, see newlib/libc/ctype/towupper.c.
>> Oh, a number of other embedded tables. To make the tow* and isw* functions
>> more easily adaptable to Unicode updates, there will be some revisions to do
>> here. And the to* and is* ones (without 'w') even refer to locales in a way
>> I do not understand. Maybe I'll restrict my effort to wcwidth first...
> The to* and is* ones (without 'w') don't matter at all and you don't
> have to touch them.
>
> The Unicode stuff only affects the tow and isw functions.
>
> As for how to fetch the data, you may want to have a look into
> newlib/libc/ctype/utf8alpha.h and newlib/libc/ctype/utf8print.h.  The
> header comments contain the awk scripts used to collect the data.
But there are no instructions to adapt the embedded conditional 
statements referring to those data...
My attempt would be to base the functions on a common table of character 
categories instead.

> All other isw* files like iswblank.c contain comments explaining
> what Unicode character categories are covered.
I'm comparing results based on Unicode 5.2 data. There will be some 
deviations and maybe some things to discuss.
For example, I wonder why in the current implementation currency symbols 
are considered as punctuation (which can be easily reproduced).

Also, there are 3 other issues:


Issue 1 is about handling non-BMP characters by wcwidth.
This has been discussed before.

On Mon, 31 Jan 2011 09:58:19 -0700 
(https://sourceware.org/ml/cygwin/2011-01/msg00453.html)
Erik Blake wrote:
> POSIX requires that 1 wchar_t corresponds to 1 character; so any use 
> of surrogates to get the full benefit of UTF-16 falls outside the 
> bounds of POSIX.
> At which point, the POSIX definition of those functions no longer 
> apply, and we can (try) to make the various wc* functions try to 
> behave as smartly as possible (as is the case with Cygwin); where 
> those smarts are only needed when you use surrogate pairs.

On Wed, 2 Feb 2011 12:29:03 +0100 
(https://sourceware.org/ml/cygwin/2011-02/msg00037.html)
Bruno Haible wrote:
> Code that uses <wctype.h> and wcwidth() is written precisely according 
> to POSIX.
> The problem is that this code cannot work correctly when wchar_t[] is 
> in UTF-16 encoding.
> There simply is no way to define these functions in a reasonable way 
> for surrogates.
I donât agree with this, see below.

On Wed, 2 Feb 2011 13:21:02 +0100 
(https://sourceware.org/ml/cygwin/2011-02/msg00040.html)
Corinna Vinschen wrote:
> And, please note the wording in SUSv4, for instance in 
> http://calimero.vinschen.de/susv4/functions/iswalpha.html
(not found)
>   The wc argument is a wint_t, the value of which the application shall
>                        ^^^^^^                         ^^^^^^^^^^^
>   ensure is a wide-character code corresponding to a valid character 
> in the current locale, or equal to the value of the macro WEOF. If the 
> argument has any other value, the behavior is undefined.
> I don't see any words in that which would disallow to convert UTF-16 
> wchar_t surrogates to a wint_t UTF-32 value before calling one of the 
> wctype functions.  Just like you have to be careful not to call the 
> ctype functions with a signed char.

While wcswidth works already (using internal __wcwidth), and the isw* 
and tow* functions work as well because they use wint_t, wcwidth is the 
only function (inconsistently insisting on wchar_t) that does not work.
But note https://linux.die.net/man/3/wcwidth which says
> Note that glibc before 2.2.5 used the prototype
> int wcwidth(wint_t c);
Why not revert to wcwidth(wint_t)?
I think for cygwin it is the only solution that makes wcwidth work for 
non-BMP characters and is also compatible (unlike some proposals 
discussed later in the quoted thread).


Issue 2 is the handling of titlecase characters (e.g. "Nj" as one 
Unicode character U+01CB). The current implementation considers them to 
be both upper and lower (iswupper: return towlower (c) != c); I'd rather 
consider them as neither upper nor lower (iswalpha (c) && towupper (c) 
== c).
https://linux.die.net/man/3/iswupper allows both interpretations:
> The wide-character class "upper" contains *at least* those characters 
> wc which are equal to towupper(wc) and different from towlower(wc).


Issue 3 is the special conversion jp2uc which seems to be half-bred; 
there is no such handling for Chinese or Korean.
If by definition the arguments of isw* functions are not Unicode but 
wide characters according of the current locale (not sure where that is 
defined), they must be transformed for all locales (CJK and also 8-bit 
ones);
also in towupper and towlower the result must be transformed back to the 
current locale encoding (now missing).


Thomas

---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprÃ¼ft.
https://www.avast.com/antivirus


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple