From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cygwin-return-209351-listarch-cygwin=sourceware.org@cygwin.com>
Received: (qmail 13952 invoked by alias); 7 Aug 2017 09:28:30 -0000
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Received: (qmail 13916 invoked by uid 89); 7 Aug 2017 09:28:28 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-99.4 required=5.0 tests=AWL,BAYES_50,GOOD_FROM_CORINNA_CYGWIN,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=estimates, 12010, Categories, 1201
X-HELO: drew.franken.de
Received: from mail-n.franken.de (HELO drew.franken.de) (193.175.24.27) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 07 Aug 2017 09:28:25 +0000
Received: from aqua.hirmke.de (aquarius.franken.de [193.175.24.89])	(Authenticated sender: aquarius)	by mail-n.franken.de (Postfix) with ESMTPSA id 65339721E281A	for <cygwin@cygwin.com>; Mon,  7 Aug 2017 11:28:21 +0200 (CEST)
Received: from calimero.vinschen.de (calimero.vinschen.de [192.168.129.6])	by aqua.hirmke.de (Postfix) with ESMTP id 7BCA65E041E	for <cygwin@cygwin.com>; Mon,  7 Aug 2017 11:28:20 +0200 (CEST)
Received: by calimero.vinschen.de (Postfix, from userid 500)	id 6142AA8056F; Mon,  7 Aug 2017 11:28:20 +0200 (CEST)
Date: Mon, 07 Aug 2017 09:28:00 -0000
From: Corinna Vinschen <corinna-cygwin@cygwin.com>
To: cygwin@cygwin.com
Subject: Re: Unicode width data inconsistent/outdated
Message-ID: <20170807092820.GQ25551@calimero.vinschen.de>
Reply-To: cygwin@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
References: <f3c1b415-7a26-8bbe-a67f-5619d356f058@towo.net> <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;	protocol="application/pgp-signature"; boundary="UlxN1C6awaFNesUv"
Content-Disposition: inline
In-Reply-To: <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net>
User-Agent: Mutt/1.8.3 (2017-05-23)
X-SW-Source: 2017-08/txt/msg00068.txt.bz2

--UlxN1C6awaFNesUv
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Content-length: 6525

On Aug  5 21:06, Thomas Wolff wrote:
> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen:
> > On Aug  3 21:44, Thomas Wolff wrote:
> > > My attempt would be to base the functions on a common table of charac=
ter categories instead.
> > Keep in mind that the table is not loaded into memory on demand, as on
> > Linux.  Rather it will be part of the Cygwin DLL, and worse in case
> > newlib, any target using the wctype functions.
> Maybe we could change that (load on demand, or put them in a shared libra=
ry
> perhaps), but...

That won't work for embedded targets, especially small ones.

If you want to go that route, you would have to extend struct __locale_t
or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to
conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function
or a new function inside Cygwin (but called from __ctype_load_locale)
could load the tables.

Then you could create new iswXXX, towXXX, and wcwidth functions inside
Cygwin using these tables, rather than relying on the newlib code.

Alternatively, if RTEMS is interested as well, we may strive for a
newlib solution which is opt-in.  Loading tables (or even big tables at
all) isn't a good solution for very small targets.

> > The idea here is that the tables take less space than a full-fledged
> > category table.  The tables in utf8print.h and utf8alpha.h and the code
> > in iswalpha and iswprint combined are 10K, code and data of the
> > tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K,
> > covering Unicode 5.2 with 107K codepoints.
> >=20
> > A category table would have to contain the category bits for the entire
> > Unicode codepoint range.  The number of potential bits is > 8 as far as=
 I
> > know so it needs 2 bytes per char, but let's make that 1 byte for now.
> > For Unicode 5.2 only the table would be at least 107K, and that would
> > only cover the iswXXX functions.
> I have a working version now, and it uses much less as the category table=
 is
> range-based.
> Another table is needed for case conversion. Size estimates are as follows
> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10=
.0
> of course):
>=20
> Categories: 2313 entries (10.0: 2715)
> each entry needs 9 bytes, total 20817 bytes
> I don't know whether that expands by some word-alignment.
> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 161=
91
> or 13878).
>=20
> Case conversion: 2062 entries (10.0: 2621)
> each entry needs 12 bytes, total 24744
> packed 8 bytes, total 16496
>=20
> The Categories table could be boiled down to 1223 entries (penalty: double
> runtime for iswupper and iswlower)
> The Case conversion table could be transformed to a compact form
> Case conversion compact: 1201 entries
> each entry needs 16 bytes, total 19216
> packed 12 or 11 (or even 10), total 14412 (or 12010)
> So I think the increase is acceptable for the benefit of simple and
> automatic generation

So we're at 40K+ plus code then.

newlib: embedded targets, looking for small sized solutions.  Simple
and automatic generation is not the main goal.

> and also more efficient processing by some of the
> functions. Also they would apply to more functions, e.g. iswdigit which
> would confirm all Unicode digits, not just the ASCII ones.

Don't do that.  There's a collision with C99 if you define other
characters than ASCII digits to return nonzero from iswdigit.  Comment
from inside Glibc:

% The "digit" class must only contain the BASIC LATIN digits, says ISO C 99
% (sections 7.25.2.1.5 and 5.2.1).

> > > > int wcwidth(wint_t c);
> > > Why not revert to wcwidth(wint_t)?
> > > I think for cygwin it is the only solution that makes wcwidth work for
> > > non-BMP characters and is also compatible (unlike some proposals disc=
ussed
> > > later in the quoted thread).
> > We can do this, but it may result in complaints from the other
> > newlib consumers.  If in doubt, use #ifdef __CYGWIN__
> Which other platforms do actually use newlib?

Lots of embedded and bare-metal tagets.

> > > Issue 2 is the handling of titlecase characters (e.g. "Nj" as one Uni=
code
> > > character U+01CB). The current implementation considers them to be bo=
th
> > > upper and lower (iswupper: return towlower (c) !=3D c); I'd rather co=
nsider
> > > them as neither upper nor lower (iswalpha (c) && towupper (c) =3D=3D =
c).
> > > https://linux.die.net/man/3/iswupper allows both interpretations:
> > > > The wide-character class "upper" contains *at least* those characte=
rs wc
> > > > which are equal to towupper(wc) and different from towlower(wc).
> > Susv4 says "The iswupper() [...] functions shall test whether wc is a
> > wide-character code representing a character of class upper." Whatever
> > does that correctly with a low footprint is fine.
> The question here is how "character of class upper" is defined, and how to
> interpret pre-Unicode assumptions in a Unicode context.

In theory, do it as glibc does and you're fine.

> > > Issue 3 is the special conversion jp2uc which seems to be half-bred; =
there
> > > is no such handling for Chinese or Korean.
> > This shouldn't matter to you, just keep it in place.  It's a historical,
> > low footprint conversion for japanese characters without pulling in the
> > unicode stuff.  Not used on Cygwin so just ignore.
> I had noticed meanwhile that this is not active in Cygwin, but it's broken
> anyway for multiple reasons:
>    * platforms for which wchar_t is not Unicode should be explicitly list=
ed
>    * if used, the transformation needs to be applied to all non-Unicode
> locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252)
>    * for towupper and towlower, the result must be back-transformed into =
the
> respective locale encoding
>    * particulary the locale-specific _l functions inconsistently do not u=
se
> the transformation but have this note:

No, no, no.  The functionality is restricted to certain use-cases and
always was.  It was a paid-for customer extension back in the day and it
was *sufficient* for the use-cases.  It's not clear how many newlib
users are still using it, but it's not a good idea to remove it without
checking first.  That means, ask on the newlib mailing list how many are
using the historical jp2uc code, and if we don't get a reply within,
say, a month, we can probably nuke it.


Corinna

--=20
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat

--UlxN1C6awaFNesUv
Content-Type: application/pgp-signature; name="signature.asc"
Content-length: 819

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJZiDK0AAoJEPU2Bp2uRE+gJ08P/32J6hoKeNHAoy0U4KZgpktn
iLy997JT+Vg/V1m9PPQJkDYGAKAl8aJJM7iYhCiyMN8wTdABszrWHYzCHGMZZoaL
UNw7dSbF7oFUAoWL8tYbs4yAXCaAXUI8uCQlzs7fllpLmeL//PbHsb5Ma30xmq7L
BGwlPZhzz63E11ABsKKE06JGYDrr23N/mCPMsy7q+1dSgzyPAem4xhiaeSYAQEzD
PhHZMHVfwddSZm4ol+EJbwC4WYDYCMVnZoy1/7kzusZa5/W0yU76rAASKYk0aV+9
noPPdnxYMBrbG0YR4/K7HBhwN49/vcYMCoAaBYMYjLLxNn2yQbAlqjg1uu1mrOae
P80leZC0IOti1rpPCAy8gILHYntEbhJUff83HtvacNUpQ2hS7AeQbgjQqH49GQAe
wLyvjaAsg/QUJ94Zwvw+2vdFadhhoTDEo/XEGjutML5VhndHZ702XjLYfo5Ceu/n
8FMP/UmuPYTXkQFmgPvUFos0K7O3C64Alq11CCAmmoRf4RZBEBcVAuizzCs6xayl
HgVj5wBN7Oq+aWAp0ZO0uZJbYY8AczQPSkqu6lqg9AEGp7UtSpJOE19mt+g0xtLX
WbqFojgdSI8HmeCM82f11Thoay2Uh1kMj0f8LQ6EprchmkPT4plcIfSjrv7IHKYo
OZYjl9KtiCEe1udjLpJA
=boet
-----END PGP SIGNATURE-----

--UlxN1C6awaFNesUv--