From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 468 invoked by alias); 7 Aug 2017 19:27:26 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 459 invoked by uid 89); 7 Aug 2017 19:27:26 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.4 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=no version=3.3.2 spammy=holiday, back-conversion, Hx-spam-relays-external:212.227.126.134, backconversion X-HELO: mout.kundenserver.de Received: from mout.kundenserver.de (HELO mout.kundenserver.de) (212.227.126.134) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 07 Aug 2017 19:27:20 +0000 Received: from [192.168.178.45] ([95.91.246.195]) by mrelayeu.kundenserver.de (mreue002 [212.227.15.167]) with ESMTPSA (Nemesis) id 0LyeIB-1dcCMz07OB-0163hn for ; Mon, 07 Aug 2017 21:27:18 +0200 Subject: Re: Unicode width data inconsistent/outdated To: cygwin@cygwin.com References: <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net> <20170807092820.GQ25551@calimero.vinschen.de> From: Thomas Wolff Message-ID: <3eb4ee2f-f62c-cb19-3e4b-10cc57852ba9@towo.net> Date: Mon, 07 Aug 2017 19:27:00 -0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170807092820.GQ25551@calimero.vinschen.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-UI-Out-Filterresults: notjunk:1;V01:K0:CZrGtiZkTeM=:GM5gK97ocntp6/hKqmit2M RUrYbeDi8A51YbE6I2oOrwHldlapUn2/6HjDdZJUxJN6FDhMjn+zdIjdSe3fJLS/eoOurWtvQ wKiDuMajoVPylyox54FSVriEPjJj26L6fAblrt67fgUJiR1OgYFkvTeccdbgOpFDfBsEYmBSa 28JsbqyaAMjvbWvBB0fHaVMQDDySHmcyAg5eagYJZO7nremVOPTNcgUKqGyP02qNfo+VFJUnR Hss44UqAThHPl06Zu0v2mA//rEKAJ+KsqXWKE+tkcQtk/2CWYdi/5c9PCNpPQVKBrR59koA1y Q+O/aQ8I7Ayk4QlNmFvT+svRnfLt5Lbo9sJWRcdLFXIwXVufvS915SUp+qkIuDHbNQ0Jvl7tn g+EnyVLU8XNyivtr/hSInRMv30MFsZSZoitHcs4TjQ6b5SxOLkVeXtK2xr1VxqoHHr+XwQIcj ah4ePCILTBh06ZQGt/GST8Dk6fzdG4b11KYQyN/91V5xpBuKICjaiegLgHl2y6i8A//ed1Yqh hTZIXw4chNXAkFwxBkKLlRbnwYgRP/W35GqHsDjdElNtF5VWaFN/GHViEXxIKIyiN2cw8fw7t 0OtiCwCUUDtjShbG7dlP0ATiamoi1GPlDUsJCcFYpCjP4zomDgON2Og2hym0pc6H4kGqklCL9 ZuqT03ksEK6NzW1T+cPQljp859FhUEgA0Tcn07/qRTupt54fOC/taqKZ2JjG1QdpEGYQpOhM6 tEY9EuYmNAT8ao7xA6lJpYIkfPaznZV7CeJ/d5dIjGN0pRFlOKbim2Rr28k= X-IsSubscribed: yes X-SW-Source: 2017-08/txt/msg00077.txt.bz2 Am 07.08.2017 um 11:28 schrieb Corinna Vinschen: > On Aug 5 21:06, Thomas Wolff wrote: >> Am 04.08.2017 um 19:01 schrieb Corinna Vinschen: >>> On Aug 3 21:44, Thomas Wolff wrote: >>>> My attempt would be to base the functions on a common table of character categories instead. >>> ...Keep in mind that the table is not loaded into memory on demand, as on >>> Linux. Rather it will be part of the Cygwin DLL, and worse in case >>> newlib, any target using the wctype functions. >> Maybe we could change that (load on demand, or put them in a shared library >> perhaps), but... > That won't work for embedded targets, especially small ones. > > If you want to go that route, you would have to extend struct __locale_t > or lc_ctype_T (in newlib/libc/locale/setlocale.h) to contain pointers to > conversion tables (Cygwin-only), and the __set_lc_ctype_from_win function > or a new function inside Cygwin (but called from __ctype_load_locale) > could load the tables. > > Then you could create new iswXXX, towXXX, and wcwidth functions inside > Cygwin using these tables, rather than relying on the newlib code. > > Alternatively, if RTEMS is interested as well, we may strive for a > newlib solution which is opt-in. Loading tables (or even big tables at > all) isn't a good solution for very small targets. > >>> The idea here is that the tables take less space than a full-fledged >>> category table. The tables in utf8print.h and utf8alpha.h and the code >>> in iswalpha and iswprint combined are 10K, code and data of the >>> tolower/toupper functions are 7K, wcwidth 3K, so a total of 20K, >>> covering Unicode 5.2 with 107K codepoints. >>> >>> A category table would have to contain the category bits for the entire >>> Unicode codepoint range. The number of potential bits is > 8 as far as I >>> know so it needs 2 bytes per char, but let's make that 1 byte for now. >>> For Unicode 5.2 only the table would be at least 107K, and that would >>> only cover the iswXXX functions. >> I have a working version now, and it uses much less as the category table is >> range-based. >> Another table is needed for case conversion. Size estimates are as follows >> (based on Unicode 5.2 for a fair comparison, going up a little bit for 10.0 >> of course): >> >> Categories: 2313 entries (10.0: 2715) >> each entry needs 9 bytes, total 20817 bytes >> I don't know whether that expands by some word-alignment. >> I could pack entries to 7 bytes, or even 6 bytes if that helps (total 16191 >> or 13878). >> >> Case conversion: 2062 entries (10.0: 2621) >> each entry needs 12 bytes, total 24744 >> packed 8 bytes, total 16496 >> >> The Categories table could be boiled down to 1223 entries (penalty: double >> runtime for iswupper and iswlower) >> The Case conversion table could be transformed to a compact form >> Case conversion compact: 1201 entries >> each entry needs 16 bytes, total 19216 >> packed 12 or 11 (or even 10), total 14412 (or 12010) >> So I think the increase is acceptable for the benefit of simple and >> automatic generation > So we're at 40K+ plus code then. No, if I implement the packed versions, it's 19.3K, so even smaller the currently. > newlib: embedded targets, looking for small sized solutions. Simple > and automatic generation is not the main goal. > >> and also more efficient processing by some of the >> functions. Also they would apply to more functions, e.g. iswdigit which >> would confirm all Unicode digits, not just the ASCII ones. > Don't do that. There's a collision with C99 if you define other > characters than ASCII digits to return nonzero from iswdigit. ... OK. >>>> Issue 3 is the special conversion jp2uc which seems to be half-bred; there >>>> is no such handling for Chinese or Korean. >>> This shouldn't matter to you, just keep it in place. It's a historical, >>> low footprint conversion for japanese characters without pulling in the >>> unicode stuff. Not used on Cygwin so just ignore. >> I had noticed meanwhile that this is not active in Cygwin, but it's broken >> anyway for multiple reasons: >> * platforms for which wchar_t is not Unicode should be explicitly listed >> * if used, the transformation needs to be applied to all non-Unicode >> locales (also Chinese, Korean, and even 8-bit locales such as *.CP1252) >> * for towupper and towlower, the result must be back-transformed into the >> respective locale encoding >> * particulary the locale-specific _l functions inconsistently do not use >> the transformation but have this note: > No, no, no. The functionality is restricted to certain use-cases and > always was. It was a paid-for customer extension back in the day and it > was *sufficient* for the use-cases. It's not clear how many newlib > users are still using it, but it's not a good idea to remove it without > checking first. That means, ask on the newlib mailing list how many are > using the historical jp2uc code, and if we don't get a reply within, > say, a month, we can probably nuke it. OK, let's make such a request after holiday time. But, even if this shall persist as a special solution, it's still broken and should be fixed. Can we then substitute the current table with calling the iconvdata functions? In that case, as I said, the back-conversion would be available too, and I could fix that and add the missing handling of the _l functions, for a consistent solution. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple