From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <cygwin-return-209364-listarch-cygwin=sourceware.org@cygwin.com>
Received: (qmail 125093 invoked by alias); 8 Aug 2017 00:29:05 -0000
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Received: (qmail 125077 invoked by uid 89); 8 Aug 2017 00:29:04 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-4.9 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_1,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_SPAM autolearn=ham version=3.3.2 spammy=
X-HELO: mout.kundenserver.de
Received: from mout.kundenserver.de (HELO mout.kundenserver.de) (217.72.192.73) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 08 Aug 2017 00:28:59 +0000
Received: from [192.168.178.45] ([95.91.246.195]) by mrelayeu.kundenserver.de (mreue103 [212.227.15.183]) with ESMTPSA (Nemesis) id 0LkAig-1d4Aza17Uo-00c9y6 for <cygwin@cygwin.com>; Tue, 08 Aug 2017 02:28:56 +0200
Subject: Re: Unicode width data inconsistent/outdated
To: cygwin@cygwin.com
References: <f3c1b415-7a26-8bbe-a67f-5619d356f058@towo.net> <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net> <20170807092820.GQ25551@calimero.vinschen.de> <401b6d26-35cb-3026-afde-6bd5d09b2d71@SystematicSw.ab.ca> <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b@towo.net> <0f8f1535-ed48-d170-7e57-c554bec23942@SystematicSw.ab.ca>
From: Thomas Wolff <towo@towo.net>
Message-ID: <4c342b2a-25e0-3fc4-a077-be2cc54d117c@towo.net>
Date: Tue, 08 Aug 2017 00:29:00 -0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <0f8f1535-ed48-d170-7e57-c554bec23942@SystematicSw.ab.ca>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-UI-Out-Filterresults: notjunk:1;V01:K0:kt0kXfIsxEo=:4LpL4xuR4I+zPWwskEudx4 4FkKtRd3jinRtsqeG3pocl0ekfLfxZ9wCb7II1HAsuiHiIUOZYYpFvMKMVpO8qVypxE/lKg8/ S2kiZfBAFBjoyDtDwGrTf2YmYaXYFBG5YJXwOgb+YHTgw1QzB7fPOBO0B0khuJ4krN+JGJ/yl SCmCUdCoBGUbbRcnObZ+VXU9avTOojOYGaB/4f6JjNg1hWhcmDeCFMej1KocLZLxsWf9/H/h3 0uC6/SRVKs5AQ8AOAzoic+1i/EY7AXmtSFl0QvuTiLJLS99sJfZ+IB15fwlHOi211B/LT2Uqj jPrEMYTS1ifi+dH2dfaFQ3zkvwHJaZw+iRKXUOJJ+oiQWInP8vB2YT8Zb9rG/eqHg0t3OpduB gfArXQ3Q8Ir76OsgtAIf/bQdZOQUt+tlmA+Z3j8dWCY1f//x7SSpyVggrDPdFGSEm1JMJoqvr qP0yGIRTlbOLH3ThwS2gAubjJvQiBU1RemQt2WSboNtzk03zmJhTB35jqpKX7ZIU6LDwCEoDb 5idjxQoJxH7ppRwKlV/zEx7/Pl8y1eJj33Wq8xl72C7vvGsmbh6t1/wXRhNPgRCH1p/13h8ro gpEM8Zu04/orkBItg5IJlxWMOsHb0vVG5c+sLNrb/J9A5+sX34ZFSbjRE8P/hutItss6ue0eH TIk6Bkf4cP9UrHD4Yns3wLCBPtNYIpnXpE0H4VoZXuNKV+DXzHs4aOCEBSUNgaHf25w8zdHlI /blk9OOAydnaARQQHASGrq/YOdR0Rkgu6OuuhmatgCSaF77Ye1wHctE8Psg=
X-IsSubscribed: yes
X-SW-Source: 2017-08/txt/msg00081.txt.bz2

Am 07.08.2017 um 23:29 schrieb Brian Inglis:
> On 2017-08-07 13:30, Thomas Wolff wrote:
>> Am 07.08.2017 um 21:07 schrieb Brian Inglis:
>>> Implementation considerations for handling the Unicode tables described in
>>>      http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf
>>> and implemented in
>>>      https://www.strchr.com/multi-stage_tables
>>>
>>> ICU icu4[cj] uses a folded trie of the properties, where the unique property
>>> combinations are indexed, strings of those indices are generated for fixed size
>>> groups of character codes, unique values of those strings are then indexed, and
>>> those indices assigned to each character code group. The result is a multi-level
>>> indexing operation that returns the required property combination for each
>>> character.
>>>
>>> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode
>>>
>>>
>>> The FOX Toolkit uses a similar approach, splitting the 21 bit character code
>>> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to
>>> eliminate redundancy.
>>>
>>> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf
>>>
>> Thanks for the interesting links, I'll chech them out.
>> But such multi-level tables don't really help without a given procedure how to
>> update them (that's only available for the lowest level, not for the
>> code-embedded levels).
> Unicode estimates property tables can be reduced to 7-8KB using these
> techniques, including using minimal int sizes for indices and array elements e.g
> char, short if you can keep the indices small, rather than pointers.
>
> Creation scripts used by PCRE and Python projects are linked from the bottom of
> the second link above. Source and docs for these packages and ICU is available
> under Cygwin, and FOX Toolkit is available in some distros and by FTP.
>
>> Also, as I've demonstrated, my more straight-forward and more efficient approach
>> will even use less total space than the multi-level approach if packed table
>> entries are used.
> Unicode recommends the double table index approach as a means of eliminating the
> massive redundancy that exists in char property entries and char groups, and
> using small integers instead of pointers, that can be optimized to meet
> conformance levels and platform speed and size limits, at the cost of an annual
> review of properties and rebuild. The amount of redundancy removed by this
> approach is estimated in the FOX Toolkit doc and ranges across orders of
> magnitude. Unfortunately none of these docs or sources quote sizes for any
> Unicode release!
>
> My own first take on these was to use run length encoded bitstrings for each
> binary property, similar to database bitmap indices, but the grouping of
> property blocks in Unicode, and their recommendation, persuaded me their
> approach was likely backed by a bunch of supporting corps' and devs' R&D, and is
> similar to those used for decades in database queries handling (lots of) small
> value set equivalence class columns to reduce memory pressure while speeding up
> selections.
I am not quite sure what you're trying to suggest or recommend now, but 
the thing is, I just wanted to get an update of width data in the first 
place, which is an easy and undisputed changed; then Corinna pointed out 
that the ctype functions are based on old Unicode data too, so I made an 
attempt to update them too. I use the approach that I also use for two 
other projects (mined and mintty) and I didn't mean this to become a 
research project for me :/
I am certainly willing to consider specs and all that to achieve a 
suitable result, but I don't feel like implementing any fancy algorithm 
recommended by Unicode with unconvincing rationale, especially after 
I've calculated that my method uses even less memory.
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple