From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 67860 invoked by alias); 7 Aug 2017 21:29:27 -0000 Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner@cygwin.com Mail-Followup-To: cygwin@cygwin.com Received: (qmail 67845 invoked by uid 89); 7 Aug 2017 21:29:26 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-6.4 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_1,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=recommendation, 07082017, Hx-languages-length:3071, Hx-spam-relays-external:sk:smtp-ou X-HELO: smtp-out-no.shaw.ca Received: from smtp-out-no.shaw.ca (HELO smtp-out-no.shaw.ca) (64.59.134.12) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 07 Aug 2017 21:29:20 +0000 Received: from [192.168.1.100] ([24.64.240.204]) by shaw.ca with SMTP id epaPd2gMUI8mCepaQdT0t1; Mon, 07 Aug 2017 15:29:18 -0600 X-Authority-Analysis: v=2.2 cv=HahkdmM8 c=1 sm=1 tr=0 a=MVEHjbUiAHxQW0jfcDq5EA==:117 a=MVEHjbUiAHxQW0jfcDq5EA==:17 a=IkcTkHD0fZMA:10 a=te1EGT4yAAAA:8 a=US7-Rng0AAAA:8 a=7GStbvgMAAAA:8 a=z0uG624dAAAA:8 a=70QB8q7fMzdtTXEgtVkA:9 a=7Zwj6sZBwVKJAoWSPKxL6X1jA+E=:19 a=QEXdDO2ut3YA:10 a=RRElR4r2U1jGY2dU47NL:22 a=RCpFSEPCRiHwXyn-TuLs:22 a=bgd9Iqch1-7RybpTBNxN:22 a=XYTzjgE7hB3o1y3dZZOX:22 Reply-To: Brian.Inglis@SystematicSw.ab.ca Subject: Re: Unicode width data inconsistent/outdated To: cygwin@cygwin.com References: <20170726080859.GA24312@calimero.vinschen.de> <5d3cb047-49f8-26a6-d816-387a71486e99@cygwin.com> <20170726095016.GA25666@calimero.vinschen.de> <289bd98b-e644-888d-07f8-8965b6538373@towo.net> <20170728195826.GI24013@calimero.vinschen.de> <1244bd24-bb27-d185-1f24-61beae02c2cd@towo.net> <20170804170156.GL25551@calimero.vinschen.de> <30486790-c59d-9a78-6000-b3c20fb86d9d@towo.net> <20170807092820.GQ25551@calimero.vinschen.de> <401b6d26-35cb-3026-afde-6bd5d09b2d71@SystematicSw.ab.ca> <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b@towo.net> From: Brian Inglis Message-ID: <0f8f1535-ed48-d170-7e57-c554bec23942@SystematicSw.ab.ca> Date: Mon, 07 Aug 2017 21:29:00 -0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <9f7a8d16-6ebc-52ff-15ae-b1a52d23986b@towo.net> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-CMAE-Envelope: MS4wfPYk+EdcN+D4RY2xXjt2gDTYBHpQV9I0oW/VrwolkvZcefXO9qSBCIELbmFyy+27PvxayZHuQCl4A27LhgKyGub7fN/kXKyGHXOG+cypWulzWbfvj1DO fdp1zfkDt5X57ccoQO8HRBJpKFQHuXFbhqCZ//r1fXCB3GjxcvcKN24+30k71lyOFnIBXqFZpNS2Fw== X-IsSubscribed: yes X-SW-Source: 2017-08/txt/msg00079.txt.bz2 On 2017-08-07 13:30, Thomas Wolff wrote: > Am 07.08.2017 um 21:07 schrieb Brian Inglis: >> Implementation considerations for handling the Unicode tables described in >> http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf >> and implemented in >> https://www.strchr.com/multi-stage_tables >> >> ICU icu4[cj] uses a folded trie of the properties, where the unique property >> combinations are indexed, strings of those indices are generated for fixed size >> groups of character codes, unique values of those strings are then indexed, and >> those indices assigned to each character code group. The result is a multi-level >> indexing operation that returns the required property combination for each >> character. >> >> https://slidegur.com/doc/4172411/folded-trie--efficient-data-structure-for-all-of-unicode >> >> >> The FOX Toolkit uses a similar approach, splitting the 21 bit character code >> into 7 bit groups, with two higher levels of 7 bit indices, and more tweaks to >> eliminate redundancy. >> >> ftp://ftp.fox-toolkit.org/pub/FOX_Unicode_Tables.pdf >> > Thanks for the interesting links, I'll chech them out. > But such multi-level tables don't really help without a given procedure how to > update them (that's only available for the lowest level, not for the > code-embedded levels). Unicode estimates property tables can be reduced to 7-8KB using these techniques, including using minimal int sizes for indices and array elements e.g char, short if you can keep the indices small, rather than pointers. Creation scripts used by PCRE and Python projects are linked from the bottom of the second link above. Source and docs for these packages and ICU is available under Cygwin, and FOX Toolkit is available in some distros and by FTP. > Also, as I've demonstrated, my more straight-forward and more efficient approach > will even use less total space than the multi-level approach if packed table > entries are used. Unicode recommends the double table index approach as a means of eliminating the massive redundancy that exists in char property entries and char groups, and using small integers instead of pointers, that can be optimized to meet conformance levels and platform speed and size limits, at the cost of an annual review of properties and rebuild. The amount of redundancy removed by this approach is estimated in the FOX Toolkit doc and ranges across orders of magnitude. Unfortunately none of these docs or sources quote sizes for any Unicode release! My own first take on these was to use run length encoded bitstrings for each binary property, similar to database bitmap indices, but the grouping of property blocks in Unicode, and their recommendation, persuaded me their approach was likely backed by a bunch of supporting corps' and devs' R&D, and is similar to those used for decades in database queries handling (lots of) small value set equivalence class columns to reduce memory pressure while speeding up selections. -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple