From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 11795 invoked by alias); 21 Nov 2014 12:39:40 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 8081 invoked by uid 48); 21 Nov 2014 12:36:38 -0000 From: "maiku.fabian at gmail dot com" To: libc-locales@sourceware.org Subject: [Bug localedata/17588] Update UTF-8 charmap and width to Unicode 7.0.0 Date: Fri, 21 Nov 2014 12:39:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: maiku.fabian at gmail dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: security- X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2014-q4/txt/msg00064.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=17588 --- Comment #4 from Mike FABIAN --- Here is another one where I have a little bit of doubt left: changed width: 0x1929 : 0->1 eaw=N category=Mc bidi=L name=LIMBU SUBJOINED LETTER YA Why is this combining characters listed with width 0 in the current UTF-8 file? In our newly generated UTF-8 file it has width 1 (because it is removed from that file). The comment in the existing UTF-8 file in glibc says: % Character width according to Unicode 5.0.0. % - Default width is 1. % - Double-width characters have width 2; generated from % "grep '^[^;]*;[WF]' EastAsianWidth.txt" % and "grep '^[^;]*;[^WF]' EastAsianWidth.txt" % - Non-spacing characters have width 0; generated from PropList.txt or % "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt" % - Format control characters have width 0; generated from % "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" % - Zero width characters have width 0; generated from % "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt" This does *not* mention combining characters as needing width 0, these grep patters to not include some combining characters. The combining characters with category=Mn get width 0 because the also have bidi=NSM, for example: changed width: 0x1a1b : 1->0 eaw=N category=Mn bidi=NSM name=BUGINESE VOWEL SIGN AE but the combining characters with category=Mc are not matched by the above grep patterns, because they do *not* have bidi=NSM. That seems correct, considering they have a positive advance width: Mn Nonspacing_Mark a nonspacing combining mark (zero advance width) Mc Spacing_Mark a spacing combining mark (positive advance width) Me Enclosing_Mark an enclosing combining mark (http://www.unicode.org/reports/tr44) But how did these get into the existing UTF-8 file in glibc? Looks like the existing UTF-8 file in glibc was edited manually and not just created using the grep patterns in the comment. -- You are receiving this mail because: You are on the CC list for the bug.