From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-locales-return-3477-listarch-libc-locales=sources.redhat.com@sourceware.org>
Received: (qmail 11795 invoked by alias); 21 Nov 2014 12:39:40 -0000
Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-locales.sourceware.org>
List-Subscribe: <mailto:libc-locales-subscribe@sourceware.org>
List-Post: <mailto:libc-locales@sourceware.org>
List-Help: <mailto:libc-locales-help@sourceware.org>, <http://sourceware.org/lists.html#faqs>
Sender: libc-locales-owner@sourceware.org
Received: (qmail 8081 invoked by uid 48); 21 Nov 2014 12:36:38 -0000
From: "maiku.fabian at gmail dot com" <sourceware-bugzilla@sourceware.org>
To: libc-locales@sourceware.org
Subject: [Bug localedata/17588] Update UTF-8 charmap and width to Unicode
 7.0.0
Date: Fri, 21 Nov 2014 12:39:00 -0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: localedata
X-Bugzilla-Version: unspecified
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: maiku.fabian at gmail dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: pravin.d.s at gmail dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: security-
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-17588-716-KYWH75okov@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-17588-716@http.sourceware.org/bugzilla/>
References: <bug-17588-716@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-SW-Source: 2014-q4/txt/msg00064.txt.bz2

https://sourceware.org/bugzilla/show_bug.cgi?id=17588

--- Comment #4 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Here is another one where I have a little bit of doubt left:

changed width: 0x1929 : 0->1 eaw=N category=Mc bidi=L   name=LIMBU SUBJOINED
LETTER YA

Why is this combining characters listed with width 0 in the current UTF-8 file?

In our newly generated UTF-8 file it has width 1 (because it is removed from
that  file).

The comment in the existing UTF-8 file in glibc says:

% Character width according to Unicode 5.0.0.
% - Default width is 1.
% - Double-width characters have width 2; generated from
%        "grep '^[^;]*;[WF]' EastAsianWidth.txt"
%   and  "grep '^[^;]*;[^WF]' EastAsianWidth.txt"
% - Non-spacing characters have width 0; generated from PropList.txt or
%   "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
% - Format control characters have width 0; generated from
%   "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
% - Zero width characters have width 0; generated from
%   "grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt"

This does *not* mention combining characters as needing width 0,
these grep patters to not include some combining characters.

The combining characters with category=Mn get width 0 because the
also have bidi=NSM, for example:

changed width: 0x1a1b : 1->0 eaw=N category=Mn bidi=NSM name=BUGINESE VOWEL
SIGN AE

but the combining characters with category=Mc are not matched by
the above grep patterns, because they do *not* have bidi=NSM.
That seems correct, considering they have a positive advance width:

Mn     Nonspacing_Mark  a nonspacing combining mark (zero advance width)
Mc     Spacing_Mark      a spacing combining mark (positive advance width)
Me     Enclosing_Mark      an enclosing combining mark

(http://www.unicode.org/reports/tr44)

But how did these get into the existing UTF-8 file in glibc?

Looks like the existing UTF-8 file in glibc was edited manually
and not just created using the grep patterns in the comment.

-- 
You are receiving this mail because:
You are on the CC list for the bug.