From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 111712 invoked by alias); 8 Mar 2019 11:24:58 -0000 Mailing-List: contact libc-locales-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Post: List-Help: , Sender: libc-locales-owner@sourceware.org Received: (qmail 111577 invoked by uid 48); 8 Mar 2019 11:24:53 -0000 From: "stlman at poczta dot fm" To: libc-locales@sourceware.org Subject: [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width Date: Fri, 08 Mar 2019 11:24:00 -0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: localedata X-Bugzilla-Version: unspecified X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: stlman at poczta dot fm X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-SW-Source: 2019-q1/txt/msg00062.txt.bz2 https://sourceware.org/bugzilla/show_bug.cgi?id=3D24314 Bug ID: 24314 Summary: charmaps: Some of UTF-8 characters have invalid width Product: glibc Version: unspecified Status: UNCONFIRMED Severity: normal Priority: P2 Component: localedata Assignee: unassigned at sourceware dot org Reporter: stlman at poczta dot fm CC: libc-locales at sourceware dot org Target Milestone: --- Some characters are assigned invalid width in localedata/charmaps/UTF-8 file (lines below 47072). For example \u2693 (ANCHOR) is described as double-wid= th character. There is a procedure for deriving the data from the standard files, which s= ays that double width characters come from EastAsianWidth.txt file. grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693 returns no results which means the line 47261 (as of commit c5f65462a2) whi= ch says 2 is wrong. At least 28 other characters seem to be improperly classified as double-width too. Use the following command to find them. perl -ne 'next if (1..47080 or /\.\.\./); print if (/2$/);' localedata/charmaps/UTF-8 None of these characters can be found in the output of=20 grep '^[^;]*;[WF]' EastAsianWidth.txt Apparently the localedata/unicode-gen/utf8_gen.py script has failed to filt= er out these characters. --=20 You are receiving this mail because: You are on the CC list for the bug.