* [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width
@ 2019-03-08 11:24 stlman at poczta dot fm
2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: stlman at poczta dot fm @ 2019-03-08 11:24 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=24314
Bug ID: 24314
Summary: charmaps: Some of UTF-8 characters have invalid width
Product: glibc
Version: unspecified
Status: UNCONFIRMED
Severity: normal
Priority: P2
Component: localedata
Assignee: unassigned at sourceware dot org
Reporter: stlman at poczta dot fm
CC: libc-locales at sourceware dot org
Target Milestone: ---
Some characters are assigned invalid width in localedata/charmaps/UTF-8 file
(lines below 47072). For example \u2693 (ANCHOR) is described as double-width
character.
There is a procedure for deriving the data from the standard files, which says
that double width characters come from EastAsianWidth.txt file.
grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693
returns no results which means the line 47261 (as of commit c5f65462a2) which
says
<U2693> 2
is wrong. At least 28 other characters seem to be improperly classified as
double-width too. Use the following command to find them.
perl -ne 'next if (1..47080 or /\.\.\./); print if (/2$/);'
localedata/charmaps/UTF-8
None of these characters can be found in the output of
grep '^[^;]*;[WF]' EastAsianWidth.txt
Apparently the localedata/unicode-gen/utf8_gen.py script has failed to filter
out these characters.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug localedata/24314] charmaps: Some of UTF-8 characters have invalid width
2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
@ 2019-03-08 13:56 ` egmont at gmail dot com
2019-03-08 19:03 ` stlman at poczta dot fm
2019-03-08 20:03 ` fweimer at redhat dot com
2 siblings, 0 replies; 4+ messages in thread
From: egmont at gmail dot com @ 2019-03-08 13:56 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=24314
Egmont Koblinger <egmont at gmail dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |egmont at gmail dot com
--- Comment #1 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to Łukasz Stelmach from comment #0)
> grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693
>
> returns no results which means the line 47261 (as of commit c5f65462a2)
This command _does_ print "2693;W" for me, as of the aforementioned commit,
assuming the input file is glibc's localedata/unicode-gen/EastAsianWidth.txt
(line 1210).
Note that the width of many codepoints, including this one, changed from narrow
to wide with Unicode 9.0. Compare these two files:
ftp://ftp.unicode.org/Public/8.0.0/ucd/EastAsianWidth.txt ("2670..269D;N")
ftp://ftp.unicode.org/Public/9.0.0/ucd/EastAsianWidth.txt ("2693;W")
Any chance you worked from a Unicode 8 (or older) EastAsianWidth.txt, rather
than the one in glibc's source?
(Also note that your grep command can easily miss matches, since the file
defines ranges. It's not the case with U+2693 though.)
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug localedata/24314] charmaps: Some of UTF-8 characters have invalid width
2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
@ 2019-03-08 19:03 ` stlman at poczta dot fm
2019-03-08 20:03 ` fweimer at redhat dot com
2 siblings, 0 replies; 4+ messages in thread
From: stlman at poczta dot fm @ 2019-03-08 19:03 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=24314
--- Comment #2 from Łukasz Stelmach <stlman at poczta dot fm> ---
TL;DR Indeed, I was working with an old data file.
As an excuse I can only say, that several fonts provide this character as
normal rather than wide, which matched ma observation of the outdated data
file.
I guess, this bug can be closed then. Thank you.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Bug localedata/24314] charmaps: Some of UTF-8 characters have invalid width
2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
2019-03-08 19:03 ` stlman at poczta dot fm
@ 2019-03-08 20:03 ` fweimer at redhat dot com
2 siblings, 0 replies; 4+ messages in thread
From: fweimer at redhat dot com @ 2019-03-08 20:03 UTC (permalink / raw)
To: libc-locales
https://sourceware.org/bugzilla/show_bug.cgi?id=24314
Florian Weimer <fweimer at redhat dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |RESOLVED
CC| |fweimer at redhat dot com
Resolution|--- |INVALID
Flags| |security-
--- Comment #3 from Florian Weimer <fweimer at redhat dot com> ---
Thanks, closing as requested.
--
You are receiving this mail because:
You are on the CC list for the bug.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-03-08 20:03 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
2019-03-08 19:03 ` stlman at poczta dot fm
2019-03-08 20:03 ` fweimer at redhat dot com
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).