[Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width

public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed

* [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width
@ 2019-03-08 11:24 stlman at poczta dot fm
  2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: stlman at poczta dot fm @ 2019-03-08 11:24 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24314

            Bug ID: 24314
           Summary: charmaps: Some of UTF-8 characters have invalid width
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: stlman at poczta dot fm
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Some characters are assigned invalid width in localedata/charmaps/UTF-8 file
(lines below 47072). For example \u2693 (ANCHOR) is described as double-width
character.

There is a procedure for deriving the data from the standard files, which says
that double width characters come from EastAsianWidth.txt file.

  grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693

returns no results which means the line 47261 (as of commit c5f65462a2) which
says

  <U2693> 2

is wrong. At least 28 other characters seem to be improperly classified as
double-width too. Use the following command to find them.

  perl -ne 'next if (1..47080 or /\.\.\./);  print if (/2$/);'
localedata/charmaps/UTF-8

None of these characters can be found in the output of 

  grep '^[^;]*;[WF]' EastAsianWidth.txt

Apparently the localedata/unicode-gen/utf8_gen.py script has failed to filter
out these characters.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug localedata/24314] charmaps: Some of UTF-8 characters have invalid width
  2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
@ 2019-03-08 13:56 ` egmont at gmail dot com
  2019-03-08 19:03 ` stlman at poczta dot fm
  2019-03-08 20:03 ` fweimer at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: egmont at gmail dot com @ 2019-03-08 13:56 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24314

Egmont Koblinger <egmont at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |egmont at gmail dot com

--- Comment #1 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to Łukasz Stelmach from comment #0)

>   grep '^[^;]*;[WF]' EastAsianWidth.txt | grep 2693
> 
> returns no results which means the line 47261 (as of commit c5f65462a2)

This command _does_ print "2693;W" for me, as of the aforementioned commit,
assuming the input file is glibc's localedata/unicode-gen/EastAsianWidth.txt
(line 1210).

Note that the width of many codepoints, including this one, changed from narrow
to wide with Unicode 9.0. Compare these two files:

ftp://ftp.unicode.org/Public/8.0.0/ucd/EastAsianWidth.txt ("2670..269D;N")
ftp://ftp.unicode.org/Public/9.0.0/ucd/EastAsianWidth.txt ("2693;W")

Any chance you worked from a Unicode 8 (or older) EastAsianWidth.txt, rather
than the one in glibc's source?

(Also note that your grep command can easily miss matches, since the file
defines ranges. It's not the case with U+2693 though.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug localedata/24314] charmaps: Some of UTF-8 characters have invalid width
  2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
  2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
@ 2019-03-08 19:03 ` stlman at poczta dot fm
  2019-03-08 20:03 ` fweimer at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: stlman at poczta dot fm @ 2019-03-08 19:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24314

--- Comment #2 from Łukasz Stelmach <stlman at poczta dot fm> ---
TL;DR Indeed, I was working with an old data file.

As an excuse I can only say, that several fonts provide this character as
normal rather than wide, which matched ma observation of the outdated data
file. 

I guess, this bug can be closed then. Thank you.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Bug localedata/24314] charmaps: Some of UTF-8 characters have invalid width
  2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
  2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
  2019-03-08 19:03 ` stlman at poczta dot fm
@ 2019-03-08 20:03 ` fweimer at redhat dot com
  2 siblings, 0 replies; 4+ messages in thread
From: fweimer at redhat dot com @ 2019-03-08 20:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24314

Florian Weimer <fweimer at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
                 CC|                            |fweimer at redhat dot com
         Resolution|---                         |INVALID
              Flags|                            |security-

--- Comment #3 from Florian Weimer <fweimer at redhat dot com> ---
Thanks, closing as requested.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-03-08 20:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-08 11:24 [Bug localedata/24314] New: charmaps: Some of UTF-8 characters have invalid width stlman at poczta dot fm
2019-03-08 13:56 ` [Bug localedata/24314] " egmont at gmail dot com
2019-03-08 19:03 ` stlman at poczta dot fm
2019-03-08 20:03 ` fweimer at redhat dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).