public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/24658] New: wcwidth inconsistencies with Unicode 12.1
@ 2019-06-10 16:49 rob.ross at ymail dot com
  2019-06-10 18:09 ` [Bug localedata/24658] " rob.ross at ymail dot com
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: rob.ross at ymail dot com @ 2019-06-10 16:49 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24658

            Bug ID: 24658
           Summary: wcwidth inconsistencies with Unicode 12.1
           Product: glibc
           Version: unspecified
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: rob.ross at ymail dot com
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

For "en_US.utf8", the 2019-06-10 trunk closely follows Unicode standard except
for U+3248 to U+324F (Circled numbers with Ambiguous [A] width) and U+4DC0 to
U+4DFF (Yijing hexagram symbols with Neutral [N] width) where wcwidth returns 2
instead of 1.  Those deviations were intentionally added to
"localedata/unicode-gen/utf8_gen.py" starting at line 262.  The rationale
starting at line 263 refers to
<http://www.unicode.org/mail-arch/unicode-ml/y2017-m08/0023.html> which only
applies to the first range and depends on the definition of "context".  The
interpretation that glibc is a context, regardless of locale, is likely not
what was intended.  In particular, UAX 11
(<http://www.unicode.org/reports/tr11/tr11-36.html>) makes it clear that the
"EastAsianWidth.txt" context is either "East Asian" or "non-East Asian".  It
also states that "narrow characters include N, Na, H, and A (when not in East
Asian context)."

This bug relates to 21750
(<https://sourceware.org/bugzilla/show_bug.cgi?id=21750>) item 5.  Part of the
rationale there for forcing a width of 2 was based on xterm's implementation
but xterm defaults to using wcwidth (unless you set mkWidth) so it's not very
convincing.  Another rationale was "glyphs for these characters are quadratic
in most fonts" which is a good point but lots of characters have this problem. 
Should there be wcwidth bugs for those characters?  Why should some ranges
receive special treatment?  The last rationale related to application
compatibility.  Changing widths to better track the Unicode database will break
old versions of applications, but programs are increasingly tracking that
database themselves so the problem will resolve itself.  A concrete example is
vim which needs its own table in order to function consistently on platforms
without wcwidth.  Egmont Koblinger provided good rationales for a width of 1
and I don't see why they were discounted.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/24658] wcwidth inconsistencies with Unicode 12.1
  2019-06-10 16:49 [Bug localedata/24658] New: wcwidth inconsistencies with Unicode 12.1 rob.ross at ymail dot com
@ 2019-06-10 18:09 ` rob.ross at ymail dot com
  2024-05-21 19:00 ` maiku.fabian at gmail dot com
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: rob.ross at ymail dot com @ 2019-06-10 18:09 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24658

--- Comment #1 from Rob Ross <rob.ross at ymail dot com> ---
I meant to start with the following:

For the "en_US.UTF-8" locale, the "master" branch as of 2019-06-10 closely
follows the Unicode standard except ...

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/24658] wcwidth inconsistencies with Unicode 12.1
  2019-06-10 16:49 [Bug localedata/24658] New: wcwidth inconsistencies with Unicode 12.1 rob.ross at ymail dot com
  2019-06-10 18:09 ` [Bug localedata/24658] " rob.ross at ymail dot com
@ 2024-05-21 19:00 ` maiku.fabian at gmail dot com
  2024-05-22  0:03 ` tg at mirbsd dot de
  2024-05-22  9:18 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-05-21 19:00 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24658

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com,
                   |                            |tg at mirbsd dot de

--- Comment #2 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Do you have any examples where the current way of treating these as width 2
causes problems? 

I checked again with the fonts I have on my system and these are still
quadratic in most fonts. The hexagrams U+4DC0 to U+4DFF are also seem to be
quadratic in most fonts although I found some fonts like "DejaVu Sans" and
"Everson Mono" where they are rectangular, although still wider even there then
"single width" (wider than half a square). 

I am reluctant to change this unless there is a good reason why this needs to
change as it was requested by Thorsten Glaser (added to CC:) in 

https://sourceware.org/bugzilla/show_bug.cgi?id=21750#c0

it seemed to make sense at the time and nobody complained about that for quite
a while.

So why should these change now?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/24658] wcwidth inconsistencies with Unicode 12.1
  2019-06-10 16:49 [Bug localedata/24658] New: wcwidth inconsistencies with Unicode 12.1 rob.ross at ymail dot com
  2019-06-10 18:09 ` [Bug localedata/24658] " rob.ross at ymail dot com
  2024-05-21 19:00 ` maiku.fabian at gmail dot com
@ 2024-05-22  0:03 ` tg at mirbsd dot de
  2024-05-22  9:18 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: tg at mirbsd dot de @ 2024-05-22  0:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24658

--- Comment #3 from Thorsten Glaser <tg at mirbsd dot de> ---
The misunderstanding is probably from conflating whatever the UCD says (which
affects Unicode) and wcwidth (which affects character width for fixed-width
terminals and similar use).

Unicode specifically does not say anything about the latter, and while sensible
wcwidth can be derived from UCD for most codepoints, there are some that are
historically fixed (from early mgk versions already), and changing them now
breaks a lot of things (I’d almost call it an API):

- fonts, especially bitmap fonts, that are intended to be used in terminals are
tailored to those widths
- if you run an application over ssh and remote and local wcwidth disagree,
visual corruption ensues

This is not limited to xterm. Applications such as GNU screen or text editors
*also* need to be able to know how wide a character *in xterm* is, so wcwidth
is *the* interface for that.

For “modern” terminal emulators, they can display outside of the cell grid
anyway (seen with Emoji in KDE Konsole, for example, which tend to be too wide
for the two-cell width assigned to them), and they can also always display a
narrower glyph.

But the wcwidth has to agree, and changing it later for long-established
codepoints is IMO not acceptable.

Ambiguous width even more than neutral width is up for that. This is just the
decision mgk did back then, and which wcwidth in vast wide use follows.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Bug localedata/24658] wcwidth inconsistencies with Unicode 12.1
  2019-06-10 16:49 [Bug localedata/24658] New: wcwidth inconsistencies with Unicode 12.1 rob.ross at ymail dot com
                   ` (2 preceding siblings ...)
  2024-05-22  0:03 ` tg at mirbsd dot de
@ 2024-05-22  9:18 ` maiku.fabian at gmail dot com
  3 siblings, 0 replies; 5+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-05-22  9:18 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=24658

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |NOTABUG
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #4 from Mike FABIAN <maiku.fabian at gmail dot com> ---
So I think it is OK to close this bug here as NOTABUG?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-05-22  9:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-10 16:49 [Bug localedata/24658] New: wcwidth inconsistencies with Unicode 12.1 rob.ross at ymail dot com
2019-06-10 18:09 ` [Bug localedata/24658] " rob.ross at ymail dot com
2024-05-21 19:00 ` maiku.fabian at gmail dot com
2024-05-22  0:03 ` tg at mirbsd dot de
2024-05-22  9:18 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).