public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
From: "tg at mirbsd dot de" <sourceware-bugzilla@sourceware.org>
To: libc-locales@sourceware.org
Subject: [Bug localedata/21750] New: column width of characters incompatible with classical wcwidth
Date: Tue, 11 Jul 2017 14:18:00 -0000	[thread overview]
Message-ID: <bug-21750-716@http.sourceware.org/bugzilla/> (raw)

https://sourceware.org/bugzilla/show_bug.cgi?id=21750

            Bug ID: 21750
           Summary: column width of characters incompatible with classical
                    wcwidth
           Product: glibc
           Version: 2.26
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: tg at mirbsd dot de
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

I’ve compared the new autogenerated column width from
localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
implementation from xterm (adjusted to Unicode 10.0.0) and found a few
divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
system-wide) side, which I fixed).

1. U+00AD is forced to width 1 in xterm, autodetected as combining in glibc

Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
when displayed as 8bit on terminals, had no combining characters at all.

Change Request to glibc: force U+00AD to width 1.

2. The UCD has three codepoints that are Me/Mn category but not NSM bidi class:
U+0CBF U+0CC6 U-00011C3F

This is likely a bug in UCD but can be fixed by glibc treating Me/Mn the same
as Cf/NSM, which I do.

Change Request to glibc: handle Me/Mn category the same as NSM bidi class.

3. Hangul Jamo medial vowels and final consonants are set to 0 by xterm so they
combine on top of the preceding initial ones: U+1160‥U+11FF

Change Request to glibc: force U+1160‥U+11FF to width 0.

4. During parsing, EastAsianWidth data overrides UCD data, more specifically
the NSM property.

This leads to U+302A‥U+302D and – see also
https://sourceware.org/bugzilla/show_bug.cgi?id=19852 – U+3099 and U+309A being
treated as width 2.

Change Request to glibc: read EAW before UCD so the NSM overrides EAW here.

5. Ambiguous circled numbers and neutral hexagrams changed width

xterm used to set those to width 2, likely because they are ideographs and not
unlike zodiac signs and emoji (which, I notice, have been set to width 2 in UCD
nowadays)

Change Request to glibc: force U+3248‥U+324F and U+4DC0‥U+4DFF to width 2.


Note: I’ve initially reported the surprising change to Debian as
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=826256 but have redone the
research today (against 2.24 in Debian and git master commit
2a91300176a5991d9825eba085e502196a3f47cd in glibc) against Unicode 10,
double-checked *all* differences against MirBSD code and fixed a few bugs there
after making it possible to compare the results (considering glibc only puts
actually assigned codepoints into the localedata/charmaps/UTF-8 file).

Rationale for requesting the change in glibc is so that all systems I have
access to use the same width data, preventing display artifacts and glitches up
to making an editor somewhat unusable with heavy Unicode (I have test files
containing the entire Unicode range). Thank you for listening.

If necessary, I will provide patches (to utf8_gen.py most likely) when asked.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

             reply	other threads:[~2017-07-11 14:18 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 14:18 tg at mirbsd dot de [this message]
2017-07-12 11:01 ` [Bug localedata/21750] " tjk at tksoft dot com
2017-07-12 11:01 ` [Bug localedata/21750] New: " Troy Korjuslommi
2017-07-12 13:39 ` [Bug localedata/21750] " tg at mirbsd dot de
2017-07-14 12:04 ` tg at mirbsd dot de
2017-08-15 13:11 ` maiku.fabian at gmail dot com
2017-08-16  7:54 ` maiku.fabian at gmail dot com
2017-08-16 14:17 ` maiku.fabian at gmail dot com
2017-08-16 15:28 ` maiku.fabian at gmail dot com
2017-08-16 18:18 ` egmont at gmail dot com
2017-08-17  9:07 ` tg at mirbsd dot de
2017-08-17  9:16 ` cvs-commit at gcc dot gnu.org
2017-08-17 13:51 ` maiku.fabian at gmail dot com
2017-08-18  7:29 ` schwab@linux-m68k.org
2017-08-18 11:04 ` egmont at gmail dot com
2017-08-21  7:24 ` cvs-commit at gcc dot gnu.org
2017-09-03 16:35 ` vapier at gentoo dot org
2017-09-03 20:43 ` vapier at gentoo dot org
2017-09-03 21:03 ` vapier at gentoo dot org
2017-09-03 21:32 ` vapier at gentoo dot org
2017-09-04 14:33 ` maiku.fabian at gmail dot com
2017-09-06 13:06 ` cvs-commit at gcc dot gnu.org
2017-09-14 13:45 ` maiku.fabian at gmail dot com
2017-09-14 18:25 ` tg at mirbsd dot de

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-21750-716@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=libc-locales@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).