public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
@ 2024-02-11 16:41 julesbertholet at quoi dot xyz
  2024-02-11 16:55 ` [Bug localedata/31370] " julesbertholet at quoi dot xyz
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: julesbertholet at quoi dot xyz @ 2024-02-11 16:41 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

            Bug ID: 31370
           Summary: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs
                    as zero-width
           Product: glibc
           Version: 2.40
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: julesbertholet at quoi dot xyz
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that
characters with the `Default_Ignorable_Code_Point` property

> should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.

Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding
precedent
- U+115F HANGUL CHOSEONG FILLER combines with jungseong and jongseong jamo to
form a width-2 syllable block, and should therefore keep its width 2

However, `wcwidth()` currently also incorrectly assigns non-zero width to
U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
@ 2024-02-11 16:55 ` julesbertholet at quoi dot xyz
  2024-02-12 13:45 ` carlos at redhat dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: julesbertholet at quoi dot xyz @ 2024-02-11 16:55 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

--- Comment #1 from Jules Bertholet <julesbertholet at quoi dot xyz> ---
Created attachment 15357
  --> https://sourceware.org/bugzilla/attachment.cgi?id=15357&action=edit
Patch to fix the issue

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
  2024-02-11 16:55 ` [Bug localedata/31370] " julesbertholet at quoi dot xyz
@ 2024-02-12 13:45 ` carlos at redhat dot com
  2024-02-13 22:53 ` maiku.fabian at gmail dot com
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: carlos at redhat dot com @ 2024-02-12 13:45 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

Carlos O'Donell <carlos at redhat dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2024-02-12
                 CC|                            |carlos at redhat dot com

--- Comment #2 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Jules Bertholet from comment #0)
> Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that
> characters with the `Default_Ignorable_Code_Point` property
> 
> > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.
> 
> Hence, `wcwidth()` should give them all a width of 0, with two exceptions:

Please provide a patch to libc-alpha@sourceware.org following:
https://sourceware.org/glibc/wiki/Contribution%20checklist

Please also provide justification for the zero width by quoting another
implementation that also provides zero width e.g. CLDR.

The goal is for glibc to harmonize closer to CLDR.

It seems sensible to me that they would be zero width if they are
non-advancing, but that isn't always what an end user needs (as seen below).

> - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding
> precedent

We use 1 in UTF-8 (default width). So this matches. The expectation is that the
system is trying to determine a width where the hyphen is chosen during the
display process.

> - U+115F HANGUL CHOSEONG FILLER combines with jungseong and jongseong jamo
> to form a width-2 syllable block, and should therefore keep its width 2

We use 2 in UTF-8. So this matches.
<U1100>...<U115F>       2

> However, `wcwidth()` currently also incorrectly assigns non-zero width to
> U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER.

This needs justification by highlighting that we are harmonizing the
implementation with CLDR.

Currently we have:
<U3131>...<U318E>       2

While U+FFA0 is default 1.

Thanks for filling this issue.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
  2024-02-11 16:55 ` [Bug localedata/31370] " julesbertholet at quoi dot xyz
  2024-02-12 13:45 ` carlos at redhat dot com
@ 2024-02-13 22:53 ` maiku.fabian at gmail dot com
  2024-02-14 18:02 ` julesbertholet at quoi dot xyz
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2024-02-13 22:53 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Assignee|unassigned at sourceware dot org   |maiku.fabian at gmail dot com
                 CC|                            |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
                   ` (2 preceding siblings ...)
  2024-02-13 22:53 ` maiku.fabian at gmail dot com
@ 2024-02-14 18:02 ` julesbertholet at quoi dot xyz
  2024-02-14 18:27 ` carlos at redhat dot com
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: julesbertholet at quoi dot xyz @ 2024-02-14 18:02 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

--- Comment #3 from Jules Bertholet <julesbertholet at quoi dot xyz> ---
> Please provide a patch to libc-alpha@sourceware.org

https://sourceware.org/pipermail/libc-alpha/2024-February/154574.html

> Please also provide justification for the zero width by quoting another implementation that also provides zero width e.g. CLDR.

CLDR doesn't address width issues at all, this is defined by Unicode itself.
The Unicode Standard, version 15.0, §5.21 - Characters Ignored for Display
<https://www.unicode.org/versions/Unicode15.1.0/ch05.pdf#G40095>:

> The list of characters which should be ignored for display in fallback rendering is given by a character property: Default_Ignorable_Code_Point (DI). Those characters include almost all format characters, all variation selectors, and a few other exceptional characters, such as Hangul fillers. The exact list is defined in DerivedCoreProperties.txt in the Unicode Character Database.

U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of
the conjoining Korean jamo characters. One composed Hangul "syllable block"
like 퓛 is made up of two to three individual component characters, or "jamo".
These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would
normally mean they would all be assigned width 2 by glibc; a combination of
(leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo)
would then have width 2 + 2 + 2 = 6. However, glibc (and other wcwidth
implementations) special-cases jungseong and jongseong, assigning them all
width 0, to ensure that the complete block has width 2 + 0 + 0 = 2 as it
should. U+115F is meant for use in syllable blocks that are intentionally
missing a leading jamo; it must be assigned a width of 2 even though it has no
visible display to ensure that the complete block has width 2.

You can read more about Unicode jamo in the Unicode spec, sections 3.12
<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646> and 18.6
<https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028>.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
                   ` (3 preceding siblings ...)
  2024-02-14 18:02 ` julesbertholet at quoi dot xyz
@ 2024-02-14 18:27 ` carlos at redhat dot com
  2024-02-14 20:47 ` julesbertholet at quoi dot xyz
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: carlos at redhat dot com @ 2024-02-14 18:27 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

--- Comment #4 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Jules Bertholet from comment #3)
> > Please provide a patch to libc-alpha@sourceware.org
> 
> https://sourceware.org/pipermail/libc-alpha/2024-February/154574.html
> 
> > Please also provide justification for the zero width by quoting another implementation that also provides zero width e.g. CLDR.
> 
> CLDR doesn't address width issues at all, this is defined by Unicode itself.
> The Unicode Standard, version 15.0, §5.21 - Characters Ignored for Display
> <https://www.unicode.org/versions/Unicode15.1.0/ch05.pdf#G40095>:

What do the libicu APIs return for these characters?

> > The list of characters which should be ignored for display in fallback rendering is given by a character property: Default_Ignorable_Code_Point (DI). Those characters include almost all format characters, all variation selectors, and a few other exceptional characters, such as Hangul fillers. The exact list is defined in DerivedCoreProperties.txt in the Unicode Character Database.
> 
> U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of
> the conjoining Korean jamo characters. One composed Hangul "syllable block"
> like 퓛 is made up of two to three individual component characters, or
> "jamo". These are all assigned an `East_Asian_Width` of `Wide` by Unicode,
> which would normally mean they would all be assigned width 2 by glibc; a
> combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing
> jongseong jamo) would then have width 2 + 2 + 2 = 6. However, glibc (and
> other wcwidth implementations) special-cases jungseong and jongseong,
> assigning them all width 0, to ensure that the complete block has width 2 +
> 0 + 0 = 2 as it should. U+115F is meant for use in syllable blocks that are
> intentionally missing a leading jamo; it must be assigned a width of 2 even
> though it has no visible display to ensure that the complete block has width
> 2.

Justification like this is *great* to have in the commit message e.g. here in a
v2.
https://patchwork.sourceware.org/project/glibc/patch/20240211175840.228824-2-julesbertholet@quoi.xyz/

> You can read more about Unicode jamo in the Unicode spec, sections 3.12
> <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646> and 18.6
> <https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028>.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
                   ` (4 preceding siblings ...)
  2024-02-14 18:27 ` carlos at redhat dot com
@ 2024-02-14 20:47 ` julesbertholet at quoi dot xyz
  2024-02-14 20:49 ` julesbertholet at quoi dot xyz
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: julesbertholet at quoi dot xyz @ 2024-02-14 20:47 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

--- Comment #5 from Jules Bertholet <julesbertholet at quoi dot xyz> ---
> What do the libicu APIs return for these characters?

They don't. Unicode does not in general have a concept of how many "cells" a
character should take up in a terminal or with a fixed-width font. Some
characters are specified to be 0-width/non-advancing, and there is also a
limited concept of "East Asian Wide" defined by UTR 11
<https://www.unicode.org/reports/tr11/>. But the displayed width of many
characters depends on the context, surrounding characters, the particular font,
etc. `wcwidth()See <https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf>
for some more background.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
                   ` (5 preceding siblings ...)
  2024-02-14 20:47 ` julesbertholet at quoi dot xyz
@ 2024-02-14 20:49 ` julesbertholet at quoi dot xyz
  2024-02-16 17:43 ` carlos at redhat dot com
  2024-02-18 18:20 ` julesbertholet at quoi dot xyz
  8 siblings, 0 replies; 10+ messages in thread
From: julesbertholet at quoi dot xyz @ 2024-02-14 20:49 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

--- Comment #6 from Jules Bertholet <julesbertholet at quoi dot xyz> ---
(Apologies for the previous mangled message, I accidentally submitted while
editing) 

…The displayed width of many characters depends on the context, surrounding
characters, the particular font, etc. `wcwidth()` can only ever be "best
effort." See <https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf> for
some more background.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
                   ` (6 preceding siblings ...)
  2024-02-14 20:49 ` julesbertholet at quoi dot xyz
@ 2024-02-16 17:43 ` carlos at redhat dot com
  2024-02-18 18:20 ` julesbertholet at quoi dot xyz
  8 siblings, 0 replies; 10+ messages in thread
From: carlos at redhat dot com @ 2024-02-16 17:43 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

--- Comment #7 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Jules Bertholet from comment #6)
> (Apologies for the previous mangled message, I accidentally submitted while
> editing) 
> 
> …The displayed width of many characters depends on the context, surrounding
> characters, the particular font, etc. `wcwidth()` can only ever be "best
> effort." See <https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf> for
> some more background.

I agree completely.

One last question: Are we internally consistent with curses, gnome,
implementations on Windows, implementations on Mac OSX?

While we might not be able to achieve this it would be good to know if we can?

At the very least being internally consistent on Linux terminals would be good.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/31370] wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width
  2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
                   ` (7 preceding siblings ...)
  2024-02-16 17:43 ` carlos at redhat dot com
@ 2024-02-18 18:20 ` julesbertholet at quoi dot xyz
  8 siblings, 0 replies; 10+ messages in thread
From: julesbertholet at quoi dot xyz @ 2024-02-18 18:20 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=31370

Jules Bertholet <julesbertholet at quoi dot xyz> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |julesbertholet at quoi dot xyz

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-02-18 18:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-11 16:41 [Bug localedata/31370] New: wcwidth() does not treat DEFAULT_IGNORABLE_CODE_POINTs as zero-width julesbertholet at quoi dot xyz
2024-02-11 16:55 ` [Bug localedata/31370] " julesbertholet at quoi dot xyz
2024-02-12 13:45 ` carlos at redhat dot com
2024-02-13 22:53 ` maiku.fabian at gmail dot com
2024-02-14 18:02 ` julesbertholet at quoi dot xyz
2024-02-14 18:27 ` carlos at redhat dot com
2024-02-14 20:47 ` julesbertholet at quoi dot xyz
2024-02-14 20:49 ` julesbertholet at quoi dot xyz
2024-02-16 17:43 ` carlos at redhat dot com
2024-02-18 18:20 ` julesbertholet at quoi dot xyz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).