[Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?

public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed

* [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?
@ 2017-09-03 20:42 vapier at gentoo dot org
  2017-09-03 21:01 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " vapier at gentoo dot org
                   ` (27 more replies)
  0 siblings, 28 replies; 31+ messages in thread
From: vapier at gentoo dot org @ 2017-09-03 20:42 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

            Bug ID: 22073
           Summary: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?
           Product: glibc
           Version: 2.26
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: vapier at gentoo dot org
                CC: egmont at gmail dot com, libc-locales at sourceware dot org,
                    maiku.fabian at gmail dot com, tg at mirbsd dot de
        Depends on: 21750
  Target Milestone: ---

+++ This bug was initially created as a clone of Bug #21750 +++

I’ve compared the new autogenerated column width from
localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
implementation from xterm (adjusted to Unicode 10.0.0) and found a few
divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
system-wide) side, which I fixed).

U+00AD is forced to width 1 in xterm, autodetected as combining in glibc

Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
when displayed as 8bit on terminals, had no combining characters at all.

Change Request to glibc: force U+00AD to width 1.

more background discussion with different standards can be found here:
  https://www.cs.tut.fi/~jkorpela/shy.html


Referenced Bugs:

https://sourceware.org/bugzilla/show_bug.cgi?id=21750
[Bug 21750] column width of characters incompatible with classical wcwidth
-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
@ 2017-09-03 21:01 ` vapier at gentoo dot org
  2017-09-04  8:10 ` vapier at gentoo dot org
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: vapier at gentoo dot org @ 2017-09-03 21:01 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|charmaps/UTF-8: wcwidth of  |charmaps/UTF-8: wcwidth of
                   |U+00AD: 0 or 1 ?            |U+00AD (soft hyphen): 0 or
                   |                            |1 ?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
  2017-09-03 21:01 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " vapier at gentoo dot org
@ 2017-09-04  8:10 ` vapier at gentoo dot org
  2017-09-04  8:10 ` [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: " Troy Korjuslommi
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: vapier at gentoo dot org @ 2017-09-04  8:10 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

Mike Frysinger <vapier at gentoo dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                URL|                            |https://www.cs.tut.fi/~jkor
                   |                            |pela/shy.html
           See Also|                            |https://github.com/jquast/w
                   |                            |cwidth/issues/8

--- Comment #1 from Mike Frysinger <vapier at gentoo dot org> ---
more discussion:
  https://github.com/jquast/wcwidth/issues/8

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
  2017-09-03 21:01 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " vapier at gentoo dot org
  2017-09-04  8:10 ` vapier at gentoo dot org
@ 2017-09-04  8:10 ` Troy Korjuslommi
  2017-09-04  8:12 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " tjk at tksoft dot com
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: Troy Korjuslommi @ 2017-09-04  8:10 UTC (permalink / raw)
  To: vapier at gentoo dot org; +Cc: libc-locales

I reached a totally different conclusion from reading those links and
thinking of the wcwidth(SHY) situation for wcwidth().

When writing a curses/terminfo (terminal) application, one goes through
input and determines the width of text by iterating through the input
characters. If a word contains multiple U+00AD characters, at the end of
the line or not, the total width of the word ends up wrong if wcwidth is
set to 1. Therefore wcwidth(U+00AD) should return 0.

Also, using a SHY (U+00AD) character as a rendering hint seems to make
sense, since if a word is broken up with SHY characters, then a SHY
aware application can determine where to break the word, adding a
visible hyphen only at that position. A SHY non-aware application can
just ignore the SHY.

The Korpela article shed light on the confusion standard writers have
had with the issue. It seems clear to me that their intention has been
to add a character which can be used as a hint for breaking words
according to hyphenation rules. The imprecise wording used for
describing the solution has led to the current confusion. We should get
past the semantics of the standards' phrases and focus on the intent,
which is to allow authors to add hyphenation hints to text. 

Troy

On Sun, 2017-09-03 at 20:42 +0000, vapier at gentoo dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22073
> 
>             Bug ID: 22073
>            Summary: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?
>            Product: glibc
>            Version: 2.26
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: vapier at gentoo dot org
>                 CC: egmont at gmail dot com, libc-locales at sourceware dot org,
>                     maiku.fabian at gmail dot com, tg at mirbsd dot de
>         Depends on: 21750
>   Target Milestone: ---
> 
> +++ This bug was initially created as a clone of Bug #21750 +++
> 
> Iâ€™ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xtermâ€™s data
> system-wide) side, which I fixed).
> 
> U+00AD is forced to width 1 in xterm, autodetected as combining in glibc
> 
> Rationale for forcing it to 1 is likely that U+0000â€¥U+00FF are latin1, which,
> when displayed as 8bit on terminals, had no combining characters at all.
> 
> Change Request to glibc: force U+00AD to width 1.
> 
> more background discussion with different standards can be found here:
>   https://www.cs.tut.fi/~jkorpela/shy.html
> 
> 
> Referenced Bugs:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> [Bug 21750] column width of characters incompatible with classical wcwidth

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (2 preceding siblings ...)
  2017-09-04  8:10 ` [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: " Troy Korjuslommi
@ 2017-09-04  8:12 ` tjk at tksoft dot com
  2017-09-04  8:54 ` maiku.fabian at gmail dot com
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tjk at tksoft dot com @ 2017-09-04  8:12 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #2 from Troy Korjuslommi <tjk at tksoft dot com> ---
I reached a totally different conclusion from reading those links and
thinking of the wcwidth(SHY) situation for wcwidth().

When writing a curses/terminfo (terminal) application, one goes through
input and determines the width of text by iterating through the input
characters. If a word contains multiple U+00AD characters, at the end of
the line or not, the total width of the word ends up wrong if wcwidth is
set to 1. Therefore wcwidth(U+00AD) should return 0.

Also, using a SHY (U+00AD) character as a rendering hint seems to make
sense, since if a word is broken up with SHY characters, then a SHY
aware application can determine where to break the word, adding a
visible hyphen only at that position. A SHY non-aware application can
just ignore the SHY.

The Korpela article shed light on the confusion standard writers have
had with the issue. It seems clear to me that their intention has been
to add a character which can be used as a hint for breaking words
according to hyphenation rules. The imprecise wording used for
describing the solution has led to the current confusion. We should get
past the semantics of the standards' phrases and focus on the intent,
which is to allow authors to add hyphenation hints to text. 


Troy




On Sun, 2017-09-03 at 20:42 +0000, vapier at gentoo dot org wrote:
> https://sourceware.org/bugzilla/show_bug.cgi?id=22073
> 
>             Bug ID: 22073
>            Summary: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ?
>            Product: glibc
>            Version: 2.26
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: localedata
>           Assignee: unassigned at sourceware dot org
>           Reporter: vapier at gentoo dot org
>                 CC: egmont at gmail dot com, libc-locales at sourceware dot org,
>                     maiku.fabian at gmail dot com, tg at mirbsd dot de
>         Depends on: 21750
>   Target Milestone: ---
> 
> +++ This bug was initially created as a clone of Bug #21750 +++
> 
> I’ve compared the new autogenerated column width from
> localedata/unicode-gen/utf8_gen.py with the results of the classical wcwidth()
> implementation from xterm (adjusted to Unicode 10.0.0) and found a few
> divergences (and bugs on my (MirBSD, which uses something based on xterm’s data
> system-wide) side, which I fixed).
> 
> U+00AD is forced to width 1 in xterm, autodetected as combining in glibc
> 
> Rationale for forcing it to 1 is likely that U+0000‥U+00FF are latin1, which,
> when displayed as 8bit on terminals, had no combining characters at all.
> 
> Change Request to glibc: force U+00AD to width 1.
> 
> more background discussion with different standards can be found here:
>   https://www.cs.tut.fi/~jkorpela/shy.html
> 
> 
> Referenced Bugs:
> 
> https://sourceware.org/bugzilla/show_bug.cgi?id=21750
> [Bug 21750] column width of characters incompatible with classical wcwidth

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (3 preceding siblings ...)
  2017-09-04  8:12 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " tjk at tksoft dot com
@ 2017-09-04  8:54 ` maiku.fabian at gmail dot com
  2017-09-05  9:14 ` tg at mirbsd dot de
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-04  8:54 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Currently, in glibc master, we have set the width of U+00AD to 1.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (4 preceding siblings ...)
  2017-09-04  8:54 ` maiku.fabian at gmail dot com
@ 2017-09-05  9:14 ` tg at mirbsd dot de
  2017-09-11 11:06   ` Troy Korjuslommi
  2017-09-05 15:54 ` vapier at gentoo dot org
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-05  9:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #4 from Thorsten Glaser <tg at mirbsd dot de> ---
> When writing a curses/terminfo (terminal) application, one goes through
> input and determines the width of text by iterating through the input
> characters. If a word contains multiple U+00AD characters, at the end of
> the line or not, the total width of the word ends up wrong if wcwidth is
> set to 1. Therefore wcwidth(U+00AD) should return 0.

In your reading, everything but the conclusion is* correct.

*) if the application uses the soft hyphen char as soft hyphen


Basically, if the application decides U+00AD is expanded into a hyphen,
it must send a hyphen, NOT U+00AD, to the terminal, and if not, it must
sende no character to the terminal.

The reason here is that wcwidth() is the _width of the character ON THE
TERMINAL_ and not for use within the application. No terminal will break
on the soft hyphen, they’ll all break only at the last column in the
line; therefore, wcwidth of U+00AD *must* be 1.

Further reasons: compatibility with previous wcwidth implementations,
and that the first 256 chars are supposed to be latin1 which had a
wcwidth of 1 for all non-control characters (20‥7E, A0‥FF).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (5 preceding siblings ...)
  2017-09-05  9:14 ` tg at mirbsd dot de
@ 2017-09-05 15:54 ` vapier at gentoo dot org
  2017-09-06 14:41 ` tg at mirbsd dot de
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: vapier at gentoo dot org @ 2017-09-05 15:54 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #5 from Mike Frysinger <vapier at gentoo dot org> ---
i don't think we have a choice here.  if the rest of the world is converging on
the unicode standard view of the world, and it says 0, then we should do that
as well.  trying to "take a stand" here won't help as long as the unicode
consortium doesn't change, and i think they've settled the matter in their
eyes.  if you want to deliberate the topic further, it'd probably be better
spent doing so on their lists.

the unicode FAQ includes this entry [1] (which the korpela page called out):
Q: Unicode now treats the SOFT HYPHEN as format control (Cf) character when
formerly it was a punctuation character (Pd). Doesn't this break ISO 8859-1
compatibility?
A: No. The ISO 8859-1 standard defines the SOFT HYPHEN as "[a] graphic
character that is imaged by a graphic symbol identical with, or similar to,
that representing hyphen" (section 6.3.3), but does not specify details of how
or when it is to be displayed, nor other details of its semantics. The soft
hyphen has had a long history of legacy implementation in two or more
incompatible ways.
Unicode clarifies the semantics of this character for Unicode implementations,
but this does not affect its usage in ISO 8859-1 implementations. Processes
that convert back and forth may need to pay attention to semantic differences
between the standards, just as for any other character.
In a terminal emulation environment, particularly in ISO-8859-1 contexts, one
could display the soft hyphen as a hyphen in all circumstances. The change in
semantics of the Unicode character does not require that implementations of
terminal emulators in other environments, such as ISO 8859-1, make any change
in their current behavior.

[1] http://www.unicode.org/faq/casemap_charprop.html#18

i think that answers the question here: in our UTF-8 charmaps, we should mark
U+00AD as 0, but in our ISO 8859-1 (and other applicable legacy) charmaps, we
should mark it as 1.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (6 preceding siblings ...)
  2017-09-05 15:54 ` vapier at gentoo dot org
@ 2017-09-06 14:41 ` tg at mirbsd dot de
  2017-09-06 15:25 ` tg at mirbsd dot de
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-06 14:41 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #6 from Thorsten Glaser <tg at mirbsd dot de> ---
Unicode does NOT define the column width of a char in the terminal. This shows
in all those mailing list threads, in which they basically assume all fonts to
be proportional.

wcwidth() however basically *is* the column width of a char in the terminal in
a fixed-width cell layout.

The cōnsēnsus seems to be to ask _users_ avoid using U+00AD because of the two
different histories in interpretation, and use something else for the separate
purposes. That leaves us with needing a definition for this char *should* it
appear anywhere still.

I’m arguing for 1 because:

• 0 is for combining characters and NUL only
• the “possible soft hyphen” reading of U+00AD is not a combining character
• compatibility with previous/older/other wcwidth() implementations, most
importantly

The 0 fraction should not be at a loss here because:

• The char should be avoided already *anyway*
• Terminal emulators never implement wrapping at a “possible soft hyphen”, only
at the end of the line
• Unicode data is still available elsewhere, this bugreport is precisely about
wcwidth() which only “almost” aligns with the various Unicode datas (yes, I
know, wrong plural, but I can’t think of anything better to express what I
mean, right now)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (7 preceding siblings ...)
  2017-09-06 14:41 ` tg at mirbsd dot de
@ 2017-09-06 15:25 ` tg at mirbsd dot de
  2017-09-07  7:16 ` vapier at gentoo dot org
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-06 15:25 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #7 from Thorsten Glaser <tg at mirbsd dot de> ---
(In reply to Mike Frysinger from comment #5)

> i think that answers the question here: in our UTF-8 charmaps, we should
> mark U+00AD as 0, but in our ISO 8859-1 (and other applicable legacy)
> charmaps, we should mark it as 1.

That could get ugly, assume you have an application displaying latin1 data on a
UTF-8 terminal (GNU screen comes to mind, or luit from XFree86®). Those map
0xAD to U+00AD not U+002D…

Given that mfabian as localedata maintainer of sorts has already accepted the
change, does it really still be needed to be discussed? (The copyright form
arrived last night btw, I’m sending it back to the FSF ASAP.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (8 preceding siblings ...)
  2017-09-06 15:25 ` tg at mirbsd dot de
@ 2017-09-07  7:16 ` vapier at gentoo dot org
  2017-09-07 10:20 ` tg at mirbsd dot de
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: vapier at gentoo dot org @ 2017-09-07  7:16 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #8 from Mike Frysinger <vapier at gentoo dot org> ---
(In reply to Thorsten Glaser from comment #6)

i'm aware wcwidth isn't explicitly defined by Unicode standards, but that
doesn't mean they completely ignore it.  they discuss terminal emulators
multiple times (including the SHY FAQ), and it's why things like
EastAsianWidth.txt exist in the first place.  it's also pretty clear what the
current Unicode standard is wrt their intentions to this codepoint.

> • 0 is for combining characters and NUL only

that is incorrect.  you mishandle Prepended_Concatenation_Mark (see bug 22070),
and ignore Format Character (Cf) characters which are all 0 (or you're
incorrectly claiming that Cf's are not combining characters).  and which U+00AD
is classified as.

> • the “possible soft hyphen” reading of U+00AD is not a combining character

except that it is.  if Unicode wanted it to be an explicit hyphen, they would
have kept its class as Pd (punctuation character), not changed it to Cf (format
control).  they also wouldn't have described it explicitly as:
Soft Hyphen. Despite its name, U+00AD soft hyphen is not a hyphen, but rather
an
invisible format character used to indicate optional intraword breaks.

> • compatibility with previous/older/other wcwidth() implementations, most
> importantly

appealing to historical wcwidth behavior isn't a great argument.  ones written
to older Unicode standards are def wrong across many codepoints (emoji much?),
and as i already mentioned, implementations converge on the latest Unicode
releases.  all of which say this should be 0.

> • The char should be avoided already *anyway*
> • Terminal emulators never implement wrapping at a “possible soft hyphen”,
> only at the end of the line

then by your own argument, having it follow the Unicode standard is a non-issue

(In reply to Thorsten Glaser from comment #7)

if your terminal and the target application disagree about encoding then you've
already lost.  everything above 0x7F will be wrong (0x80 != U+0080 or 0xc2
0x80).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (9 preceding siblings ...)
  2017-09-07  7:16 ` vapier at gentoo dot org
@ 2017-09-07 10:20 ` tg at mirbsd dot de
  2017-09-07 10:59 ` egmont at gmail dot com
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-07 10:20 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #9 from Thorsten Glaser <tg at mirbsd dot de> ---
(In reply to Mike Frysinger from comment #8)

> > • 0 is for combining characters and NUL only
> 
> that is incorrect.  you mishandle Prepended_Concatenation_Mark (see bug
> 22070), and ignore Format Character (Cf) characters which are all 0 (or
> you're incorrectly claiming that Cf's are not combining characters).  and

OK, sorry about that. But xterm handles even those as such, basically
it combines the glyph for it (could be blank or just the dotted square)
over the preceding character, as they have no meaning for a terminal.

> > • compatibility with previous/older/other wcwidth() implementations, most
> > importantly
> 
> appealing to historical wcwidth behavior isn't a great argument.  ones

But this is more important than you make it sound.

> written to older Unicode standards

Sure, which is why I updated it to use the current Unicode data
as base, but there are a few cases which were specifically handled
explicitly different right from the start, and, with the changes
I described, mfabian’s code in glibc and mine in MirBSD come to
the same result modulo implementation differences.

(I also handle Prepended_Concatenation_Mark in MirBSD now in the
way you requested in bz#22070, so compatibility goes both ways.
My focus was on updating mgk25’s code in a compatible way, as to
not introduce any regressions; changes from later Unicode changes
are welcome, as are initial oversights such as this one (if it
existed back then), but as I said, U+00AD was special-handled
right from the beginning.)

> > • The char should be avoided already *anyway*
> > • Terminal emulators never implement wrapping at a “possible soft hyphen”,
> > only at the end of the line
> 
> then by your own argument, having it follow the Unicode standard is a

There is no Unicode standard for wcwidth().

> non-issue

It’s not because with 0, applications displaying a simple charmap
for the first page (i.e. latin1) fail on X'AD'.

> if your terminal and the target application disagree about encoding then
> you've already lost.  everything above 0x7F will be wrong (0x80 != U+0080 or
> 0xc2 0x80).

You did not understand what I wrote.

Tools like GNU screen and XFree86® luit can convert between the encodings,
so they’d convert an \xA0 from the program (meaning an 0x80 in latin1) to
a U+00A0 internally to a \xC2\xA0 in UTF-8 to the screen, and back.

The *definition* of these mappings maps 0xAD from latin1 to U+00AD, not to
U+002D. (Changing _this_ would also be unwise as there’d be no way to type
latin1 0xAD any more.)

Therefore, wcwidth(U+00AD) should stay at 1.

PS: Discussing this is really straining for me, and English is only my third
non-programming language, so please read anything weird as I mean it, not as I
formulated it.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (10 preceding siblings ...)
  2017-09-07 10:20 ` tg at mirbsd dot de
@ 2017-09-07 10:59 ` egmont at gmail dot com
  2017-09-07 16:55 ` egmont at gmail dot com
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-07 10:59 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #10 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to Thorsten Glaser from comment #6)

> • The char should be avoided already *anyway*

Just wondering, isn't perhaps iswprint(0xAD) = 0, wcwidth(0xAD) = -1 also a
sensible solution worth considering?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (11 preceding siblings ...)
  2017-09-07 10:59 ` egmont at gmail dot com
@ 2017-09-07 16:55 ` egmont at gmail dot com
  2017-09-07 20:19 ` egmont at gmail dot com
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-07 16:55 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #11 from Egmont Koblinger <egmont at gmail dot com> ---
To clarify my previous comment:

If compatibility is a concern then let's go with 1, I'm absolutely fine with
that.

If compatibility is not such of a concern, it feels to me that -1 is a more
reasonable choice than 0.

Basically, out of the three possibilities 0 is the one I find the least
reasonable. As for the other two, my guts feeling tell me to go with the
backwards compatible 1, however, you guys have way better arguments pro or con
than guts feeling so I cannot join that discussion.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (12 preceding siblings ...)
  2017-09-07 16:55 ` egmont at gmail dot com
@ 2017-09-07 20:19 ` egmont at gmail dot com
  2017-09-07 20:25 ` maiku.fabian at gmail dot com
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-07 20:19 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #12 from Egmont Koblinger <egmont at gmail dot com> ---
To further clarify:

By "compatibility" I meant compatibility with existing legacy apps.

Compatibility with (or let's rather say: proper implementation of) the recent
Unicode standard, if we're okay with dropping backwards compatibility, is where
I feel -1 might perhaps be the best choice.

Either-or, I can't really see 0 being justified.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (13 preceding siblings ...)
  2017-09-07 20:19 ` egmont at gmail dot com
@ 2017-09-07 20:25 ` maiku.fabian at gmail dot com
  2017-09-07 20:47 ` egmont at gmail dot com
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-07 20:25 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #13 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Egmont Koblinger from comment #11)
> To clarify my previous comment:
> 
> If compatibility is a concern then let's go with 1, I'm absolutely fine with
> that.
> 
> If compatibility is not such of a concern, it feels to me that -1 is a more
> reasonable choice than 0.


> Basically, out of the three possibilities 0 is the one I find the least
> reasonable. As for the other two, my guts feeling tell me to go with the
> backwards compatible 1, however, you guys have way better arguments pro or
> con than guts feeling so I cannot join that discussion.

From the man page of wcwidth:

     The wcwidth() function returns  the number of columns needed
     to represent the wide character c.  If c is a printable wide
     character,  the value  is at  least 0.   If c  is null  wide
     character  (L'\0'),  the  value  is  0.   Otherwise,  -1  is
     returned.

The soft hyphen is printable (it is in the section “print” of LC_CTYPE
in localedata/locales/i18n), therefore the value returned by wcwidth()
is at least 0.  So -1 is not possible for the soft hyphen.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (14 preceding siblings ...)
  2017-09-07 20:25 ` maiku.fabian at gmail dot com
@ 2017-09-07 20:47 ` egmont at gmail dot com
  2017-09-08  3:12 ` egmont at gmail dot com
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-07 20:47 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #14 from Egmont Koblinger <egmont at gmail dot com> ---
My recommendation to consider was to make wcwidth return -1 and, in the mean
time, mark it as non-printable. As far as I understand, properly written apps
should never print this character, right?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (15 preceding siblings ...)
  2017-09-07 20:47 ` egmont at gmail dot com
@ 2017-09-08  3:12 ` egmont at gmail dot com
  2017-09-09 10:06 ` maiku.fabian at gmail dot com
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-08  3:12 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #15 from Egmont Koblinger <egmont at gmail dot com> ---
Don't get me wrong... I'm not saying that this is the solution we should go
with. I'm not arguing that -1 is the best choice. I just wanted to make sure
that this possibility is also considered.

If the final decision is 1, I'm absolutely fine with that.

If the final decision is 0, I wouldn't be that happy because then I think -1 is
a better choice, however, I'd still accept that decision.

I justed wanted to give a heads up about a third possibility that probably
wouldn't have been considered otherwise. The rest is up to you guys. Thanks for
listening to me! :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (16 preceding siblings ...)
  2017-09-08  3:12 ` egmont at gmail dot com
@ 2017-09-09 10:06 ` maiku.fabian at gmail dot com
  2017-09-11 11:24 ` tjk at tksoft dot com
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-09 10:06 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #16 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Egmont Koblinger from comment #14)
> My recommendation to consider was to make wcwidth return -1 and, in the mean
> time, mark it as non-printable. As far as I understand, properly written
> apps should never print this character, right?

https://www.cs.tut.fi/~jkorpela/shy.html quotes ISO 8859-1 standard as:

>  The ISO 8859-1 standard defines, in section 6.3.3, both the graphic
>  presentation and the usage of soft hyphen, as follows:
> 
>     A graphic character that is imaged by a graphic symbol identical
>     with, or similar to, that representing hyphen, for use when a line
>     break has been established within a word.

So according to this, it should be printable.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-05  9:14 ` tg at mirbsd dot de
@ 2017-09-11 11:06   ` Troy Korjuslommi
  0 siblings, 0 replies; 31+ messages in thread
From: Troy Korjuslommi @ 2017-09-11 11:06 UTC (permalink / raw)
  To: tg at mirbsd dot de; +Cc: libc-locales

I would like to point out that wcwidth of 1 for SHY would mean that
applications which haven't taken soft hyphens into consideration, as
they are rare in actual input, will display words with SHY in them very
awkwardly. Namely, as "the-os-o-phy" or "the os o phy." The actual
display will of course depend on the font in use. It can resemble a
hyphen or a space. Applications which are SHY aware, will of course
handle it separately, either breaking the word and adding a hyphen or
ignoring it.

I might add that I speculate that the reason SHY is so rarely used is
because of these kinds of disagreements over its display.

I don't see any disagreement over the intent of the SHY, so why not make
the lives of writers (who could then start including SHY in text) and
programmers (who would then find it worthwhile to write special handlers
for SHY).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (17 preceding siblings ...)
  2017-09-09 10:06 ` maiku.fabian at gmail dot com
@ 2017-09-11 11:24 ` tjk at tksoft dot com
  2017-09-11 20:08 ` tg at mirbsd dot de
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tjk at tksoft dot com @ 2017-09-11 11:24 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #17 from Troy Korjuslommi <tjk at tksoft dot com> ---
I would like to point out that wcwidth of 1 for SHY would mean that
applications which haven't taken soft hyphens into consideration, as
they are rare in actual input, will display words with SHY in them very
awkwardly. Namely, as "the-os-o-phy" or "the os o phy." The actual
display will of course depend on the font in use. It can resemble a
hyphen or a space. Applications which are SHY aware, will of course
handle it separately, either breaking the word and adding a hyphen or
ignoring it.

I might add that I speculate that the reason SHY is so rarely used is
because of these kinds of disagreements over its display.

I don't see any disagreement over the intent of the SHY, so why not make
the lives of writers (who could then start including SHY in text) and
programmers (who would then find it worthwhile to write special handlers
for SHY).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (18 preceding siblings ...)
  2017-09-11 11:24 ` tjk at tksoft dot com
@ 2017-09-11 20:08 ` tg at mirbsd dot de
  2017-09-14 12:51   ` Troy Korjuslommi
  2017-09-14 13:04 ` tjk at tksoft dot com
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-11 20:08 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #18 from Thorsten Glaser <tg at mirbsd dot de> ---
(In reply to Troy Korjuslommi from comment #17)

> I don't see any disagreement over the intent of the SHY, so why not make
> the lives of writers (who could then start including SHY in text) and

That is done even when wcwidth(U+00AD) == 1, because the application would
never send U+00AD to the tty but always either no character or one of the
other hyphen-ish codepoints.

In fact, if a font renders U+00AD different from U+002D, an SHY-aware
application might even PREFER it to have wcwidth 1 because then it COULD
send U+00AD to the tty *in the places where it expands to a hyphenation*
(and just omit it where not).

For GUI applications, wcwidth() is of no meaning anyway.

> awkwardly. Namely, as "the-os-o-phy" or "the os o phy." The actual

“the-os-o-phy” (“theo-sophy” is how I’d split it, though) is common for
the soft hyphenation point editing mode of word processors.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-11 20:08 ` tg at mirbsd dot de
@ 2017-09-14 12:51   ` Troy Korjuslommi
  0 siblings, 0 replies; 31+ messages in thread
From: Troy Korjuslommi @ 2017-09-14 12:51 UTC (permalink / raw)
  To: tg at mirbsd dot de; +Cc: libc-locales

I was referring to non-SHY-aware apps. When iterating through input in
curses code, one needs wcwidth() for at least two reasons. One is to
calculate space needed to display a word, and the other is to determine
the position of the cursor (only applicable when input contains 2 column
wide characters). If SHY is wcwidth other than 0, the non-SHY-aware
applications will calculate the width incorrectly.

A non-SHY-aware application could easily add the U+00AD to the terminal,
and thus possibly cause cursor movement, and maybe even character
rendering, to occur.

An author who cares about grammar would actually hyphenate theosophy as
"the-o-so-phy." That was kind of my point, that words with more than two
syllables have two or more hyphens. And that hyphenation is a rule based
system, non-obvious and hard to guess, which is why SHY can be a useful
tool.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (19 preceding siblings ...)
  2017-09-11 20:08 ` tg at mirbsd dot de
@ 2017-09-14 13:04 ` tjk at tksoft dot com
  2017-09-14 13:45 ` egmont at gmail dot com
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tjk at tksoft dot com @ 2017-09-14 13:04 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #19 from Troy Korjuslommi <tjk at tksoft dot com> ---
I was referring to non-SHY-aware apps. When iterating through input in
curses code, one needs wcwidth() for at least two reasons. One is to
calculate space needed to display a word, and the other is to determine
the position of the cursor (only applicable when input contains 2 column
wide characters). If SHY is wcwidth other than 0, the non-SHY-aware
applications will calculate the width incorrectly.

A non-SHY-aware application could easily add the U+00AD to the terminal,
and thus possibly cause cursor movement, and maybe even character
rendering, to occur.

An author who cares about grammar would actually hyphenate theosophy as
"the-o-so-phy." That was kind of my point, that words with more than two
syllables have two or more hyphens. And that hyphenation is a rule based
system, non-obvious and hard to guess, which is why SHY can be a useful
tool.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (20 preceding siblings ...)
  2017-09-14 13:04 ` tjk at tksoft dot com
@ 2017-09-14 13:45 ` egmont at gmail dot com
  2017-09-14 13:45 ` maiku.fabian at gmail dot com
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-14 13:45 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #20 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to Troy Korjuslommi from comment #19)

> A non-SHY-aware application could easily add the U+00AD to the terminal,
> and thus possibly cause cursor movement, and maybe even character
> rendering, to occur.

There's two sides to this story: apps and terminal emulators. You seem to care
about apps here, and forgot that altering wcwidth might have an effect on
terminal emulators' behavior as well.

If all parties respect wcwidth() then either 0 or 1 is okay. In case of 0 the
terminal emulator will not print anything nor advance the cursor, in accordance
with what the app expects. In case of 1 the outcome again will be correct.

The story is about to foresee the impacts of apps as well as terminal emulators
(and their combinations) that use hardcoded values rather than wcwidth. For
example, I don't know if xterm always uses its built-in table, or only in
certain cases; nor whether its author is open to adjust the table to follow
what gets decided in this bugreport. There's also vte's (gnome-terminal's)
issue of using glib's method instead, but I can most likely change that if
really needed.

(Plus, again, let's not forget about the case of ssh'ing between different
systems, potentially either one not even glibc-based.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (21 preceding siblings ...)
  2017-09-14 13:45 ` egmont at gmail dot com
@ 2017-09-14 13:45 ` maiku.fabian at gmail dot com
  2017-09-14 16:38 ` tg at mirbsd dot de
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-14 13:45 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073
Bug 22073 depends on bug 21750, which changed state.

Bug 21750 Summary: column width of characters incompatible with classical wcwidth
https://sourceware.org/bugzilla/show_bug.cgi?id=21750

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |FIXED

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (22 preceding siblings ...)
  2017-09-14 13:45 ` maiku.fabian at gmail dot com
@ 2017-09-14 16:38 ` tg at mirbsd dot de
  2017-09-14 19:03 ` egmont at gmail dot com
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-14 16:38 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #21 from Thorsten Glaser <tg at mirbsd dot de> ---
(In reply to Egmont Koblinger from comment #20)

> (In reply to Troy Korjuslommi from comment #19)
> 
> > A non-SHY-aware application could easily add the U+00AD to the terminal,
> > and thus possibly cause cursor movement, and maybe even character
> > rendering, to occur.

Yes, that would be correct. The terminal is, in your terminology, *also*
a non-SHY-aware application.

> > wide characters). If SHY is wcwidth other than 0, the non-SHY-aware
> > applications will calculate the width incorrectly.

No, actually, if wcwidth is anything other than *1* they will calculate
it incorrectly, because, to a terminal, the character will always have
a constant width. (If wcwidth were 0 and an SHY-aware application were
to send U+00AD to the terminal in the place where a break DOES occur,
the terminal could NOT emit a space-using glyph otherwise.)

> There's two sides to this story: apps and terminal emulators. You seem to
> care about apps here, and forgot that altering wcwidth might have an effect
> on terminal emulators' behavior as well.
> 
> If all parties respect wcwidth() then either 0 or 1 is okay. In case of 0

Indeed, both use wcwith() and thus have to agree.

> (Plus, again, let's not forget about the case of ssh'ing between different
> systems, potentially either one not even glibc-based.)

One more point in favour of letting it stay at 1 to stay compatible with
everyone else in the world including previous releases.

> to follow what gets decided in this bugreport. There's also vte's
> (gnome-terminal's) issue of using glib's method instead, but I can most
> likely change that if really needed.

Either that, or add special handling of a couple of characters to vte…
it’ll likely handle stuff like direction changes or so already if it’s
not just a dumb terminal like xterm, so there’s bound to be a correct
place for it.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (23 preceding siblings ...)
  2017-09-14 16:38 ` tg at mirbsd dot de
@ 2017-09-14 19:03 ` egmont at gmail dot com
  2017-09-15  8:22 ` tg at mirbsd dot de
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 31+ messages in thread
From: egmont at gmail dot com @ 2017-09-14 19:03 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #22 from Egmont Koblinger <egmont at gmail dot com> ---
(In reply to Thorsten Glaser from comment #21)

> Yes, that would be correct. The terminal is, in your terminology, *also*
> a non-SHY-aware application.

I'd rather not define this concept for terminals. They cannot make a choice in
the sense client apps can.

Also for conciseness I'd reserve the word "app" or "application" for the client
app that's running inside the terminal emulator, and not the terminal emulator
itself in this discussion.

> No, actually, if wcwidth is anything other than *1* they will calculate
> it incorrectly, because, to a terminal, the character will always have
> a constant width. (If wcwidth were 0 and an SHY-aware application were
> to send U+00AD to the terminal in the place where a break DOES occur,
> the terminal could NOT emit a space-using glyph otherwise.)

Nope. SHY-aware apps by definition never send SHY to the terminal, they either
send a regular hyphen U+2D or nothing at all, that's what makes them SHY-aware.
(Especially since in several fonts the glyph of SHY is empty, it looks like a
space.) If an app ever sends a SHY to the terminal emulator, it is SHY-unaware.

Hence for SHY-aware apps, wcwidth() of SHY is irrelevant.

For SHY-unaware ones it's important that what the application thinks will
happen matches with what really happens in the terminal emulator. Both the
application and the terminal emulator may or may not rely on wcwidth(), or the
app may even rely on wcwidth() of a remote system.

> One more point in favour of letting it stay at 1 to stay compatible with
> everyone else in the world including previous releases.

I'm not arguing against 1 at all. In fact, my guts feeling tell me to go with 1
rather than 0. I just wouldn't want 0 being ditched with invalid arguments.

> Either that, or add special handling of a couple of characters to vte…
> it’ll likely handle stuff like direction changes or so already if it’s
> not just a dumb terminal like xterm, so there’s bound to be a correct
> place for it.

There's no BiDi in VTE, anyway, I wouldn't want to pollute this bugreport with
this.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (24 preceding siblings ...)
  2017-09-14 19:03 ` egmont at gmail dot com
@ 2017-09-15  8:22 ` tg at mirbsd dot de
  2017-09-15  8:24 ` maiku.fabian at gmail dot com
  2017-09-19 10:04 ` maiku.fabian at gmail dot com
  27 siblings, 0 replies; 31+ messages in thread
From: tg at mirbsd dot de @ 2017-09-15  8:22 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #23 from Thorsten Glaser <tg at mirbsd dot de> ---
> Nope. SHY-aware apps by definition never send SHY to the terminal, they either send a regu
> lar hyphen U+2D or nothing at all, that's what makes them SHY-aware. (Especially since in
> several fonts the glyph of SHY is empty, it looks like a space.) If an app ever sends a SH
> Y to the terminal emulator, it is SHY-unaware.
>
> Hence for SHY-aware apps, wcwidth() of SHY is irrelevant.

OK, granted, if that is the sense, you are, of course, correct.
(But that also means that, if it’s irrelevant for them, which,
again, if they send U+002D to the terminal instead, it is, then
all the more reason to stick to 1.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (25 preceding siblings ...)
  2017-09-15  8:22 ` tg at mirbsd dot de
@ 2017-09-15  8:24 ` maiku.fabian at gmail dot com
  2017-09-19 10:04 ` maiku.fabian at gmail dot com
  27 siblings, 0 replies; 31+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-15  8:24 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

--- Comment #24 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Thorsten Glaser from comment #23)
> > Nope. SHY-aware apps by definition never send SHY to the terminal, they either send a regu
> > lar hyphen U+2D or nothing at all, that's what makes them SHY-aware. (Especially since in
> > several fonts the glyph of SHY is empty, it looks like a space.) If an app ever sends a SH
> > Y to the terminal emulator, it is SHY-unaware.
> >
> > Hence for SHY-aware apps, wcwidth() of SHY is irrelevant.
> 
> OK, granted, if that is the sense, you are, of course, correct.
> (But that also means that, if it’s irrelevant for them, which,
> again, if they send U+002D to the terminal instead, it is, then
> all the more reason to stick to 1.)

Yes,that is really a good reason to stick to 1.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): 0 or 1 ?
  2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
                   ` (26 preceding siblings ...)
  2017-09-15  8:24 ` maiku.fabian at gmail dot com
@ 2017-09-19 10:04 ` maiku.fabian at gmail dot com
  27 siblings, 0 replies; 31+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-19 10:04 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22073

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #25 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike FABIAN from comment #24)
> (In reply to Thorsten Glaser from comment #23)
> > > Nope. SHY-aware apps by definition never send SHY to the terminal, they either send a regu
> > > lar hyphen U+2D or nothing at all, that's what makes them SHY-aware. (Especially since in
> > > several fonts the glyph of SHY is empty, it looks like a space.) If an app ever sends a SH
> > > Y to the terminal emulator, it is SHY-unaware.
> > >
> > > Hence for SHY-aware apps, wcwidth() of SHY is irrelevant.
> > 
> > OK, granted, if that is the sense, you are, of course, correct.
> > (But that also means that, if it’s irrelevant for them, which,
> > again, if they send U+002D to the terminal instead, it is, then
> > all the more reason to stick to 1.)
> 
> Yes,that is really a good reason to stick to 1.

So it looks like we have reached some agreement that width 1 is OK
for the soft hypen and I can close this bug as FIXED, right?

Closing as FIXED.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-09-15  8:24 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-03 20:42 [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: 0 or 1 ? vapier at gentoo dot org
2017-09-03 21:01 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " vapier at gentoo dot org
2017-09-04  8:10 ` vapier at gentoo dot org
2017-09-04  8:10 ` [Bug localedata/22073] New: charmaps/UTF-8: wcwidth of U+00AD: " Troy Korjuslommi
2017-09-04  8:12 ` [Bug localedata/22073] charmaps/UTF-8: wcwidth of U+00AD (soft hyphen): " tjk at tksoft dot com
2017-09-04  8:54 ` maiku.fabian at gmail dot com
2017-09-05  9:14 ` tg at mirbsd dot de
2017-09-11 11:06   ` Troy Korjuslommi
2017-09-05 15:54 ` vapier at gentoo dot org
2017-09-06 14:41 ` tg at mirbsd dot de
2017-09-06 15:25 ` tg at mirbsd dot de
2017-09-07  7:16 ` vapier at gentoo dot org
2017-09-07 10:20 ` tg at mirbsd dot de
2017-09-07 10:59 ` egmont at gmail dot com
2017-09-07 16:55 ` egmont at gmail dot com
2017-09-07 20:19 ` egmont at gmail dot com
2017-09-07 20:25 ` maiku.fabian at gmail dot com
2017-09-07 20:47 ` egmont at gmail dot com
2017-09-08  3:12 ` egmont at gmail dot com
2017-09-09 10:06 ` maiku.fabian at gmail dot com
2017-09-11 11:24 ` tjk at tksoft dot com
2017-09-11 20:08 ` tg at mirbsd dot de
2017-09-14 12:51   ` Troy Korjuslommi
2017-09-14 13:04 ` tjk at tksoft dot com
2017-09-14 13:45 ` egmont at gmail dot com
2017-09-14 13:45 ` maiku.fabian at gmail dot com
2017-09-14 16:38 ` tg at mirbsd dot de
2017-09-14 19:03 ` egmont at gmail dot com
2017-09-15  8:22 ` tg at mirbsd dot de
2017-09-15  8:24 ` maiku.fabian at gmail dot com
2017-09-19 10:04 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).