public inbox for libc-locales@sourceware.org
 help / color / mirror / Atom feed
* [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
@ 2017-09-03 20:36 vapier at gentoo dot org
  2017-09-04 14:44 ` [Bug localedata/22070] " maiku.fabian at gmail dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: vapier at gentoo dot org @ 2017-09-03 20:36 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

            Bug ID: 22070
           Summary: charmaps/UTF-8: wcwidth for
                    Prepended_Concatenation_Mark codepoints set to 0
                    (should be 1)
           Product: glibc
           Version: unspecified
            Status: NEW
          Severity: normal
          Priority: P2
         Component: localedata
          Assignee: unassigned at sourceware dot org
          Reporter: vapier at gentoo dot org
                CC: libc-locales at sourceware dot org
  Target Milestone: ---

we currently mark all Cf (Format Character) as width 0, but this ignores
Prepended_Concatenation_Mark codepoints.  specifically these should all have a
wcwidth of 1:
0600..0605 ; Prepended_Concatenation_Mark # Cf  ARABIC NUMBER SIGN..ARABIC
NUMBER MARK ABOVE
06DD       ; Prepended_Concatenation_Mark # Cf  ARABIC END OF AYAH
070F       ; Prepended_Concatenation_Mark # Cf  SYRIAC ABBREVIATION MARK
08E2       ; Prepended_Concatenation_Mark # Cf  ARABIC DISPUTED END OF AYAH
110BD      ; Prepended_Concatenation_Mark # Cf  KAITHI NUMBER SIGN

Unicode 10.0.0 chapter 9 section 2 page 377-378 [1] states:
Signs Spanning Numbers. Several other special signs are written in association
with numbers in the Arabic script. All of these signs can span multiple-digit
numbers, rather than just a single digit. They are not formally considered
combining marks in the sense used by the Unicode Standard, although they
clearly interact graphically with their associated sequence of digits. In the
text representation they precede the sequence of digits that they span, rather
than follow a base character, as would be the case for a combining mark. Their
General_Category value is Cf (format character). Unlike most other format
characters, however, they should be rendered with a visible glyph, even in
circumstances where no suitable digit or sequence of digits follows them in
logical order. The characters have the Bidi_Class value of Arabic_Number to
make them appear in the same run as the numbers following them.

A few similar signs spanning numbers or letters are associated with scripts
other than Arabic. See the discussion of U+070F syriac abbreviation mark in
Section 9.3, Syriac, and the discussion of U+110BD kaithi number sign in
Section 15.2, Kaithi. All of these prefixed format controls, including the
non-Arabic ones, are given the property value
Prepended_Concatenation_Mark=True, to identify them as a class. They also have
special behavior in text segmentation. (See Unicode Standard Annex #29,
“Unicode Text Segmentation.”)

[1] http://unicode.org/versions/Unicode10.0.0/ch09.pdf

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
@ 2017-09-04 14:44 ` maiku.fabian at gmail dot com
  2017-09-04 16:48 ` maiku.fabian at gmail dot com
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-04 14:44 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |maiku.fabian at gmail dot com
           Assignee|unassigned at sourceware dot org   |maiku.fabian at gmail dot com

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
  2017-09-04 14:44 ` [Bug localedata/22070] " maiku.fabian at gmail dot com
@ 2017-09-04 16:48 ` maiku.fabian at gmail dot com
  2017-09-06  0:20 ` vapier at gentoo dot org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-04 16:48 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

--- Comment #1 from Mike FABIAN <maiku.fabian at gmail dot com> ---
(In reply to Mike Frysinger from comment #0)
> we currently mark all Cf (Format Character) as width 0, but this ignores
> Prepended_Concatenation_Mark codepoints.  specifically these should all have
> a wcwidth of 1:
> 0600..0605 ; Prepended_Concatenation_Mark # Cf  ARABIC NUMBER SIGN..ARABIC
> NUMBER MARK ABOVE
> 06DD       ; Prepended_Concatenation_Mark # Cf  ARABIC END OF AYAH
> 070F       ; Prepended_Concatenation_Mark # Cf  SYRIAC ABBREVIATION MARK
> 08E2       ; Prepended_Concatenation_Mark # Cf  ARABIC DISPUTED END OF AYAH
> 110BD      ; Prepended_Concatenation_Mark # Cf  KAITHI NUMBER SIGN

This list is from

ftp://ftp.unicode.org/Public/10.0.0/ucd/PropList.txt

So maybe I should add that file as well to glibc/localedata/unicode-gen/
and parse it in glibc/localedata/unicode-gen/utf8_gen.py ?

(If this list never changes I could also hardcode it in utf8_gen.py).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
  2017-09-04 14:44 ` [Bug localedata/22070] " maiku.fabian at gmail dot com
  2017-09-04 16:48 ` maiku.fabian at gmail dot com
@ 2017-09-06  0:20 ` vapier at gentoo dot org
  2017-09-06  9:49 ` maiku.fabian at gmail dot com
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: vapier at gentoo dot org @ 2017-09-06  0:20 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

--- Comment #2 from Mike Frysinger <vapier at gentoo dot org> ---
(In reply to Mike FABIAN from comment #1)

right, thought i had mentioned PropList.txt here, but guess not.

we should update the python code to parse that db directly instead of
hardcoding anything.  the latest update was in Unicode 5.2 (110BD), but it's
trivial to parse, so might as well be future proof.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
                   ` (2 preceding siblings ...)
  2017-09-06  0:20 ` vapier at gentoo dot org
@ 2017-09-06  9:49 ` maiku.fabian at gmail dot com
  2017-09-06 11:14 ` cvs-commit at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-06  9:49 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

--- Comment #3 from Mike FABIAN <maiku.fabian at gmail dot com> ---
Created attachment 10397
  --> https://sourceware.org/bugzilla/attachment.cgi?id=10397&action=edit
0001-Improve-utf8_gen.py-to-set-the-width-for-characters-.patch

This improvement to the script generating charmaps/UTF-8 
fixes the problem.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
                   ` (4 preceding siblings ...)
  2017-09-06 11:14 ` cvs-commit at gcc dot gnu.org
@ 2017-09-06 11:14 ` maiku.fabian at gmail dot com
  2017-09-06 13:14 ` maiku.fabian at gmail dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-06 11:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
                   ` (3 preceding siblings ...)
  2017-09-06  9:49 ` maiku.fabian at gmail dot com
@ 2017-09-06 11:14 ` cvs-commit at gcc dot gnu.org
  2017-09-06 11:14 ` maiku.fabian at gmail dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2017-09-06 11:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

--- Comment #4 from cvs-commit at gcc dot gnu.org <cvs-commit at gcc dot gnu.org> ---
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "GNU C Library master sources".

The branch, master has been updated
       via  2ae5be041d9ea89cdd0f37734d72051e8f773947 (commit)
       via  af83ed5c4647bda196fc1a7efebbe8019aa83f4a (commit)
      from  4f3647e46e3f645c6516faa299efc6e89d520d7b (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=2ae5be041d9ea89cdd0f37734d72051e8f773947

commit 2ae5be041d9ea89cdd0f37734d72051e8f773947
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Sep 6 11:19:33 2017 +0200

    Improve utf8_gen.py to set the width for characters with
Prepended_Concatenation_Mark property to 1

        [BZ #22070]
        * localedata/unicode-gen/utf8_gen.py: Set the width for
        characters with Prepended_Concatenation_Mark property to 1
        * localedata/charmaps/UTF-8: Updated using the improved script.

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=af83ed5c4647bda196fc1a7efebbe8019aa83f4a

commit af83ed5c4647bda196fc1a7efebbe8019aa83f4a
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Fri Aug 18 10:12:29 2017 +0200

    Write all ranges of neighbouring characters with the same width using the
range notation in charmaps/UTF-8

    Writing ranges of neighbouring characters with the same with like this

        <U000E0100>...<U000E01EF>       0

    in charmaps/UTF-8 is more efficient than writing many single character
lines
    like:

        <U000E0100>     0
        <U000E0101>     0
        ...

        [BZ #21750]
        * unicode-gen/utf8_gen.py: Write all ranges of neighbouring characters
        with the same width using the range notation in charmaps/UTF-8.

-----------------------------------------------------------------------

Summary of changes:
 ChangeLog                           |   14 +
 localedata/charmaps/UTF-8           |   10 +-
 localedata/unicode-gen/Makefile     |    4 +-
 localedata/unicode-gen/PropList.txt | 1618 +++++++++++++++++++++++++++++++++++
 localedata/unicode-gen/utf8_gen.py  |   84 ++-
 5 files changed, 1704 insertions(+), 26 deletions(-)
 create mode 100644 localedata/unicode-gen/PropList.txt

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
                   ` (5 preceding siblings ...)
  2017-09-06 11:14 ` maiku.fabian at gmail dot com
@ 2017-09-06 13:14 ` maiku.fabian at gmail dot com
  2017-09-07  3:07 ` maiku.fabian at gmail dot com
  2017-09-07  7:53 ` maiku.fabian at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-06 13:14 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|---                         |FIXED

--- Comment #5 from Mike FABIAN <maiku.fabian at gmail dot com> ---
FIXED.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
                   ` (6 preceding siblings ...)
  2017-09-06 13:14 ` maiku.fabian at gmail dot com
@ 2017-09-07  3:07 ` maiku.fabian at gmail dot com
  2017-09-07  7:53 ` maiku.fabian at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-07  3:07 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tg at mirbsd dot de

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug localedata/22070] charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1)
  2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
                   ` (7 preceding siblings ...)
  2017-09-07  3:07 ` maiku.fabian at gmail dot com
@ 2017-09-07  7:53 ` maiku.fabian at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: maiku.fabian at gmail dot com @ 2017-09-07  7:53 UTC (permalink / raw)
  To: libc-locales

https://sourceware.org/bugzilla/show_bug.cgi?id=22070

Mike FABIAN <maiku.fabian at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |2.27

-- 
You are receiving this mail because:
You are on the CC list for the bug.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-09-07  7:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-03 20:36 [Bug localedata/22070] New: charmaps/UTF-8: wcwidth for Prepended_Concatenation_Mark codepoints set to 0 (should be 1) vapier at gentoo dot org
2017-09-04 14:44 ` [Bug localedata/22070] " maiku.fabian at gmail dot com
2017-09-04 16:48 ` maiku.fabian at gmail dot com
2017-09-06  0:20 ` vapier at gentoo dot org
2017-09-06  9:49 ` maiku.fabian at gmail dot com
2017-09-06 11:14 ` cvs-commit at gcc dot gnu.org
2017-09-06 11:14 ` maiku.fabian at gmail dot com
2017-09-06 13:14 ` maiku.fabian at gmail dot com
2017-09-07  3:07 ` maiku.fabian at gmail dot com
2017-09-07  7:53 ` maiku.fabian at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).