From: Jules Bertholet <julesbertholet@quoi.xyz>
To: libc-alpha@sourceware.org
Cc: Carlos O'Donnell <carlos@redhat.com>,
Mike Fabian <maiku.fabian@gmail.com>,
libc-locales@sourceware.org,
Jules Bertholet <julesbertholet@quoi.xyz>
Subject: [PATCH][v2] localedata: Fix several issues with the set of characters considered 0-width [BZ #31370]
Date: Sun, 18 Feb 2024 18:54:09 +0000 (UTC) [thread overview]
Message-ID: <20240218185326.16663-1-julesbertholet@quoi.xyz> (raw)
In-Reply-To: <20240211190202.414300-2-julesbertholet@quoi.xyz>
This new version of the patch has a more detailed commit message,
and includes one more related fix.
---
Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property
> should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering.
Hence, `wcwidth()` should give them all a width of 0, with two exceptions:
- the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent
- U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters.
One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo".
These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc;
a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6.
However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0,
to ensure that the complete block has width 2 + 0 + 0 = 2 as it should.
U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo;
it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2.
However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER;
this commit fixes that.
You can read more about Unicode jamo in the Unicode spec, sections 3.12 <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646> and 18.6 <https://www.unicode.org/versions/Unicode15.0.0/ch18.pdf#G31028>,
and about `Default_Ignorable_Code_Point` in §5.21 <https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095>.
---
The Unicode Standard, §5.21 - Characters Ignored for Display <https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G40095> says the following:
> A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property.
> This may surprise implementers, who often assume that all format characters are generally ignored in fallback display.
> The exact list of these exceptional format characters can be found in the Unicode Character Database.
> There are, however, three important sets of such format characters to note:
>
> - prepended concatenation marks
> - interlinear annotation characters
> - Egyptian hieroglyph format controls
>
> The prepended concatenation marks always have a visible display.
> See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls)
> for more discussion of the use and display of these signs.
>
> The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters,
> U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR,
> and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE.
> These characters should have a visible glyph display for fallback rendering, because if they are not displayed,
> it is too easy to misread the resulting displayed text.
> See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials),
> as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs)
> for more discussion of the use and display of these characters.
glibc currently correctly assigns non-zero width to the prepended concatenation marks,
but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret)
and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present).
This commit fixes both these issues as well.
Signed-off-by: Jules Bertholet <julesbertholet@quoi.xyz>
---
localedata/charmaps/UTF-8 | 21 ++++++----
localedata/unicode-gen/Makefile | 2 +
localedata/unicode-gen/utf8_gen.py | 67 +++++++++++++++++-------------
3 files changed, 53 insertions(+), 37 deletions(-)
diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8
index bd8075f20d..f3fcd64fce 100644
--- a/localedata/charmaps/UTF-8
+++ b/localedata/charmaps/UTF-8
@@ -49842,12 +49842,17 @@ END CHARMAP
% Character width according to Unicode 15.0.0.
% - Default width is 1.
+% - U+115F HANGUL CHOSEONG FILLER has width 2.
+% - Combining jungseong and jongseong Hangul jamo have with 0.
+% - U+00AD SOFT HYPHEN has width 1.
% - Double-width characters have width 2; generated from
% "grep '^[^;]*;[WF]' EastAsianWidth.txt"
-% - Non-spacing characters have width 0; generated from PropList.txt or
-% "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt"
-% - Format control characters have width 0; generated from
-% "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt"
+% - Non-spacing marks have width 0; generated from
+% "grep '^[^;]*;[^;]*;Mn;' UnicodeData.txt"
+% - Enclosing marks have width 0; generated from
+% "grep '^[^;]*;[^;]*;Me;' UnicodeData.txt"
+% - "Default_Ignorable_Code_Point"s have width 0; generated from
+% "grep '^[^;]*;\s*Default_Ignorable_Code_Point' UnicodeData.txt"
WIDTH
<U0300>...<U036F> 0
<U0483>...<U0489> 0
@@ -50069,7 +50074,9 @@ WIDTH
<U3099>...<U309A> 0
<U309B>...<U30FF> 2
<U3105>...<U312F> 2
-<U3131>...<U318E> 2
+<U3131>...<U3163> 2
+<U3164> 0
+<U3165>...<U318E> 2
<U3190>...<U31E3> 2
<U31F0>...<U321E> 2
<U3220>...<UA48C> 2
@@ -50124,8 +50131,8 @@ WIDTH
<UFE68>...<UFE6B> 2
<UFEFF> 0
<UFF01>...<UFF60> 2
+<UFFA0> 0
<UFFE0>...<UFFE6> 2
-<UFFF9>...<UFFFB> 0
<U000101FD> 0
<U000102E0> 0
<U00010376>...<U0001037A> 0
@@ -50226,7 +50233,7 @@ WIDTH
<U00011F36>...<U00011F3A> 0
<U00011F40> 0
<U00011F42> 0
-<U00013430>...<U00013440> 0
+<U00013440> 0
<U00013447>...<U00013455> 0
<U00016AF0>...<U00016AF4> 0
<U00016B30>...<U00016B36> 0
diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile
index fd0c732ac4..1975065679 100644
--- a/localedata/unicode-gen/Makefile
+++ b/localedata/unicode-gen/Makefile
@@ -1,4 +1,5 @@
# Copyright (C) 2015-2023 Free Software Foundation, Inc.
+# Copyright (C) 2024 The GNU Toolchain Authors.
# This file is part of the GNU C Library.
# The GNU C Library is free software; you can redistribute it and/or
@@ -94,6 +95,7 @@ UTF-8: UnicodeData.txt EastAsianWidth.txt
UTF-8: utf8_gen.py
$(PYTHON3) utf8_gen.py -u UnicodeData.txt \
-e EastAsianWidth.txt -p PropList.txt \
+ -d DerivedCoreProperties.txt \
--unicode_version $(UNICODE_VERSION)
UTF-8-report: UTF-8 ../charmaps/UTF-8
diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py
index b48dc2aaa4..eedf6eadb0 100755
--- a/localedata/unicode-gen/utf8_gen.py
+++ b/localedata/unicode-gen/utf8_gen.py
@@ -1,6 +1,7 @@
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# Copyright (C) 2014-2023 Free Software Foundation, Inc.
+# Copyright (C) 2024 The GNU Toolchain Authors.
# This file is part of the GNU C Library.
#
# The GNU C Library is free software; you can redistribute it and/or
@@ -28,7 +29,6 @@ It will output UTF-8 file
'''
import argparse
-import sys
import re
import unicode_utils
@@ -203,25 +203,24 @@ def write_header_width(outfile, unicode_version):
outfile.write('% Character width according to Unicode '
+ '{:s}.\n'.format(unicode_version))
outfile.write('% - Default width is 1.\n')
+ outfile.write('% - U+115F HANGUL CHOSEONG FILLER has width 2.\n')
+ outfile.write('% - Combining jungseong and jongseong Hangul jamo have with 0.\n')
+ outfile.write('% - U+00AD SOFT HYPHEN has width 1.\n')
outfile.write('% - Double-width characters have width 2; generated from\n')
outfile.write('% "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n')
- outfile.write('% - Non-spacing characters have width 0; '
- + 'generated from PropList.txt or\n')
- outfile.write('% "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' '
- + 'UnicodeData.txt"\n')
- outfile.write('% - Format control characters have width 0; '
- + 'generated from\n')
- outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n")
-# Not needed covered by Cf
-# outfile.write("% - Zero width characters have width 0; generated from\n")
-# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n")
+ outfile.write('% - Non-spacing marks have width 0; generated from\n')
+ outfile.write('% "grep \'^[^;]*;[^;]*;Mn;\' UnicodeData.txt"\n')
+ outfile.write('% - Enclosing marks have width 0; generated from\n')
+ outfile.write('% "grep \'^[^;]*;[^;]*;Me;\' UnicodeData.txt"\n')
+ outfile.write('% - "Default_Ignorable_Code_Point"s have width 0; generated from\n')
+ outfile.write("% \"grep '^[^;]*;\\s*Default_Ignorable_Code_Point' UnicodeData.txt\"\n")
outfile.write("WIDTH\n")
-def process_width(outfile, ulines, elines, plines):
+def process_width(outfile, ulines, elines, dlines):
'''ulines are lines from UnicodeData.txt, elines are lines from
- EastAsianWidth.txt containing characters with width “W” or “F”,
- plines are lines from PropList.txt which contain characters
- with the property “Prepended_Concatenation_Mark”.
+ EastAsianWidth.txt containing characters with width “W” or “F”.
+ dlines are lines from DerivedCoreProperties.txt which contain
+ characters with the property “Default_Ignorable_Code_Point”.
'''
width_dict = {}
@@ -237,12 +236,12 @@ def process_width(outfile, ulines, elines, plines):
for line in ulines:
fields = line.split(";")
- if fields[4] == "NSM" or fields[2] in ("Cf", "Me", "Mn"):
+ if fields[4] == "NSM" or fields[2] in ("Me", "Mn"):
width_dict[int(fields[0], 16)] = 0
- for line in plines:
- # Characters with the property “Prepended_Concatenation_Mark”
- # should have the width 1:
+ for line in dlines:
+ # Characters with the property “Default_Ignorable_Code_Point”
+ # should have the width 0:
fields = line.split(";")
if not '..' in fields[0]:
code_points = (fields[0], fields[0])
@@ -250,7 +249,13 @@ def process_width(outfile, ulines, elines, plines):
code_points = fields[0].split("..")
for key in range(int(code_points[0], 16),
int(code_points[1], 16)+1):
- del width_dict[key] # default width is 1
+ width_dict[key] = 0 # default width is 1
+
+ # special case: U+115F HANGUL CHOSEONG FILLER
+ # combines with other Hangul jamo to form a width-2
+ # syllable block, so treat it as width 2
+ # despite it being a `Default_Ignorable_Code_Point`
+ width_dict[0x115F] = 2
# handle special cases for compatibility
for key in list((0x00AD,)):
@@ -302,7 +307,7 @@ def process_width(outfile, ulines, elines, plines):
if __name__ == "__main__":
PARSER = argparse.ArgumentParser(
description='''
- Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, and PropList.txt.
+ Generate a UTF-8 file from UnicodeData.txt, DerivedCoreProperties.txt, and EastAsianWidth.txt
''')
PARSER.add_argument(
'-u', '--unicode_data_file',
@@ -319,11 +324,11 @@ if __name__ == "__main__":
help=('The EastAsianWidth.txt file to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
- '-p', '--prop_list_file',
+ '-d', '--derived_core_properties_file',
nargs='?',
type=str,
- default='PropList.txt',
- help=('The PropList.txt file to read, '
+ default='DerivedCoreProperties.txt',
+ help=('The DerivedCoreProperties.txt file to read, '
+ 'default: %(default)s'))
PARSER.add_argument(
'--unicode_version',
@@ -352,11 +357,13 @@ if __name__ == "__main__":
continue
if re.match(r'^[^;]*;[WF]', LINE):
EAST_ASIAN_WIDTH_LINES.append(LINE.strip())
- with open(ARGS.prop_list_file, mode='r') as PROP_LIST_FILE:
- PROP_LIST_LINES = []
- for LINE in PROP_LIST_FILE:
- if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE):
- PROP_LIST_LINES.append(LINE.strip())
+ with open(ARGS.derived_core_properties_file, mode='r') as DERIVED_CORE_PROPERTIES_FILE:
+ DERIVED_CORE_PROPERTIES_LINES = []
+ for LINE in DERIVED_CORE_PROPERTIES_FILE:
+ if re.match(r'.*<reserved-.+>', LINE):
+ continue
+ if re.match(r'^[^;]*;\s*Default_Ignorable_Code_Point', LINE):
+ DERIVED_CORE_PROPERTIES_LINES.append(LINE.strip())
with open('UTF-8', mode='w') as OUTFILE:
# Processing UnicodeData.txt and write CHARMAP to UTF-8 file
write_header_charmap(OUTFILE)
@@ -367,5 +374,5 @@ if __name__ == "__main__":
process_width(OUTFILE,
UNICODE_DATA_LINES,
EAST_ASIAN_WIDTH_LINES,
- PROP_LIST_LINES)
+ DERIVED_CORE_PROPERTIES_LINES)
OUTFILE.write("END WIDTH\n")
--
2.43.1
next parent reply other threads:[~2024-02-18 18:54 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20240211190202.414300-2-julesbertholet@quoi.xyz>
2024-02-18 18:54 ` Jules Bertholet [this message]
2024-02-20 12:57 ` Arjun Shankar
2024-02-23 20:54 ` [PATCH][v3] " Jules Bertholet
2024-02-28 17:21 ` Mike FABIAN
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240218185326.16663-1-julesbertholet@quoi.xyz \
--to=julesbertholet@quoi.xyz \
--cc=carlos@redhat.com \
--cc=libc-alpha@sourceware.org \
--cc=libc-locales@sourceware.org \
--cc=maiku.fabian@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).