From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 268173858D28 for ; Tue, 20 Feb 2024 12:57:43 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 268173858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 268173858D28 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708433865; cv=none; b=F2UeKpQXoeEm1yLE7cnd4/7pJaktiTBThrmp9OAW5wyOO1jfdnCFNykjia1Z4Dah8K8RpRzOMwF932FImUL6sRIoTR/zK97w4fZydfMq4rYgskHkw+qOVj+CJUSXYtj8T8fztwGAqYTJMGdtcUnZxg+EDZ/shS4vUBTtUdWdzZ4= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708433865; c=relaxed/simple; bh=PifOjNUNbKNOB+wwZq4R9VUlYmCymW1sjEvfyqtkrN8=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=kngyNmnXtu48bx2CYWMqL6E1sUBH2fToFReO62dLg6lLPU9Rg7OhW/7jX990TA4ZQkPLbRtG9Rc11wvJtBD06I1BJCQI9F91iO4b6M4iFq3Sn8SeC1N9VzhtakLujJ+Dy/MEoLoNgPim9LgRohm9WFMdALmGUIONVzodcNFl76k= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1708433862; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ArqHDzR3LnKGkQmIjuGpNpUTC5YvdY3amNcwgh3SEmU=; b=hsnaQeD1IbK9DgXoVAyNLV67zgHskHHx8SWH+0VCZce+nWkMCAsWftrEyjjYQFlCVA+PMk QEzJs3Vp+ssQijDGdWMjhRS108/IqhpA+mDKpi0EghoO4w9lHpwvVd4ng+ztjP59PqXQCI EWA2kqXnKFW+0jKca1ulU7J3eWq7M/U= Received: from mail-pj1-f71.google.com (mail-pj1-f71.google.com [209.85.216.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-27-DvOqwZL7PiqoplBjO7gauQ-1; Tue, 20 Feb 2024 07:57:39 -0500 X-MC-Unique: DvOqwZL7PiqoplBjO7gauQ-1 Received: by mail-pj1-f71.google.com with SMTP id 98e67ed59e1d1-29969af5829so2529151a91.2 for ; Tue, 20 Feb 2024 04:57:38 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708433858; x=1709038658; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ArqHDzR3LnKGkQmIjuGpNpUTC5YvdY3amNcwgh3SEmU=; b=HzVRfFyilu7wlbCOeUBvd8Wz7oa07ShlT3UOygoxSCe+UB8qKRtkK95MwhGB8x93+A 5j+YoL4ec2G23rs/3dpvPcuodCQQ1tjbf9lGSu3rWr272ZxijJGhap7AXz3UWmRqq2q4 Xe+Gk+70FOU7djtO5tcx6wEkvinzQK0T1gJo1HunQyP6eLQyHAwPOOL2l/aEvmkHNNrU nMJkJL6YgTO99OZcFH5iqNqLeiRyXiMs2U1TxL4cJYPwmN5RvuSwrMkKMuRzl3U4+25L i/dBJf3jDFIVyVdKtijEhKE/wRuNwzrtyhbsLDYp/vhctN8cqOLy+8SJEyvUoncJyzjM 1z9w== X-Forwarded-Encrypted: i=1; AJvYcCVp4NWYxXtQccJD5GT79kiVvz6LGWCTs1HjCzE+gApVCbuTEqt/J8bO6ZWKaKx5q1L4Q7InLppwYBJIz3FzXheP9/IQMjiRG67YVUA= X-Gm-Message-State: AOJu0YytSQdXIxnz40ToaUIpKKI1MhDQXLkdZPOUrauW002RikkO9HBi BKK/u49FkknYbECj2plMww8BOI8Lf6VFQAbcGpDhs3alSMkR4Tw7cBB+FZ2UwrtQfDcmhjU7YjM +aoq39Bj6rqDCoVcFXK2f73goMab2jJTZod+u6znpxaoWiq8+PD4XgudHjuPmt9p6dsDzIeTnER TRIHCyvGvoD+eydDzfT3fUgoGtf2AcPNPPWMA= X-Received: by 2002:a17:90a:d48d:b0:299:2214:5398 with SMTP id s13-20020a17090ad48d00b0029922145398mr11269130pju.13.1708433857909; Tue, 20 Feb 2024 04:57:37 -0800 (PST) X-Google-Smtp-Source: AGHT+IGdpzbmGh4aKnXuN0SnsASxTJquOYN11W8PeTW1px5RSasnSmWoBYUXK3wnaupTZ4HkzMOm52+222XHhIF6b2U= X-Received: by 2002:a17:90a:d48d:b0:299:2214:5398 with SMTP id s13-20020a17090ad48d00b0029922145398mr11269115pju.13.1708433857430; Tue, 20 Feb 2024 04:57:37 -0800 (PST) MIME-Version: 1.0 References: <20240211190202.414300-2-julesbertholet@quoi.xyz> <20240218185326.16663-1-julesbertholet@quoi.xyz> In-Reply-To: <20240218185326.16663-1-julesbertholet@quoi.xyz> From: Arjun Shankar Date: Tue, 20 Feb 2024 13:57:26 +0100 Message-ID: Subject: Re: [PATCH][v2] localedata: Fix several issues with the set of characters considered 0-width [BZ #31370] To: Jules Bertholet Cc: libc-alpha@sourceware.org, "Carlos O'Donnell" , Mike Fabian , libc-locales@sourceware.org X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00,BODY_8BITS,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi Jules, > This new version of the patch has a more detailed commit message, > and includes one more related fix. Thanks for working on this! Looks like due to a couple of intervening changes to utf8_gen.py (and the generated UTF-8 file) in master along with some copyright line changes, your patch doesn't currently apply to master and will need an update. Anyway, I resolved the conflict by hand and continued reviewing so I could offer some feedback in this iteration itself. I've found some issues that I've mentioned below, inline with the patch content. > > --- The first occurance of "---" makes git-am drop the rest of the body from the commit message. I usually put all my non-commit-message-relevant notes after the first "---" printed by git's email tools when composing a patch post. > Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that ch= aracters with the `Default_Ignorable_Code_Point` property > > > should be rendered as completely invisible (and non advancing, i.e. =E2= =80=9Czero width=E2=80=9D), if not explicitly supported in rendering. > > Hence, `wcwidth()` should give them all a width of 0, with two exceptions= : > > - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstandin= g precedent > - U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavi= or of the conjoining Korean jamo characters. > One composed Hangul "syllable block" like =E1=84=91=E1=85=B1=E1=86=B6 i= s made up of two to three individual component characters, or "jamo". > These are all assigned an `East_Asian_Width` of `Wide` by Unicode, whic= h would normally mean they would all be assigned width 2 by glibc; > a combination of (leading choseong jamo) + (medial jungseong jamo) + (t= railing jongseong jamo) would then have width 2 + 2 + 2 =3D 6. > However, glibc (and other wcwidth implementations) special-cases jungse= ong and jongseong, assigning them all width 0, > to ensure that the complete block has width 2 + 0 + 0 =3D 2 as it shoul= d. > U+115F is meant for use in syllable blocks that are intentionally missi= ng a leading jamo; > it must be assigned a width of 2 even though it has no visible display = to ensure that the complete block has width 2. OK. I assume this simply explains current and correct behaviour. I'm wondering if some of this can instead be used to expand the existing comments in the `write_header_width' function of utf8_gen.py instead. > However, `wcwidth()` currently (before this patch) incorrectly assigns no= n-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER; > this commit fixes that. OK. I'll look for this change below. > You can read more about Unicode jamo in the Unicode spec, sections 3.12 <= https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf#G24646> and 18.6 , > and about `Default_Ignorable_Code_Point` in =C2=A75.21 . I suggest replacing the "You can read more" with a bulleted list of referen= ces. > > --- > The Unicode Standard, =C2=A75.21 - Characters Ignored for Display says the followin= g: > > > A small number of format characters (General_Category =3D Cf ) are also= not given the Default_Ignorable_Code_Point property. > > This may surprise implementers, who often assume that all format charac= ters are generally ignored in fallback display. > > The exact list of these exceptional format characters can be found in t= he Unicode Character Database. > > There are, however, three important sets of such format characters to n= ote: > > > > - prepended concatenation marks > > - interlinear annotation characters > > - Egyptian hieroglyph format controls > > > > The prepended concatenation marks always have a visible display. > > See =E2=80=9CPrepended Concatenation Marks=E2=80=9D in [*Section 23.2, = Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M= 9.35858.HeadingBreak.132.Layout.Controls) > > for more discussion of the use and display of these signs. > > > > The other two notable sets of format characters that exceptionally are = not ignored in fallback display consist of the interlinear annotation chara= cters, > > U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTAT= ION TERMINATOR, > > and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYP= H VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. > > These characters should have a visible glyph display for fallback rende= ring, because if they are not displayed, > > it is too easy to misread the resulting displayed text. > > See =E2=80=9CAnnotation Characters=E2=80=9D in [*Section 23.8, Specials= *](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading= .133.Specials), > > as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.o= rg/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyp= hs) > > for more discussion of the use and display of these characters. OK. A direct quote from the chapter. > glibc currently correctly assigns non-zero width to the prepended concate= nation marks, > but it incorrectly gives zero width to the interlinear annotation charact= ers (which a generic terminal cannot interpret) > and the Egyptian hieroglyph format controls (which are not widely support= ed in rendering implementations at present). > This commit fixes both these issues as well. OK. I'll look for this change below. > Signed-off-by: Jules Bertholet A minor nit: would be great if some of these long lines get split across multiple lines. Of course, it's only a nit and `git log' shows that many commit messages do have long lines in them. > --- > localedata/charmaps/UTF-8 | 21 ++++++---- > localedata/unicode-gen/Makefile | 2 + > localedata/unicode-gen/utf8_gen.py | 67 +++++++++++++++++------------- > 3 files changed, 53 insertions(+), 37 deletions(-) > > diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8 > index bd8075f20d..f3fcd64fce 100644 > --- a/localedata/charmaps/UTF-8 > +++ b/localedata/charmaps/UTF-8 > @@ -49842,12 +49842,17 @@ END CHARMAP > > % Character width according to Unicode 15.0.0. > % - Default width is 1. > +% - U+115F HANGUL CHOSEONG FILLER has width 2. > +% - Combining jungseong and jongseong Hangul jamo have with 0. > +% - U+00AD SOFT HYPHEN has width 1. > % - Double-width characters have width 2; generated from > % "grep '^[^;]*;[WF]' EastAsianWidth.txt" > -% - Non-spacing characters have width 0; generated from PropList.txt or > -% "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt" > -% - Format control characters have width 0; generated from > -% "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" > +% - Non-spacing marks have width 0; generated from > +% "grep '^[^;]*;[^;]*;Mn;' UnicodeData.txt" > +% - Enclosing marks have width 0; generated from > +% "grep '^[^;]*;[^;]*;Me;' UnicodeData.txt" > +% - "Default_Ignorable_Code_Point"s have width 0; generated from > +% "grep '^[^;]*;\s*Default_Ignorable_Code_Point' UnicodeData.txt" This bit doesn't apply due to the conflict I mentioned earlier. > WIDTH > ... 0 > ... 0 > @@ -50069,7 +50074,9 @@ WIDTH > ... 0 > ... 2 > ... 2 > -... 2 > +... 2 > + 0 > +... 2 OK. HANGUL FILLER. > ... 2 > ... 2 > ... 2 > @@ -50124,8 +50131,8 @@ WIDTH > ... 2 > 0 > ... 2 > + 0 OK. HALFWIDTH HANGUL FILLER. > ... 2 > -... 0 OK. "U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR" should not be ignored. You quoted this in the commit message. > 0 > 0 > ... 0 > @@ -50226,7 +50233,7 @@ WIDTH > ... 0 > 0 > 0 > -... 0 > + 0 OK. "U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE" should not be ignored. > ... 0 > ... 0 > ... 0 > diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Mak= efile > index fd0c732ac4..1975065679 100644 > --- a/localedata/unicode-gen/Makefile > +++ b/localedata/unicode-gen/Makefile > @@ -1,4 +1,5 @@ > # Copyright (C) 2015-2023 Free Software Foundation, Inc. > +# Copyright (C) 2024 The GNU Toolchain Authors. This bit doesn't apply due to a recent copyright line change in master. > # This file is part of the GNU C Library. > > # The GNU C Library is free software; you can redistribute it and/or > @@ -94,6 +95,7 @@ UTF-8: UnicodeData.txt EastAsianWidth.txt > UTF-8: utf8_gen.py > $(PYTHON3) utf8_gen.py -u UnicodeData.txt \ > -e EastAsianWidth.txt -p PropList.txt \ > + -d DerivedCoreProperties.txt \ OK. Adds a new parameter. > --unicode_version $(UNICODE_VERSION) > > UTF-8-report: UTF-8 ../charmaps/UTF-8 > diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/= utf8_gen.py > index b48dc2aaa4..eedf6eadb0 100755 > --- a/localedata/unicode-gen/utf8_gen.py > +++ b/localedata/unicode-gen/utf8_gen.py > @@ -1,6 +1,7 @@ > #!/usr/bin/python3 > # -*- coding: utf-8 -*- > # Copyright (C) 2014-2023 Free Software Foundation, Inc. > +# Copyright (C) 2024 The GNU Toolchain Authors. Again, needs to be rebased due to a copyright line change. > # This file is part of the GNU C Library. > # > # The GNU C Library is free software; you can redistribute it and/or > @@ -28,7 +29,6 @@ It will output UTF-8 file > ''' > > import argparse > -import sys OK. As long as the script continues to run. > import re > import unicode_utils > > @@ -203,25 +203,24 @@ def write_header_width(outfile, unicode_version): > outfile.write('% Character width according to Unicode ' > + '{:s}.\n'.format(unicode_version)) > outfile.write('% - Default width is 1.\n') > + outfile.write('% - U+115F HANGUL CHOSEONG FILLER has width 2.\n') > + outfile.write('% - Combining jungseong and jongseong Hangul jamo hav= e with 0.\n') > + outfile.write('% - U+00AD SOFT HYPHEN has width 1.\n') OK. You did add comments about the change. > outfile.write('% - Double-width characters have width 2; generated f= rom\n') > outfile.write('% "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n'= ) > - outfile.write('% - Non-spacing characters have width 0; ' > - + 'generated from PropList.txt or\n') > - outfile.write('% "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' ' > - + 'UnicodeData.txt"\n') > - outfile.write('% - Format control characters have width 0; ' > - + 'generated from\n') > - outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n") > -# Not needed covered by Cf > -# outfile.write("% - Zero width characters have width 0; generated fr= om\n") > -# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n"= ) > + outfile.write('% - Non-spacing marks have width 0; generated from\n'= ) > + outfile.write('% "grep \'^[^;]*;[^;]*;Mn;\' UnicodeData.txt"\n') > + outfile.write('% - Enclosing marks have width 0; generated from\n') > + outfile.write('% "grep \'^[^;]*;[^;]*;Me;\' UnicodeData.txt"\n') > + outfile.write('% - "Default_Ignorable_Code_Point"s have width 0; gen= erated from\n') > + outfile.write("% \"grep '^[^;]*;\\s*Default_Ignorable_Code_Point' = UnicodeData.txt\"\n") This doesn't apply due to conflicts. > outfile.write("WIDTH\n") > > -def process_width(outfile, ulines, elines, plines): > +def process_width(outfile, ulines, elines, dlines): > '''ulines are lines from UnicodeData.txt, elines are lines from > - EastAsianWidth.txt containing characters with width =E2=80=9CW=E2=80= =9D or =E2=80=9CF=E2=80=9D, > - plines are lines from PropList.txt which contain characters > - with the property =E2=80=9CPrepended_Concatenation_Mark=E2=80=9D. > + EastAsianWidth.txt containing characters with width =E2=80=9CW=E2=80= =9D or =E2=80=9CF=E2=80=9D. > + dlines are lines from DerivedCoreProperties.txt which contain > + characters with the property =E2=80=9CDefault_Ignorable_Code_Point= =E2=80=9D. > > ''' > width_dict =3D {} > @@ -237,12 +236,12 @@ def process_width(outfile, ulines, elines, plines): > > for line in ulines: > fields =3D line.split(";") > - if fields[4] =3D=3D "NSM" or fields[2] in ("Cf", "Me", "Mn"): > + if fields[4] =3D=3D "NSM" or fields[2] in ("Me", "Mn"): > width_dict[int(fields[0], 16)] =3D 0 > > - for line in plines: > - # Characters with the property =E2=80=9CPrepended_Concatenation_= Mark=E2=80=9D > - # should have the width 1: > + for line in dlines: > + # Characters with the property =E2=80=9CDefault_Ignorable_Code_P= oint=E2=80=9D > + # should have the width 0: > fields =3D line.split(";") > if not '..' in fields[0]: > code_points =3D (fields[0], fields[0]) > @@ -250,7 +249,13 @@ def process_width(outfile, ulines, elines, plines): > code_points =3D fields[0].split("..") > for key in range(int(code_points[0], 16), > int(code_points[1], 16)+1): > - del width_dict[key] # default width is 1 > + width_dict[key] =3D 0 # default width is 1 > + > + # special case: U+115F HANGUL CHOSEONG FILLER > + # combines with other Hangul jamo to form a width-2 > + # syllable block, so treat it as width 2 > + # despite it being a `Default_Ignorable_Code_Point` > + width_dict[0x115F] =3D 2 > > # handle special cases for compatibility > for key in list((0x00AD,)): > @@ -302,7 +307,7 @@ def process_width(outfile, ulines, elines, plines): > if __name__ =3D=3D "__main__": > PARSER =3D argparse.ArgumentParser( > description=3D''' > - Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, = and PropList.txt. > + Generate a UTF-8 file from UnicodeData.txt, DerivedCorePropertie= s.txt, and EastAsianWidth.txt > ''') > PARSER.add_argument( > '-u', '--unicode_data_file', > @@ -319,11 +324,11 @@ if __name__ =3D=3D "__main__": > help=3D('The EastAsianWidth.txt file to read, ' > + 'default: %(default)s')) > PARSER.add_argument( > - '-p', '--prop_list_file', > + '-d', '--derived_core_properties_file', This seems problematic. Running `make UTF-8' in localedata/unicode-gen errors out: "utf8_gen.py: error: unrecognized arguments: -p PropList.txt" I didn't get around to reviewing the changes to the script but I'll look out for a v3. Cheers! -- Arjun Shankar he/him/his