From: Jonathan Wakely <jwakely@redhat.com>
To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org
Subject: Re: [committed V3] libstdc++: Add Unicode-aware width estimation for std::format
Date: Mon, 8 Jan 2024 22:56:06 +0000 [thread overview]
Message-ID: <CACb0b4=n+SM+-G5zwWiasxYvP72LXTs-PbsNLf2i_bHrUnU34w@mail.gmail.com> (raw)
In-Reply-To: <20240108011829.3670492-1-jwakely@redhat.com>
On Mon, 8 Jan 2024 at 01:19, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> I decided to push this now, not wait for the morning.
>
> This is mostly the same as V2, but adds to the contrib/unicode/README as
> suggested by Lewis, and avoids a trailing whitespace character in the
> generated header.
>
> Tested x86_64-linux and aarch64-linux. Pushed to trunk.
>
> -- >8 --
>
>
> This implements the requirements in the following proposals, which
> dictate how std::format deals with non-ASCII strings:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf
>
> There are two parts to this. The width estimation for strings must only
> count the width of the first character in an extended grapheme cluster.
> That requires implementing the algorithm for detecting cluster breaks,
> which requires a number of lookup tables of the grapheme cluster break
> properties (and Indic_Conjunct_Break and Extended_Pictographic
> properties) of every code point. Additionally, some characters have a
> field width of 2, which requires another lookup table of field widths
> for every code point. The tables added in this commit do not contain
> entries for every code point from 0 to 0x10FFFF as that would be very
> inefficient and use too much memory. Instead the tables only contain the
> code points that form an "edge" for a property, omitting all the code
> points that have the same property as the preceding one. We can use a
> binary search to find the closest code point in the table that is not
> greater than the one we're looking for.
>
> The tables are generated by a new Python script added to the
> contrib/unicode directory, and a new data file downloaded from the
> Unicode Consortium website.
>
> The rules for extended grapheme cluster breaking are implemented for the
> latest Unicode standard, version 15.1.0.
>
> libstdc++-v3/ChangeLog:
>
> * include/Makefile.am: Add new headers.
> * include/Makefile.in: Regenerate.
> * include/bits/unicode.h: New file.
> * include/bits/unicode-data.h: New file.
> * include/std/format: Include <bits/unicode.h>.
> (__literal_encoding_is_utf8): Move to <bits/unicode.h>.
> (_Spec::_M_fill): Change type to char32_t.
> (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
> instead of a single character.
> (__write_padded): Change __fill_char parameter to char32_t and
> encode it into the output.
> (__formatter_str::format): Use new __unicode::__field_width and
> __unicode::__truncate functions.
> * include/std/ostream: Adjust namespace qualification for
> __literal_encoding_is_utf8.
> * include/std/print: Likewise.
> * src/c++23/print.cc: Add [[unlikely]] attribute to error path.
> * testsuite/ext/unicode/view.cc: New test.
> * testsuite/std/format/functions/format.cc: Add missing examples
> from the standard demonstrating alignment with non-ASCII
> characters. Add examples checking correct handling of extended
> grapheme clusters.
>
> contrib/ChangeLog:
>
> * unicode/README: Add notes about generating libstdc++ tables.
> * unicode/GraphemeBreakProperty.txt: New file.
> * unicode/emoji-data.txt: New file.
> * unicode/gen_libstdcxx_unicode_data.py: New file.
> ---
While writing some more tests I realised I'd forgotten to finish this
function, and had left it as a copy&paste from __field_width(char32_t)
above:
> + constexpr bool
> + __is_extended_pictographic(char32_t __c)
> + {
> + if (__c < __xpicto_edges[0]) [[likely]]
> + return 1;
> +
> + auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c);
> + return (__p - __xpicto_edges) % 2 + 1;
> + }
It should be:
constexpr bool
__is_extended_pictographic(char32_t __c)
{
if (__c < __xpicto_edges[0]) [[likely]]
return false;
auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c);
return (__p - __xpicto_edges) % 2;
}
I'll push a fix for that (and add my new tests) tomorrow.
next prev parent reply other threads:[~2024-01-08 22:56 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-05 14:36 [PATCH] " Jonathan Wakely
2024-01-05 15:33 ` Jonathan Wakely
2024-01-06 15:17 ` [PATCH v2] " Jonathan Wakely
2024-01-06 16:57 ` Lewis Hyatt
2024-01-06 17:03 ` Jonathan Wakely
2024-01-06 21:11 ` Jonathan Wakely
2024-01-08 1:13 ` Jonathan Wakely
2024-01-08 1:17 ` [committed V3] " Jonathan Wakely
2024-01-08 22:56 ` Jonathan Wakely [this message]
2024-01-08 1:22 ` [PATCH v2] " Jonathan Wakely
2024-01-08 1:25 ` Jonathan Wakely
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CACb0b4=n+SM+-G5zwWiasxYvP72LXTs-PbsNLf2i_bHrUnU34w@mail.gmail.com' \
--to=jwakely@redhat.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=libstdc++@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).