Re: [committed V3] libstdc++: Add Unicode-aware width estimation for std::format

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jonathan Wakely <jwakely@redhat.com>
To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org
Subject: Re: [committed V3] libstdc++: Add Unicode-aware width estimation for std::format
Date: Mon, 8 Jan 2024 22:56:06 +0000	[thread overview]
Message-ID: <CACb0b4=n+SM+-G5zwWiasxYvP72LXTs-PbsNLf2i_bHrUnU34w@mail.gmail.com> (raw)
In-Reply-To: <20240108011829.3670492-1-jwakely@redhat.com>

On Mon, 8 Jan 2024 at 01:19, Jonathan Wakely <jwakely@redhat.com> wrote:
>
> I decided to push this now, not wait for the morning.
>
> This is mostly the same as V2, but adds to the contrib/unicode/README as
> suggested by Lewis, and avoids a trailing whitespace character in the
> generated header.
>
> Tested x86_64-linux and aarch64-linux. Pushed to trunk.
>
> -- >8 --
>
>
> This implements the requirements in the following proposals, which
> dictate how std::format deals with non-ASCII strings:
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf
>
> There are two parts to this. The width estimation for strings must only
> count the width of the first character in an extended grapheme cluster.
> That requires implementing the algorithm for detecting cluster breaks,
> which requires a number of lookup tables of the grapheme cluster break
> properties (and Indic_Conjunct_Break and Extended_Pictographic
> properties) of every code point. Additionally, some characters have a
> field width of 2, which requires another lookup table of field widths
> for every code point.  The tables added in this commit do not contain
> entries for every code point from 0 to 0x10FFFF as that would be very
> inefficient and use too much memory. Instead the tables only contain the
> code points that form an "edge" for a property, omitting all the code
> points that have the same property as the preceding one. We can use a
> binary search to find the closest code point in the table that is not
> greater than the one we're looking for.
>
> The tables are generated by a new Python script added to the
> contrib/unicode directory, and a new data file downloaded from the
> Unicode Consortium website.
>
> The rules for extended grapheme cluster breaking are implemented for the
> latest Unicode standard, version 15.1.0.
>
> libstdc++-v3/ChangeLog:
>
>         * include/Makefile.am: Add new headers.
>         * include/Makefile.in: Regenerate.
>         * include/bits/unicode.h: New file.
>         * include/bits/unicode-data.h: New file.
>         * include/std/format: Include <bits/unicode.h>.
>         (__literal_encoding_is_utf8): Move to <bits/unicode.h>.
>         (_Spec::_M_fill): Change type to char32_t.
>         (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value
>         instead of a single character.
>         (__write_padded): Change __fill_char parameter to char32_t and
>         encode it into the output.
>         (__formatter_str::format): Use new __unicode::__field_width and
>         __unicode::__truncate functions.
>         * include/std/ostream: Adjust namespace qualification for
>         __literal_encoding_is_utf8.
>         * include/std/print: Likewise.
>         * src/c++23/print.cc: Add [[unlikely]] attribute to error path.
>         * testsuite/ext/unicode/view.cc: New test.
>         * testsuite/std/format/functions/format.cc: Add missing examples
>         from the standard demonstrating alignment with non-ASCII
>         characters. Add examples checking correct handling of extended
>         grapheme clusters.
>
> contrib/ChangeLog:
>
>         * unicode/README: Add notes about generating libstdc++ tables.
>         * unicode/GraphemeBreakProperty.txt: New file.
>         * unicode/emoji-data.txt: New file.
>         * unicode/gen_libstdcxx_unicode_data.py: New file.
> ---


While writing some more tests I realised I'd forgotten to finish this
function, and had left it as a copy&paste from __field_width(char32_t)
above:

> +  constexpr bool
> +  __is_extended_pictographic(char32_t __c)
> +  {
> +    if (__c < __xpicto_edges[0]) [[likely]]
> +      return 1;
> +
> +    auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c);
> +    return (__p - __xpicto_edges) % 2 + 1;
> +  }

It should be:

  constexpr bool
  __is_extended_pictographic(char32_t __c)
  {
    if (__c < __xpicto_edges[0]) [[likely]]
      return false;

    auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c);
    return (__p - __xpicto_edges) % 2;
  }

I'll push a fix for that (and add my new tests) tomorrow.

next prev parent reply	other threads:[~2024-01-08 22:56 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-05 14:36 [PATCH] " Jonathan Wakely
2024-01-05 15:33 ` Jonathan Wakely
2024-01-06 15:17   ` [PATCH v2] " Jonathan Wakely
2024-01-06 16:57     ` Lewis Hyatt
2024-01-06 17:03       ` Jonathan Wakely
2024-01-06 21:11         ` Jonathan Wakely
2024-01-08  1:13     ` Jonathan Wakely
2024-01-08  1:17       ` [committed V3] " Jonathan Wakely
2024-01-08 22:56         ` Jonathan Wakely [this message]
2024-01-08  1:22       ` [PATCH v2] " Jonathan Wakely
2024-01-08  1:25         ` Jonathan Wakely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACb0b4=n+SM+-G5zwWiasxYvP72LXTs-PbsNLf2i_bHrUnU34w@mail.gmail.com' \
    --to=jwakely@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=libstdc++@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).