From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 0EECC3858C33 for ; Mon, 8 Jan 2024 22:56:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0EECC3858C33 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 0EECC3858C33 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.129.124 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704754592; cv=none; b=ALq1tqdaVEEoUnk8uZIsDQCHifaVNABDGOi1vGxSfwZww3ff2kWWwiJlllqpWqJaU7vzK9/hQz8+rHTthuKzQX93Pas6ms0F60SkRuErFXoQpAMFuMAeYoZbSE0NDnzplnvcWQrGJJPF3WXuUpF5L1/ORDwOeR6U9LFIuaiIp9M= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704754592; c=relaxed/simple; bh=MoOQl99FE3YwzVcfPa8cj567Q/Xgq7TqV1N3zH/cp2s=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=wraTF3HacpqdXPW/p4qjkH5kUuHfNji3xdctmejOrKQaEDkHrDgiY+xbXCzn+rHHrxa8m/DJU2wro6F0VUejvAISENXixr9Sz9AX76MzZMbGylhZvXp0NPxyuQ+Zw0Mm/y03gWwSYaLQAhO4LEljrkKYEwzkGKD8II6Xdxro044= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1704754590; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=jrQG1R/m8tKW4ibWHg8BfpcFbo2CUkuWNvWb8i6/26Y=; b=SlEnGUHEohYmPY7A/gFki0OqMXwrBFRbzn1WCOSNqHZeY/wCWTA9aXSVwVoS3c6n02Eghc wGcKxE9IN+J+A8taN2TUUh6GPdmDQBHmNCTfirmhpqWyW3dIanpO5POtvxPXiMi0douhx8 /mwiuVMmaWR16lE7neoE7ug5sgMKebI= Received: from mail-yw1-f198.google.com (mail-yw1-f198.google.com [209.85.128.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-477-e0gDdi9lO_CEwt4VTI_T8Q-1; Mon, 08 Jan 2024 17:56:23 -0500 X-MC-Unique: e0gDdi9lO_CEwt4VTI_T8Q-1 Received: by mail-yw1-f198.google.com with SMTP id 00721157ae682-5e617562a65so40683667b3.1 for ; Mon, 08 Jan 2024 14:56:23 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704754583; x=1705359383; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=jrQG1R/m8tKW4ibWHg8BfpcFbo2CUkuWNvWb8i6/26Y=; b=MQovsVo6m5Z7+TckRxmP2Ts4ZfU/8aYUU+RUhhbZmvR0NBYuPy+tU1obQ7tuUEZp8K Yv4MxCiJTl4rAW0YMfS9A8Pi2bq7ncB4Xy36NU09bqdnMr0CBqFQNUT1nI7BxyL8C8iu oMlmG9Xo9kCfopDptfITne7ogriRrJu1MAyF3tFas047cI1NAHe6YMytHIDwJF6Un3+5 5ZpgRvSXiYDWIafZ/TEGkKWYKQ2Ul9/y8Kbwk5HSlNSfcLMyciYy0jS6HsFgLnq7aKyk BEFaR3JVTEK/PILvFL/74cuiKmuDIgyfRSS45wW5n9xYpkC5ELKMFB3v0bI4/iuLnLm8 B4qQ== X-Gm-Message-State: AOJu0YxVk6Rdyu5DNwqhH6P8+NVt03lLg/Flq25B9KWu1TeLHQk346qH l2x+HvvtUkt0FRoMQTnTZIeR26xOoZe9PxMWcvg05PgHM5uJihADHgLuqYjgkll3JKkYNC5PU7Q EkF67CFX4v5m3Mz42fWG16W9D9Wv54cmQzFr7+egLjA== X-Received: by 2002:a81:8602:0:b0:5f0:c7f8:35df with SMTP id w2-20020a818602000000b005f0c7f835dfmr2951118ywf.70.1704754583143; Mon, 08 Jan 2024 14:56:23 -0800 (PST) X-Google-Smtp-Source: AGHT+IEUw+SE9yWczw8qptu9r6nIa+xrETR38c74AEN4qxZKkvo9H7hfBJPnw8KFVRrLidic54rmYutfvAXG52b91W0= X-Received: by 2002:a81:8602:0:b0:5f0:c7f8:35df with SMTP id w2-20020a818602000000b005f0c7f835dfmr2951110ywf.70.1704754582797; Mon, 08 Jan 2024 14:56:22 -0800 (PST) MIME-Version: 1.0 References: <20240108011829.3670492-1-jwakely@redhat.com> In-Reply-To: <20240108011829.3670492-1-jwakely@redhat.com> From: Jonathan Wakely Date: Mon, 8 Jan 2024 22:56:06 +0000 Message-ID: Subject: Re: [committed V3] libstdc++: Add Unicode-aware width estimation for std::format To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, 8 Jan 2024 at 01:19, Jonathan Wakely wrote: > > I decided to push this now, not wait for the morning. > > This is mostly the same as V2, but adds to the contrib/unicode/README as > suggested by Lewis, and avoids a trailing whitespace character in the > generated header. > > Tested x86_64-linux and aarch64-linux. Pushed to trunk. > > -- >8 -- > > > This implements the requirements in the following proposals, which > dictate how std::format deals with non-ASCII strings: > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1868r1.html > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2572r1.html > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2675r1.pdf > > There are two parts to this. The width estimation for strings must only > count the width of the first character in an extended grapheme cluster. > That requires implementing the algorithm for detecting cluster breaks, > which requires a number of lookup tables of the grapheme cluster break > properties (and Indic_Conjunct_Break and Extended_Pictographic > properties) of every code point. Additionally, some characters have a > field width of 2, which requires another lookup table of field widths > for every code point. The tables added in this commit do not contain > entries for every code point from 0 to 0x10FFFF as that would be very > inefficient and use too much memory. Instead the tables only contain the > code points that form an "edge" for a property, omitting all the code > points that have the same property as the preceding one. We can use a > binary search to find the closest code point in the table that is not > greater than the one we're looking for. > > The tables are generated by a new Python script added to the > contrib/unicode directory, and a new data file downloaded from the > Unicode Consortium website. > > The rules for extended grapheme cluster breaking are implemented for the > latest Unicode standard, version 15.1.0. > > libstdc++-v3/ChangeLog: > > * include/Makefile.am: Add new headers. > * include/Makefile.in: Regenerate. > * include/bits/unicode.h: New file. > * include/bits/unicode-data.h: New file. > * include/std/format: Include . > (__literal_encoding_is_utf8): Move to . > (_Spec::_M_fill): Change type to char32_t. > (_Spec::_M_parse_fill_and_align): Read a Unicode scalar value > instead of a single character. > (__write_padded): Change __fill_char parameter to char32_t and > encode it into the output. > (__formatter_str::format): Use new __unicode::__field_width and > __unicode::__truncate functions. > * include/std/ostream: Adjust namespace qualification for > __literal_encoding_is_utf8. > * include/std/print: Likewise. > * src/c++23/print.cc: Add [[unlikely]] attribute to error path. > * testsuite/ext/unicode/view.cc: New test. > * testsuite/std/format/functions/format.cc: Add missing examples > from the standard demonstrating alignment with non-ASCII > characters. Add examples checking correct handling of extended > grapheme clusters. > > contrib/ChangeLog: > > * unicode/README: Add notes about generating libstdc++ tables. > * unicode/GraphemeBreakProperty.txt: New file. > * unicode/emoji-data.txt: New file. > * unicode/gen_libstdcxx_unicode_data.py: New file. > --- While writing some more tests I realised I'd forgotten to finish this function, and had left it as a copy&paste from __field_width(char32_t) above: > + constexpr bool > + __is_extended_pictographic(char32_t __c) > + { > + if (__c < __xpicto_edges[0]) [[likely]] > + return 1; > + > + auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c); > + return (__p - __xpicto_edges) % 2 + 1; > + } It should be: constexpr bool __is_extended_pictographic(char32_t __c) { if (__c < __xpicto_edges[0]) [[likely]] return false; auto* __p = std::upper_bound(__xpicto_edges, std::end(__xpicto_edges), __c); return (__p - __xpicto_edges) % 2; } I'll push a fix for that (and add my new tests) tomorrow.