From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp113.iad3b.emailsrvr.com (smtp113.iad3b.emailsrvr.com [146.20.161.113]) by sourceware.org (Postfix) with ESMTPS id 0BBD4386F0D2 for ; Thu, 30 Jun 2022 12:52:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0BBD4386F0D2 X-Auth-ID: tom@honermann.net Received: by smtp15.relay.iad3b.emailsrvr.com (Authenticated sender: tom-AT-honermann.net) with ESMTPSA id 9162AC00B4; Thu, 30 Jun 2022 08:52:28 -0400 (EDT) From: Tom Honermann To: libc-alpha@sourceware.org Cc: Tom Honermann Subject: [PATCH v4 0/3] C++20 P0482R6 and C2X N2653: char8_t, mbrtoc8(), and c8rtomb() Date: Thu, 30 Jun 2022 08:52:12 -0400 Message-Id: <20220630125215.6052-1-tom@honermann.net> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Classification-ID: bd4e2ac9-54ec-4bf3-bea9-66e0e087bd86-1-1 X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jun 2022 12:52:35 -0000 This series of patches provides the following: - A fix for BZ #25744 [1]. - Implementations of the mbrtoc8 and c8rtomb functions adopted for C++20 via WG21 P0482R6 [2] and for C2X via WG14 N2653 [3]. - A char8_t typedef as adopted for C2X via WG14 N2653 [3]. The fix for BZ #25744 [1] is included in this patch series because the tests for mbrtoc8 and c8rtomb depend on it for exercising the special case where a pair of Unicode code points is converted to/from a single double byte sequence. Such conversion cases exist for Big5-HKSCS. N2653 was adopted by WG14 for C2X and wording is present in the N2912 revision of the C2X working draft [4]. This patch series enables the new declarations in C2X mode and when _GNU_SOURCE is defined. This patch series addresses feedback provided in response to the prior submission series starting at [5]. All feedback was addressed except as noted below: - I removed the previously proposed wcsmbs/test-char8-type.c test as redundant per feedback. I prototyped extending the c++-types.data test as suggested, but that would have introduced a C++20 dependency. The test demonstrated that the char16_t and char32_t types were mapped to 'Ds' and 'Di' as expected, but char8_t got mapped to 'h' rather than 'Du'. It might be worth extending this test in the future if/when C++20 support becomes a minimal requirement. - I did not switch the iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c test to use TEST_COMPARE in favor of the existing custom error handling; the current error messages are customized to provide a more detailed explanation of the error, so removing that seemed like an arguable regression. My changes retain consistency with the existing code. I think it is reasonable to switch to TEST_COMPARE, but I think that should be done separately from this change and separately motivated. - I did not change the c8rtomb() and mbrtoc8() implementations to use thread local storage when PS is null; the implementations continue to use static storage consistent with c16rtomb(), mbrtoc16(), c32rtomb(), and mbrtoc32(). I think it is reasonable to switch all of these functions to use thread local storage, but I think such a change should be done consistently across all of them and should be separately motivated. Thank you to Joseph Myers, Carlos O'Donell, Adhemerval Zanella, and Florian Weimer for prior reviews of this patch series. Tested on Linux x86_64. [1]: Bug 25744 "mbrtowc with Big5-HKSCS returns 2 instead of 1 when consuming the second byte of certain double byte characters" https://sourceware.org/bugzilla/show_bug.cgi?id=25744 [2]: WG21 P0482R6 "char8_t: A type for UTF-8 characters and strings (Revision 6)" https://wg21.link/p0482r6 [3]: WG14 N2653 "char8_t: A type for UTF-8 characters and strings (Revision 1)" http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm [4]: WG14 N2912 "Programming languages — C working draft — June 8, 2022" https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2912.pdf [5]: libc-alpha mailing list "[PATCH 0/3]: C++20 P0482R6 and C2X N2653: support for char8_t, mbrtoc8(), and c8rtomb()." https://sourceware.org/pipermail/libc-alpha/2022-February/136723.html