From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Wj/B=7M=redhat.com=jwakely@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id F0BD63858413
	for <gcc-patches@gcc.gnu.org>; Mon, 20 Mar 2023 15:21:19 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org F0BD63858413
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1679325679;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=xrtiMu7W2EXFrLARmFUeYQP+GadMZxrZZhAx66XWR0E=;
	b=KmkJXw3+VnSCLVTSdAKX/U27g/mkSLsR6huv7EL0SQa+k7UpvOdjkFmVdkUfL5opU006mE
	7U+5Xt93REQX/FpzGzo0+S+6k3Y57WGCr2FVhONCVahHZHwv22YIjjAgDMTm30nPT3FAz/
	Htj0H1sZ3qbnaSxouNbn4nic3onVLNE=
Received: from mail-lj1-f199.google.com (mail-lj1-f199.google.com
 [209.85.208.199]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-382-IS2GqH-wO-Wl87ask2CcYg-1; Mon, 20 Mar 2023 11:21:17 -0400
X-MC-Unique: IS2GqH-wO-Wl87ask2CcYg-1
Received: by mail-lj1-f199.google.com with SMTP id k5-20020a05651c0a0500b00296611bbca8so3165308ljq.1
        for <gcc-patches@gcc.gnu.org>; Mon, 20 Mar 2023 08:21:16 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679325675;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=xrtiMu7W2EXFrLARmFUeYQP+GadMZxrZZhAx66XWR0E=;
        b=1bgOB4iHYx91N8by3FYyJM7lgr1/gmZkg1W3XyQES5W3bRGx3qle6KcQbzIVLQKrhZ
         1J+6VAFsr9XfPjvp396dHKRs6h8at8hxb/Qy2ayI3TXY8A4KaOgrEqAfcaYr5UmMQ3I2
         bstM4YS1IKcKTs9GeT+ocimsPb1VDMyCwJOXbvPgBzCr3u+ysTt/6Cxt9jF9aflbZNzS
         6oiPyESiQzcXW9Zsi2HSiJ8333+Wq1sZb9DSDIvMXZOsWgvjroHP00gvwffWx0JiY2Ft
         Xl48f7+WrB1rY46qaNuwu3vls/LJbYvdyNSf1rKSb2PB8PCNYAgL13ovqRufHO6604vM
         /b/g==
X-Gm-Message-State: AO0yUKUP0csmN6MI5v6mAvOjl2a6E+IvgiC/Khc0rMO1sIq1KWo8TJUd
	xC/EzthEq/7Fa/zeCq3P1RWvU6YGA1uDs554I+RmiuICi10v/9+EbBfl5nu5C9/UwzgxHEPMirE
	inR5ATp22QkFH8bG5fNyZlStz+9WLqaoiKA==
X-Received: by 2002:a2e:9a8b:0:b0:299:ac1c:d8b3 with SMTP id p11-20020a2e9a8b000000b00299ac1cd8b3mr115285lji.9.1679325675605;
        Mon, 20 Mar 2023 08:21:15 -0700 (PDT)
X-Google-Smtp-Source: AK7set/aVQI9afwRC17MDS30pv4Sr3h86IbO0wE31usBqv3HaZ6liAxHhoZG4/Z8g6H8Zw7HcawrJMIpwCyBry/wCYQ=
X-Received: by 2002:a2e:9a8b:0:b0:299:ac1c:d8b3 with SMTP id
 p11-20020a2e9a8b000000b00299ac1cd8b3mr115280lji.9.1679325675313; Mon, 20 Mar
 2023 08:21:15 -0700 (PDT)
MIME-Version: 1.0
References: <AS1P192MB16201866FEA701E17850E8D3ACB49@AS1P192MB1620.EURP192.PROD.OUTLOOK.COM>
In-Reply-To: <AS1P192MB16201866FEA701E17850E8D3ACB49@AS1P192MB1620.EURP192.PROD.OUTLOOK.COM>
From: Jonathan Wakely <jwakely@redhat.com>
Date: Mon, 20 Mar 2023 15:21:04 +0000
Message-ID: <CACb0b4m-5r7aStZY=yc+75nf5bX18Q-nMgsktjqWK7-VT8WQXg@mail.gmail.com>
Subject: Re: [PATCH] libstdc++: Fix handling of surrogate CP in codecvt [PR108976]
To: Dimitrij Mijoski <dmjpp@hotmail.com>
Cc: gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: multipart/alternative; boundary="00000000000056f27005f7567a70"
X-Spam-Status: No, score=-5.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

--00000000000056f27005f7567a70
Content-Type: text/plain; charset="UTF-8"

On Wed, 8 Mar 2023 at 14:09, Dimitrij Mijoski via Libstdc++ <
libstdc++@gcc.gnu.org> wrote:

> This patch fixes the handling of surrogate code points in all standard
> facets for transcoding Unicode that are based on std::codecvt. Surrogate
> code points should always be treated as error. On the other hand
> surrogate code units can only appear in UTF-16 and only when they come
> in a proper pair.
>
> Additionally, it fixes a bug in std::codecvt_utf16::in() when odd number
> of bytes were given in the range [from, from_end), error was returned
> always. The last byte in such range does not form a full UTF-16 code
> unit and we can not make any decisions for error, instead partial should
> be returned.
>
> The testsuite for testing these facets was updated in the following
> order:
>
> 1. All functions that test codecvts that work with UTF-8 were refactored
>    and made more generic so they accept codecvt that works with the char
>    type char8_t.
> 2. The same functions were updated with new test cases for transcoding
>    errors and now additionally test for surrogates, overlong UTF-8
>    sequences, code points out of the Unicode range, and more tests for
>    missing leading and trailing code units.
> 3. New tests were added to test codecvt_utf16 in both of its variants,
>    UTF-16 <-> UTF-32/UCS-4 and UTF-16 <-> UCS-2.
>

Thanks, the patch looks OK to my uninformed eye, but I'm seeing a new
regression:

/home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_utf16/79980.cc:86:
void test06(): Assertion 'result == u"from_bytes failed"' failed.
FAIL: 22_locale/codecvt/codecvt_utf16/79980.cc execution test


Also, I see that libc++ fails some of your new tests the same way as
current libstdc++ does:

unicode:
/home/jwakely/src/gcc/gcc/libstdc++-v3/testsuite/22_locale/codecvt/codecvt_unicode.h:298:
void utf8_to_utf32_in_error(const std::codecvt<InternT, ExternT, mbstate_t>
&) [InternT = char32_t, ExternT = char]: Assertion `res == cvt.error'
failed.
Aborted (core dumped)

Does that mean they have the same problem? Or is the test wrong? Or is your
patch implementing something that contradicts the requirements of the
standard? I think it's that libc++ has the same handling of surrogates, but
I'd like to be sure that's right.

--00000000000056f27005f7567a70--