From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Raxo=GJ=redhat.com=jason@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id 63CC9385351D
	for <gcc-patches@gcc.gnu.org>; Fri, 27 Oct 2023 23:05:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 63CC9385351D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 63CC9385351D
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.133.124
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698447954; cv=none;
	b=AhxCS32Zb/ER4I351q/t916jA5c6Br6sNDtpDEe80RdCM30bYfljyl+huIIwRQX8mb3C/0rF2iSf9yMClIyDkFUya3cUXdY2wKvzK8cgZyDBDRn/hosACm/vBMVLOWCy76S/xkpG+FeyiSFw4mGary+MNLeXmxmNcxIbSX5lGu4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1698447954; c=relaxed/simple;
	bh=iwsMpBM9A+isvkqwTXsX/EbH0djJwLjzDONQbuKKcNk=;
	h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=IKuFFeCpLLaE1uwgFg8c538/kGCn6cQ1YBAqr1bTvm8Ki+JGOfiYAN1zOufqYyvsGaAYvh28qf89v7sHaYUEMxh90iCGo/E+2k5dIYTfYPcinhiTlcSUGd83F6DW2LySnvAjCoMKD/EIn5BYOWz1UAGjJRRhRVXxyXQ6qfYnaQ0=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1698447951;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=HHag2ysqpV9WA10koEyDsyj6vIVUQznLV+7/jibUSWM=;
	b=Rbqowq+j0WiVzTD+xt9N1s9iUlqdZB5KmXfDB3FoAWGx6KwTyzkCjmQCv+vlw7r+vpZZAu
	ONCcCSfb2bgkxJLdYjtvD3JJLnt2WI9ze7pkGdOg2u9n7vEs3vVHb6YzUfqazyaiVozwct
	22QZLZhAPPjr0eJ+QPfdn/JFNs6OzBI=
Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com
 [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-258-tEiC9151O5Gsi4_96N_iMw-1; Fri, 27 Oct 2023 19:05:37 -0400
X-MC-Unique: tEiC9151O5Gsi4_96N_iMw-1
Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-66cffe51b07so34256566d6.3
        for <gcc-patches@gcc.gnu.org>; Fri, 27 Oct 2023 16:05:37 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698447936; x=1699052736;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=HHag2ysqpV9WA10koEyDsyj6vIVUQznLV+7/jibUSWM=;
        b=SKl1Ar6fCm30yaIXqyPlr2j4Sg5xTOYPBwOarrt3EcVDX/deaxBurtBhCs24O9fvQQ
         D4daAs83nQLtO++6Hb28j96Jh06GbjlnbTtoVIDLP6erEIukzvJJFg8ga9jcJc4VAQbl
         7tqjb9mOqyCG0jlnxNz8Q4Mwdb2f3ylxz2SOX9hJD3CDF0cfi6ieEC6GrgNqlyHfw6G5
         qIUkje0ZVJcY8BQuG7fUVOtkknikRSh9tknvlbOn2olxcMvs3Oi8OfixVproz9eBsdcd
         4vmdleIgm2pSMzL1L7shDNSVVZ6utU6xfnRLHNSUMPqtI0KAi3xj1MWCzPtFi9Mqv821
         RF0g==
X-Gm-Message-State: AOJu0YyU2Rhe6rfjeNnNWTDJVncMnxF1bvvG+fE94xNiCHLLGCczlmdc
	N0djJBr5VXaKGr2m8raXpBSreUqDJfLvq/hYKt3O3jpeL2QEluS46wAV344VSMz8Y++Rm7LKH7j
	I0t2ZEK9gxc53LhwGhA==
X-Received: by 2002:a05:6214:2466:b0:66d:20f5:23cb with SMTP id im6-20020a056214246600b0066d20f523cbmr3362713qvb.5.1698447936563;
        Fri, 27 Oct 2023 16:05:36 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IE0LnfXR7PyGdd0pqVOKq5BSGlvz3D4ZqC6cNT7gSwfAwtohPEWyVT9yDhGJMONmuHL7tBnRQ==
X-Received: by 2002:a05:6214:2466:b0:66d:20f5:23cb with SMTP id im6-20020a056214246600b0066d20f523cbmr3362700qvb.5.1698447936130;
        Fri, 27 Oct 2023 16:05:36 -0700 (PDT)
Received: from [192.168.1.108] (130-44-146-16.s12558.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.146.16])
        by smtp.gmail.com with ESMTPSA id x19-20020a0ce253000000b006365b23b5dfsm1022976qvl.23.2023.10.27.16.05.35
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Fri, 27 Oct 2023 16:05:35 -0700 (PDT)
Message-ID: <41f58edf-9f30-442b-b321-16f30496cec6@redhat.com>
Date: Fri, 27 Oct 2023 19:05:34 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] c++: Implement C++26 P1854R4 - Making non-encodable
 string literals ill-formed [PR110341]
To: Jakub Jelinek <jakub@redhat.com>
Cc: gcc-patches@gcc.gnu.org
References: <ZOkT1OwRjZWBntFR@tucnak>
From: Jason Merrill <jason@redhat.com>
In-Reply-To: <ZOkT1OwRjZWBntFR@tucnak>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 8/25/23 16:49, Jakub Jelinek wrote:
> Hi!
> 
> This paper voted in as DR makes some multi-character literals ill-formed.
> 'abcd' stays valid, but e.g. 'á' is newly invalid in UTF-8 exec charset
> while valid e.g. in ISO-8859-1, because it is a single character which needs
> 2 bytes to be encoded.
> 
> The following patch does that by checking (only pedantically, especially
> because it is a DR) if we'd emit a -Wmultichar warning because character
> constant has more than one byte in it whether the number of bytes in the
> narrow string matches number of bytes in CPP_STRING32 divided by char32_t
> size in bytes.  If it is, it is normal multi-character literal constant
> and is diagnosed normally with -Wmultichar, if the number of bytes is
> larger, at least one of the c-chars in the sequence was encoded as 2+
> bytes.
> 
> Now, doing this way has 2 drawbacks, some of the diagnostics which doesn't
> result in cpp_interpret_string_1 failures can be printed twice, once
> when calling cpp_interpret_string_1 for CPP_CHAR, once for CPP_STRING32.
> And, functionally I think it must work 100% correctly if host source
> character set is UTF-8 (because all valid UTF-8 chars are encodable in
> UTF-32), but might not work for some control codes in UTF-EBCDIC if
> that is the source character set (though I don't know if we really actually
> support it, e.g. Linux iconv certainly doesn't).
> All we actually need is count the number of c-chars in the literal,
> alternative would be to write custom character counter which would quietly
> interpret/skip over + count escape sequences and decode UTF-8 characters
> in between those escape sequences.  But we'd need to have something similar
> also for UTF-EBCDIC if it works at all, and from what I've looked, we don't
> have anyything like that implemented in libcpp nor anywhere else in GCC.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> Or ok with some tweaks to avoid the second round of diagnostics from
> cpp_interpret_string_1/convert_escape?  Or reimplement that second time and
> count manually?
> 
> 2023-08-25  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/110341
> libcpp/
> 	* charset.cc: Implement C++ 26 P1854R4 - Making non-encodable string
> 	literals ill-formed.
> 	(narrow_str_to_charconst): Change last type from cpp_ttype to
> 	const cpp_token *.  For C++ if pedantic and i > 1 in CPP_CHAR
> 	interpret token also as CPP_STRING32 and if number of characters
> 	in the CPP_STRING32 is larger than number of bytes in CPP_CHAR,
> 	pedwarn on it.
> 	(cpp_interpret_charconst): Adjust narrow_str_to_charconst caller.
> gcc/testsuite/
> 	* g++.dg/cpp26/literals1.C: New test.
> 	* g++.dg/cpp26/literals2.C: New test.
> 	* g++.dg/cpp23/wchar-multi1.C (c, d): Expect an error rather than
> 	warning.
> 
> --- gcc/testsuite/g++.dg/cpp26/literals1.C.jj	2023-08-25 17:23:06.662878355 +0200
> +++ gcc/testsuite/g++.dg/cpp26/literals1.C	2023-08-25 17:37:03.085132304 +0200
> @@ -0,0 +1,65 @@
> +// C++26 P1854R4 - Making non-encodable string literals ill-formed
> +// { dg-do compile { target c++11 } }
> +// { dg-require-effective-target int32 }
> +// { dg-options "-pedantic-errors -finput-charset=UTF-8 -fexec-charset=UTF-8" }
> +
> +int d = '😁';						// { dg-error "character too large for character literal type" }
...
> +char16_t m = u'😁';					// { dg-error "character constant too long for its type" }

Why are these different diagnostics?  Why doesn't the first line already 
hit the existing diagnostic that the second gets?

Both could be clearer that the problem is that the single source 
character can't be encoded as a single execution character.

Jason