From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 63CC9385351D for ; Fri, 27 Oct 2023 23:05:52 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 63CC9385351D Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 63CC9385351D Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.133.124 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698447954; cv=none; b=AhxCS32Zb/ER4I351q/t916jA5c6Br6sNDtpDEe80RdCM30bYfljyl+huIIwRQX8mb3C/0rF2iSf9yMClIyDkFUya3cUXdY2wKvzK8cgZyDBDRn/hosACm/vBMVLOWCy76S/xkpG+FeyiSFw4mGary+MNLeXmxmNcxIbSX5lGu4= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698447954; c=relaxed/simple; bh=iwsMpBM9A+isvkqwTXsX/EbH0djJwLjzDONQbuKKcNk=; h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From; b=IKuFFeCpLLaE1uwgFg8c538/kGCn6cQ1YBAqr1bTvm8Ki+JGOfiYAN1zOufqYyvsGaAYvh28qf89v7sHaYUEMxh90iCGo/E+2k5dIYTfYPcinhiTlcSUGd83F6DW2LySnvAjCoMKD/EIn5BYOWz1UAGjJRRhRVXxyXQ6qfYnaQ0= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1698447951; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HHag2ysqpV9WA10koEyDsyj6vIVUQznLV+7/jibUSWM=; b=Rbqowq+j0WiVzTD+xt9N1s9iUlqdZB5KmXfDB3FoAWGx6KwTyzkCjmQCv+vlw7r+vpZZAu ONCcCSfb2bgkxJLdYjtvD3JJLnt2WI9ze7pkGdOg2u9n7vEs3vVHb6YzUfqazyaiVozwct 22QZLZhAPPjr0eJ+QPfdn/JFNs6OzBI= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-258-tEiC9151O5Gsi4_96N_iMw-1; Fri, 27 Oct 2023 19:05:37 -0400 X-MC-Unique: tEiC9151O5Gsi4_96N_iMw-1 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-66cffe51b07so34256566d6.3 for ; Fri, 27 Oct 2023 16:05:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698447936; x=1699052736; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HHag2ysqpV9WA10koEyDsyj6vIVUQznLV+7/jibUSWM=; b=SKl1Ar6fCm30yaIXqyPlr2j4Sg5xTOYPBwOarrt3EcVDX/deaxBurtBhCs24O9fvQQ D4daAs83nQLtO++6Hb28j96Jh06GbjlnbTtoVIDLP6erEIukzvJJFg8ga9jcJc4VAQbl 7tqjb9mOqyCG0jlnxNz8Q4Mwdb2f3ylxz2SOX9hJD3CDF0cfi6ieEC6GrgNqlyHfw6G5 qIUkje0ZVJcY8BQuG7fUVOtkknikRSh9tknvlbOn2olxcMvs3Oi8OfixVproz9eBsdcd 4vmdleIgm2pSMzL1L7shDNSVVZ6utU6xfnRLHNSUMPqtI0KAi3xj1MWCzPtFi9Mqv821 RF0g== X-Gm-Message-State: AOJu0YyU2Rhe6rfjeNnNWTDJVncMnxF1bvvG+fE94xNiCHLLGCczlmdc N0djJBr5VXaKGr2m8raXpBSreUqDJfLvq/hYKt3O3jpeL2QEluS46wAV344VSMz8Y++Rm7LKH7j I0t2ZEK9gxc53LhwGhA== X-Received: by 2002:a05:6214:2466:b0:66d:20f5:23cb with SMTP id im6-20020a056214246600b0066d20f523cbmr3362713qvb.5.1698447936563; Fri, 27 Oct 2023 16:05:36 -0700 (PDT) X-Google-Smtp-Source: AGHT+IE0LnfXR7PyGdd0pqVOKq5BSGlvz3D4ZqC6cNT7gSwfAwtohPEWyVT9yDhGJMONmuHL7tBnRQ== X-Received: by 2002:a05:6214:2466:b0:66d:20f5:23cb with SMTP id im6-20020a056214246600b0066d20f523cbmr3362700qvb.5.1698447936130; Fri, 27 Oct 2023 16:05:36 -0700 (PDT) Received: from [192.168.1.108] (130-44-146-16.s12558.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.146.16]) by smtp.gmail.com with ESMTPSA id x19-20020a0ce253000000b006365b23b5dfsm1022976qvl.23.2023.10.27.16.05.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 27 Oct 2023 16:05:35 -0700 (PDT) Message-ID: <41f58edf-9f30-442b-b321-16f30496cec6@redhat.com> Date: Fri, 27 Oct 2023 19:05:34 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] c++: Implement C++26 P1854R4 - Making non-encodable string literals ill-formed [PR110341] To: Jakub Jelinek Cc: gcc-patches@gcc.gnu.org References: From: Jason Merrill In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 8/25/23 16:49, Jakub Jelinek wrote: > Hi! > > This paper voted in as DR makes some multi-character literals ill-formed. > 'abcd' stays valid, but e.g. 'รก' is newly invalid in UTF-8 exec charset > while valid e.g. in ISO-8859-1, because it is a single character which needs > 2 bytes to be encoded. > > The following patch does that by checking (only pedantically, especially > because it is a DR) if we'd emit a -Wmultichar warning because character > constant has more than one byte in it whether the number of bytes in the > narrow string matches number of bytes in CPP_STRING32 divided by char32_t > size in bytes. If it is, it is normal multi-character literal constant > and is diagnosed normally with -Wmultichar, if the number of bytes is > larger, at least one of the c-chars in the sequence was encoded as 2+ > bytes. > > Now, doing this way has 2 drawbacks, some of the diagnostics which doesn't > result in cpp_interpret_string_1 failures can be printed twice, once > when calling cpp_interpret_string_1 for CPP_CHAR, once for CPP_STRING32. > And, functionally I think it must work 100% correctly if host source > character set is UTF-8 (because all valid UTF-8 chars are encodable in > UTF-32), but might not work for some control codes in UTF-EBCDIC if > that is the source character set (though I don't know if we really actually > support it, e.g. Linux iconv certainly doesn't). > All we actually need is count the number of c-chars in the literal, > alternative would be to write custom character counter which would quietly > interpret/skip over + count escape sequences and decode UTF-8 characters > in between those escape sequences. But we'd need to have something similar > also for UTF-EBCDIC if it works at all, and from what I've looked, we don't > have anyything like that implemented in libcpp nor anywhere else in GCC. > > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? > Or ok with some tweaks to avoid the second round of diagnostics from > cpp_interpret_string_1/convert_escape? Or reimplement that second time and > count manually? > > 2023-08-25 Jakub Jelinek > > PR c++/110341 > libcpp/ > * charset.cc: Implement C++ 26 P1854R4 - Making non-encodable string > literals ill-formed. > (narrow_str_to_charconst): Change last type from cpp_ttype to > const cpp_token *. For C++ if pedantic and i > 1 in CPP_CHAR > interpret token also as CPP_STRING32 and if number of characters > in the CPP_STRING32 is larger than number of bytes in CPP_CHAR, > pedwarn on it. > (cpp_interpret_charconst): Adjust narrow_str_to_charconst caller. > gcc/testsuite/ > * g++.dg/cpp26/literals1.C: New test. > * g++.dg/cpp26/literals2.C: New test. > * g++.dg/cpp23/wchar-multi1.C (c, d): Expect an error rather than > warning. > > --- gcc/testsuite/g++.dg/cpp26/literals1.C.jj 2023-08-25 17:23:06.662878355 +0200 > +++ gcc/testsuite/g++.dg/cpp26/literals1.C 2023-08-25 17:37:03.085132304 +0200 > @@ -0,0 +1,65 @@ > +// C++26 P1854R4 - Making non-encodable string literals ill-formed > +// { dg-do compile { target c++11 } } > +// { dg-require-effective-target int32 } > +// { dg-options "-pedantic-errors -finput-charset=UTF-8 -fexec-charset=UTF-8" } > + > +int d = '๐Ÿ˜'; // { dg-error "character too large for character literal type" } ... > +char16_t m = u'๐Ÿ˜'; // { dg-error "character constant too long for its type" } Why are these different diagnostics? Why doesn't the first line already hit the existing diagnostic that the second gets? Both could be clearer that the problem is that the single source character can't be encoded as a single execution character. Jason