From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=1Xbs=DE=redhat.com=jason@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by sourceware.org (Postfix) with ESMTPS id D0BD23858D32
	for <gcc-patches@gcc.gnu.org>; Tue, 18 Jul 2023 20:33:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D0BD23858D32
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1689712409;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=CMiT0xN7W7LNdAJe73r3I38tMnLxrlpArLMDM6aELIY=;
	b=ClVXzU7B7+ry64WM2fbFRqpdnu9ruQ3BsnqQxH5Ks470xQpd+G1QQByN6TYOF1JtNdqxIR
	djfPGFwi2/HuuCzPzLJnbJsHXhm+Wpwyus4HlTZdxDBFG5qXWYKxx2FDbHzs+PcQUwc3B2
	LQg8fyQLFvf/kqEQPBEIfmDqonQyMAE=
Received: from mail-oa1-f70.google.com (mail-oa1-f70.google.com
 [209.85.160.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-310-X9NKnSZlM9uVr--ruxnzdA-1; Tue, 18 Jul 2023 16:33:28 -0400
X-MC-Unique: X9NKnSZlM9uVr--ruxnzdA-1
Received: by mail-oa1-f70.google.com with SMTP id 586e51a60fabf-1b773df6216so9154282fac.1
        for <gcc-patches@gcc.gnu.org>; Tue, 18 Jul 2023 13:33:28 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1689712407; x=1692304407;
        h=content-transfer-encoding:in-reply-to:content-language:references
         :to:subject:from:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=CMiT0xN7W7LNdAJe73r3I38tMnLxrlpArLMDM6aELIY=;
        b=BCHonSNiogeCQNH0c9rkdy/vAAAyORRAAl/8vdU28dqYrPicYHUW/rXD7SjvEJvMjm
         SuFK8OQxBz9JGSJyx1XJkebiI/yLmivfzl+k6LmBmQxb99yKHPfkiVq4jq3PTrtxuSE1
         uwe8Y0BFGTe9aXY9pYTqj74Vf8seEMa6B7lFS9wKdBgoSr2pgt5aNiAKu8xSOwjkxfXH
         yzM1pN9lAjLu3ErU7EM21w5wJrDcuxkrNtO1AF+AUvBhn2eYm9rDxRVZIvc054lqtrs0
         /xEYNQfCoSyQwilLpOq5XTpj/sI6YKECTN81p7E0y6VHMe0kaAf8zyPlIrSDoa398BLe
         YqNg==
X-Gm-Message-State: ABy/qLa8YS0id9tSZ9qTf+HF++ssPN91mQjCtheL2M5SGZt4zh3Qt4yf
	IcN9pDAvyQ5rVXA6co8NXxe72TKmNxFAE49PMM7x/7z79uGKfrvjNRmT7RaIYRcJJGPN4HaleVK
	vy8BnstSu1ea+9mTMEQ==
X-Received: by 2002:a05:6870:f115:b0:1b0:89e4:e26b with SMTP id k21-20020a056870f11500b001b089e4e26bmr258595oac.48.1689712407528;
        Tue, 18 Jul 2023 13:33:27 -0700 (PDT)
X-Google-Smtp-Source: APBJJlHEy5IM5uMLnDfsSB5QoqusOOLVXC9KUtMcYV3Le/21ZrhERRHfv6ah/pBngeYi82XY2Lpr8Q==
X-Received: by 2002:a05:6870:f115:b0:1b0:89e4:e26b with SMTP id k21-20020a056870f11500b001b089e4e26bmr258564oac.48.1689712406918;
        Tue, 18 Jul 2023 13:33:26 -0700 (PDT)
Received: from [192.168.1.108] (130-44-146-16.s12558.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.146.16])
        by smtp.gmail.com with ESMTPSA id n4-20020a0ce544000000b006362aac00a2sm992572qvm.29.2023.07.18.13.33.25
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 18 Jul 2023 13:33:26 -0700 (PDT)
Message-ID: <f125824b-1632-9a44-2fd6-33021d911613@redhat.com>
Date: Tue, 18 Jul 2023 16:33:24 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.13.0
From: Jason Merrill <jason@redhat.com>
Subject: Re: [PATCH v2] libcpp: Handle extended characters in user-defined
 literal suffix [PR103902]
To: Lewis Hyatt <lhyatt@gmail.com>, gcc-patches@gcc.gnu.org
References: <CAA_5UQ7-Sz4OAEB_qqAyswt6JGCOpiRxACCUjWoGPpwG4dEdVQ@mail.gmail.com>
 <20230302230703.2234902-1-lhyatt@gmail.com>
In-Reply-To: <20230302230703.2234902-1-lhyatt@gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-6.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,KAM_SHORT,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 3/2/23 18:07, Lewis Hyatt wrote:
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
> 
> bool operator ""_π (unsigned long long);
> 
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
> 
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
> 
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
> 
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
> 
> libcpp/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* lex.cc (identifier_diagnostics_on_lex): New function refactoring
> 	some common code.
> 	(lex_identifier_intern): Use the new function.
> 	(lex_identifier): Don't run identifier diagnostics here, rather let
> 	the call site do it when needed.
> 	(_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> 	acccordingly.
> 	(struct scan_id_result): New struct.
> 	(scan_cur_identifier): New function.
> 	(create_literal2): New function.
> 	(lit_accum::create_literal2): New function.
> 	(is_macro): Folded into new function...
> 	(maybe_ignore_udl_macro_suffix): ...here.
> 	(is_macro_not_literal_suffix): Folded likewise.
> 	(lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier ().
> 	(lex_string): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
> 
> Notes:
>      Hello-
>      
>      This is the updated version of the patch, incorporating feedback from Jakub
>      and Jason, most recently discussed here:
>      
>      https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>      
>      Please let me know how it looks? It is simpler than before with the new
>      approach. Thanks!
>      
>      One thing to note. As Jason clarified for me, a usage like this:
>      
>       #pragma GCC poison _x
>      const char * operator "" _x (const char *, unsigned long);
>      
>      The space between the "" and the _x is currently allowed but will be
>      deprecated in C++23. GCC currently will complain about the poisoned use of
>      _x in this case, and this patch, which is just focused on handling UTF-8
>      properly, does not change this. But it seems that it would be correct
>      not to apply poison in this case. I can try to follow up with a patch to do
>      so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
>      worth it...

Right, the deprecation is intended to avoid this problem; it's fine for 
us to complain.

>      For the time being, this patch does add a testcase for the above and xfails
>      it. For the case where no space is present, which is the part touched by the
>      present patch, existing behavior is preserved correctly and no diagnostics
>      such as poison are issued for the UDL suffix. (Contrary to v1 of this
>      patch.)
>      
>      Thanks! bootstrap + regtested all languages on x86-64 Linux with
>      no regressions.
>      

> -/* Returns true if a macro has been defined.
> -   This might not work if compile with -save-temps,
> -   or preprocess separately from compilation.  */
> +/* Helper function to check if a string format macro, say from inttypes.h, is
> +   placed touching a string literal, in which case it could be parsed as a C++11
> +   user-defined string literal thus breaking the program.

>     User-defined literals
> +   outside of namespace std must start with a single underscore, so assume
> +   anything of that form really is a UDL suffix.  We don't need to worry about
> +   UDLs defined inside namespace std because their names are reserved, so cannot
> +   be used as macro names in valid programs.

I'd prefer to leave this rationale comment on the _[^_] check rather 
than hoist it out of the function; OK with that change.  Thank you very 
much for your persistence.

>     Return TRUE if the UDL should be
> +   ignored for now and preserved for potential macro expansion.  */
>   
>   static bool
> -is_macro(cpp_reader *pfile, const uchar *base)
> +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
> +			       const uchar *suffix_begin, cpp_hashnode *node)
>   {
> -  const uchar *cur = base;
> -  if (! ISIDST (*cur))
> +  if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
> +      || !cpp_macro_p (node))
>       return false;
> -  unsigned int hash = HT_HASHSTEP (0, *cur);
> -  ++cur;
> -  while (ISIDNUM (*cur))
> -    {
> -      hash = HT_HASHSTEP (hash, *cur);
> -      ++cur;
> -    }
> -  hash = HT_HASHFINISH (hash, cur - base);
>   
> -  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
> -					base, cur - base, hash, HT_NO_INSERT));
> -
> -  return result && cpp_macro_p (result);
> +  /* Maybe raise a warning here; caller should arrange not to consume
> +     the tokens.  */
> +  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
> +    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
> +			   "invalid suffix on literal; C++11 requires a space "
> +			   "between literal and string macro");
> +  return true;
>   }
>   
> -/* Returns true if a literal suffix does not have the expected form
> -   and is defined as a macro.  */
> -
> -static bool
> -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
>   {
> -  /* User-defined literals outside of namespace std must start with a single
> -     underscore, so assume anything of that form really is a UDL suffix.
> -     We don't need to worry about UDLs defined inside namespace std because
> -     their names are reserved, so cannot be used as macro names in valid
> -     programs.  */
> -  if (base[0] == '_' && base[1] != '_')
> -    return false;
> -  return is_macro (pfile, base);