Re: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jason Merrill <jason@redhat.com>
To: Lewis Hyatt <lhyatt@gmail.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
Date: Tue, 18 Jul 2023 16:33:24 -0400	[thread overview]
Message-ID: <f125824b-1632-9a44-2fd6-33021d911613@redhat.com> (raw)
In-Reply-To: <20230302230703.2234902-1-lhyatt@gmail.com>

On 3/2/23 18:07, Lewis Hyatt wrote:
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
> 
> bool operator ""_π (unsigned long long);
> 
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
> 
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
> 
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
> 
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
> 
> libcpp/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* lex.cc (identifier_diagnostics_on_lex): New function refactoring
> 	some common code.
> 	(lex_identifier_intern): Use the new function.
> 	(lex_identifier): Don't run identifier diagnostics here, rather let
> 	the call site do it when needed.
> 	(_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> 	acccordingly.
> 	(struct scan_id_result): New struct.
> 	(scan_cur_identifier): New function.
> 	(create_literal2): New function.
> 	(lit_accum::create_literal2): New function.
> 	(is_macro): Folded into new function...
> 	(maybe_ignore_udl_macro_suffix): ...here.
> 	(is_macro_not_literal_suffix): Folded likewise.
> 	(lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier ().
> 	(lex_string): Likewise.
> 
> gcc/testsuite/ChangeLog:
> 
> 	PR preprocessor/103902
> 	* g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> 	* g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
> 
> Notes:
>      Hello-
>      
>      This is the updated version of the patch, incorporating feedback from Jakub
>      and Jason, most recently discussed here:
>      
>      https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>      
>      Please let me know how it looks? It is simpler than before with the new
>      approach. Thanks!
>      
>      One thing to note. As Jason clarified for me, a usage like this:
>      
>       #pragma GCC poison _x
>      const char * operator "" _x (const char *, unsigned long);
>      
>      The space between the "" and the _x is currently allowed but will be
>      deprecated in C++23. GCC currently will complain about the poisoned use of
>      _x in this case, and this patch, which is just focused on handling UTF-8
>      properly, does not change this. But it seems that it would be correct
>      not to apply poison in this case. I can try to follow up with a patch to do
>      so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
>      worth it...

Right, the deprecation is intended to avoid this problem; it's fine for 
us to complain.

>      For the time being, this patch does add a testcase for the above and xfails
>      it. For the case where no space is present, which is the part touched by the
>      present patch, existing behavior is preserved correctly and no diagnostics
>      such as poison are issued for the UDL suffix. (Contrary to v1 of this
>      patch.)
>      
>      Thanks! bootstrap + regtested all languages on x86-64 Linux with
>      no regressions.
>      

> -/* Returns true if a macro has been defined.
> -   This might not work if compile with -save-temps,
> -   or preprocess separately from compilation.  */
> +/* Helper function to check if a string format macro, say from inttypes.h, is
> +   placed touching a string literal, in which case it could be parsed as a C++11
> +   user-defined string literal thus breaking the program.

>     User-defined literals
> +   outside of namespace std must start with a single underscore, so assume
> +   anything of that form really is a UDL suffix.  We don't need to worry about
> +   UDLs defined inside namespace std because their names are reserved, so cannot
> +   be used as macro names in valid programs.

I'd prefer to leave this rationale comment on the _[^_] check rather 
than hoist it out of the function; OK with that change.  Thank you very 
much for your persistence.

>     Return TRUE if the UDL should be
> +   ignored for now and preserved for potential macro expansion.  */
>   
>   static bool
> -is_macro(cpp_reader *pfile, const uchar *base)
> +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
> +			       const uchar *suffix_begin, cpp_hashnode *node)
>   {
> -  const uchar *cur = base;
> -  if (! ISIDST (*cur))
> +  if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
> +      || !cpp_macro_p (node))
>       return false;
> -  unsigned int hash = HT_HASHSTEP (0, *cur);
> -  ++cur;
> -  while (ISIDNUM (*cur))
> -    {
> -      hash = HT_HASHSTEP (hash, *cur);
> -      ++cur;
> -    }
> -  hash = HT_HASHFINISH (hash, cur - base);
>   
> -  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
> -					base, cur - base, hash, HT_NO_INSERT));
> -
> -  return result && cpp_macro_p (result);
> +  /* Maybe raise a warning here; caller should arrange not to consume
> +     the tokens.  */
> +  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
> +    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
> +			   "invalid suffix on literal; C++11 requires a space "
> +			   "between literal and string macro");
> +  return true;
>   }
>   
> -/* Returns true if a literal suffix does not have the expected form
> -   and is defined as a macro.  */
> -
> -static bool
> -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
>   {
> -  /* User-defined literals outside of namespace std must start with a single
> -     underscore, so assume anything of that form really is a UDL suffix.
> -     We don't need to worry about UDLs defined inside namespace std because
> -     their names are reserved, so cannot be used as macro names in valid
> -     programs.  */
> -  if (base[0] == '_' && base[1] != '_')
> -    return false;
> -  return is_macro (pfile, base);

     prev parent reply	other threads:[~2023-07-18 20:33 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-14 21:26 [PATCH] " Lewis Hyatt
2022-06-15 19:06 ` Lewis Hyatt
2022-09-26 22:27   ` Ping^3: " Lewis Hyatt
2023-02-10 16:30     ` Jakub Jelinek
2023-02-10 16:58       ` Lewis Hyatt
2023-02-10 17:10         ` Jakub Jelinek
2023-02-15 18:39     ` Jason Merrill
2023-02-15 23:18       ` Lewis Hyatt
2023-03-02 23:07         ` [PATCH v2] " Lewis Hyatt
2023-07-18 20:33           ` Jason Merrill [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f125824b-1632-9a44-2fd6-33021d911613@redhat.com \
    --to=jason@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=lhyatt@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).