From: Jason Merrill <jason@redhat.com>
To: Lewis Hyatt <lhyatt@gmail.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]
Date: Tue, 18 Jul 2023 16:33:24 -0400 [thread overview]
Message-ID: <f125824b-1632-9a44-2fd6-33021d911613@redhat.com> (raw)
In-Reply-To: <20230302230703.2234902-1-lhyatt@gmail.com>
On 3/2/23 18:07, Lewis Hyatt wrote:
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
>
> bool operator ""_π (unsigned long long);
>
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
>
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
>
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
>
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
>
> libcpp/ChangeLog:
>
> PR preprocessor/103902
> * lex.cc (identifier_diagnostics_on_lex): New function refactoring
> some common code.
> (lex_identifier_intern): Use the new function.
> (lex_identifier): Don't run identifier diagnostics here, rather let
> the call site do it when needed.
> (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
> acccordingly.
> (struct scan_id_result): New struct.
> (scan_cur_identifier): New function.
> (create_literal2): New function.
> (lit_accum::create_literal2): New function.
> (is_macro): Folded into new function...
> (maybe_ignore_udl_macro_suffix): ...here.
> (is_macro_not_literal_suffix): Folded likewise.
> (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier ().
> (lex_string): Likewise.
>
> gcc/testsuite/ChangeLog:
>
> PR preprocessor/103902
> * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
> * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
>
> Notes:
> Hello-
>
> This is the updated version of the patch, incorporating feedback from Jakub
> and Jason, most recently discussed here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>
> Please let me know how it looks? It is simpler than before with the new
> approach. Thanks!
>
> One thing to note. As Jason clarified for me, a usage like this:
>
> #pragma GCC poison _x
> const char * operator "" _x (const char *, unsigned long);
>
> The space between the "" and the _x is currently allowed but will be
> deprecated in C++23. GCC currently will complain about the poisoned use of
> _x in this case, and this patch, which is just focused on handling UTF-8
> properly, does not change this. But it seems that it would be correct
> not to apply poison in this case. I can try to follow up with a patch to do
> so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
> worth it...
Right, the deprecation is intended to avoid this problem; it's fine for
us to complain.
> For the time being, this patch does add a testcase for the above and xfails
> it. For the case where no space is present, which is the part touched by the
> present patch, existing behavior is preserved correctly and no diagnostics
> such as poison are issued for the UDL suffix. (Contrary to v1 of this
> patch.)
>
> Thanks! bootstrap + regtested all languages on x86-64 Linux with
> no regressions.
>
> -/* Returns true if a macro has been defined.
> - This might not work if compile with -save-temps,
> - or preprocess separately from compilation. */
> +/* Helper function to check if a string format macro, say from inttypes.h, is
> + placed touching a string literal, in which case it could be parsed as a C++11
> + user-defined string literal thus breaking the program.
> User-defined literals
> + outside of namespace std must start with a single underscore, so assume
> + anything of that form really is a UDL suffix. We don't need to worry about
> + UDLs defined inside namespace std because their names are reserved, so cannot
> + be used as macro names in valid programs.
I'd prefer to leave this rationale comment on the _[^_] check rather
than hoist it out of the function; OK with that change. Thank you very
much for your persistence.
> Return TRUE if the UDL should be
> + ignored for now and preserved for potential macro expansion. */
>
> static bool
> -is_macro(cpp_reader *pfile, const uchar *base)
> +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
> + const uchar *suffix_begin, cpp_hashnode *node)
> {
> - const uchar *cur = base;
> - if (! ISIDST (*cur))
> + if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
> + || !cpp_macro_p (node))
> return false;
> - unsigned int hash = HT_HASHSTEP (0, *cur);
> - ++cur;
> - while (ISIDNUM (*cur))
> - {
> - hash = HT_HASHSTEP (hash, *cur);
> - ++cur;
> - }
> - hash = HT_HASHFINISH (hash, cur - base);
>
> - cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
> - base, cur - base, hash, HT_NO_INSERT));
> -
> - return result && cpp_macro_p (result);
> + /* Maybe raise a warning here; caller should arrange not to consume
> + the tokens. */
> + if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
> + cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
> + "invalid suffix on literal; C++11 requires a space "
> + "between literal and string macro");
> + return true;
> }
>
> -/* Returns true if a literal suffix does not have the expected form
> - and is defined as a macro. */
> -
> -static bool
> -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
> {
> - /* User-defined literals outside of namespace std must start with a single
> - underscore, so assume anything of that form really is a UDL suffix.
> - We don't need to worry about UDLs defined inside namespace std because
> - their names are reserved, so cannot be used as macro names in valid
> - programs. */
> - if (base[0] == '_' && base[1] != '_')
> - return false;
> - return is_macro (pfile, base);
prev parent reply other threads:[~2023-07-18 20:33 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-14 21:26 [PATCH] " Lewis Hyatt
2022-06-15 19:06 ` Lewis Hyatt
2022-09-26 22:27 ` Ping^3: " Lewis Hyatt
2023-02-10 16:30 ` Jakub Jelinek
2023-02-10 16:58 ` Lewis Hyatt
2023-02-10 17:10 ` Jakub Jelinek
2023-02-15 18:39 ` Jason Merrill
2023-02-15 23:18 ` Lewis Hyatt
2023-03-02 23:07 ` [PATCH v2] " Lewis Hyatt
2023-07-18 20:33 ` Jason Merrill [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f125824b-1632-9a44-2fd6-33021d911613@redhat.com \
--to=jason@redhat.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=lhyatt@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).