From: Jason Merrill <jason@redhat.com>
To: Jakub Jelinek <jakub@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655]
Date: Mon, 29 Aug 2022 17:15:26 -0400 [thread overview]
Message-ID: <e93528c7-efa8-a328-2af5-a2c2ce59dfe4@redhat.com> (raw)
In-Reply-To: <Ywx1jiBRWLmv2NOZ@tucnak>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8; format=flowed, Size: 14897 bytes --]
On 8/29/22 04:15, Jakub Jelinek wrote:
> Hi!
>
> The following patch introduces a new warning - -Winvalid-utf8 similarly
> to what clang now has - to diagnose invalid UTF-8 byte sequences in
> comments. In identifiers and in string literals it should be diagnosed
> already but comment content hasn't been really verified.
>
> I'm not sure if this is enough to say P2295R6 is implemented or not.
>
> The problem is that in the most common case, people don't use
> -finput-charset= option and the sources often are UTF-8, but sometimes
> could be some ASCII compatible single byte encoding where non-ASCII
> characters only appear in comments. So having the warning off by default
> is IMO desirable. Now, if people use explicit -finput-charset=UTF-8,
> perhaps we could make the warning on by default for C++23 and use pedwarn
> instead of warning, because then the user told us explicitly that the source
> is UTF-8. From the paper I understood one of the implementation options
> is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
> like encodings where invalid UTF-8 characters in comments are replaced say
> by spaces, where the latter could be the default and the former only
> used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
>
> Thoughts on this?
That sounds good to me.
> 2022-08-29 Jakub Jelinek <jakub@redhat.com>
>
> PR c++/106655
> libcpp/
> * include/cpplib.h (struct cpp_options): Implement C++23
> P2295R6 - Support for UTF-8 as a portable source file encoding.
> Add cpp_warn_invalid_utf8 field.
> (enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
> * init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8.
> * lex.cc (utf8_continuation): New const variable.
> (utf8_signifier): Move earlier in the file.
> (_cpp_warn_invalid_utf8): New function.
> (_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
> (skip_line_comment): Likewise.
> gcc/
> * doc/invoke.texi (-Winvalid-utf8): Document it.
> gcc/c-family/
> * c.opt (-Winvalid-utf8): New warning.
> gcc/testsuite/
> * c-c++-common/cpp/Winvalid-utf8-1.c: New test.
>
> --- libcpp/include/cpplib.h.jj 2022-08-25 14:25:23.866912426 +0200
> +++ libcpp/include/cpplib.h 2022-08-27 12:17:55.185022807 +0200
> @@ -560,6 +560,9 @@ struct cpp_options
> cpp_bidirectional_level. */
> unsigned char cpp_warn_bidirectional;
>
> + /* True if libcpp should warn about invalid UTF-8 characters in comments. */
> + bool cpp_warn_invalid_utf8;
> +
> /* Dependency generation. */
> struct
> {
> @@ -666,7 +669,8 @@ enum cpp_warning_reason {
> CPP_W_CXX11_COMPAT,
> CPP_W_CXX20_COMPAT,
> CPP_W_EXPANSION_TO_DEFINED,
> - CPP_W_BIDIRECTIONAL
> + CPP_W_BIDIRECTIONAL,
> + CPP_W_INVALID_UTF8
> };
>
> /* Callback for header lookup for HEADER, which is the name of a
> --- libcpp/init.cc.jj 2022-08-24 09:55:44.571876638 +0200
> +++ libcpp/init.cc 2022-08-27 12:18:54.559246323 +0200
> @@ -227,6 +227,7 @@ cpp_create_reader (enum c_lang lang, cpp
> CPP_OPTION (pfile, ext_numeric_literals) = 1;
> CPP_OPTION (pfile, warn_date_time) = 0;
> CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
> + CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
>
> /* Default CPP arithmetic to something sensible for the host for the
> benefit of dumb users like fix-header. */
> --- libcpp/lex.cc.jj 2022-08-26 09:24:12.089615949 +0200
> +++ libcpp/lex.cc 2022-08-27 13:43:40.560769087 +0200
> @@ -1704,6 +1704,59 @@ maybe_warn_bidi_on_char (cpp_reader *pfi
> bidi::on_char (kind, ucn_p, loc);
> }
>
> +static const cppchar_t utf8_continuation = 0x80;
> +static const cppchar_t utf8_signifier = 0xC0; >
> +/* Emit -Winvalid-utf8 warning on invalid UTF-8 character starting
> + at PFILE->buffer->cur. Return a pointer after the diagnosed
> + invalid character. */
> +
> +static const uchar *
> +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
> +{
> + cpp_buffer *buffer = pfile->buffer;
> + const uchar *cur = buffer->cur;
> +
> + if (cur[0] < utf8_signifier
> + || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
> + {
> + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> + pfile->line_table->highest_line,
> + CPP_BUF_COL (buffer),
> + "invalid UTF-8 character <%x> in comment",
> + cur[0]);
> + return cur + 1;
> + }
> + else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
Unicode table 3-7 says that the second byte is sometimes restricted to
less than this range. Hmm, it looks like one_utf8_to_cppchar doesn't
check that, either.
Did you consider adding error reporting to one_utf8_to_cppchar? It
might be better to localize the utf8 logic.
> + {
> + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> + pfile->line_table->highest_line,
> + CPP_BUF_COL (buffer),
> + "invalid UTF-8 character <%x><%x> in comment",
> + cur[0], cur[1]);
> + return cur + 2;
> + }
> + else if (cur[3] < utf8_continuation || cur[3] >= utf8_signifier)
> + {
> + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> + pfile->line_table->highest_line,
> + CPP_BUF_COL (buffer),
> + "invalid UTF-8 character <%x><%x><%x> in comment",
> + cur[0], cur[1], cur[2]);
> + return cur + 3;
> + }
> + else
> + {
> + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> + pfile->line_table->highest_line,
> + CPP_BUF_COL (buffer),
> + "invalid UTF-8 character "
> + "<%x><%x><%x><%x> in comment",
> + cur[0], cur[1], cur[2], cur[3]);
> + return cur + 4;
> + }
> +}
> +
> /* Skip a C-style block comment. We find the end of the comment by
> seeing if an asterisk is before every '/' we encounter. Returns
> nonzero if comment terminated by EOF, zero otherwise.
> @@ -1716,6 +1769,8 @@ _cpp_skip_block_comment (cpp_reader *pfi
> const uchar *cur = buffer->cur;
> uchar c;
> const bool warn_bidi_p = pfile->warn_bidi_p ();
> + const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
> + const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
>
> cur++;
> if (*cur == '/')
> @@ -1765,13 +1820,32 @@ _cpp_skip_block_comment (cpp_reader *pfi
>
> cur = buffer->cur;
> }
> - /* If this is a beginning of a UTF-8 encoding, it might be
> - a bidirectional control character. */
> - else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
> - {
> - location_t loc;
> - bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
> - maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> + else if (__builtin_expect (c >= utf8_continuation, 0)
> + && warn_bidi_or_invalid_utf8_p)
> + {
> + /* If this is a beginning of a UTF-8 encoding, it might be
> + a bidirectional control character. */
> + if (c == bidi::utf8_start && warn_bidi_p)
> + {
> + location_t loc;
> + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
> + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> + }
> + if (!warn_invalid_utf8_p)
> + continue;
> + if (c >= utf8_signifier)
> + {
> + cppchar_t s;
> + const uchar *pstr = cur - 1;
> + if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> + && s <= 0x0010FFFF)
> + {
> + cur = pstr;
> + continue;
> + }
> + }
> + buffer->cur = cur - 1;
> + cur = _cpp_warn_invalid_utf8 (pfile);
> }
> }
>
> @@ -1789,11 +1863,13 @@ skip_line_comment (cpp_reader *pfile)
> cpp_buffer *buffer = pfile->buffer;
> location_t orig_line = pfile->line_table->highest_line;
> const bool warn_bidi_p = pfile->warn_bidi_p ();
> + const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
> + const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
>
> - if (!warn_bidi_p)
> + if (!warn_bidi_or_invalid_utf8_p)
> while (*buffer->cur != '\n')
> buffer->cur++;
> - else
> + else if (!warn_invalid_utf8_p)
> {
> while (*buffer->cur != '\n'
> && *buffer->cur != bidi::utf8_start)
> @@ -1813,6 +1889,38 @@ skip_line_comment (cpp_reader *pfile)
> maybe_warn_bidi_on_close (pfile, buffer->cur);
> }
> }
> + else
> + {
> + while (*buffer->cur != '\n')
> + {
> + if (*buffer->cur < utf8_continuation)
> + {
> + buffer->cur++;
> + continue;
> + }
> + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)
> + && warn_bidi_p)
> + {
> + location_t loc;
> + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc);
> + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> + }
> + if (*buffer->cur >= utf8_signifier)
> + {
> + cppchar_t s;
> + const uchar *pstr = buffer->cur;
> + if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> + && s <= 0x0010FFFF)
> + {
> + buffer->cur = pstr;
> + continue;
> + }
> + }
> + buffer->cur = _cpp_warn_invalid_utf8 (pfile);
> + }
> + if (warn_bidi_p)
> + maybe_warn_bidi_on_close (pfile, buffer->cur);
> + }
>
> _cpp_process_line_notes (pfile, true);
> return orig_line != pfile->line_table->highest_line;
> @@ -1919,8 +2027,6 @@ warn_about_normalization (cpp_reader *pf
> }
> }
>
> -static const cppchar_t utf8_signifier = 0xC0;
> -
> /* Returns TRUE if the sequence starting at buffer->cur is valid in
> an identifier. FIRST is TRUE if this starts an identifier. */
>
> --- gcc/doc/invoke.texi.jj 2022-08-27 09:14:43.047696028 +0200
> +++ gcc/doc/invoke.texi 2022-08-27 14:05:22.417755406 +0200
> @@ -365,9 +365,9 @@ Objective-C and Objective-C++ Dialects}.
> -Winfinite-recursion @gol
> -Winit-self -Winline -Wno-int-conversion -Wint-in-bool-context @gol
> -Wno-int-to-pointer-cast -Wno-invalid-memory-model @gol
> --Winvalid-pch -Wjump-misses-init -Wlarger-than=@var{byte-size} @gol
> --Wlogical-not-parentheses -Wlogical-op -Wlong-long @gol
> --Wno-lto-type-mismatch -Wmain -Wmaybe-uninitialized @gol
> +-Winvalid-pch -Winvalid-utf8 -Wjump-misses-init @gol
> +-Wlarger-than=@var{byte-size} -Wlogical-not-parentheses -Wlogical-op @gol
> +-Wlong-long -Wno-lto-type-mismatch -Wmain -Wmaybe-uninitialized @gol
> -Wmemset-elt-size -Wmemset-transposed-args @gol
> -Wmisleading-indentation -Wmissing-attributes -Wmissing-braces @gol
> -Wmissing-field-initializers -Wmissing-format-attribute @gol
> @@ -9569,6 +9569,11 @@ different size.
> Warn if a precompiled header (@pxref{Precompiled Headers}) is found in
> the search path but cannot be used.
>
> +@item -Winvalid-utf8
> +@opindex Winvalid-utf8
> +@opindex Wno-invalid-utf8
> +Warn if an invalid UTF-8 character is inside of a comment.
> +
> @item -Wlong-long
> @opindex Wlong-long
> @opindex Wno-long-long
> --- gcc/c-family/c.opt.jj 2022-08-27 09:14:43.036696173 +0200
> +++ gcc/c-family/c.opt 2022-08-27 14:03:06.328534617 +0200
> @@ -821,6 +821,10 @@ Winvalid-pch
> C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_warn_invalid_pch) Init(0) Warning
> Warn about PCH files that are found but not used.
>
> +Winvalid-utf8
> +C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
> +Warn about invalid UTF-8 characters in comments.
> +
> Wjump-misses-init
> C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat)
> Warn when a jump misses a variable initialization.
> --- gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c.jj 2022-08-27 14:01:51.115517571 +0200
> +++ gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c 2022-08-27 14:33:21.466802817 +0200
> @@ -0,0 +1,39 @@
> +// P2295R6 - Support for UTF-8 as a portable source file encoding
> +// This test intentionally contains various byte sequences which are not valid UTF-8
> +// { dg-do preprocess }
> +// { dg-options "-finput-charset=UTF-8 -Winvalid-utf8" }
> +
> +// aÂß¿à í¿îðô¿¿a { dg-bogus "invalid UTF-8 character" }
> +// a�a { dg-warning "invalid UTF-8 character <80> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <bf> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <c0> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <c1> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <f5> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <ff> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <c2> in comment" }
> +// a�a { dg-warning "invalid UTF-8 character <e0> in comment" }
> +// a���a { dg-warning "invalid UTF-8 character <e0><80><bf> in comment" }
> +// a���a { dg-warning "invalid UTF-8 character <e0><9f><80> in comment" }
> +// a��a { dg-warning "invalid UTF-8 character <e0><bf> in comment" }
> +// a��a { dg-warning "invalid UTF-8 character <ec><80> in comment" }
> +// a���a { dg-warning "invalid UTF-8 character <ed><a0><80> in comment" }
> +// a����a { dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" }
> +// a����a { dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" }
> +// a����a { dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" }
> +/* aÂß¿à í¿îðô¿¿a { dg-bogus "invalid UTF-8 character" } */
> +/* a�a { dg-warning "invalid UTF-8 character <80> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <bf> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <c0> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <c1> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <f5> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <ff> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <c2> in comment" } */
> +/* a�a { dg-warning "invalid UTF-8 character <e0> in comment" } */
> +/* a���a { dg-warning "invalid UTF-8 character <e0><80><bf> in comment" } */
> +/* a���a { dg-warning "invalid UTF-8 character <e0><9f><80> in comment" } */
> +/* a��a { dg-warning "invalid UTF-8 character <e0><bf> in comment" } */
> +/* a��a { dg-warning "invalid UTF-8 character <ec><80> in comment" } */
> +/* a���a { dg-warning "invalid UTF-8 character <ed><a0><80> in comment" } */
> +/* a����a { dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" } */
> +/* a����a { dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" } */
> +/* a����a { dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" } */
>
> Jakub
>
next prev parent reply other threads:[~2022-08-29 21:15 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-29 8:15 Jakub Jelinek
2022-08-29 21:15 ` Jason Merrill [this message]
2022-08-29 21:35 ` Jakub Jelinek
2022-08-29 21:54 ` Jakub Jelinek
2022-08-30 3:31 ` Jason Merrill
2022-08-30 15:20 ` [PATCH] libcpp, v2: " Jakub Jelinek
2022-08-30 21:51 ` Jason Merrill
2022-08-31 11:14 ` [PATCH] libcpp, v3: " Jakub Jelinek
2022-08-31 13:55 ` Jason Merrill
2022-08-31 14:15 ` [PATCH] libcpp, v4: " Jakub Jelinek
2022-08-31 14:24 ` Jason Merrill
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e93528c7-efa8-a328-2af5-a2c2ce59dfe4@redhat.com \
--to=jason@redhat.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jakub@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).