Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jason Merrill <jason@redhat.com>
To: Jakub Jelinek <jakub@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655]
Date: Mon, 29 Aug 2022 17:15:26 -0400	[thread overview]
Message-ID: <e93528c7-efa8-a328-2af5-a2c2ce59dfe4@redhat.com> (raw)
In-Reply-To: <Ywx1jiBRWLmv2NOZ@tucnak>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8; format=flowed, Size: 14897 bytes --]

On 8/29/22 04:15, Jakub Jelinek wrote:
> Hi!
> 
> The following patch introduces a new warning - -Winvalid-utf8 similarly
> to what clang now has - to diagnose invalid UTF-8 byte sequences in
> comments.  In identifiers and in string literals it should be diagnosed
> already but comment content hasn't been really verified.
> 
> I'm not sure if this is enough to say P2295R6 is implemented or not.
> 
> The problem is that in the most common case, people don't use
> -finput-charset= option and the sources often are UTF-8, but sometimes
> could be some ASCII compatible single byte encoding where non-ASCII
> characters only appear in comments.  So having the warning off by default
> is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
> perhaps we could make the warning on by default for C++23 and use pedwarn
> instead of warning, because then the user told us explicitly that the source
> is UTF-8.  From the paper I understood one of the implementation options
> is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
> like encodings where invalid UTF-8 characters in comments are replaced say
> by spaces, where the latter could be the default and the former only
> used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
> 
> Thoughts on this?

That sounds good to me.

> 2022-08-29  Jakub Jelinek  <jakub@redhat.com>
> 
> 	PR c++/106655
> libcpp/
> 	* include/cpplib.h (struct cpp_options): Implement C++23
> 	P2295R6 - Support for UTF-8 as a portable source file encoding.
> 	Add cpp_warn_invalid_utf8 field.
> 	(enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator.
> 	* init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8.
> 	* lex.cc (utf8_continuation): New const variable.
> 	(utf8_signifier): Move earlier in the file.
> 	(_cpp_warn_invalid_utf8): New function.
> 	(_cpp_skip_block_comment): Handle -Winvalid-utf8 warning.
> 	(skip_line_comment): Likewise.
> gcc/
> 	* doc/invoke.texi (-Winvalid-utf8): Document it.
> gcc/c-family/
> 	* c.opt (-Winvalid-utf8): New warning.
> gcc/testsuite/
> 	* c-c++-common/cpp/Winvalid-utf8-1.c: New test.
> 
> --- libcpp/include/cpplib.h.jj	2022-08-25 14:25:23.866912426 +0200
> +++ libcpp/include/cpplib.h	2022-08-27 12:17:55.185022807 +0200
> @@ -560,6 +560,9 @@ struct cpp_options
>        cpp_bidirectional_level.  */
>     unsigned char cpp_warn_bidirectional;
>   
> +  /* True if libcpp should warn about invalid UTF-8 characters in comments.  */
> +  bool cpp_warn_invalid_utf8;
> +
>     /* Dependency generation.  */
>     struct
>     {
> @@ -666,7 +669,8 @@ enum cpp_warning_reason {
>     CPP_W_CXX11_COMPAT,
>     CPP_W_CXX20_COMPAT,
>     CPP_W_EXPANSION_TO_DEFINED,
> -  CPP_W_BIDIRECTIONAL
> +  CPP_W_BIDIRECTIONAL,
> +  CPP_W_INVALID_UTF8
>   };
>   
>   /* Callback for header lookup for HEADER, which is the name of a
> --- libcpp/init.cc.jj	2022-08-24 09:55:44.571876638 +0200
> +++ libcpp/init.cc	2022-08-27 12:18:54.559246323 +0200
> @@ -227,6 +227,7 @@ cpp_create_reader (enum c_lang lang, cpp
>     CPP_OPTION (pfile, ext_numeric_literals) = 1;
>     CPP_OPTION (pfile, warn_date_time) = 0;
>     CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired;
> +  CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0;
>   
>     /* Default CPP arithmetic to something sensible for the host for the
>        benefit of dumb users like fix-header.  */
> --- libcpp/lex.cc.jj	2022-08-26 09:24:12.089615949 +0200
> +++ libcpp/lex.cc	2022-08-27 13:43:40.560769087 +0200
> @@ -1704,6 +1704,59 @@ maybe_warn_bidi_on_char (cpp_reader *pfi
>     bidi::on_char (kind, ucn_p, loc);
>   }
>   
> +static const cppchar_t utf8_continuation = 0x80;
> +static const cppchar_t utf8_signifier = 0xC0; >
> +/* Emit -Winvalid-utf8 warning on invalid UTF-8 character starting
> +   at PFILE->buffer->cur.  Return a pointer after the diagnosed
> +   invalid character.  */
> +
> +static const uchar *
> +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
> +{
> +  cpp_buffer *buffer = pfile->buffer;
> +  const uchar *cur = buffer->cur;
> +
> +  if (cur[0] < utf8_signifier
> +      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character <%x> in comment",
> +			     cur[0]);
> +      return cur + 1;
> +    }
> +  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)

Unicode table 3-7 says that the second byte is sometimes restricted to 
less than this range.  Hmm, it looks like one_utf8_to_cppchar doesn't 
check that, either.

Did you consider adding error reporting to one_utf8_to_cppchar?  It 
might be better to localize the utf8 logic.

> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character <%x><%x> in comment",
> +			     cur[0], cur[1]);
> +      return cur + 2;
> +    }
> +  else if (cur[3] < utf8_continuation || cur[3] >= utf8_signifier)
> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character <%x><%x><%x> in comment",
> +			     cur[0], cur[1], cur[2]);
> +      return cur + 3;
> +    }
> +  else
> +    {
> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> +			     pfile->line_table->highest_line,
> +			     CPP_BUF_COL (buffer),
> +			     "invalid UTF-8 character "
> +			     "<%x><%x><%x><%x> in comment",
> +			     cur[0], cur[1], cur[2], cur[3]);
> +      return cur + 4;
> +    }
> +}
> +
>   /* Skip a C-style block comment.  We find the end of the comment by
>      seeing if an asterisk is before every '/' we encounter.  Returns
>      nonzero if comment terminated by EOF, zero otherwise.
> @@ -1716,6 +1769,8 @@ _cpp_skip_block_comment (cpp_reader *pfi
>     const uchar *cur = buffer->cur;
>     uchar c;
>     const bool warn_bidi_p = pfile->warn_bidi_p ();
> +  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
> +  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
>   
>     cur++;
>     if (*cur == '/')
> @@ -1765,13 +1820,32 @@ _cpp_skip_block_comment (cpp_reader *pfi
>   
>   	  cur = buffer->cur;
>   	}
> -      /* If this is a beginning of a UTF-8 encoding, it might be
> -	 a bidirectional control character.  */
> -      else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p)
> -	{
> -	  location_t loc;
> -	  bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
> -	  maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> +      else if (__builtin_expect (c >= utf8_continuation, 0)
> +	       && warn_bidi_or_invalid_utf8_p)
> +	{
> +	  /* If this is a beginning of a UTF-8 encoding, it might be
> +	     a bidirectional control character.  */
> +	  if (c == bidi::utf8_start && warn_bidi_p)
> +	    {
> +	      location_t loc;
> +	      bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc);
> +	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> +	    }
> +	  if (!warn_invalid_utf8_p)
> +	    continue;
> +	  if (c >= utf8_signifier)
> +	    {
> +	      cppchar_t s;
> +	      const uchar *pstr = cur - 1;
> +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> +		  && s <= 0x0010FFFF)
> +		{
> +		  cur = pstr;
> +		  continue;
> +		}
> +	    }
> +	  buffer->cur = cur - 1;
> +	  cur = _cpp_warn_invalid_utf8 (pfile);
>   	}
>       }
>   
> @@ -1789,11 +1863,13 @@ skip_line_comment (cpp_reader *pfile)
>     cpp_buffer *buffer = pfile->buffer;
>     location_t orig_line = pfile->line_table->highest_line;
>     const bool warn_bidi_p = pfile->warn_bidi_p ();
> +  const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8);
> +  const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p;
>   
> -  if (!warn_bidi_p)
> +  if (!warn_bidi_or_invalid_utf8_p)
>       while (*buffer->cur != '\n')
>         buffer->cur++;
> -  else
> +  else if (!warn_invalid_utf8_p)
>       {
>         while (*buffer->cur != '\n'
>   	     && *buffer->cur != bidi::utf8_start)
> @@ -1813,6 +1889,38 @@ skip_line_comment (cpp_reader *pfile)
>   	  maybe_warn_bidi_on_close (pfile, buffer->cur);
>   	}
>       }
> +  else
> +    {
> +      while (*buffer->cur != '\n')
> +	{
> +	  if (*buffer->cur < utf8_continuation)
> +	    {
> +	      buffer->cur++;
> +	      continue;
> +	    }
> +	  if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0)
> +	      && warn_bidi_p)
> +	    {
> +	      location_t loc;
> +	      bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc);
> +	      maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc);
> +	    }
> +	  if (*buffer->cur >= utf8_signifier)
> +	    {
> +	      cppchar_t s;
> +	      const uchar *pstr = buffer->cur;
> +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> +		  && s <= 0x0010FFFF)
> +		{
> +		  buffer->cur = pstr;
> +		  continue;
> +		}
> +	    }
> +	  buffer->cur = _cpp_warn_invalid_utf8 (pfile);
> +	}
> +      if (warn_bidi_p)
> +	maybe_warn_bidi_on_close (pfile, buffer->cur);
> +    }
>   
>     _cpp_process_line_notes (pfile, true);
>     return orig_line != pfile->line_table->highest_line;
> @@ -1919,8 +2027,6 @@ warn_about_normalization (cpp_reader *pf
>       }
>   }
>   
> -static const cppchar_t utf8_signifier = 0xC0;
> -
>   /* Returns TRUE if the sequence starting at buffer->cur is valid in
>      an identifier.  FIRST is TRUE if this starts an identifier.  */
>   
> --- gcc/doc/invoke.texi.jj	2022-08-27 09:14:43.047696028 +0200
> +++ gcc/doc/invoke.texi	2022-08-27 14:05:22.417755406 +0200
> @@ -365,9 +365,9 @@ Objective-C and Objective-C++ Dialects}.
>   -Winfinite-recursion @gol
>   -Winit-self  -Winline  -Wno-int-conversion  -Wint-in-bool-context @gol
>   -Wno-int-to-pointer-cast  -Wno-invalid-memory-model @gol
> --Winvalid-pch  -Wjump-misses-init  -Wlarger-than=@var{byte-size} @gol
> --Wlogical-not-parentheses  -Wlogical-op  -Wlong-long @gol
> --Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
> +-Winvalid-pch  -Winvalid-utf8 -Wjump-misses-init  @gol
> +-Wlarger-than=@var{byte-size}  -Wlogical-not-parentheses  -Wlogical-op  @gol
> +-Wlong-long  -Wno-lto-type-mismatch -Wmain  -Wmaybe-uninitialized @gol
>   -Wmemset-elt-size  -Wmemset-transposed-args @gol
>   -Wmisleading-indentation  -Wmissing-attributes  -Wmissing-braces @gol
>   -Wmissing-field-initializers  -Wmissing-format-attribute @gol
> @@ -9569,6 +9569,11 @@ different size.
>   Warn if a precompiled header (@pxref{Precompiled Headers}) is found in
>   the search path but cannot be used.
>   
> +@item -Winvalid-utf8
> +@opindex Winvalid-utf8
> +@opindex Wno-invalid-utf8
> +Warn if an invalid UTF-8 character is inside of a comment.
> +
>   @item -Wlong-long
>   @opindex Wlong-long
>   @opindex Wno-long-long
> --- gcc/c-family/c.opt.jj	2022-08-27 09:14:43.036696173 +0200
> +++ gcc/c-family/c.opt	2022-08-27 14:03:06.328534617 +0200
> @@ -821,6 +821,10 @@ Winvalid-pch
>   C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_warn_invalid_pch) Init(0) Warning
>   Warn about PCH files that are found but not used.
>   
> +Winvalid-utf8
> +C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
> +Warn about invalid UTF-8 characters in comments.
> +
>   Wjump-misses-init
>   C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat)
>   Warn when a jump misses a variable initialization.
> --- gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c.jj	2022-08-27 14:01:51.115517571 +0200
> +++ gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c	2022-08-27 14:33:21.466802817 +0200
> @@ -0,0 +1,39 @@
> +// P2295R6 - Support for UTF-8 as a portable source file encoding
> +// This test intentionally contains various byte sequences which are not valid UTF-8
> +// { dg-do preprocess }
> +// { dg-options "-finput-charset=UTF-8 -Winvalid-utf8" }
> +
> +// aÂ€ß¿à €íŸ¿î€€ð€€ô¿¿a		{ dg-bogus "invalid UTF-8 character" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <80> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <bf> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <c0> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <c1> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <f5> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <ff> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <c2> in comment" }
> +// aï¿½a					{ dg-warning "invalid UTF-8 character <e0> in comment" }
> +// aï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" }
> +// aï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" }
> +// aï¿½ï¿½a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" }
> +// aï¿½ï¿½a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" }
> +// aï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" }
> +// aï¿½ï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" }
> +// aï¿½ï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" }
> +// aï¿½ï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" }
> +/* aÂ€ß¿à €íŸ¿î€€ð€€ô¿¿a		{ dg-bogus "invalid UTF-8 character" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <80> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <bf> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <c0> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <c1> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <f5> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <ff> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <c2> in comment" } */
> +/* aï¿½a					{ dg-warning "invalid UTF-8 character <e0> in comment" } */
> +/* aï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <e0><80><bf> in comment" } */
> +/* aï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <e0><9f><80> in comment" } */
> +/* aï¿½ï¿½a					{ dg-warning "invalid UTF-8 character <e0><bf> in comment" } */
> +/* aï¿½ï¿½a					{ dg-warning "invalid UTF-8 character <ec><80> in comment" } */
> +/* aï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <ed><a0><80> in comment" } */
> +/* aï¿½ï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <f0><80><80><80> in comment" } */
> +/* aï¿½ï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <f0><8f><bf><bf> in comment" } */
> +/* aï¿½ï¿½ï¿½ï¿½a				{ dg-warning "invalid UTF-8 character <f4><90><80><80> in comment" } */
> 
> 	Jakub
>

next prev parent reply	other threads:[~2022-08-29 21:15 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-29  8:15 Jakub Jelinek
2022-08-29 21:15 ` Jason Merrill [this message]
2022-08-29 21:35   ` Jakub Jelinek
2022-08-29 21:54     ` Jakub Jelinek
2022-08-30  3:31     ` Jason Merrill
2022-08-30 15:20       ` [PATCH] libcpp, v2: " Jakub Jelinek
2022-08-30 21:51         ` Jason Merrill
2022-08-31 11:14           ` [PATCH] libcpp, v3: " Jakub Jelinek
2022-08-31 13:55             ` Jason Merrill
2022-08-31 14:15               ` [PATCH] libcpp, v4: " Jakub Jelinek
2022-08-31 14:24                 ` Jason Merrill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e93528c7-efa8-a328-2af5-a2c2ce59dfe4@redhat.com \
    --to=jason@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).