From: Jakub Jelinek <jakub@redhat.com>
To: Jason Merrill <jason@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655]
Date: Mon, 29 Aug 2022 23:35:43 +0200 [thread overview]
Message-ID: <Yw0xL4sOmYH08VP0@tucnak> (raw)
In-Reply-To: <e93528c7-efa8-a328-2af5-a2c2ce59dfe4@redhat.com>
On Mon, Aug 29, 2022 at 05:15:26PM -0400, Jason Merrill wrote:
> On 8/29/22 04:15, Jakub Jelinek wrote:
> > Hi!
> >
> > The following patch introduces a new warning - -Winvalid-utf8 similarly
> > to what clang now has - to diagnose invalid UTF-8 byte sequences in
> > comments. In identifiers and in string literals it should be diagnosed
> > already but comment content hasn't been really verified.
> >
> > I'm not sure if this is enough to say P2295R6 is implemented or not.
> >
> > The problem is that in the most common case, people don't use
> > -finput-charset= option and the sources often are UTF-8, but sometimes
> > could be some ASCII compatible single byte encoding where non-ASCII
> > characters only appear in comments. So having the warning off by default
> > is IMO desirable. Now, if people use explicit -finput-charset=UTF-8,
> > perhaps we could make the warning on by default for C++23 and use pedwarn
> > instead of warning, because then the user told us explicitly that the source
> > is UTF-8. From the paper I understood one of the implementation options
> > is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
> > like encodings where invalid UTF-8 characters in comments are replaced say
> > by spaces, where the latter could be the default and the former only
> > used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
> >
> > Thoughts on this?
>
> That sounds good to me.
The pedwarn on -std=c++23 -finput-charset=UTF-8 or just documenting that
"conforming" UTF-8 is only -finput-charset=UTF-8 -Werror=invalid-utf8 ?
> > +static const uchar *
> > +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
> > +{
> > + cpp_buffer *buffer = pfile->buffer;
> > + const uchar *cur = buffer->cur;
> > +
> > + if (cur[0] < utf8_signifier
> > + || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
> > + {
> > + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
> > + pfile->line_table->highest_line,
> > + CPP_BUF_COL (buffer),
> > + "invalid UTF-8 character <%x> in comment",
> > + cur[0]);
> > + return cur + 1;
> > + }
> > + else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
>
> Unicode table 3-7 says that the second byte is sometimes restricted to less
> than this range.
That is true and I've tried to include tests for all of those cases in the
testcase and all of them get a warning. Some of them are through:
/* Make sure the shortest possible encoding was used. */
if (c <= 0x7F && nbytes > 1) return EILSEQ;
if (c <= 0x7FF && nbytes > 2) return EILSEQ;
if (c <= 0xFFFF && nbytes > 3) return EILSEQ;
if (c <= 0x1FFFFF && nbytes > 4) return EILSEQ;
if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ;
and others are through:
/* Make sure the character is valid. */
if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
All I had to do outside of what one_utf8_to_cppchar already implements was:
> > + if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
> > + && s <= 0x0010FFFF)
the <= 0x0010FFFF check, because one_utf8_to_cppchar as written happily
supports up to 6 bytes long UTF-8 which can encode up to 7FFFFFFF:
00000000-0000007F 0xxxxxxx
00000080-000007FF 110xxxxx 10xxxxxx
00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
00200000-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
04000000-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
while 3-7 only talks about encoding 0..D7FF and D800..10FFFF in up to 4
bytes.
I guess I should try what happens with 0x110000 and 0x7fffffff in
identifiers and string literals.
Jakub
next prev parent reply other threads:[~2022-08-29 21:35 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-29 8:15 Jakub Jelinek
2022-08-29 21:15 ` Jason Merrill
2022-08-29 21:35 ` Jakub Jelinek [this message]
2022-08-29 21:54 ` Jakub Jelinek
2022-08-30 3:31 ` Jason Merrill
2022-08-30 15:20 ` [PATCH] libcpp, v2: " Jakub Jelinek
2022-08-30 21:51 ` Jason Merrill
2022-08-31 11:14 ` [PATCH] libcpp, v3: " Jakub Jelinek
2022-08-31 13:55 ` Jason Merrill
2022-08-31 14:15 ` [PATCH] libcpp, v4: " Jakub Jelinek
2022-08-31 14:24 ` Jason Merrill
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yw0xL4sOmYH08VP0@tucnak \
--to=jakub@redhat.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jason@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).