From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by sourceware.org (Postfix) with ESMTPS id 3EE563856959 for ; Mon, 29 Aug 2022 21:15:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 3EE563856959 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1661807730; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SzCqKSuZUkz9QsHr2jzzTsg3H71T+k8dd1c5vkfNyi8=; b=e1eJ2ZAf9P7MWLc7r07TQSP1v+2qFXxdQo265u+7AIkRQhpZOUutw33/nUqmqTZlMPbqLR sxT/uXuimLjUJu18R0QY7ZSkNEADHBTc53jXdVaRD+xlmdqGcb2TtgLXeu/iuUGaoIuy5+ eVGUE4pNbcHxpD93E1M0GRM4Jg8SiR4= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-41-94CxetsXMuGxIZuSU4I3ag-1; Mon, 29 Aug 2022 17:15:29 -0400 X-MC-Unique: 94CxetsXMuGxIZuSU4I3ag-1 Received: by mail-qk1-f197.google.com with SMTP id bj2-20020a05620a190200b006bba055ab6eso7562036qkb.12 for ; Mon, 29 Aug 2022 14:15:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc; bh=SzCqKSuZUkz9QsHr2jzzTsg3H71T+k8dd1c5vkfNyi8=; b=t2J368IbIqcjD1ygXAp3t2QhOChiOhhlk/49UFJd9b/HKTstRIknKPrgIprfCy3ZRV CrfSv3bvIibp6BvDmMqUDJaFJOqVKlYCntDh3IYiltGcQ8zhAe80CcCj0UYFWJyH+Jqy mDrOM5PXKMNKfBeRX06jq2Kyp0Uq40gadIeFRhrUkYQ3KNfwRcnnrJ80eTSNeM/PzsED H+aRbrFtspQvFWlG+5pNVD+wxl6wr593kjISXSrYVB8hfDeZ5xTeR/Dfv6XHVtwTmHJ6 ie5pJ0j57JkYEmRFLDaqkLePzoO5noT0Y4mb9gIv3ihRWO1YZMHoGdfzwAzGxYbKCX3+ nxYw== X-Gm-Message-State: ACgBeo3vXU3083VtLvNEP1rxaCxLeEDY5P9xhflA0lrpX4toSysrEM5i qYZXbCk8o7I0UyEnhkjVugcdMFwsGGuFa2yRohHQYmSusNpS2VtQOblFsYNdlWS4AIyD9yQrXJ0 HQoI4pKdZSZYEVLN6Uw== X-Received: by 2002:a05:622a:413:b0:343:73b9:5cc6 with SMTP id n19-20020a05622a041300b0034373b95cc6mr11987023qtx.655.1661807728778; Mon, 29 Aug 2022 14:15:28 -0700 (PDT) X-Google-Smtp-Source: AA6agR5KgpRG65UbnCmkuRN2W/1HbivrroYLicgQUdhKb81CziyffN5fHLHw3YkOs5PQRqWLYdvrTg== X-Received: by 2002:a05:622a:413:b0:343:73b9:5cc6 with SMTP id n19-20020a05622a041300b0034373b95cc6mr11986984qtx.655.1661807728232; Mon, 29 Aug 2022 14:15:28 -0700 (PDT) Received: from [192.168.1.101] (130-44-159-43.s15913.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.159.43]) by smtp.gmail.com with ESMTPSA id hf11-20020a05622a608b00b00344b807bb95sm5705344qtb.74.2022.08.29.14.15.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 29 Aug 2022 14:15:27 -0700 (PDT) Message-ID: Date: Mon, 29 Aug 2022 17:15:26 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.13.0 Subject: Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655] To: Jakub Jelinek Cc: gcc-patches@gcc.gnu.org References: From: Jason Merrill In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.5 required=5.0 tests=BAYES_00,BODY_8BITS,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On 8/29/22 04:15, Jakub Jelinek wrote: > Hi! > > The following patch introduces a new warning - -Winvalid-utf8 similarly > to what clang now has - to diagnose invalid UTF-8 byte sequences in > comments. In identifiers and in string literals it should be diagnosed > already but comment content hasn't been really verified. > > I'm not sure if this is enough to say P2295R6 is implemented or not. > > The problem is that in the most common case, people don't use > -finput-charset= option and the sources often are UTF-8, but sometimes > could be some ASCII compatible single byte encoding where non-ASCII > characters only appear in comments. So having the warning off by default > is IMO desirable. Now, if people use explicit -finput-charset=UTF-8, > perhaps we could make the warning on by default for C++23 and use pedwarn > instead of warning, because then the user told us explicitly that the source > is UTF-8. From the paper I understood one of the implementation options > is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8 > like encodings where invalid UTF-8 characters in comments are replaced say > by spaces, where the latter could be the default and the former only > used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used. > > Thoughts on this? That sounds good to me. > 2022-08-29 Jakub Jelinek > > PR c++/106655 > libcpp/ > * include/cpplib.h (struct cpp_options): Implement C++23 > P2295R6 - Support for UTF-8 as a portable source file encoding. > Add cpp_warn_invalid_utf8 field. > (enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator. > * init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8. > * lex.cc (utf8_continuation): New const variable. > (utf8_signifier): Move earlier in the file. > (_cpp_warn_invalid_utf8): New function. > (_cpp_skip_block_comment): Handle -Winvalid-utf8 warning. > (skip_line_comment): Likewise. > gcc/ > * doc/invoke.texi (-Winvalid-utf8): Document it. > gcc/c-family/ > * c.opt (-Winvalid-utf8): New warning. > gcc/testsuite/ > * c-c++-common/cpp/Winvalid-utf8-1.c: New test. > > --- libcpp/include/cpplib.h.jj 2022-08-25 14:25:23.866912426 +0200 > +++ libcpp/include/cpplib.h 2022-08-27 12:17:55.185022807 +0200 > @@ -560,6 +560,9 @@ struct cpp_options > cpp_bidirectional_level. */ > unsigned char cpp_warn_bidirectional; > > + /* True if libcpp should warn about invalid UTF-8 characters in comments. */ > + bool cpp_warn_invalid_utf8; > + > /* Dependency generation. */ > struct > { > @@ -666,7 +669,8 @@ enum cpp_warning_reason { > CPP_W_CXX11_COMPAT, > CPP_W_CXX20_COMPAT, > CPP_W_EXPANSION_TO_DEFINED, > - CPP_W_BIDIRECTIONAL > + CPP_W_BIDIRECTIONAL, > + CPP_W_INVALID_UTF8 > }; > > /* Callback for header lookup for HEADER, which is the name of a > --- libcpp/init.cc.jj 2022-08-24 09:55:44.571876638 +0200 > +++ libcpp/init.cc 2022-08-27 12:18:54.559246323 +0200 > @@ -227,6 +227,7 @@ cpp_create_reader (enum c_lang lang, cpp > CPP_OPTION (pfile, ext_numeric_literals) = 1; > CPP_OPTION (pfile, warn_date_time) = 0; > CPP_OPTION (pfile, cpp_warn_bidirectional) = bidirectional_unpaired; > + CPP_OPTION (pfile, cpp_warn_invalid_utf8) = 0; > > /* Default CPP arithmetic to something sensible for the host for the > benefit of dumb users like fix-header. */ > --- libcpp/lex.cc.jj 2022-08-26 09:24:12.089615949 +0200 > +++ libcpp/lex.cc 2022-08-27 13:43:40.560769087 +0200 > @@ -1704,6 +1704,59 @@ maybe_warn_bidi_on_char (cpp_reader *pfi > bidi::on_char (kind, ucn_p, loc); > } > > +static const cppchar_t utf8_continuation = 0x80; > +static const cppchar_t utf8_signifier = 0xC0; > > +/* Emit -Winvalid-utf8 warning on invalid UTF-8 character starting > + at PFILE->buffer->cur. Return a pointer after the diagnosed > + invalid character. */ > + > +static const uchar * > +_cpp_warn_invalid_utf8 (cpp_reader *pfile) > +{ > + cpp_buffer *buffer = pfile->buffer; > + const uchar *cur = buffer->cur; > + > + if (cur[0] < utf8_signifier > + || cur[1] < utf8_continuation || cur[1] >= utf8_signifier) > + { > + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8, > + pfile->line_table->highest_line, > + CPP_BUF_COL (buffer), > + "invalid UTF-8 character <%x> in comment", > + cur[0]); > + return cur + 1; > + } > + else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier) Unicode table 3-7 says that the second byte is sometimes restricted to less than this range. Hmm, it looks like one_utf8_to_cppchar doesn't check that, either. Did you consider adding error reporting to one_utf8_to_cppchar? It might be better to localize the utf8 logic. > + { > + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8, > + pfile->line_table->highest_line, > + CPP_BUF_COL (buffer), > + "invalid UTF-8 character <%x><%x> in comment", > + cur[0], cur[1]); > + return cur + 2; > + } > + else if (cur[3] < utf8_continuation || cur[3] >= utf8_signifier) > + { > + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8, > + pfile->line_table->highest_line, > + CPP_BUF_COL (buffer), > + "invalid UTF-8 character <%x><%x><%x> in comment", > + cur[0], cur[1], cur[2]); > + return cur + 3; > + } > + else > + { > + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8, > + pfile->line_table->highest_line, > + CPP_BUF_COL (buffer), > + "invalid UTF-8 character " > + "<%x><%x><%x><%x> in comment", > + cur[0], cur[1], cur[2], cur[3]); > + return cur + 4; > + } > +} > + > /* Skip a C-style block comment. We find the end of the comment by > seeing if an asterisk is before every '/' we encounter. Returns > nonzero if comment terminated by EOF, zero otherwise. > @@ -1716,6 +1769,8 @@ _cpp_skip_block_comment (cpp_reader *pfi > const uchar *cur = buffer->cur; > uchar c; > const bool warn_bidi_p = pfile->warn_bidi_p (); > + const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8); > + const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p; > > cur++; > if (*cur == '/') > @@ -1765,13 +1820,32 @@ _cpp_skip_block_comment (cpp_reader *pfi > > cur = buffer->cur; > } > - /* If this is a beginning of a UTF-8 encoding, it might be > - a bidirectional control character. */ > - else if (__builtin_expect (c == bidi::utf8_start, 0) && warn_bidi_p) > - { > - location_t loc; > - bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); > - maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > + else if (__builtin_expect (c >= utf8_continuation, 0) > + && warn_bidi_or_invalid_utf8_p) > + { > + /* If this is a beginning of a UTF-8 encoding, it might be > + a bidirectional control character. */ > + if (c == bidi::utf8_start && warn_bidi_p) > + { > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, cur - 1, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > + } > + if (!warn_invalid_utf8_p) > + continue; > + if (c >= utf8_signifier) > + { > + cppchar_t s; > + const uchar *pstr = cur - 1; > + if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s) > + && s <= 0x0010FFFF) > + { > + cur = pstr; > + continue; > + } > + } > + buffer->cur = cur - 1; > + cur = _cpp_warn_invalid_utf8 (pfile); > } > } > > @@ -1789,11 +1863,13 @@ skip_line_comment (cpp_reader *pfile) > cpp_buffer *buffer = pfile->buffer; > location_t orig_line = pfile->line_table->highest_line; > const bool warn_bidi_p = pfile->warn_bidi_p (); > + const bool warn_invalid_utf8_p = CPP_OPTION (pfile, cpp_warn_invalid_utf8); > + const bool warn_bidi_or_invalid_utf8_p = warn_bidi_p | warn_invalid_utf8_p; > > - if (!warn_bidi_p) > + if (!warn_bidi_or_invalid_utf8_p) > while (*buffer->cur != '\n') > buffer->cur++; > - else > + else if (!warn_invalid_utf8_p) > { > while (*buffer->cur != '\n' > && *buffer->cur != bidi::utf8_start) > @@ -1813,6 +1889,38 @@ skip_line_comment (cpp_reader *pfile) > maybe_warn_bidi_on_close (pfile, buffer->cur); > } > } > + else > + { > + while (*buffer->cur != '\n') > + { > + if (*buffer->cur < utf8_continuation) > + { > + buffer->cur++; > + continue; > + } > + if (__builtin_expect (*buffer->cur == bidi::utf8_start, 0) > + && warn_bidi_p) > + { > + location_t loc; > + bidi::kind kind = get_bidi_utf8 (pfile, buffer->cur, &loc); > + maybe_warn_bidi_on_char (pfile, kind, /*ucn_p=*/false, loc); > + } > + if (*buffer->cur >= utf8_signifier) > + { > + cppchar_t s; > + const uchar *pstr = buffer->cur; > + if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s) > + && s <= 0x0010FFFF) > + { > + buffer->cur = pstr; > + continue; > + } > + } > + buffer->cur = _cpp_warn_invalid_utf8 (pfile); > + } > + if (warn_bidi_p) > + maybe_warn_bidi_on_close (pfile, buffer->cur); > + } > > _cpp_process_line_notes (pfile, true); > return orig_line != pfile->line_table->highest_line; > @@ -1919,8 +2027,6 @@ warn_about_normalization (cpp_reader *pf > } > } > > -static const cppchar_t utf8_signifier = 0xC0; > - > /* Returns TRUE if the sequence starting at buffer->cur is valid in > an identifier. FIRST is TRUE if this starts an identifier. */ > > --- gcc/doc/invoke.texi.jj 2022-08-27 09:14:43.047696028 +0200 > +++ gcc/doc/invoke.texi 2022-08-27 14:05:22.417755406 +0200 > @@ -365,9 +365,9 @@ Objective-C and Objective-C++ Dialects}. > -Winfinite-recursion @gol > -Winit-self -Winline -Wno-int-conversion -Wint-in-bool-context @gol > -Wno-int-to-pointer-cast -Wno-invalid-memory-model @gol > --Winvalid-pch -Wjump-misses-init -Wlarger-than=@var{byte-size} @gol > --Wlogical-not-parentheses -Wlogical-op -Wlong-long @gol > --Wno-lto-type-mismatch -Wmain -Wmaybe-uninitialized @gol > +-Winvalid-pch -Winvalid-utf8 -Wjump-misses-init @gol > +-Wlarger-than=@var{byte-size} -Wlogical-not-parentheses -Wlogical-op @gol > +-Wlong-long -Wno-lto-type-mismatch -Wmain -Wmaybe-uninitialized @gol > -Wmemset-elt-size -Wmemset-transposed-args @gol > -Wmisleading-indentation -Wmissing-attributes -Wmissing-braces @gol > -Wmissing-field-initializers -Wmissing-format-attribute @gol > @@ -9569,6 +9569,11 @@ different size. > Warn if a precompiled header (@pxref{Precompiled Headers}) is found in > the search path but cannot be used. > > +@item -Winvalid-utf8 > +@opindex Winvalid-utf8 > +@opindex Wno-invalid-utf8 > +Warn if an invalid UTF-8 character is inside of a comment. > + > @item -Wlong-long > @opindex Wlong-long > @opindex Wno-long-long > --- gcc/c-family/c.opt.jj 2022-08-27 09:14:43.036696173 +0200 > +++ gcc/c-family/c.opt 2022-08-27 14:03:06.328534617 +0200 > @@ -821,6 +821,10 @@ Winvalid-pch > C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_warn_invalid_pch) Init(0) Warning > Warn about PCH files that are found but not used. > > +Winvalid-utf8 > +C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning > +Warn about invalid UTF-8 characters in comments. > + > Wjump-misses-init > C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat) > Warn when a jump misses a variable initialization. > --- gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c.jj 2022-08-27 14:01:51.115517571 +0200 > +++ gcc/testsuite/c-c++-common/cpp/Winvalid-utf8-1.c 2022-08-27 14:33:21.466802817 +0200 > @@ -0,0 +1,39 @@ > +// P2295R6 - Support for UTF-8 as a portable source file encoding > +// This test intentionally contains various byte sequences which are not valid UTF-8 > +// { dg-do preprocess } > +// { dg-options "-finput-charset=UTF-8 -Winvalid-utf8" } > + > +// a€߿ࠀ퟿𐀀􏿿a { dg-bogus "invalid UTF-8 character" } > +// a�a { dg-warning "invalid UTF-8 character <80> in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a�a { dg-warning "invalid UTF-8 character in comment" } > +// a���a { dg-warning "invalid UTF-8 character <80> in comment" } > +// a���a { dg-warning "invalid UTF-8 character <9f><80> in comment" } > +// a��a { dg-warning "invalid UTF-8 character in comment" } > +// a��a { dg-warning "invalid UTF-8 character <80> in comment" } > +// a���a { dg-warning "invalid UTF-8 character <80> in comment" } > +// a����a { dg-warning "invalid UTF-8 character <80><80><80> in comment" } > +// a����a { dg-warning "invalid UTF-8 character <8f> in comment" } > +// a����a { dg-warning "invalid UTF-8 character <90><80><80> in comment" } > +/* a€߿ࠀ퟿𐀀􏿿a { dg-bogus "invalid UTF-8 character" } */ > +/* a�a { dg-warning "invalid UTF-8 character <80> in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a�a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a���a { dg-warning "invalid UTF-8 character <80> in comment" } */ > +/* a���a { dg-warning "invalid UTF-8 character <9f><80> in comment" } */ > +/* a��a { dg-warning "invalid UTF-8 character in comment" } */ > +/* a��a { dg-warning "invalid UTF-8 character <80> in comment" } */ > +/* a���a { dg-warning "invalid UTF-8 character <80> in comment" } */ > +/* a����a { dg-warning "invalid UTF-8 character <80><80><80> in comment" } */ > +/* a����a { dg-warning "invalid UTF-8 character <8f> in comment" } */ > +/* a����a { dg-warning "invalid UTF-8 character <90><80><80> in comment" } */ > > Jakub >