From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTPS id 18F81385695E for ; Mon, 29 Aug 2022 21:35:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 18F81385695E Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1661808947; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:in-reply-to:in-reply-to: references:references; bh=es5dC3zsbtxhuxPACiwBKW+n+h6V23GIautuZPFjK3o=; b=IPeU5ZxC2z1wF62ciA7VQBUwxE5UoAPQRLCbo+9NWB/w7Rt4LUVPeYvR3vltol2/EDFBhW ThQHJ+AgyZkAH/2RLZRwDctgNo7acQpXGNjqZg7JCb9Z2fWDT7j044Pg1wWjN/dLPpoYvZ tqEeKxIPkE1FRIc8QSTfl2SvWat4rV0= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-8-L0s9OgyiPiqNNAJS7VS43A-1; Mon, 29 Aug 2022 17:35:46 -0400 X-MC-Unique: L0s9OgyiPiqNNAJS7VS43A-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 9C19985A585 for ; Mon, 29 Aug 2022 21:35:46 +0000 (UTC) Received: from tucnak.zalov.cz (unknown [10.39.192.41]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 591454C816; Mon, 29 Aug 2022 21:35:46 +0000 (UTC) Received: from tucnak.zalov.cz (localhost [127.0.0.1]) by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 27TLZhqX2147599 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT); Mon, 29 Aug 2022 23:35:44 +0200 Received: (from jakub@localhost) by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 27TLZhqM2147598; Mon, 29 Aug 2022 23:35:43 +0200 Date: Mon, 29 Aug 2022 23:35:43 +0200 From: Jakub Jelinek To: Jason Merrill Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655] Message-ID: Reply-To: Jakub Jelinek References: MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Mon, Aug 29, 2022 at 05:15:26PM -0400, Jason Merrill wrote: > On 8/29/22 04:15, Jakub Jelinek wrote: > > Hi! > > > > The following patch introduces a new warning - -Winvalid-utf8 similarly > > to what clang now has - to diagnose invalid UTF-8 byte sequences in > > comments. In identifiers and in string literals it should be diagnosed > > already but comment content hasn't been really verified. > > > > I'm not sure if this is enough to say P2295R6 is implemented or not. > > > > The problem is that in the most common case, people don't use > > -finput-charset= option and the sources often are UTF-8, but sometimes > > could be some ASCII compatible single byte encoding where non-ASCII > > characters only appear in comments. So having the warning off by default > > is IMO desirable. Now, if people use explicit -finput-charset=UTF-8, > > perhaps we could make the warning on by default for C++23 and use pedwarn > > instead of warning, because then the user told us explicitly that the source > > is UTF-8. From the paper I understood one of the implementation options > > is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8 > > like encodings where invalid UTF-8 characters in comments are replaced say > > by spaces, where the latter could be the default and the former only > > used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used. > > > > Thoughts on this? > > That sounds good to me. The pedwarn on -std=c++23 -finput-charset=UTF-8 or just documenting that "conforming" UTF-8 is only -finput-charset=UTF-8 -Werror=invalid-utf8 ? > > +static const uchar * > > +_cpp_warn_invalid_utf8 (cpp_reader *pfile) > > +{ > > + cpp_buffer *buffer = pfile->buffer; > > + const uchar *cur = buffer->cur; > > + > > + if (cur[0] < utf8_signifier > > + || cur[1] < utf8_continuation || cur[1] >= utf8_signifier) > > + { > > + cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8, > > + pfile->line_table->highest_line, > > + CPP_BUF_COL (buffer), > > + "invalid UTF-8 character <%x> in comment", > > + cur[0]); > > + return cur + 1; > > + } > > + else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier) > > Unicode table 3-7 says that the second byte is sometimes restricted to less > than this range. That is true and I've tried to include tests for all of those cases in the testcase and all of them get a warning. Some of them are through: /* Make sure the shortest possible encoding was used. */ if (c <= 0x7F && nbytes > 1) return EILSEQ; if (c <= 0x7FF && nbytes > 2) return EILSEQ; if (c <= 0xFFFF && nbytes > 3) return EILSEQ; if (c <= 0x1FFFFF && nbytes > 4) return EILSEQ; if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ; and others are through: /* Make sure the character is valid. */ if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ; All I had to do outside of what one_utf8_to_cppchar already implements was: > > + if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s) > > + && s <= 0x0010FFFF) the <= 0x0010FFFF check, because one_utf8_to_cppchar as written happily supports up to 6 bytes long UTF-8 which can encode up to 7FFFFFFF: 00000000-0000007F 0xxxxxxx 00000080-000007FF 110xxxxx 10xxxxxx 00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx 00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 00200000-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 04000000-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx while 3-7 only talks about encoding 0..D7FF and D800..10FFFF in up to 4 bytes. I guess I should try what happens with 0x110000 and 0x7fffffff in identifiers and string literals. Jakub