From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jason@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id 368813982411
	for <gcc-patches@gcc.gnu.org>; Tue, 30 Aug 2022 03:32:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 368813982411
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1661830321;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=1RkFpmZPBeOOXnNh8ZDE0F77p6sQrrrv3loHbFWFkEU=;
	b=VzDBL+BlS8wHRROElTHVXfQC1HWSYC0oBlmfBNDYDj5vRVBfOAu1q31tqsrVQM0cP+yIF1
	n5EUFnYJ/f84sAaiVDro+EmwbYbEDnGrG4sQaK4JJvGKdaiEZwtnyl7fOQgpCfwT2NCoga
	iqIEvNz/wKtwa2g4Q4R9DGsT6zzUrQw=
Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com
 [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id
 us-mta-133-pLh1BuYlNXicEJsXME_Dwg-1; Mon, 29 Aug 2022 23:31:56 -0400
X-MC-Unique: pLh1BuYlNXicEJsXME_Dwg-1
Received: by mail-qt1-f200.google.com with SMTP id s2-20020ac85cc2000000b00342f8ad1f40so7826921qta.12
        for <gcc-patches@gcc.gnu.org>; Mon, 29 Aug 2022 20:31:56 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:in-reply-to:from:references:cc:to
         :content-language:subject:user-agent:mime-version:date:message-id
         :x-gm-message-state:from:to:cc;
        bh=1RkFpmZPBeOOXnNh8ZDE0F77p6sQrrrv3loHbFWFkEU=;
        b=uM9H4bvb5bpn/sthbWiMfbRI/1Og758xsCWxw4WMdaMoDfLiKvtjspHkitkF5jsLKo
         baKIlZa8edtWJMvJyPmWu0v6bRMmqBSm6kw88hIHGvwSQBWoSyiFNgLWuAYEQVLYHWB1
         olmlmhBt+I54jsDP23YLHgF+swSPJRQyVd13kb42basipp16UJFWi2x7mItY98eJHK74
         hd+2TjU7lO1Z1lRsfM3SD6qx9Aw3nrZFkvLIel+V+RWejwrz9NRsL8EUNtwUBo04wbdB
         PdVZkDR7/tguiBzmstTAgKdolo5jyAJ0qyUb+8Yc7qi9keA4pJqxRR0W6V7/HkiCdBkQ
         vR0A==
X-Gm-Message-State: ACgBeo2e7R68Aat+/cDeT2dhc4JqNjmlTPcczuo4CZwYr7Za/5Db/7LF
	59059VMixcKQkQ0TH8RCBv5j9OTp1TVGFP9dPT7OPbwVzNHGeCt/ob5Za2a7rwb1aJTUzsAYaXV
	JFpCsOBdxucuMosC7eg==
X-Received: by 2002:a05:6214:21e4:b0:499:3a0:47fd with SMTP id p4-20020a05621421e400b0049903a047fdmr5977684qvj.61.1661830316019;
        Mon, 29 Aug 2022 20:31:56 -0700 (PDT)
X-Google-Smtp-Source: AA6agR45/cUWXg5DMX9Oa0oEj9glbsd8FheJsCTvbtzkEfxnvVmicYEOmGHRG6Nd1NAzGXtXsCxe2Q==
X-Received: by 2002:a05:6214:21e4:b0:499:3a0:47fd with SMTP id p4-20020a05621421e400b0049903a047fdmr5977677qvj.61.1661830315732;
        Mon, 29 Aug 2022 20:31:55 -0700 (PDT)
Received: from [192.168.1.101] (130-44-159-43.s15913.c3-0.arl-cbr1.sbo-arl.ma.cable.rcncustomer.com. [130.44.159.43])
        by smtp.gmail.com with ESMTPSA id y10-20020a05620a09ca00b006b9264191b5sm6891927qky.32.2022.08.29.20.31.55
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 29 Aug 2022 20:31:55 -0700 (PDT)
Message-ID: <53c4b971-4f14-848c-e921-e10d6f18407f@redhat.com>
Date: Mon, 29 Aug 2022 23:31:54 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.13.0
Subject: Re: [PATCH] libcpp: Add -Winvalid-utf8 warning [PR106655]
To: Jakub Jelinek <jakub@redhat.com>
Cc: gcc-patches@gcc.gnu.org
References: <Ywx1jiBRWLmv2NOZ@tucnak>
 <e93528c7-efa8-a328-2af5-a2c2ce59dfe4@redhat.com> <Yw0xL4sOmYH08VP0@tucnak>
From: Jason Merrill <jason@redhat.com>
In-Reply-To: <Yw0xL4sOmYH08VP0@tucnak>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-7.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On 8/29/22 17:35, Jakub Jelinek wrote:
> On Mon, Aug 29, 2022 at 05:15:26PM -0400, Jason Merrill wrote:
>> On 8/29/22 04:15, Jakub Jelinek wrote:
>>> Hi!
>>>
>>> The following patch introduces a new warning - -Winvalid-utf8 similarly
>>> to what clang now has - to diagnose invalid UTF-8 byte sequences in
>>> comments.  In identifiers and in string literals it should be diagnosed
>>> already but comment content hasn't been really verified.
>>>
>>> I'm not sure if this is enough to say P2295R6 is implemented or not.
>>>
>>> The problem is that in the most common case, people don't use
>>> -finput-charset= option and the sources often are UTF-8, but sometimes
>>> could be some ASCII compatible single byte encoding where non-ASCII
>>> characters only appear in comments.  So having the warning off by default
>>> is IMO desirable.  Now, if people use explicit -finput-charset=UTF-8,
>>> perhaps we could make the warning on by default for C++23 and use pedwarn
>>> instead of warning, because then the user told us explicitly that the source
>>> is UTF-8.  From the paper I understood one of the implementation options
>>> is to claim that the implementation supports 2 encodings, UTF-8 and UTF-8
>>> like encodings where invalid UTF-8 characters in comments are replaced say
>>> by spaces, where the latter could be the default and the former only
>>> used if -finput-charset=UTF-8 -Werror=invalid-utf8 options are used.
>>>
>>> Thoughts on this?
>>
>> That sounds good to me.
> 
> The pedwarn on -std=c++23 -finput-charset=UTF-8 or just documenting that
> "conforming" UTF-8 is only -finput-charset=UTF-8 -Werror=invalid-utf8 ?

The former.

>>> +static const uchar *
>>> +_cpp_warn_invalid_utf8 (cpp_reader *pfile)
>>> +{
>>> +  cpp_buffer *buffer = pfile->buffer;
>>> +  const uchar *cur = buffer->cur;
>>> +
>>> +  if (cur[0] < utf8_signifier
>>> +      || cur[1] < utf8_continuation || cur[1] >= utf8_signifier)
>>> +    {
>>> +      cpp_warning_with_line (pfile, CPP_W_INVALID_UTF8,
>>> +			     pfile->line_table->highest_line,
>>> +			     CPP_BUF_COL (buffer),
>>> +			     "invalid UTF-8 character <%x> in comment",
>>> +			     cur[0]);
>>> +      return cur + 1;
>>> +    }
>>> +  else if (cur[2] < utf8_continuation || cur[2] >= utf8_signifier)
>>
>> Unicode table 3-7 says that the second byte is sometimes restricted to less
>> than this range.
> 
> That is true and I've tried to include tests for all of those cases in the
> testcase and all of them get a warning.  Some of them are through:
>    /* Make sure the shortest possible encoding was used.  */
> 
>    if (c <=      0x7F && nbytes > 1) return EILSEQ;
>    if (c <=     0x7FF && nbytes > 2) return EILSEQ;
>    if (c <=    0xFFFF && nbytes > 3) return EILSEQ;
>    if (c <=  0x1FFFFF && nbytes > 4) return EILSEQ;
>    if (c <= 0x3FFFFFF && nbytes > 5) return EILSEQ;
> and others are through:
>    /* Make sure the character is valid.  */
>    if (c > 0x7FFFFFFF || (c >= 0xD800 && c <= 0xDFFF)) return EILSEQ;
> All I had to do outside of what one_utf8_to_cppchar already implements was:
> 
>>> +	      if (_cpp_valid_utf8 (pfile, &pstr, buffer->rlimit, 0, NULL, &s)
>>> +		  && s <= 0x0010FFFF)
> 
> the <= 0x0010FFFF check, because one_utf8_to_cppchar as written happily
> supports up to 6 bytes long UTF-8 which can encode up to 7FFFFFFF:
>     00000000-0000007F   0xxxxxxx
>     00000080-000007FF   110xxxxx 10xxxxxx
>     00000800-0000FFFF   1110xxxx 10xxxxxx 10xxxxxx
>     00010000-001FFFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>     00200000-03FFFFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>     04000000-7FFFFFFF   1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> while 3-7 only talks about encoding 0..D7FF and D800..10FFFF in up to 4
> bytes.
> 
> I guess I should try what happens with 0x110000 and 0x7fffffff in
> identifiers and string literals.
> 
> 	Jakub
>