From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jakub@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTPS id 0F28D385843E
 for <gcc-patches@gcc.gnu.org>; Wed, 17 Aug 2022 21:19:23 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0F28D385843E
Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com
 [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-626--H_HEo1PM4GNbInjksGGpg-1; Wed, 17 Aug 2022 17:19:19 -0400
X-MC-Unique: -H_HEo1PM4GNbInjksGGpg-1
Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com
 [10.11.54.9])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 7DB4C1C051A6;
 Wed, 17 Aug 2022 21:19:19 +0000 (UTC)
Received: from tucnak.zalov.cz (unknown [10.39.192.41])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id 263A8492C3B;
 Wed, 17 Aug 2022 21:19:19 +0000 (UTC)
Received: from tucnak.zalov.cz (localhost [127.0.0.1])
 by tucnak.zalov.cz (8.17.1/8.17.1) with ESMTPS id 27HLJGAm3135993
 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NOT);
 Wed, 17 Aug 2022 23:19:16 +0200
Received: (from jakub@localhost)
 by tucnak.zalov.cz (8.17.1/8.17.1/Submit) id 27HLJFKF3135992;
 Wed, 17 Aug 2022 23:19:15 +0200
Date: Wed, 17 Aug 2022 23:19:14 +0200
From: Jakub Jelinek <jakub@redhat.com>
To: Jason Merrill <jason@redhat.com>
Cc: Marek Polacek <polacek@redhat.com>,
 "Joseph S. Myers" <joseph@codesourcery.com>, gcc-patches@gcc.gnu.org
Subject: Re: [PATCH] libcpp: Implement C++23 P2290R3 - Delimited escape
 sequences [PR106645]
Message-ID: <Yv1bUmjit6zH+Jw0@tucnak>
Reply-To: Jakub Jelinek <jakub@redhat.com>
References: <YvyWH6JZYXjkPO49@tucnak>
 <d6c2553d-8ed3-b0e3-6d1d-0445103f1a0a@redhat.com>
MIME-Version: 1.0
In-Reply-To: <d6c2553d-8ed3-b0e3-6d1d-0445103f1a0a@redhat.com>
X-Scanned-By: MIMEDefang 2.85 on 10.11.54.9
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, RCVD_IN_DNSWL_NONE,
 SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 17 Aug 2022 21:19:24 -0000

On Wed, Aug 17, 2022 at 04:47:19PM -0400, Jason Merrill via Gcc-patches wrote:
> > +	  length = 32;
> 
> /* Magic value to indicate no digits seen.  */

Indeed, will add the comment.

> > +	  delimited = true;
> > +	  if (loc_reader)
> > +	    char_range->m_finish = loc_reader->get_next ().m_finish;
> > +	}
> > +    }
> >     else if (str[-1] == 'U')
> >       length = 8;
> >     else
> > @@ -1107,6 +1118,8 @@ _cpp_valid_ucn (cpp_reader *pfile, const
> >     result = 0;
> >     do
> >       {
> > +      if (str == limit)
> > +	break;
> >         c = *str;
> >         if (!ISXDIGIT (c))
> >   	break;
> > @@ -1116,9 +1129,41 @@ _cpp_valid_ucn (cpp_reader *pfile, const
> >   	  gcc_assert (char_range);
> >   	  char_range->m_finish = loc_reader->get_next ().m_finish;
> >   	}
> > +      if (delimited)
> > +	{
> > +	  if (!result)
> > +	    /* Accept arbitrary number of leading zeros.  */
> > +	    length = 16;
> > +	  else if (length == 8)
> > +	    {
> > +	      /* Make sure we detect overflows.  */
> > +	      result |= 0x8000000;
> > +	      ++length;
> > +	    }
> 
> 16 above so that this case happens after we read 8 digits after leading
> zeroes?

Another magic value less than the no digits seen one and
> 8, so that it can count 8 digits with the first non-zero one after
which to or in the overflow flag.  The intent is not to break the loop
if there are further digits, just that there will be overflow.
Another option would be those overflow |= n ^ (n << 4 >> 4);
tests that convert_hex does and just making sure length is never decremented
(except we need a way to distinguish between \u{} and at least one digit).

> > +      if (loc_reader)
> > +	char_range->m_finish = loc_reader->get_next ().m_finish;
> 
> Here and in other functions, the pattern of increment the input pointer and
> update m_finish seems like it should be a macro?

Perhaps or inline function.  Before my patch, there are 5 such ifs
(some with char_range.m_finish and others char_range->m_finish),
the patch adds another 7 such spots.

> > @@ -2119,15 +2255,23 @@ _cpp_interpret_identifier (cpp_reader *p
> >   	cppchar_t value = 0;
> >   	size_t bufleft = len - (bufp - buf);
> >   	int rval;
> > +	bool delimited = false;
> >   	idp += 2;
> > +	if (length == 4 && id[idp] == '{')
> > +	  {
> > +	    delimited = true;
> > +	    idp++;
> > +	  }
> >   	while (length && idp < len && ISXDIGIT (id[idp]))
> >   	  {
> >   	    value = (value << 4) + hex_value (id[idp]);
> >   	    idp++;
> > -	    length--;
> > +	    if (!delimited)
> > +	      length--;
> >   	  }
> > -	idp--;
> > +	if (!delimited)
> > +	  idp--;
> 
> Don't we need to check that the first non-xdigit is a }?

The comments and my understanding of the code say that we first
check what is a valid identifier and the above is only called on
a valid identifier.  So, if it would be delimited \u{ not terminated
with }, then it would fail forms_identifier_p and wouldn't be included
in the range.  Thus e.g. the ISXDIGIT (id[id]) test is probably not needed
unless delimited is true because we've checked earlier that it has 4 or 8
hex digits.
But sure, if you want a id[idp] == '}' test or assertion, it can be
added.

	Jakub