public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Lewis Hyatt <lhyatt@gmail.com>
To: Thomas Schwinge <thomas@codesourcery.com>
Cc: gcc-patches@gcc.gnu.org,
	Richard Sandiford <richard.sandiford@arm.com>,
	 Jakub Jelinek <jakub@redhat.com>,
	David Malcolm <dmalcolm@redhat.com>
Subject: Re: 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes)
Date: Tue, 4 Jul 2023 15:56:23 -0400	[thread overview]
Message-ID: <CAA_5UQ6VTXUjPBykS0YJaus8MdVkCJSm+cE2RR4844Af8S=t6w@mail.gmail.com> (raw)
In-Reply-To: <87h6qjvfp1.fsf@euler.schwinge.homeip.net>

On Tue, Jul 4, 2023 at 11:50 AM Thomas Schwinge <thomas@codesourcery.com> wrote:
>
> Hi!
>
> I came across this one here on my way working through another (somewhat
> related) GTY issue.  I generally do understand the issue here, but do
> have a question about 'unsigned int len' field in
> 'libcpp/include/symtab.h:struct ht_identifier':
>
> On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > When a GTY'ed struct is streamed to PCH, any plain char* pointers it contains
> > (whether they live in GC-controlled memory or not) will be marked for PCH
> > output by the routine gt_pch_note_object in ggc-common.cc. This routine
> > special-cases plain char* strings, and in particular it uses strlen() to get
> > their length.
>
> Oh, wow, this special casing for strings...  8-|
>
> > Thus it does not handle strings with embedded null bytes, but it
> > is possible for something PCH cares about (such as a string literal token in a
> > macro definition) to contain such embedded nulls. To fix that up, add a new
> > GTY option "string_length" so that gt_pch_note_object can be informed the
> > actual length it ought to use, and use it in the relevant libcpp structs
> > (cpp_string and ht_identifier) accordingly.
>
> For your test case:
>
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C
> > @@ -0,0 +1,3 @@
> > +// { dg-do compile { target c++11 } }
> > +#include "pch-string-nulls.H"
> > +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error");
>
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs
> > @@ -0,0 +1,2 @@
> > +/* Note that there is a null byte following "ABC".  */
> > +#define X R"(ABC^@[!])"
>
> ..., I understand how the following is necessary:
>
> > --- a/libcpp/include/cpplib.h
> > +++ b/libcpp/include/cpplib.h
> > @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, CLK_GNUC17, CLK_GNUC2X,
> >  /* Payload of a NUMBER, STRING, CHAR or COMMENT token.  */
> >  struct GTY(()) cpp_string {
> >    unsigned int len;
> > -  const unsigned char *text;
> > +
> > +  /* TEXT is always null terminated (terminator not included in len); but this
> > +     GTY markup arranges that PCH streaming works properly even if there is a
> > +     null byte in the middle of the string.  */
> > +  const unsigned char * GTY((string_length ("1 + %h.len"))) text;
> >  };
>
> (That is, the test case FAILs with that one reverted.)
>
> However, this one did confuse me:
>
> > --- a/libcpp/include/symtab.h
> > +++ b/libcpp/include/symtab.h
> > @@ -29,7 +29,10 @@ along with this program; see the file COPYING3.  If not see
> >  typedef struct ht_identifier ht_identifier;
> >  typedef struct ht_identifier *ht_identifier_ptr;
> >  struct GTY(()) ht_identifier {
> > -  const unsigned char *str;
> > +  /* This GTY markup arranges that the null-terminated identifier would still
> > +     stream to PCH correctly, if a null byte were to make its way into an
> > +     identifier somehow.  */
> > +  const unsigned char * GTY((string_length ("1 + %h.len"))) str;
> >    unsigned int len;
> >    unsigned int hash_value;
> >  };
>
> I did wonder whether that's an actual or just a theoretical concern, to
> have 'ht_identifier's with embedded NULs?  If an actual concern, can we
> get a test case constructed?  Otherwise, should we revert this hunk,
> given that we have this auto-'strlen' handling, ignorant of embedded
> NULs, in a lot of other places?
>
> But then I realized that possibly we do maintain 'len' here not for
> correctness but as an optimization (trading an 'unsigned int' for
> repeated 'strlen' calls)?  My quick testing with the attached
> "[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm
> this theory: no regressions (..., but I didn't bootstrap, and ran only
> parts of the testsuite).  (Not proposing that RFC for 'git push', of
> course.)
>
> If that's indeed the intention here, I shall change/add source code
> commentary to describe this rationale for this dedicated 'len' field
> (plus, that handling of embedded NULs falls out of that, automatically).
>
> For reference, this 'len' field has existed "forever".  Before
> 'struct ht_identifier' was added in
> commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we
> had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or
> earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode':
> 'unsigned short length', earlier 'size_t length' with comment:
> "length of token, for quick comparison", erlier 'int length', ever since
> the 'gcc/cpp*' files were added in
> commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191).
>

I don't think there is currently any possibility for a null byte to
end up in an ht_identifier's string. I assumed that ht_identifier
stores the length as an optimization (especially since it doesn't take
up any extra space on 64-bit platforms, given the 32-bit hash code is
stored as well there.) I created the string_length GTY markup mainly
to support another patch that I have still pending review, which I
thought would increase the likelihood of PCH needing to handle null
bytes in general. When I did that, I added the markup to ht_identifier
simply because the length was already there, so there was no reason
not to add it. It does save a few cycles when streaming out the PCH,
but I doubt it is meaningful.

-Lewis

  reply	other threads:[~2023-07-04 19:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-18 22:14 [PATCH] pch: Fix streaming of strings with embedded null bytes Lewis Hyatt
2022-10-19 11:54 ` Richard Sandiford
2022-10-19 12:08   ` Jakub Jelinek
2022-10-19 12:17     ` Richard Sandiford
2022-10-19 12:23       ` Jakub Jelinek
2022-10-19 12:47         ` Lewis Hyatt
2023-07-04 15:50 ` 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) Thomas Schwinge
2023-07-04 19:56   ` Lewis Hyatt [this message]
2023-07-05  7:56     ` GTY: Enhance 'string_length' option documentation (was: 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes)) Thomas Schwinge
2023-07-05  8:15       ` Richard Biener
2023-07-05  7:50 ` GTY: Explicitly reject 'string_length' option for (fields in) global variables (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) Thomas Schwinge
2023-07-05  8:13   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAA_5UQ6VTXUjPBykS0YJaus8MdVkCJSm+cE2RR4844Af8S=t6w@mail.gmail.com' \
    --to=lhyatt@gmail.com \
    --cc=dmalcolm@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=richard.sandiford@arm.com \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).