From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x136.google.com (mail-lf1-x136.google.com [IPv6:2a00:1450:4864:20::136]) by sourceware.org (Postfix) with ESMTPS id BD3B83858D35 for ; Tue, 4 Jul 2023 19:56:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BD3B83858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-lf1-x136.google.com with SMTP id 2adb3069b0e04-4fb96e2b573so9688794e87.3 for ; Tue, 04 Jul 2023 12:56:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688500595; x=1691092595; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=DA6/WwnDkZ1WOH9PGZU8ZoV5kCn8YdTvWC1OFdZSrL8=; b=RZw8VfQIkBdP67/iaA8Xyl520Dpa5BbFq7TWMk2ZyinrJTtPzseW0GdMUY1aC6mVk2 JJVLStm7U5dh3EdbPLFiFktBkyB0NbO7QuPqufgUIcY9mXmsMO8DLQdK+/UIv6hHiPUp pANURu3uWZWfP44REiSkc2+oZjXggc1SvEFFriIQT3+gWfYjl77ff8uzdxyAl9SuL3EK /1329AVzn8jaSJQj2nbwRRGYNMnN7Tfy/F+sKG5dqbiFtOaAHn6pXlliKjKHzYx5CAw0 SkEvhr7H8TNkAWpigLl2I6VrjRkVyAEuFQNDpVlLN5znLHs8e+3/83GPsFB7w5QQEOIY i2Xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688500595; x=1691092595; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DA6/WwnDkZ1WOH9PGZU8ZoV5kCn8YdTvWC1OFdZSrL8=; b=lGRqPt6OJeF+EfQdSrlwQ6nzVmjv98oOu1wZemg24v9mW1+VjSwB3ok7VLBQVyWXu0 N/qfP3phgZJUceEy+wF/9Enf0i/yzZfv/ObuFCFy80OwFcJFTOm7kBg+yxqtrw6XpDHl Og+M9077Mhr36kW/BF6Vg5XOwVRnATzAFI3VJ1b/aMiXo5qlNDlMSQ2uuptHqponeU8D /r/FvX6/EU73cGhKvFpAHWhU+OXSg44VIJY7iqYNNXK6UgWthYqgLd9Qvp9OU3h6Wk+M qZ9iEORGGyqnJBdLPUBRA8T1W9xU9G+hBCXfN69hh7BfF0gREDuQMVJ1Mh/aLGk/+rpE s39Q== X-Gm-Message-State: ABy/qLYj5IrBfTl+aJUPGKEyPUV8m26Wd7JRwT+PjWUBV9SSRlWUkZTC byQAP+7eFOTsJknVN0oBkz+s6XcPgwCYcqKDShgMhOZA X-Google-Smtp-Source: APBJJlGhgVA6pUmTfTT8qn2EAtSR922ZNxZtepfuSWvP98oq3BsACXCiwb6L3YdrpOgk4ECapdBlyQ0PcRnGpQYC3fQ= X-Received: by 2002:a19:7105:0:b0:4fa:6d62:9219 with SMTP id m5-20020a197105000000b004fa6d629219mr9526424lfc.62.1688500594700; Tue, 04 Jul 2023 12:56:34 -0700 (PDT) MIME-Version: 1.0 References: <87h6qjvfp1.fsf@euler.schwinge.homeip.net> In-Reply-To: <87h6qjvfp1.fsf@euler.schwinge.homeip.net> From: Lewis Hyatt Date: Tue, 4 Jul 2023 15:56:23 -0400 Message-ID: Subject: Re: 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) To: Thomas Schwinge Cc: gcc-patches@gcc.gnu.org, Richard Sandiford , Jakub Jelinek , David Malcolm Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-3029.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, Jul 4, 2023 at 11:50=E2=80=AFAM Thomas Schwinge wrote: > > Hi! > > I came across this one here on my way working through another (somewhat > related) GTY issue. I generally do understand the issue here, but do > have a question about 'unsigned int len' field in > 'libcpp/include/symtab.h:struct ht_identifier': > > On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches wrote: > > When a GTY'ed struct is streamed to PCH, any plain char* pointers it co= ntains > > (whether they live in GC-controlled memory or not) will be marked for P= CH > > output by the routine gt_pch_note_object in ggc-common.cc. This routine > > special-cases plain char* strings, and in particular it uses strlen() t= o get > > their length. > > Oh, wow, this special casing for strings... 8-| > > > Thus it does not handle strings with embedded null bytes, but it > > is possible for something PCH cares about (such as a string literal tok= en in a > > macro definition) to contain such embedded nulls. To fix that up, add a= new > > GTY option "string_length" so that gt_pch_note_object can be informed t= he > > actual length it ought to use, and use it in the relevant libcpp struct= s > > (cpp_string and ht_identifier) accordingly. > > For your test case: > > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C > > @@ -0,0 +1,3 @@ > > +// { dg-do compile { target c++11 } } > > +#include "pch-string-nulls.H" > > +static_assert (X[4] =3D=3D '[' && X[5] =3D=3D '!' && X[6] =3D=3D ']', = "error"); > > > --- /dev/null > > +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs > > @@ -0,0 +1,2 @@ > > +/* Note that there is a null byte following "ABC". */ > > +#define X R"(ABC^@[!])" > > ..., I understand how the following is necessary: > > > --- a/libcpp/include/cpplib.h > > +++ b/libcpp/include/cpplib.h > > @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 =3D 0, CLK_GNUC99, CLK_GNU= C11, CLK_GNUC17, CLK_GNUC2X, > > /* Payload of a NUMBER, STRING, CHAR or COMMENT token. */ > > struct GTY(()) cpp_string { > > unsigned int len; > > - const unsigned char *text; > > + > > + /* TEXT is always null terminated (terminator not included in len); = but this > > + GTY markup arranges that PCH streaming works properly even if the= re is a > > + null byte in the middle of the string. */ > > + const unsigned char * GTY((string_length ("1 + %h.len"))) text; > > }; > > (That is, the test case FAILs with that one reverted.) > > However, this one did confuse me: > > > --- a/libcpp/include/symtab.h > > +++ b/libcpp/include/symtab.h > > @@ -29,7 +29,10 @@ along with this program; see the file COPYING3. If = not see > > typedef struct ht_identifier ht_identifier; > > typedef struct ht_identifier *ht_identifier_ptr; > > struct GTY(()) ht_identifier { > > - const unsigned char *str; > > + /* This GTY markup arranges that the null-terminated identifier woul= d still > > + stream to PCH correctly, if a null byte were to make its way into= an > > + identifier somehow. */ > > + const unsigned char * GTY((string_length ("1 + %h.len"))) str; > > unsigned int len; > > unsigned int hash_value; > > }; > > I did wonder whether that's an actual or just a theoretical concern, to > have 'ht_identifier's with embedded NULs? If an actual concern, can we > get a test case constructed? Otherwise, should we revert this hunk, > given that we have this auto-'strlen' handling, ignorant of embedded > NULs, in a lot of other places? > > But then I realized that possibly we do maintain 'len' here not for > correctness but as an optimization (trading an 'unsigned int' for > repeated 'strlen' calls)? My quick testing with the attached > "[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm > this theory: no regressions (..., but I didn't bootstrap, and ran only > parts of the testsuite). (Not proposing that RFC for 'git push', of > course.) > > If that's indeed the intention here, I shall change/add source code > commentary to describe this rationale for this dedicated 'len' field > (plus, that handling of embedded NULs falls out of that, automatically). > > For reference, this 'len' field has existed "forever". Before > 'struct ht_identifier' was added in > commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we > had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or > earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode': > 'unsigned short length', earlier 'size_t length' with comment: > "length of token, for quick comparison", erlier 'int length', ever since > the 'gcc/cpp*' files were added in > commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191). > I don't think there is currently any possibility for a null byte to end up in an ht_identifier's string. I assumed that ht_identifier stores the length as an optimization (especially since it doesn't take up any extra space on 64-bit platforms, given the 32-bit hash code is stored as well there.) I created the string_length GTY markup mainly to support another patch that I have still pending review, which I thought would increase the likelihood of PCH needing to handle null bytes in general. When I did that, I added the markup to ht_identifier simply because the length was already there, so there was no reason not to add it. It does save a few cycles when streaming out the PCH, but I doubt it is meaningful. -Lewis