public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Thomas Schwinge <thomas@codesourcery.com>
To: Lewis Hyatt <lhyatt@gmail.com>, <gcc-patches@gcc.gnu.org>,
	"Richard Sandiford" <richard.sandiford@arm.com>,
	Jakub Jelinek <jakub@redhat.com>,
	David Malcolm <dmalcolm@redhat.com>
Subject: 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes)
Date: Tue, 4 Jul 2023 17:50:34 +0200	[thread overview]
Message-ID: <87h6qjvfp1.fsf@euler.schwinge.homeip.net> (raw)
In-Reply-To: <aa605ce17fbe4783b46a2cea7b3fa6d99d2cbfe6.1666131048.git.lhyatt@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4817 bytes --]

Hi!

I came across this one here on my way working through another (somewhat
related) GTY issue.  I generally do understand the issue here, but do
have a question about 'unsigned int len' field in
'libcpp/include/symtab.h:struct ht_identifier':

On 2022-10-18T18:14:54-0400, Lewis Hyatt via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> When a GTY'ed struct is streamed to PCH, any plain char* pointers it contains
> (whether they live in GC-controlled memory or not) will be marked for PCH
> output by the routine gt_pch_note_object in ggc-common.cc. This routine
> special-cases plain char* strings, and in particular it uses strlen() to get
> their length.

Oh, wow, this special casing for strings...  8-|

> Thus it does not handle strings with embedded null bytes, but it
> is possible for something PCH cares about (such as a string literal token in a
> macro definition) to contain such embedded nulls. To fix that up, add a new
> GTY option "string_length" so that gt_pch_note_object can be informed the
> actual length it ought to use, and use it in the relevant libcpp structs
> (cpp_string and ht_identifier) accordingly.

For your test case:

> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.C
> @@ -0,0 +1,3 @@
> +// { dg-do compile { target c++11 } }
> +#include "pch-string-nulls.H"
> +static_assert (X[4] == '[' && X[5] == '!' && X[6] == ']', "error");

> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/pch/pch-string-nulls.Hs
> @@ -0,0 +1,2 @@
> +/* Note that there is a null byte following "ABC".  */
> +#define X R"(ABC^@[!])"

..., I understand how the following is necessary:

> --- a/libcpp/include/cpplib.h
> +++ b/libcpp/include/cpplib.h
> @@ -179,7 +179,11 @@ enum c_lang {CLK_GNUC89 = 0, CLK_GNUC99, CLK_GNUC11, CLK_GNUC17, CLK_GNUC2X,
>  /* Payload of a NUMBER, STRING, CHAR or COMMENT token.  */
>  struct GTY(()) cpp_string {
>    unsigned int len;
> -  const unsigned char *text;
> +
> +  /* TEXT is always null terminated (terminator not included in len); but this
> +     GTY markup arranges that PCH streaming works properly even if there is a
> +     null byte in the middle of the string.  */
> +  const unsigned char * GTY((string_length ("1 + %h.len"))) text;
>  };

(That is, the test case FAILs with that one reverted.)

However, this one did confuse me:

> --- a/libcpp/include/symtab.h
> +++ b/libcpp/include/symtab.h
> @@ -29,7 +29,10 @@ along with this program; see the file COPYING3.  If not see
>  typedef struct ht_identifier ht_identifier;
>  typedef struct ht_identifier *ht_identifier_ptr;
>  struct GTY(()) ht_identifier {
> -  const unsigned char *str;
> +  /* This GTY markup arranges that the null-terminated identifier would still
> +     stream to PCH correctly, if a null byte were to make its way into an
> +     identifier somehow.  */
> +  const unsigned char * GTY((string_length ("1 + %h.len"))) str;
>    unsigned int len;
>    unsigned int hash_value;
>  };

I did wonder whether that's an actual or just a theoretical concern, to
have 'ht_identifier's with embedded NULs?  If an actual concern, can we
get a test case constructed?  Otherwise, should we revert this hunk,
given that we have this auto-'strlen' handling, ignorant of embedded
NULs, in a lot of other places?

But then I realized that possibly we do maintain 'len' here not for
correctness but as an optimization (trading an 'unsigned int' for
repeated 'strlen' calls)?  My quick testing with the attached
"[RFC] Verify no embedded NULs in 'struct ht_identifier'" might confirm
this theory: no regressions (..., but I didn't bootstrap, and ran only
parts of the testsuite).  (Not proposing that RFC for 'git push', of
course.)

If that's indeed the intention here, I shall change/add source code
commentary to describe this rationale for this dedicated 'len' field
(plus, that handling of embedded NULs falls out of that, automatically).

For reference, this 'len' field has existed "forever".  Before
'struct ht_identifier' was added in
commit 2a967f3d3a45294640e155381ef549e0b8090ad4 (Subversion r42334), we
had in 'gcc/cpplib.h:struct cpp_hashnode': 'unsigned short len', or
earlier 'length', earlier in 'gcc/cpphash.h:struct hashnode':
'unsigned short length', earlier 'size_t length' with comment:
"length of token, for quick comparison", erlier 'int length', ever since
the 'gcc/cpp*' files were added in
commit 7f2935c734c36f84ab62b20a04de465e19061333 (Subversion r9191).


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-RFC-Verify-no-embedded-NULs-in-struct-ht_identifier.patch --]
[-- Type: text/x-diff, Size: 3729 bytes --]

From 75b54da2f30b7ffa0c69c1659e3f3538bfdbd339 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 4 Jul 2023 14:07:47 +0200
Subject: [PATCH] [RFC] Verify no embedded NULs in 'struct ht_identifier'

---
 gcc/analyzer/pending-diagnostic.cc | 2 +-
 gcc/c-family/c-spellcheck.h        | 2 +-
 gcc/c/c-decl.cc                    | 2 +-
 gcc/tree.h                         | 2 +-
 libcpp/include/symtab.h            | 4 ++--
 libcpp/symtab.cc                   | 2 +-
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/gcc/analyzer/pending-diagnostic.cc b/gcc/analyzer/pending-diagnostic.cc
index e36ed4fd9c1..9452719a908 100644
--- a/gcc/analyzer/pending-diagnostic.cc
+++ b/gcc/analyzer/pending-diagnostic.cc
@@ -128,7 +128,7 @@ pending_diagnostic::same_tree_p (tree t1, tree t2)
 static bool
 ht_ident_eq (ht_identifier ident, const char *str)
 {
-  return (strlen (str) == ident.len
+  return (strlen (str) == HT_LEN (&ident)
 	  && 0 == strcmp (str, (const char *)ident.str));
 }
 
diff --git a/gcc/c-family/c-spellcheck.h b/gcc/c-family/c-spellcheck.h
index e3748430b18..0f0c264ef01 100644
--- a/gcc/c-family/c-spellcheck.h
+++ b/gcc/c-family/c-spellcheck.h
@@ -31,7 +31,7 @@ struct edit_distance_traits<cpp_hashnode *>
 {
   static size_t get_length (cpp_hashnode *hashnode)
   {
-    return hashnode->ident.len;
+    return HT_LEN (&hashnode->ident);
   }
 
   static const char *get_string (cpp_hashnode *hashnode)
diff --git a/gcc/c/c-decl.cc b/gcc/c/c-decl.cc
index 1af51c4acfc..c67ae5be514 100644
--- a/gcc/c/c-decl.cc
+++ b/gcc/c/c-decl.cc
@@ -4499,7 +4499,7 @@ lookup_name_fuzzy (tree name, enum lookup_name_fuzzy_kind kind, location_t loc)
     {
       const char *id = (const char *)best_macro->ident.str;
       tree macro_as_identifier
-	= get_identifier_with_length (id, best_macro->ident.len);
+	= get_identifier_with_length (id, HT_LEN (&best_macro->ident));
       bm.set_best_so_far (macro_as_identifier,
 			  bmm.get_best_distance (),
 			  bmm.get_best_candidate_length ());
diff --git a/gcc/tree.h b/gcc/tree.h
index 1854fe4a7d4..20169e7a467 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1174,7 +1174,7 @@ extern void omp_clause_range_check_failed (const_tree, const char *, int,
 
 /* Unlike STRING_CST, in C terms this is strlen, not sizeof.  */
 #define IDENTIFIER_LENGTH(NODE) \
-  (IDENTIFIER_NODE_CHECK (NODE)->identifier.id.len)
+  (HT_LEN (&IDENTIFIER_NODE_CHECK (NODE)->identifier.id))
 #define IDENTIFIER_POINTER(NODE) \
   ((const char *) IDENTIFIER_NODE_CHECK (NODE)->identifier.id.str)
 #define IDENTIFIER_HASH_VALUE(NODE) \
diff --git a/libcpp/include/symtab.h b/libcpp/include/symtab.h
index c7ccc6db9f0..ea23e3dbedb 100644
--- a/libcpp/include/symtab.h
+++ b/libcpp/include/symtab.h
@@ -33,11 +33,11 @@ struct GTY(()) ht_identifier {
      stream to PCH correctly, if a null byte were to make its way into an
      identifier somehow.  */
   const unsigned char * GTY((string_length ("1 + %h.len"))) str;
-  unsigned int len;
+  unsigned int len_;
   unsigned int hash_value;
 };
 
-#define HT_LEN(NODE) ((NODE)->len)
+#define HT_LEN(NODE) ({ gcc_assert ((NODE)->len_ == strlen ((const char *) HT_STR (NODE))); (NODE)->len_; })
 #define HT_STR(NODE) ((NODE)->str)
 
 typedef struct ht cpp_hash_table;
diff --git a/libcpp/symtab.cc b/libcpp/symtab.cc
index ea9f26905a9..be5c1da9d67 100644
--- a/libcpp/symtab.cc
+++ b/libcpp/symtab.cc
@@ -155,7 +155,7 @@ ht_lookup_with_hash (cpp_hash_table *table, const unsigned char *str,
   node = (*table->alloc_node) (table);
   table->entries[index] = node;
 
-  HT_LEN (node) = (unsigned int) len;
+  node->len_ = (unsigned int) len;
   node->hash_value = hash;
 
   if (table->alloc_subobject)
-- 
2.34.1


  parent reply	other threads:[~2023-07-04 15:50 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-18 22:14 [PATCH] pch: Fix streaming of strings with embedded null bytes Lewis Hyatt
2022-10-19 11:54 ` Richard Sandiford
2022-10-19 12:08   ` Jakub Jelinek
2022-10-19 12:17     ` Richard Sandiford
2022-10-19 12:23       ` Jakub Jelinek
2022-10-19 12:47         ` Lewis Hyatt
2023-07-04 15:50 ` Thomas Schwinge [this message]
2023-07-04 19:56   ` 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) Lewis Hyatt
2023-07-05  7:56     ` GTY: Enhance 'string_length' option documentation (was: 'unsigned int len' field in 'libcpp/include/symtab.h:struct ht_identifier' (was: [PATCH] pch: Fix streaming of strings with embedded null bytes)) Thomas Schwinge
2023-07-05  8:15       ` Richard Biener
2023-07-05  7:50 ` GTY: Explicitly reject 'string_length' option for (fields in) global variables (was: [PATCH] pch: Fix streaming of strings with embedded null bytes) Thomas Schwinge
2023-07-05  8:13   ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h6qjvfp1.fsf@euler.schwinge.homeip.net \
    --to=thomas@codesourcery.com \
    --cc=dmalcolm@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=lhyatt@gmail.com \
    --cc=richard.sandiford@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).