From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lhyatt@gmail.com>
Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229])
	by sourceware.org (Postfix) with ESMTPS id A5A7F3858C2D
	for <gcc-patches@gcc.gnu.org>; Sat,  5 Nov 2022 17:28:35 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A5A7F3858C2D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x229.google.com with SMTP id h12so10518389ljg.9
        for <gcc-patches@gcc.gnu.org>; Sat, 05 Nov 2022 10:28:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=E+3xsgBy3155zhRxBvPNE2psexkw4L1i9Bc+JIzC+kY=;
        b=NLiYbwbITw8ayJNOcdkf6Q7Tet/tGFZpn71qBov8cRky4G5okfyTvLcHhhh50HdPtO
         WoTNnNNWIYDNe2jGuhHcKImPpFzRoWg4+zPMoX6h8lgVVkZg5lKhuWi5qO7M/yZaV/6U
         PfcU5JcCDrC998TjdqIlJ5fcWPQAqk6iAciZEuOXuBroMZQ9sxrHKM3rz7jfOM7zmL8c
         SbI+FAjF5swa0FexXelKCW66cE9V4ivtrQtbGX1r7xI3p/+H2amWbPrZtUn4v2wZ+gfp
         Ve4SxAwBBdeN2m9cK8dA7MlSSN02sTykUdwUEIOpqqdaG2nSCjMoqh51rOEge+F905lX
         GFzA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=E+3xsgBy3155zhRxBvPNE2psexkw4L1i9Bc+JIzC+kY=;
        b=4vq2vZiPsIueZrElyWPkrKF+Pi/Qkw0hXHIFFgfbexGXsYnzy3zAM7kaP2zNecK6HP
         O0MtaZwZ6MXfbIwpcMtp7eT6Z0EJt2J2wvj58fdFET3LGTxdEPrn66nWwT6WopOxZ6wR
         FkL9sQlavKnZqDQNNRboY3hzuB8Qoho/pBEZxM/htZokijBM4HFBlvoew76Rd1xCAvha
         vVJXvXGOBFbn6UeW2HTXRF/aB3zgEbZkcl2wyRSNcNg2yqmmiU1tmW7aDMpbv1iBUEKq
         v64oxhCOzR4I+bVwnP4aG6+/e+d9wGu5m1sAOFwkZTIRdz0HwhsWAmyhY+43tJdNQISL
         YmtQ==
X-Gm-Message-State: ACrzQf0sel7wDId5LL0LYtcuPMxIJh7GUUXrRQE7O+k7SEag1cvCdDHl
	+yDQA6ACpHQVH3y+7eNkW0OghU3tJTeU7WhNlE8=
X-Google-Smtp-Source: AMsMyM5RfZJpS71CXzg/sfA8J7Ifwa6KSStBp51oroaDxWGdV0zXkYxef5rPR9Khpika8013Hl7o4JHijge+oCM8u0I=
X-Received: by 2002:a2e:8ecc:0:b0:277:34bb:ea2 with SMTP id
 e12-20020a2e8ecc000000b0027734bb0ea2mr3235231ljl.427.1667669313405; Sat, 05
 Nov 2022 10:28:33 -0700 (PDT)
MIME-Version: 1.0
References: <cover.1667514153.git.lhyatt@gmail.com> <dbe5ba2a7d067eb725c0733ca3960fab969cf139.1667514153.git.lhyatt@gmail.com>
 <298deff4873718eb8b5e366007e8f2d4baa83bdb.camel@redhat.com>
In-Reply-To: <298deff4873718eb8b5e366007e8f2d4baa83bdb.camel@redhat.com>
From: Lewis Hyatt <lhyatt@gmail.com>
Date: Sat, 5 Nov 2022 13:28:21 -0400
Message-ID: <CAA_5UQ4YKOsLZUXV-Puv261mN3aBMNjsU3RsMxTfd=Se0DGE4Q@mail.gmail.com>
Subject: Re: [PATCH 4/6] diagnostics: libcpp: Add LC_GEN linemaps to support
 in-memory buffers
To: David Malcolm <dmalcolm@redhat.com>
Cc: gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
X-Spam-Status: No, score=-3035.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Thanks for the comments! I have some replies below.

On Sat, Nov 5, 2022 at 12:23 PM David Malcolm <dmalcolm@redhat.com> wrote:
> [...snip...]
> >
> > diff --git a/gcc/c-family/c-common.cc b/gcc/c-family/c-common.cc
> > index 5890c18bdc3..2935d7fb236 100644
> > --- a/gcc/c-family/c-common.cc
> > +++ b/gcc/c-family/c-common.cc
> > @@ -9183,11 +9183,14 @@ try_to_locate_new_include_insertion_point (const char *file, location_t loc)
> >        const line_map_ordinary *ord_map
> >         = LINEMAPS_ORDINARY_MAP_AT (line_table, i);
> >
> > +      if (ord_map->reason == LC_GEN)
> > +       continue;
> > +
> >        if (const line_map_ordinary *from
> >           = linemap_included_from_linemap (line_table, ord_map))
> >         /* We cannot use pointer equality, because with preprocessed
> >            input all filename strings are unique.  */
> > -       if (0 == strcmp (from->to_file, file))
> > +       if (from->reason != LC_GEN && 0 == strcmp (from->to_file, file))
> >           {
> >             last_include_ord_map = from;
> >             last_ord_map_after_include = NULL;
>
> [...snip...]
>
> I'm not a fan of having the "to_file" field change meaning based on
> whether reason is LC_GEN.
>
> How involved would it be to split line_map_ordinary into two
> subclasses, so that we'd have this hierarchy (with indentation showing
> inheritance):
>
> line_map
>   line_map_ordinary
>     line_map_ordinary_file
>     line_map_ordinary_generated
>   line_map_macro
>
> Alternatively, how about renaming "to_file" to be "data" (or "m_data"),
> to emphasize that it might not be a filename, and that we have to check
> everywhere we access that field.
>

Yeah, there were definitely a lot of ways to go about this. I settled
on the approach of minimizing the changes to libcpp for a couple
reasons. One is that I didn't want to add any extra overhead to
handling of non-_Pragma lexing, which is of course most of the time. I
think it's nice that lex.cpp was not touched at all for this change,
for example. The reason I re-used the to_file field was that this
class seems to be very concerned about minimizing space overhead (c.f.
all the comments about pointer alignment boundaries, etc.) I feel like
the reason for that attention was that the addition of macro location
tracking added a lot of overhead when it was implemented and the
authors wanted to minimize that. Nowadays, perhaps the RAM usage is
not as much of a concern. We do create a lot of line_map instances,
though. The other reason is that the line-maps API is already pretty
error-prone to use. A given location_t could be an ordinary location,
or a virtual location, or an ad-hoc location. Going through the
_Pragma location-related bugs that have been fixed over the years, it
seems like most of them stemmed from failing to check one or the other
of these cases when needed. So I was worried that adding yet another
type of location would make things worse.

But I see your point certainly. I feel like adding a new subclass will
require touching many more call sites, so not sure how it will look. I
guess I would be concerned about adding too many new conditional
branches. There are already very many, since almost every use of
line-maps API has to check for ad-hoc location first, etc. At some
point, if there are too many branches, it makes more sense to use
virtual functions instead and would perform better. I guess the
fundamental issue is that it's really a C-like API that has had C++
features added on to it over time, probably redesigning the API from
scratch would yield something cleaner. Given I wasn't proposing that
for now, I thought making the minimal possible change here would be
the way to go.

What do you think about making to_file a union and adjusting the
handful of places that would care? That could be a good improvement
that's in the right direction.

> Please can all those checks for LC_GEN go into an inline function so we
> can write e.g.
>   map->generated_p ()
> or somesuch.
>

Sure. I guess for consistency it has to look something like
LINEMAP_ORDINARY_GENERATED_P (map).

> If I reading things right, patch 6 adds the sole usage of this in
> destringize_and_run.  Would we ever want to discriminate between
> different kinds of generated buffers?
>

One other possible use case I had in mind was for builtin macros, e.g.
right now for something like

const char* line = __LINE__;

the diagnostic points just to the __LINE__ token. With an LC_GEN map
it could show the user that __LINE__ has expanded to an integer rather
than a string. Something like that. But anyway that was just an aside,
the way I was envisioning it, just one type of LC_GEN map is needed,
although I can see it might be nice to know further what it was made
for.

I could imagine eventually the static analyzer finding a use of it
also. For instance, you had a recent patch that asks libcpp to lex a
buffer containing a macro token, to get the expanded value. If a
diagnostic could be generated during that process for some reason,
then an LC_GEN map could be used to get a reasonable location for it.

> [...snip...]
>
> > @@ -796,10 +798,13 @@ diagnostic_report_current_module (diagnostic_context *context, location_t where)
> >                  N_("of module"),
> >                  N_("In module imported at"),   /* 6 */
> >                  N_("imported at"),
> > +                N_("In buffer generated from"),   /* 8 */
> >                 };
>
> We use the wording "destringized" in:
>
> so maybe this should be "In buffer destringized from" ???  (I'm not
> sure)
>

I was definitely not sure about what would be the best wording here
either. BTW clang has a similar concept, I think they call it "scratch
space". I didn't think to use destringized here, because it could be a
different type of generated data perhaps, although we could also check
specifically what it was for and output a custom string like you
suggest, I guess that's what your previous question was getting at
too.

> [...snip...]
>
> > diff --git a/gcc/input.cc b/gcc/input.cc
> > index 483cb6e940d..3cf5480551d 100644
> > --- a/gcc/input.cc
> > +++ b/gcc/input.cc
>
> [..snip...]
>
> > @@ -58,7 +64,7 @@ public:
> >    ~file_cache_slot ();
>
> My initial thought reading the input.cc part of this patch was that I
> want it to be very clear when a file_cache_slot is for a real file vs
> when we're replaying generated data.  I'd hoped that this could have
> been expressed via inheritance, but we preallocate all the cache slots
> once in an array in file_cache's ctor and the slots get reused over
> time.  So instead of that, can we please have some kind of:
>
>    bool file_slot_p () const;
>    bool generated_slot_p () const;
>
> or somesuch, so that we can have clear assertions and conditionals
> about the current state of a slot (I think the discriminating condition
> is that generated_data_len > 0, right?)
>
> If I'm reading things right, it looks like file_cache_slot::m_file_path
> does double duty after this patch, and is either a filename, or a
> pointer to the generated data.  If so, please can the patch rename it,
> and have all usage guarded appropriately.  Can it be a union? (or does
> the ctor prevent that?)
>

Yes, I did reuse it this way, I will make it better.

> [...snip...]
>
> > @@ -445,16 +461,23 @@ file_cache::evicted_cache_tab_entry (unsigned *highest_use_count)
> >     num_file_slots files are cached.  */
> >
> >  file_cache_slot*
> > -file_cache::add_file (const char *file_path)
> > +file_cache::add_file (const char *file_path, unsigned int generated_data_len)
>
> Can we split this into two functions: one for files, and one for
> generated data?  (add_file vs add_generated_data?)
>

> >  {
> >
> > -  FILE *fp = fopen (file_path, "r");
> > -  if (fp == NULL)
> > -    return NULL;
> > +  FILE *fp;
> > +  if (generated_data_len)
> > +    fp = NULL;
> > +  else
> > +    {
> > +      fp = fopen (file_path, "r");
> > +      if (fp == NULL)
> > +       return NULL;
> > +    }
> >
> >    unsigned highest_use_count = 0;
> >    file_cache_slot *r = evicted_cache_tab_entry (&highest_use_count);
> > -  if (!r->create (in_context, file_path, fp, highest_use_count))
> > +  if (!r->create (in_context, file_path, fp, highest_use_count,
> > +                 generated_data_len))
> >      return NULL;
> >    return r;
> >  }
>
> [...snip...]
>
> > @@ -535,11 +571,12 @@ file_cache::~file_cache ()
> >     it.  */
> >
> >  file_cache_slot*
> > -file_cache::lookup_or_add_file (const char *file_path)
> > +file_cache::lookup_or_add_file (const char *file_path,
> > +                               unsigned int generated_data_len)
>
> Likewise, could this be split into:
>   lookup_or_add_file
> and
>   lookup_or_add_generated
> or somesuch?
>
> >  {
> >    file_cache_slot *r = lookup_file (file_path);
>
> The patch doesn't seem to touch file_cache::lookup_file.  Is the
> current implementation of that ideal (it looks like we're going to be
> doing strcmp of generated buffers, when presumably for those we could
> simply be doing pointer comparisons).
>
> Maybe rename it to lookup_slot?
>

Yeah, so I guess there is also the potential for problems here, if a
generated data buffer were to have the same content as a filename.
Unlike the case of the line-maps API, there is not really as much
concern for optimizing the overhead here, since it only comes into
play when a diagnostic will be issued. So perhaps, I should take a
different approach along the lines of your initial suggestion, and
separate generated data into its own class entirely... I think that
will address most or all of your concerns here.

> >    if (r == NULL)
> > -    r = add_file (file_path);
> > +    r = add_file (file_path, generated_data_len);
> >    return r;
> >  }
> >
> > @@ -547,7 +584,8 @@ file_cache::lookup_or_add_file (const char *file_path)
> >     diagnostic.  */
> >
> >  file_cache_slot::file_cache_slot ()
> > -: m_use_count (0), m_file_path (NULL), m_fp (NULL), m_data (0),
> > +: m_use_count (0), m_file_path (NULL), m_fp (NULL),
> > +  m_data (0), m_data_active (0),
> >    m_alloc_offset (0), m_size (0), m_nb_read (0), m_line_start_idx (0),
> >    m_line_num (0), m_total_lines (0), m_missing_trailing_newline (true)
> >  {
>
> [...snip...]
>
>
>
> > diff --git a/libcpp/include/line-map.h b/libcpp/include/line-map.h
> > index 50207cacc12..eb281809cbd 100644
> > --- a/libcpp/include/line-map.h
> > +++ b/libcpp/include/line-map.h
> > @@ -75,6 +75,8 @@ enum lc_reason
> >    LC_RENAME_VERBATIM,  /* Likewise, but "" != stdin.  */
> >    LC_ENTER_MACRO,      /* Begin macro expansion.  */
> >    LC_MODULE,           /* A (C++) Module.  */
> > +  LC_GEN,              /* Internally generated source.  */
> > +
> >    /* FIXME: add support for stringize and paste.  */
> >    LC_HWM /* High Water Mark.  */
> >  };
> > @@ -437,7 +439,11 @@ struct GTY((tag ("1"))) line_map_ordinary : public line_map {
> >
> >    /* Pointer alignment boundary on both 32 and 64-bit systems.  */
> >
> > -  const char *to_file;
> > +  /* This GTY markup is in case this is an LC_GEN map, in which case
> > +     to_file actually points to the generated data, which we do not
> > +     want to require to be free of null bytes.  */
> > +  const char * GTY((string_length ("%h.to_file_len"))) to_file;
> > +  unsigned int to_file_len;
> >    linenum_type to_line;
>
> What's the intended interaction between this, the garbage-collector,
> and PCH?  Is to_file to be allocated in the GC-managed heap, or can it
> be outside of it?  Looking at patch 6 I see that this seems to be
> allocated (in destringize_and_run) by _cpp_unaligned_alloc.  I don't
> remember off the top of my head if that's valid.
>

As a small preparation for this patch, I recently did r13-3380, that
added support for GTY((string_length)). There was some discussion
here:
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/603885.html .
The short answer is, char* is a special case for GGC, unlike other
pointer types. It will silently ignore anything that wasn't actually
GGC-allocated during the marking phase. This makes it convenient to
use char* pointers in classes like this, where the GTY markup is only
there for PCH purposes (since the line map instance otherwise lives
for the whole process, other than some isolated selftests). The PCH
mechanism still arranges to get them into the PCH even when they were
not allocated by GGC. (The existing to_file that's a filename, for
instance, is allocated by libcpp and not by GGC already.) So the only
constraint on to_file is that is should outlive the usage of the
line-map.

> >
> >    /* Location from whence this line map was included.  For regular
> > @@ -1101,13 +1107,15 @@ extern line_map *line_map_new_raw (line_maps *, bool, unsigned);
> >     at least as long as the lifetime of SET.  An empty
> >     TO_FILE means standard input.  If reason is LC_LEAVE, and
> >     TO_FILE is NULL, then TO_FILE, TO_LINE and SYSP are given their
> > -   natural values considering the file we are returning to.
> > +   natural values considering the file we are returning to.  If reason
> > +   is LC_GEN, then TO_FILE is not a file name, but rather the actual
> > +   content, and TO_FILE_LEN>0 is the length of it.
> >
> >     A call to this function can relocate the previous set of
> >     maps, so any stored line_map pointers should not be used.  */
> >  extern const line_map *linemap_add
> >    (class line_maps *, enum lc_reason, unsigned int sysp,
> > -   const char *to_file, linenum_type to_line);
> > +   const char *to_file, linenum_type to_line, unsigned int to_file_len = 0);
> >
> >  /* Create a macro map.  A macro map encodes source locations of tokens
> >     that are part of a macro replacement-list, at a macro expansion
> > @@ -1304,7 +1312,8 @@ linemap_location_before_p (class line_maps *set,
> >
> >  typedef struct
> >  {
> > -  /* The name of the source file involved.  */
> > +  /* The name of the source file involved, or NULL if
> > +     generated_data is non-NULL.  */
> >    const char *file;
> >
> >    /* The line-location in the source file.  */
> > @@ -1316,6 +1325,10 @@ typedef struct
> >
> >    /* In a system header?. */
> >    bool sysp;
> > +
> > +  /* If generated data, the data and its length.  */
> > +  unsigned int generated_data_len;
> > +  const char *generated_data;
> >  } expanded_location;
>
> Probably worth noting that generated_data can contain NUL bytes, and
> isn't necessarily NUL-terminated.
>

Sure.

>
> Thanks again for the patch; hope this is constructive

Of course, thanks for going through it. I will work on improving the
way input.cc handles this and send a new one. For the line-maps part,
I'll try something like the union for to_file, unless you would prefer
one of the more comprehensive routes there.

-Lewis