tree decl stored during LGEN does not map to a symtab

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* tree decl stored during LGEN does not map to a symtab_node during WPA
@ 2021-07-07  9:27 Erick Ochoa
  2021-07-09  7:51 ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-07  9:27 UTC (permalink / raw)
  To: gcc

Hi,

I am saving some tree declarations during LGEN that I will be later
analyzing at WPA time. I am able to read the decl from my summaries
and print it at WPA time. It corresponds to a global variable.
However, whenever I use symtab_node::get (decl) during WPA time I keep
getting NULL.

Does anyone know why that might be the case? Is it possible that other
optimizations are rewriting global variables during LGEN (or prior
WPA)? The variable I am looking at is a static const char typeinfo
name for a class in the program I am analyzing. I don't think this is
an issue since other type info names have an associated symtab_node.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-07  9:27 tree decl stored during LGEN does not map to a symtab_node during WPA Erick Ochoa
@ 2021-07-09  7:51 ` Erick Ochoa
  2021-07-09  9:49   ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-09  7:51 UTC (permalink / raw)
  To: gcc

Hi, I noticed this is also happening also for local variables. Again,
storing tree declarations on a summary during LGEN and then at WPA
time reading from those summaries. I can print the declaration, but
when I try to look for its node in the symtab I get NULL as the return
value.

Any help is appreciated. Thanks!

On Wed, 7 Jul 2021 at 11:27, Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> Hi,
>
> I am saving some tree declarations during LGEN that I will be later
> analyzing at WPA time. I am able to read the decl from my summaries
> and print it at WPA time. It corresponds to a global variable.
> However, whenever I use symtab_node::get (decl) during WPA time I keep
> getting NULL.
>
> Does anyone know why that might be the case? Is it possible that other
> optimizations are rewriting global variables during LGEN (or prior
> WPA)? The variable I am looking at is a static const char typeinfo
> name for a class in the program I am analyzing. I don't think this is
> an issue since other type info names have an associated symtab_node.
>
> Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-09  7:51 ` Erick Ochoa
@ 2021-07-09  9:49   ` Richard Biener
  2021-07-12 10:55     ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-09  9:49 UTC (permalink / raw)
  To: Erick Ochoa, Jan Hubicka; +Cc: GCC Development

On Fri, Jul 9, 2021 at 9:52 AM Erick Ochoa via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hi, I noticed this is also happening also for local variables. Again,
> storing tree declarations on a summary during LGEN and then at WPA
> time reading from those summaries. I can print the declaration, but
> when I try to look for its node in the symtab I get NULL as the return
> value.
>
> Any help is appreciated. Thanks!

I'm not too familiar with it but I think you're supposed to stream encoded
symtab references during LGEN/WPA, the decls are subject to symtab
merging and I'm not quite sure that happens when you read your
summaries.

Note locals do not have carpool nodes.

Richard.

> On Wed, 7 Jul 2021 at 11:27, Erick Ochoa <eochoa@gcc.gnu.org> wrote:
> >
> > Hi,
> >
> > I am saving some tree declarations during LGEN that I will be later
> > analyzing at WPA time. I am able to read the decl from my summaries
> > and print it at WPA time. It corresponds to a global variable.
> > However, whenever I use symtab_node::get (decl) during WPA time I keep
> > getting NULL.
> >
> > Does anyone know why that might be the case? Is it possible that other
> > optimizations are rewriting global variables during LGEN (or prior
> > WPA)? The variable I am looking at is a static const char typeinfo
> > name for a class in the program I am analyzing. I don't think this is
> > an issue since other type info names have an associated symtab_node.
> >
> > Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-09  9:49   ` Richard Biener
@ 2021-07-12 10:55     ` Erick Ochoa
  2021-07-13  9:21       ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-12 10:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

> I'm not too familiar with it but I think you're supposed to stream encoded
> symtab references during LGEN/WPA,

Thanks Richard, this happened to be the solution. I am now using
lto_symtab_encoder_t to encode the declarations during LGEN and decode
them during WPA.

Are there any more limitations of using stream_write_tree that one
should be aware of? Now I am looking into storing trees of the type
STRING_CST and I think this might be causing me a problem at WPA time.
I think it segfaults at the moment of creating the process, but I
still need more time to investigate. Perhaps you might know if storing
STRING_CST trees has to be handled in a special way? Not sure if it
also has something to do with LTO file sections. The tree is used to
initialize a global static variable.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-12 10:55     ` Erick Ochoa
@ 2021-07-13  9:21       ` Erick Ochoa
  2021-07-13  9:41         ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-13  9:21 UTC (permalink / raw)
  Cc: Richard Biener, Jan Hubicka, GCC Development

Hi,

Just to clarify a similar question: I am using stream_write_tree and
looking at the comments it says that it is assumed that the tree T is
already in the encoder cache. Does this mean that I have to use
lto_symtab_encoder_t for all trees I want to store in summaries? I
thought the encoder only works for trees which are stored on the
symbol table. Would this mean that the only trees that can be written
out to summaries are those that are declarations? Or are there any
other encoders?

I am trying to store SSA trees at LGEN and read them back during WPA.

Thanks! Any help is appreciated.

On Mon, 12 Jul 2021 at 12:55, Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> > I'm not too familiar with it but I think you're supposed to stream encoded
> > symtab references during LGEN/WPA,
>
> Thanks Richard, this happened to be the solution. I am now using
> lto_symtab_encoder_t to encode the declarations during LGEN and decode
> them during WPA.
>
> Are there any more limitations of using stream_write_tree that one
> should be aware of? Now I am looking into storing trees of the type
> STRING_CST and I think this might be causing me a problem at WPA time.
> I think it segfaults at the moment of creating the process, but I
> still need more time to investigate. Perhaps you might know if storing
> STRING_CST trees has to be handled in a special way? Not sure if it
> also has something to do with LTO file sections. The tree is used to
> initialize a global static variable.
>
> Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-13  9:21       ` Erick Ochoa
@ 2021-07-13  9:41         ` Richard Biener
  2021-07-13 10:49           ` Erick Ochoa
  2021-07-13 11:56           ` Erick Ochoa
  0 siblings, 2 replies; 20+ messages in thread
From: Richard Biener @ 2021-07-13  9:41 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Tue, Jul 13, 2021 at 11:21 AM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> Hi,
>
> Just to clarify a similar question: I am using stream_write_tree and
> looking at the comments it says that it is assumed that the tree T is
> already in the encoder cache. Does this mean that I have to use
> lto_symtab_encoder_t for all trees I want to store in summaries? I
> thought the encoder only works for trees which are stored on the
> symbol table. Would this mean that the only trees that can be written
> out to summaries are those that are declarations? Or are there any
> other encoders?
>
> I am trying to store SSA trees at LGEN and read them back during WPA.

There are entities, like SSA names and STRING_CSTs which are specially
encoded and if you stream those in your LGEN data you have to set up
appropriate encoders.  In general streaming arbitrary trees isn't the
best thing to do, usually you're interested in specific pieces only.  That's
especially true for things "local" to a function (like SSA names), I think
existing IPA passes only stream encoded references to global entities
(or parameters) and all "local" info is some pass specific data streamed
as raw data, not trees.

Why do you want to stream SSA trees for example?  There can be no
references to those from other IPA entities?

> Thanks! Any help is appreciated.
>
> On Mon, 12 Jul 2021 at 12:55, Erick Ochoa <eochoa@gcc.gnu.org> wrote:
> >
> > > I'm not too familiar with it but I think you're supposed to stream encoded
> > > symtab references during LGEN/WPA,
> >
> > Thanks Richard, this happened to be the solution. I am now using
> > lto_symtab_encoder_t to encode the declarations during LGEN and decode
> > them during WPA.
> >
> > Are there any more limitations of using stream_write_tree that one
> > should be aware of? Now I am looking into storing trees of the type
> > STRING_CST and I think this might be causing me a problem at WPA time.
> > I think it segfaults at the moment of creating the process, but I
> > still need more time to investigate. Perhaps you might know if storing
> > STRING_CST trees has to be handled in a special way? Not sure if it
> > also has something to do with LTO file sections. The tree is used to
> > initialize a global static variable.
> >
> > Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-13  9:41         ` Richard Biener
@ 2021-07-13 10:49           ` Erick Ochoa
  2021-07-13 12:55             ` Richard Biener
  2021-07-13 11:56           ` Erick Ochoa
  1 sibling, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-13 10:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

> There are entities, like SSA names and STRING_CSTs which are specially
> encoded and if you stream those in your LGEN data you have to set up
> appropriate encoders.  In general streaming arbitrary trees isn't the
> best thing to do, usually you're interested in specific pieces only.  That's
> especially true for things "local" to a function (like SSA names), I think
> existing IPA passes only stream encoded references to global entities
> (or parameters) and all "local" info is some pass specific data streamed
> as raw data, not trees.

Thanks!

>
> Why do you want to stream SSA trees for example?
>

I am working on a prototype for a points-to analysis where the
implementation is an IPA_PASS as opposed to a SIMPLE_IPA_PASS. It is
still somewhat early in the prototype implementation stages but I
would like to have SSA trees at WPA time as a way to map constraints
variables back to Gimple. The idea is to generate most constraints at
LGEN time (some will have to be updated) and solve them at WPA time.

Do you have any other suggestions on how to map constraint variables
(similar to the index in the varinfo_t array) back to Gimple (but now
as an IPA_PASS)? We have had similar discussions before, but now I am
looking at things more concretely, from the point of view of: how
exactly does one map the analysis back to Gimple while staying within
what is possible in the LTO framework?

I do have a pass-specific data structure, but it contains references
to trees. It is similar to varinfo_t, which has a decl. I also have
another tree for expressions (the concrete use case right now is
having a global variable being assigned to a string literal), and a
field for ssa variables, which was intended to store a reference to
the corresponding tree. Again, as a way to map back the constraint
variable to Gimple.

> There can be no
> references to those from other IPA entities?

I don't follow this question. I think this may be a statement and not
a question. Please clarify this for me, but perhaps the following
might answer: Since I am working on an interprocedural analysis, SSA
variables may point to abstract memory locations allocated in
different functions. So I need to have a link between SSA variables
and whatever memory locations they might point to.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-13  9:41         ` Richard Biener
  2021-07-13 10:49           ` Erick Ochoa
@ 2021-07-13 11:56           ` Erick Ochoa
  1 sibling, 0 replies; 20+ messages in thread
From: Erick Ochoa @ 2021-07-13 11:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

On Tue, 13 Jul 2021 at 11:41, Richard Biener <richard.guenther@gmail.com> wrote:

> There are entities, like SSA names and STRING_CSTs which are specially
> encoded and if you stream those in your LGEN data you have to set up
> appropriate encoders.

I forgot to ask, is there an example of these appropriate encoders
being used somewhere? I only see references to lto_symtab_encoder_t.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-13 10:49           ` Erick Ochoa
@ 2021-07-13 12:55             ` Richard Biener
  2021-07-14 13:56               ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-13 12:55 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Tue, Jul 13, 2021 at 12:50 PM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> > There are entities, like SSA names and STRING_CSTs which are specially
> > encoded and if you stream those in your LGEN data you have to set up
> > appropriate encoders.  In general streaming arbitrary trees isn't the
> > best thing to do, usually you're interested in specific pieces only.  That's
> > especially true for things "local" to a function (like SSA names), I think
> > existing IPA passes only stream encoded references to global entities
> > (or parameters) and all "local" info is some pass specific data streamed
> > as raw data, not trees.
>
> Thanks!
>
> >
> > Why do you want to stream SSA trees for example?
> >
>
> I am working on a prototype for a points-to analysis where the
> implementation is an IPA_PASS as opposed to a SIMPLE_IPA_PASS. It is
> still somewhat early in the prototype implementation stages but I
> would like to have SSA trees at WPA time as a way to map constraints
> variables back to Gimple. The idea is to generate most constraints at
> LGEN time (some will have to be updated) and solve them at WPA time.
>
> Do you have any other suggestions on how to map constraint variables
> (similar to the index in the varinfo_t array) back to Gimple (but now
> as an IPA_PASS)? We have had similar discussions before, but now I am
> looking at things more concretely, from the point of view of: how
> exactly does one map the analysis back to Gimple while staying within
> what is possible in the LTO framework?

I guess the way to encode SSA trees would be to use sth like a
<function-encoder>, SSA-version tuple much like PTA internally
uses the varinfo array index as identifier for the variables in the
constraints.  For local decls (as opposed to SSA names) it's a bit
more difficult - you'd have to devise your own encoding here.

What you can rely on I think is that for local variables UID relations
are preserved, so you could sort cfun->local_decls and use the
position in this array as encoding (in fact I see local_decls is
streamed literally, so you don't even need to sort that for the
start - but we could likely do that without harm to make searching
for a UID O(log n)).

> I do have a pass-specific data structure, but it contains references
> to trees. It is similar to varinfo_t, which has a decl. I also have
> another tree for expressions (the concrete use case right now is
> having a global variable being assigned to a string literal), and a
> field for ssa variables, which was intended to store a reference to
> the corresponding tree. Again, as a way to map back the constraint
> variable to Gimple.
>
> > There can be no
> > references to those from other IPA entities?
>
> I don't follow this question. I think this may be a statement and not
> a question. Please clarify this for me, but perhaps the following
> might answer: Since I am working on an interprocedural analysis, SSA
> variables may point to abstract memory locations allocated in
> different functions. So I need to have a link between SSA variables
> and whatever memory locations they might point to.

So as said above - you need to devise your own "encoder" then
and at WPA time hopefully not need to look at the actual trees
but can treat them as opaque entities, only mapping back the
result at LTRANS time (and only for the interesting functions / globals).

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-13 12:55             ` Richard Biener
@ 2021-07-14 13:56               ` Erick Ochoa
  2021-07-15  7:23                 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-14 13:56 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

> I guess the way to encode SSA trees would be to use sth like a
> <function-encoder>, SSA-version tuple much like PTA internally
> uses the varinfo array index as identifier for the variables in the
> constraints.  For local decls (as opposed to SSA names) it's a bit
> more difficult - you'd have to devise your own encoding here.
>
> What you can rely on I think is that for local variables UID relations
> are preserved, so you could sort cfun->local_decls and use the
> position in this array as encoding (in fact I see local_decls is
> streamed literally, so you don't even need to sort that for the
> start - but we could likely do that without harm to make searching
> for a UID O(log n)).

At the moment I am generating a unique id for each constraint variable
generated. I have assigned a unique LGEN number to each variable and
during WPA I have merged duplicates. The duplication of equivalent
gimple variables in distinct LGEN partitions happens for global
variables (as we have discussed before). Do you know if there are
other cases of duplication that might happen? For example, could a
single function be analyzed in different LGEN partitions?

I followed your example here and I am "encoding" the constraint
variables that relate to SSA variables by looking at the cgraph_node
and the SSA-version. The tree is not stored but at WPA we know the
SSA-version and the cgraph_node and I think this is enough to relate
back to the SSA variable in the gimple source.

You mention that I need to devise my own "encoder", but I am not sure
if we are conflating two notions:

1. encoding tree variables to constraint variables (i.e., a mapping of
some tuple (cgraph_node x symtab_node x ssa-version) to an integer
that represents the constraint variable)
2. encoding as an implementation of a data structure used during LTO
to stream in and stream out trees/symbols to and from partitions.
(e.g., lto_symtab_encoder_t).

So to be clear, when you say I need to devise my own "encoder" you are
referring to definition number 1, not definition number 2, right? And
at LTRANS using the relation (cgraph_node x symtab_node x ssa-version)
x constraint-variable-id one should be able to map to the interesting
pointer/pointee from the constraint variable id.

I am thinking a little bit ahead, but I will need a way to relate
memory allocation sites (e.g., malloc's) to some constraint variable
and perhaps generalize this to expressions (I would like to say that a
variable is pointing to a STRING_CST for example). Do you have an idea
on how to go and encode using the first definition of encoding tree
expressions? I have seen some papers that use instruction-id's
(potentially an integer that corresponds as a unique identifier for
the instruction) but I am unsure if there is something similar to this
in GCC. If what you meant is the second definition, can someone
elaborate on the precise steps for making my own encoder? While I am
somewhat familiar with using the LTO framework I am unfamiliar with
potentially extending it in these sorts of ways.

Thanks! Any help is appreciated.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-14 13:56               ` Erick Ochoa
@ 2021-07-15  7:23                 ` Richard Biener
  2021-07-21 16:55                   ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-15  7:23 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Wed, Jul 14, 2021 at 3:56 PM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> > I guess the way to encode SSA trees would be to use sth like a
> > <function-encoder>, SSA-version tuple much like PTA internally
> > uses the varinfo array index as identifier for the variables in the
> > constraints.  For local decls (as opposed to SSA names) it's a bit
> > more difficult - you'd have to devise your own encoding here.
> >
> > What you can rely on I think is that for local variables UID relations
> > are preserved, so you could sort cfun->local_decls and use the
> > position in this array as encoding (in fact I see local_decls is
> > streamed literally, so you don't even need to sort that for the
> > start - but we could likely do that without harm to make searching
> > for a UID O(log n)).
>
> At the moment I am generating a unique id for each constraint variable
> generated. I have assigned a unique LGEN number to each variable and
> during WPA I have merged duplicates. The duplication of equivalent
> gimple variables in distinct LGEN partitions happens for global
> variables (as we have discussed before). Do you know if there are
> other cases of duplication that might happen? For example, could a
> single function be analyzed in different LGEN partitions?

A single source representation of inline functions and template
instantiations can be analyzed in different LGEN partitions, yes.
Those are merged as well.

> I followed your example here and I am "encoding" the constraint
> variables that relate to SSA variables by looking at the cgraph_node
> and the SSA-version. The tree is not stored but at WPA we know the
> SSA-version and the cgraph_node and I think this is enough to relate
> back to the SSA variable in the gimple source.

Yes, I think so.

> You mention that I need to devise my own "encoder", but I am not sure
> if we are conflating two notions:
>
> 1. encoding tree variables to constraint variables (i.e., a mapping of
> some tuple (cgraph_node x symtab_node x ssa-version) to an integer
> that represents the constraint variable)
> 2. encoding as an implementation of a data structure used during LTO
> to stream in and stream out trees/symbols to and from partitions.
> (e.g., lto_symtab_encoder_t).

I meant 1) and streaming using the LTO cgraph encoder for the cgraph
part and simply using the SSA version for the second part.

> So to be clear, when you say I need to devise my own "encoder" you are
> referring to definition number 1, not definition number 2, right? And
> at LTRANS using the relation (cgraph_node x symtab_node x ssa-version)
> x constraint-variable-id one should be able to map to the interesting
> pointer/pointee from the constraint variable id.
>
> I am thinking a little bit ahead, but I will need a way to relate
> memory allocation sites (e.g., malloc's) to some constraint variable
> and perhaps generalize this to expressions (I would like to say that a
> variable is pointing to a STRING_CST for example). Do you have an idea
> on how to go and encode using the first definition of encoding tree
> expressions?

The easiest is probably to hook it up to things you already encode,
like for malloc it would be the SSA def of the resulting pointer.

We do preserve the order of stmts in basic-blocks and basic-block
indices, we also use stmt UIDs (but re-number them at streaming
time, so you can't directly use them), so using a "stmt number" would
be possible as well.  The LTO streaming uses this to map back
the callgraph edge -> gimple stmt reference
(lto-streamer-in.c:fixup_call_stmt_edges), the code also calls
execute_all_ipa_stmt_fixups which presumably is more "generic"
code - I'm not familiar with it but you can dig whether it would suit
your needs.  It's not that the generic LTO / IPA mechanisms cannot
be extended.

I have seen some papers that use instruction-id's
> (potentially an integer that corresponds as a unique identifier for
> the instruction) but I am unsure if there is something similar to this
> in GCC. If what you meant is the second definition, can someone
> elaborate on the precise steps for making my own encoder? While I am
> somewhat familiar with using the LTO framework I am unfamiliar with
> potentially extending it in these sorts of ways.
>
> Thanks! Any help is appreciated.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-15  7:23                 ` Richard Biener
@ 2021-07-21 16:55                   ` Erick Ochoa
  2021-07-22 11:40                     ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-21 16:55 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

Hello Richard, I need a little bit more help. In our previous messages
you mentioned "<function-encoder>"

> >
> > > I guess the way to encode SSA trees would be to use sth like a
> > > <function-encoder>, SSA-version tuple much like PTA internally
> > > uses the varinfo array index as identifier for the variables in the
> > > constraints.  For local decls (as opposed to SSA names) it's a bit
> > > more difficult - you'd have to devise your own encoding here.
> > >

There was a little confusion on my part about what "encoder" meant

>
> > You mention that I need to devise my own "encoder", but I am not sure
> > if we are conflating two notions:
> >
> > 1. encoding tree variables to constraint variables (i.e., a mapping of
> > some tuple (cgraph_node x symtab_node x ssa-version) to an integer
> > that represents the constraint variable)
> > 2. encoding as an implementation of a data structure used during LTO
> > to stream in and stream out trees/symbols to and from partitions.
> > (e.g., lto_symtab_encoder_t).
>

And you proceed with the following answer

> I meant 1) and streaming using the LTO cgraph encoder for the cgraph
> part and simply using the SSA version for the second part.
>

From this exchange I understood that I should develop my own mapping
for ssa variables and local declarations. However, when dealing with
encoding a node which is available in the symbol table, I could use
the function lto_symtab_encoder_encode to map symbols to an integer
which would later make the symbol available at WPA time. Implicitly,
for me, this meant that this integer is the same for every distinct
symbol in different LGEN partitions. For example, if we encode symbol
X from partitions Y and we get the number N, then encoding symbol X in
partition Z should also yield N.

I believe this is not the case, during WPA time I am printing:
1. pid of lgen process that generated the encoding
2. index returned by lto_symtab_encoder_encode
3. varpool_node->name ()
4. the pointer address being pointed by varpool node

I think we had a previous discussion where it was mentioned that the
only way to distinguish between these cases is to look at varpool_node
cgraph_node:

(From a different email edited for brevity)

On Wed, 30 Jun 2021 at 19:38, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On June 30, 2021 6:28:29 PM GMT+02:00, Erick Ochoa <eochoa@gcc.gnu.org> wrote:
> >So how would I be able to say that two different declarations are the
> >same variable?
>
> By looking at the associated varpool node.
>
> Richard.
>

If this is the case, I can indeed get the varpool node's at WPA time
(as shown above), but comparing their pointer addresses will be
distinct. How can one find out that two varpool nodes/cgraph nodes are
the same at WPA time? Is just looking at the assembler name enough? I
of course want this to be safe.

Another question, how is this currently handled in other IPA passes?
Alternatively, do you have suggestions for encoding functions and
global variables in a similar way to how you suggested encoding ssa
variables and local declarations?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-21 16:55                   ` Erick Ochoa
@ 2021-07-22 11:40                     ` Richard Biener
  2021-07-22 12:04                       ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-22 11:40 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Wed, Jul 21, 2021 at 6:55 PM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> Hello Richard, I need a little bit more help. In our previous messages
> you mentioned "<function-encoder>"
>
> > >
> > > > I guess the way to encode SSA trees would be to use sth like a
> > > > <function-encoder>, SSA-version tuple much like PTA internally
> > > > uses the varinfo array index as identifier for the variables in the
> > > > constraints.  For local decls (as opposed to SSA names) it's a bit
> > > > more difficult - you'd have to devise your own encoding here.
> > > >
>
> There was a little confusion on my part about what "encoder" meant
>
> >
> > > You mention that I need to devise my own "encoder", but I am not sure
> > > if we are conflating two notions:
> > >
> > > 1. encoding tree variables to constraint variables (i.e., a mapping of
> > > some tuple (cgraph_node x symtab_node x ssa-version) to an integer
> > > that represents the constraint variable)
> > > 2. encoding as an implementation of a data structure used during LTO
> > > to stream in and stream out trees/symbols to and from partitions.
> > > (e.g., lto_symtab_encoder_t).
> >
>
> And you proceed with the following answer
>
> > I meant 1) and streaming using the LTO cgraph encoder for the cgraph
> > part and simply using the SSA version for the second part.
> >
>
> From this exchange I understood that I should develop my own mapping
> for ssa variables and local declarations. However, when dealing with
> encoding a node which is available in the symbol table, I could use
> the function lto_symtab_encoder_encode to map symbols to an integer
> which would later make the symbol available at WPA time. Implicitly,
> for me, this meant that this integer is the same for every distinct
> symbol in different LGEN partitions. For example, if we encode symbol
> X from partitions Y and we get the number N, then encoding symbol X in
> partition Z should also yield N.
>
> I believe this is not the case, during WPA time I am printing:
> 1. pid of lgen process that generated the encoding
> 2. index returned by lto_symtab_encoder_encode
> 3. varpool_node->name ()
> 4. the pointer address being pointed by varpool node
>
> I think we had a previous discussion where it was mentioned that the
> only way to distinguish between these cases is to look at varpool_node
> cgraph_node:
>
> (From a different email edited for brevity)
>
> On Wed, 30 Jun 2021 at 19:38, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On June 30, 2021 6:28:29 PM GMT+02:00, Erick Ochoa <eochoa@gcc.gnu.org> wrote:
> > >So how would I be able to say that two different declarations are the
> > >same variable?
> >
> > By looking at the associated varpool node.
> >
> > Richard.
> >
>
> If this is the case, I can indeed get the varpool node's at WPA time
> (as shown above), but comparing their pointer addresses will be
> distinct. How can one find out that two varpool nodes/cgraph nodes are
> the same at WPA time? Is just looking at the assembler name enough? I
> of course want this to be safe.

If they are the same they are merged by the symtab merging process done
at WPA time.

> Another question, how is this currently handled in other IPA passes?
> Alternatively, do you have suggestions for encoding functions and
> global variables in a similar way to how you suggested encoding ssa
> variables and local declarations?

I don't think any other pass has to encode SSA vars because those are
strictly function local.  They only handle IP invariants aka addresses of
globals or so.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 11:40                     ` Richard Biener
@ 2021-07-22 12:04                       ` Erick Ochoa
  2021-07-22 12:08                         ` Erick Ochoa
  2021-07-22 12:23                         ` Richard Biener
  0 siblings, 2 replies; 20+ messages in thread
From: Erick Ochoa @ 2021-07-22 12:04 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

> > If this is the case, I can indeed get the varpool node's at WPA time
> > (as shown above), but comparing their pointer addresses will be
> > distinct. How can one find out that two varpool nodes/cgraph nodes are
> > the same at WPA time? Is just looking at the assembler name enough? I
> > of course want this to be safe.
>
> If they are the same they are merged by the symtab merging process done
> at WPA time.

I trust you, but how would I verify it myself? For example, I
mentioned that I printed:

> 1. pid of lgen process that generated the encoding
> 2. index returned by lto_symtab_encoder_encode
> 3. varpool_node->name ()
> 4. the pointer address being pointed by varpool node

and I got the same "name" but different indices, and different varpool
node's addresses for each of the lgen partitions. And of course
there's only one global variable with that name. In other words, what
exactly does it mean that they are "merged" and in which cases would
they not get merged?

What I'm seeing is for example:

fopen $PID1 8 $ADDR1
fopen $PID2 7 $ADDR2

where $PID1 != $PID2 (expected since it was seen in two LGEN
partitions). They were encoded as "8" and "7" in each of their LGEN
partitions. And when reading them and printing the address of
cgraph_node $ADDR1 != $ADDR2.

So, previously when I thought that merged implied that $ADDR1 ==
$ADDR2 this print shows that not to be the case. Also when I thought
that merging implied they would have the same number, the different
encoding showed that not to be the case. What I essentially want is
the following:

fopen $PID1 $ID
fopen $PID2 $ID

Where $ID can either be derived at WPA time or is encoded during LGEN.

I'm wondering if I should "merge" these two constraint variables by
checking their asm_name? I was just about to run an experiment to see
different cases for this (i.e., same named static function and print
their assembler name, etc.). This would be an example of $ID being
derived at WPA time by doing an O(n^2) comparison in the number of
functions, partitioning them into equivalent functions and then
assigning each of the distinct partitions an incremental ID.

>
> > Another question, how is this currently handled in other IPA passes?
> > Alternatively, do you have suggestions for encoding functions and
> > global variables in a similar way to how you suggested encoding ssa
> > variables and local declarations?
>
> I don't think any other pass has to encode SSA vars because those are
> strictly function local.  They only handle IP invariants aka addresses of
> globals or so.
>

For SSA vars I followed your suggestion of having a function encoding
and then using the SSA_VERSION number. Essentially <function encoding>
x ssa version. However, since both varpool and cgraph_node are encoded
in the same way (for which I am having problems), then the issue is
with the function encoding.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 12:04                       ` Erick Ochoa
@ 2021-07-22 12:08                         ` Erick Ochoa
  2021-07-22 12:23                         ` Richard Biener
  1 sibling, 0 replies; 20+ messages in thread
From: Erick Ochoa @ 2021-07-22 12:08 UTC (permalink / raw)
  Cc: Richard Biener, Jan Hubicka, GCC Development

>
> fopen $PID1 8 $ADDR1
> fopen $PID2 7 $ADDR2
>

Just to clarify a bit further. $PID is generated and stored during
LGEN. The encoding is obviously generated during LGEN.
These are read during WPA. And the encoding is decoded and dyn_casted
into a cgraph_node at WPA time.
All these are printed during WPA.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 12:04                       ` Erick Ochoa
  2021-07-22 12:08                         ` Erick Ochoa
@ 2021-07-22 12:23                         ` Richard Biener
  2021-07-22 12:33                           ` Erick Ochoa
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-22 12:23 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Thu, Jul 22, 2021 at 2:04 PM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> > > If this is the case, I can indeed get the varpool node's at WPA time
> > > (as shown above), but comparing their pointer addresses will be
> > > distinct. How can one find out that two varpool nodes/cgraph nodes are
> > > the same at WPA time? Is just looking at the assembler name enough? I
> > > of course want this to be safe.
> >
> > If they are the same they are merged by the symtab merging process done
> > at WPA time.
>
> I trust you, but how would I verify it myself? For example, I
> mentioned that I printed:
>
> > 1. pid of lgen process that generated the encoding
> > 2. index returned by lto_symtab_encoder_encode
> > 3. varpool_node->name ()
> > 4. the pointer address being pointed by varpool node

Well, yes, during LGEN no WPA has run.  Do you mean LTRANS after WPA?
Sure, the encoder numbers do not have to match up between different
LTRANS units but then they don't speak to each other so that shouldn't
matter, no?

> and I got the same "name" but different indices, and different varpool
> node's addresses for each of the lgen partitions. And of course
> there's only one global variable with that name. In other words, what
> exactly does it mean that they are "merged" and in which cases would
> they not get merged?
>
> What I'm seeing is for example:
>
> fopen $PID1 8 $ADDR1
> fopen $PID2 7 $ADDR2
>
> where $PID1 != $PID2 (expected since it was seen in two LGEN
> partitions). They were encoded as "8" and "7" in each of their LGEN
> partitions. And when reading them and printing the address of
> cgraph_node $ADDR1 != $ADDR2.
>
> So, previously when I thought that merged implied that $ADDR1 ==
> $ADDR2 this print shows that not to be the case. Also when I thought
> that merging implied they would have the same number, the different
> encoding showed that not to be the case. What I essentially want is
> the following:
>
> fopen $PID1 $ID
> fopen $PID2 $ID
>
> Where $ID can either be derived at WPA time or is encoded during LGEN.
>
> I'm wondering if I should "merge" these two constraint variables by
> checking their asm_name? I was just about to run an experiment to see
> different cases for this (i.e., same named static function and print
> their assembler name, etc.). This would be an example of $ID being
> derived at WPA time by doing an O(n^2) comparison in the number of
> functions, partitioning them into equivalent functions and then
> assigning each of the distinct partitions an incremental ID.

I _think_ that it should work if you stream at LGEN constraint variables
as their varpool node (using the varpool encoder), get the nodes merged
at WPA time, and thus your constraints from different LGEN runs "merged"
properly, then stream them the same way to the LTRANS units?

>
> >
> > > Another question, how is this currently handled in other IPA passes?
> > > Alternatively, do you have suggestions for encoding functions and
> > > global variables in a similar way to how you suggested encoding ssa
> > > variables and local declarations?
> >
> > I don't think any other pass has to encode SSA vars because those are
> > strictly function local.  They only handle IP invariants aka addresses of
> > globals or so.
> >
>
> For SSA vars I followed your suggestion of having a function encoding
> and then using the SSA_VERSION number. Essentially <function encoding>
> x ssa version. However, since both varpool and cgraph_node are encoded
> in the same way (for which I am having problems), then the issue is
> with the function encoding.

And the same solution should exist.  For "merged" function definitions
(like multiple same inline definitions) you'd simply drop all but one set of
constraints.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 12:23                         ` Richard Biener
@ 2021-07-22 12:33                           ` Erick Ochoa
  2021-07-22 12:48                             ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-22 12:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

> > > 1. pid of lgen process that generated the encoding
> > > 2. index returned by lto_symtab_encoder_encode
> > > 3. varpool_node->name ()
> > > 4. the pointer address being pointed by varpool node
>
> Well, yes, during LGEN no WPA has run.  Do you mean LTRANS after WPA?
> Sure, the encoder numbers do not have to match up between different
> LTRANS units but then they don't speak to each other so that shouldn't
> matter, no?

No. I mean during WPA. On a different e-mail I clarified the following:

```
>
> fopen $PID1 8 $ADDR1
> fopen $PID2 7 $ADDR2
>

Just to clarify a bit further. $PID is generated and stored during
LGEN. The encoding is obviously generated during LGEN.
These are read during WPA. And the encoding is decoded and dyn_casted
into a cgraph_node at WPA time.
All these are printed during WPA.
```

>
> I _think_ that it should work if you stream at LGEN constraint variables
> as their varpool node (using the varpool encoder), get the nodes merged
> at WPA time, and thus your constraints from different LGEN runs "merged"
> properly, then stream them the same way to the LTRANS units?
>

The only reference to a varpool encoder is on the Changelog. The only
encoder I know of is the symtab_encoder (which I believe should be the
same as varpool_nodes and cgraph_nodes are both symtab_nodes). But
again, I do not know what you mean by "merged" here, since they have
different addresses.

>
> And the same solution should exist.  For "merged" function definitions
> (like multiple same inline definitions) you'd simply drop all but one set of
> constraints.
>

Yes, this is what I would like, but I don't see how to detect "merged"
function definitions. I can get their cgraphs but as I mentioned for
every encoding I decode and dyn_cast all I get is a cgraph holding a
different address. What does "merged" concretely mean?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 12:33                           ` Erick Ochoa
@ 2021-07-22 12:48                             ` Richard Biener
  2021-07-22 14:32                               ` Erick Ochoa
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Biener @ 2021-07-22 12:48 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Thu, Jul 22, 2021 at 2:33 PM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> > > > 1. pid of lgen process that generated the encoding
> > > > 2. index returned by lto_symtab_encoder_encode
> > > > 3. varpool_node->name ()
> > > > 4. the pointer address being pointed by varpool node
> >
> > Well, yes, during LGEN no WPA has run.  Do you mean LTRANS after WPA?
> > Sure, the encoder numbers do not have to match up between different
> > LTRANS units but then they don't speak to each other so that shouldn't
> > matter, no?
>
> No. I mean during WPA. On a different e-mail I clarified the following:
>
> ```
> >
> > fopen $PID1 8 $ADDR1
> > fopen $PID2 7 $ADDR2
> >
>
> Just to clarify a bit further. $PID is generated and stored during
> LGEN. The encoding is obviously generated during LGEN.
> These are read during WPA. And the encoding is decoded and dyn_casted
> into a cgraph_node at WPA time.
> All these are printed during WPA.
> ```
>
> >
> > I _think_ that it should work if you stream at LGEN constraint variables
> > as their varpool node (using the varpool encoder), get the nodes merged
> > at WPA time, and thus your constraints from different LGEN runs "merged"
> > properly, then stream them the same way to the LTRANS units?
> >
>
> The only reference to a varpool encoder is on the Changelog. The only
> encoder I know of is the symtab_encoder (which I believe should be the
> same as varpool_nodes and cgraph_nodes are both symtab_nodes). But
> again, I do not know what you mean by "merged" here, since they have
> different addresses.

But the addresses are at LGEN time?  Note the nodes are actually
streamed to different instances by input_symtab, then decls are merged
(lto_symtab_merge_decls), then I think the IPA
pass summaries are read in (to different unmerged instances!), _then_
the symtab merging process starts (lto_symtab_merge_symbols).
I think the last step eventually calls the cgraph/varpool removal hook
IPA passes registered.

It might be that you need a replace hook to do what you want, I think
that for example IPA CP encodes references to global vars aka &global
as IPA_REF and those are transparently re-written.

As said, I think it can be made work but the details, since this is the
first IPA pass needing this, can be incomplete infra-structure-wise.

Basically you have summaries like

 'global = <fn::1>_3'

where the <fn::1> should eventually be implicit and the constraints
grouped into constraints generated from the respective function body
and constraints generated by call stmts (not sure here), and constraints
for global variable init.  But for the above constraint the point is to
make the 'global' references from different LGEN units the same by
some means (but not streaming and comparing the actual assembler name).

> >
> > And the same solution should exist.  For "merged" function definitions
> > (like multiple same inline definitions) you'd simply drop all but one set of
> > constraints.
> >
>
> Yes, this is what I would like, but I don't see how to detect "merged"
> function definitions. I can get their cgraphs but as I mentioned for
> every encoding I decode and dyn_cast all I get is a cgraph holding a
> different address. What does "merged" concretely mean?

One node is dropped and all references are adjusted.  And somehow
IPA passes are notified about this _after_they have read their
summaries.

Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 12:48                             ` Richard Biener
@ 2021-07-22 14:32                               ` Erick Ochoa
  2021-07-28 10:35                                 ` Richard Biener
  0 siblings, 1 reply; 20+ messages in thread
From: Erick Ochoa @ 2021-07-22 14:32 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jan Hubicka, GCC Development

>
> But the addresses are at LGEN time?

The following is what runs at WPA time

unsigned long pid = streamer_read_uhwi (&ib);
unsigned long id = streamer_read_uhwi (&ib);
lto_symtab_encoder_t encoder = file_data->symtab_node_encoder;
cgraph_node *cnode =
dyn_cast<cgraph_node*>(lto_symtab_encoder_deref(encoder, id));
logger ("%s %ld %ld %p\n", cnode->name (), pid, id, cnode);

> Note the nodes are actually
> streamed to different instances by input_symtab, then decls are merged
> (lto_symtab_merge_decls), then I think the IPA
> pass summaries are read in (to different unmerged instances!), _then_
> the symtab merging process starts (lto_symtab_merge_symbols).
> I think the last step eventually calls the cgraph/varpool removal hook
> IPA passes registered.

Ah, so what you are saying is that during the read_summary stage they
will still be different, but during execute or
write_optimization_summary (), will they be finally merged? I think
maybe the terminology of LGEN/WPA/LTRANS should be expanded to be
lgen_gen, lgen_write, lwpa_read, lwpa_exec/lwpa_write, ltrans_read,
ltrans_exec?

So, just to be a bit more concrete, when initializing the
ipa_opt_pass_d instance one has to write functions which will be
called by a parent process. Normally I see the following comments with
them:

generate_summary
write_summary
read_summary
write_optimization_summary
read_optimization_summary

and finally there's the execute function that gets called.

I am doing the following:

generate_summary, /* generating pid */
write_summary /* generating id and writing pid and id */
read_summary /* reading and printing the info I told about */
write_optimization_summary /* nothing yet */
read_optimization_summary /* nothing yet */
execute /* nothing yet */

And I think these correspond to the following "LGEN/WPA/LTRANS" stages

1. lgen (multiple processes) generate_summary
2. lgen (multiple process) write_summary
3. wpa (single process) read_summary
4. wpa (single process) execute
5. wpa? (single process?) write_optimization_summary
6  ltrans (multiple processes) read_optimization_summary

And you are telling me that cgraph_node and varpool_nodes will have
the same address only after the beginning of the execute stage but not
before that?

Is the above correct?

<OPEN EDIT>

I did try printing cnode->name() during execute and it segfaulted, so
perhaps those function bodies where merged to something else? Note,
that some names were successfully printed out. I'm wondering, can I
use the function lto_symtab_encoder_deref during execute? I think this
is unlikely... because in the past I've tried to use
lto_symtab_encoder_encode during generate_summary and it caused
segfaults. I'll still give it a try.

Perhaps this is still a bit of progress? But now I'm wondering, if I
can't use lto_symtab_encoder_deref and the nodes were indeed merged,
do some of the varpool_node* I saved during read_summary are pointing
to random memory? How am I able to tell which ones survived?

<CLOSE EDIT>

>
> It might be that you need a replace hook to do what you want, I think
> that for example IPA CP encodes references to global vars aka &global
> as IPA_REF and those are transparently re-written.
>
> As said, I think it can be made work but the details, since this is the
> first IPA pass needing this, can be incomplete infra-structure-wise.
>
> Basically you have summaries like
>
>  'global = <fn::1>_3'
>
> where the <fn::1> should eventually be implicit and the constraints
> grouped into constraints generated from the respective function body
> and constraints generated by call stmts (not sure here), and constraints
> for global variable init.  But for the above constraint the point is to
> make the 'global' references from different LGEN units the same by
> some means (but not streaming and comparing the actual assembler name).
>

I'll need some more time to read through how ipa-cp encodes references
to global variables. Thanks for the suggestion!

I don't really follow the paragraph that details what you think my
summaries look like. I'm thinking that for

global = <fn::1>_3

global is a variable? and <fn::1>_3 states that it is an SSA variable
in function 1? I think that can be a possible notation. I prefer to
just use integers.

What do you mean by implicit?

But the idea is to essentially "compile" down all
variables/functions/locals/ssa/etc into integers. And have the program
represented as integers and relation of integers. For example:

int* a

extern void foo (int* c);

int main ()
{
  int b;
  a = &b;
  foo (a) // defined in a different function
}

Should have the following at LGEN time (more specifically write_summary)

variable -> long integer encoding
--------------------------------------------
abstract null -> $null_id
cgraph main -> 0
cgraph foo -> 1
varpool a -> 2
tree b -> 0 x 0  // corresponds to main position 0
real arg c -> 1 x 0 // corresponds to foo position 0

Technically we can also map the other way around, we just need to know
in which "table" the information is stored. (I.e., the symbol_table,
the local_decl table or the ssa_table...)

Then, we give them a unique id

id for lgen <-> variable <-> long integer encoding
--------------------------------------------------------------
$null_id <-> abstract null -> $null_id
0 <-> cgraph main -> 0
1 <-> cgraph foo -> 1
2 <-> varpool a -> 2
3 <-> tree b -> 0 x 0
4 <-> real arg c -> 1 x 0

Then we can generate the constraints

2 = &3 // a = &b
4 = 2   // parm c = a
call foo

The problem is that because this is happening in parallel the other
partition might generate the following constraints:

void foo(int *c)
{
  c = NULL;
}

abstract null -> $null_id
cgraph foo -> 0
formal arg c -> 0 x 0

Give the following global id:

$null_id <-> abstract null -> $null_id
0 <-> cgraph foo -> 0
1 <-> formal arg c -> 0 x 0

And have the following constraint:

1 = $null_id

and so if we were to merge the constraints from both partitions
naively, we would get that 0 and 1 refer to different parts of the
program.

I am trying to get the primary ID's to match at WPA time to be something like:

FROM PARTITON pid 1
0 <-> cgraph main -> 0
1 <-> cgraph foo -> 1
2 <-> varpool a -> 2
3 <-> tree b -> 0 x 0
4 <-> real arg c -> 1 x 0

2 = &3 // a = &b
4 = 2   // parm c = a
call 1

FROM PARTITION pid 2
$null_id <-> abstract null -> $null_id
0 <-> cgraph foo -> 0
1 <-> formal arg c -> 0 x 0

1 = $null_id

MERGED with a map back to their old PID
wpa id, pid x lgen id, var,
0 <-> 1 x 0 <-> cgraph main -> 0
1 <-> 1 x 1 <-> cgraph foo -> 1
1 <-> 2 x 0 <-> cgraph foo -> 0
2 <-> 1 x 2 <-> varpool a -> 2
3 <-> 1 x 3 <-> tree b -> 0 x 0
4 <-> 1 x 4 <-> real arg c -> 1 x 0
5 <-> 2 x 1 <-> formal arg c -> 1 x 0

2 = &3 // a = &b
4 = 2   // real arg c = a
call 1  //  call foo
5 = $null_id  // formal arg c = NULL

Finally, with this information we can run points-to analysis using
integers standing in for memory locations and can output a pointer
pointee relationship also as integers.

I don't want to go through the whole derivation (and I already omitted
details and probably have made some silly mistakes here) but in the
end, for example we should at least have:

Pointer, pointee
---------------------
2, 3  // a may-points-to b
4, 3  // real arg c may-points-to b
2, $null_id // a may-points-to NULL
5, $null_id // formal arg c may-points-to NULL
5, 3 // formal arg c may-points-to b

And we can use these numbers to map back to the gimple source.

This might be inefficient and there's room for removing some
redundancy, but that's kinda what I'm thinking about.

>
> One node is dropped and all references are adjusted.  And somehow
> IPA passes are notified about this _after_they have read their
> summaries.
>
> Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: tree decl stored during LGEN does not map to a symtab_node during WPA
  2021-07-22 14:32                               ` Erick Ochoa
@ 2021-07-28 10:35                                 ` Richard Biener
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Biener @ 2021-07-28 10:35 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: Jan Hubicka, GCC Development

On Thu, Jul 22, 2021 at 4:33 PM Erick Ochoa <eochoa@gcc.gnu.org> wrote:
>
> >
> > But the addresses are at LGEN time?
>
> The following is what runs at WPA time
>
> unsigned long pid = streamer_read_uhwi (&ib);
> unsigned long id = streamer_read_uhwi (&ib);
> lto_symtab_encoder_t encoder = file_data->symtab_node_encoder;
> cgraph_node *cnode =
> dyn_cast<cgraph_node*>(lto_symtab_encoder_deref(encoder, id));
> logger ("%s %ld %ld %p\n", cnode->name (), pid, id, cnode);
>
> > Note the nodes are actually
> > streamed to different instances by input_symtab, then decls are merged
> > (lto_symtab_merge_decls), then I think the IPA
> > pass summaries are read in (to different unmerged instances!), _then_
> > the symtab merging process starts (lto_symtab_merge_symbols).
> > I think the last step eventually calls the cgraph/varpool removal hook
> > IPA passes registered.
>
> Ah, so what you are saying is that during the read_summary stage they
> will still be different, but during execute or
> write_optimization_summary (), will they be finally merged? I think
> maybe the terminology of LGEN/WPA/LTRANS should be expanded to be
> lgen_gen, lgen_write, lwpa_read, lwpa_exec/lwpa_write, ltrans_read,
> ltrans_exec?
>
> So, just to be a bit more concrete, when initializing the
> ipa_opt_pass_d instance one has to write functions which will be
> called by a parent process. Normally I see the following comments with
> them:
>
> generate_summary
> write_summary
> read_summary
> write_optimization_summary
> read_optimization_summary
>
> and finally there's the execute function that gets called.
>
> I am doing the following:
>
> generate_summary, /* generating pid */
> write_summary /* generating id and writing pid and id */
> read_summary /* reading and printing the info I told about */
> write_optimization_summary /* nothing yet */
> read_optimization_summary /* nothing yet */
> execute /* nothing yet */
>
> And I think these correspond to the following "LGEN/WPA/LTRANS" stages
>
> 1. lgen (multiple processes) generate_summary
> 2. lgen (multiple process) write_summary
> 3. wpa (single process) read_summary
> 4. wpa (single process) execute
> 5. wpa? (single process?) write_optimization_summary
> 6  ltrans (multiple processes) read_optimization_summary
>
>
> And you are telling me that cgraph_node and varpool_nodes will have
> the same address only after the beginning of the execute stage but not
> before that?
>
> Is the above correct?
>
> <OPEN EDIT>
>
> I did try printing cnode->name() during execute and it segfaulted, so
> perhaps those function bodies where merged to something else? Note,
> that some names were successfully printed out. I'm wondering, can I
> use the function lto_symtab_encoder_deref during execute? I think this
> is unlikely... because in the past I've tried to use
> lto_symtab_encoder_encode during generate_summary and it caused
> segfaults. I'll still give it a try.
>
> Perhaps this is still a bit of progress? But now I'm wondering, if I
> can't use lto_symtab_encoder_deref and the nodes were indeed merged,
> do some of the varpool_node* I saved during read_summary are pointing
> to random memory? How am I able to tell which ones survived?

As said there are modification hooks and there's likely one missing for
your case (merge-A-and-B or at least B removal).

> <CLOSE EDIT>
>
> >
> > It might be that you need a replace hook to do what you want, I think
> > that for example IPA CP encodes references to global vars aka &global
> > as IPA_REF and those are transparently re-written.
> >
> > As said, I think it can be made work but the details, since this is the
> > first IPA pass needing this, can be incomplete infra-structure-wise.
> >
> > Basically you have summaries like
> >
> >  'global = <fn::1>_3'
> >
> > where the <fn::1> should eventually be implicit and the constraints
> > grouped into constraints generated from the respective function body
> > and constraints generated by call stmts (not sure here), and constraints
> > for global variable init.  But for the above constraint the point is to
> > make the 'global' references from different LGEN units the same by
> > some means (but not streaming and comparing the actual assembler name).
> >
>
> I'll need some more time to read through how ipa-cp encodes references
> to global variables. Thanks for the suggestion!
>
> I don't really follow the paragraph that details what you think my
> summaries look like. I'm thinking that for
>
> global = <fn::1>_3
>
> global is a variable? and <fn::1>_3 states that it is an SSA variable
> in function 1? I think that can be a possible notation. I prefer to
> just use integers.
>
> What do you mean by implicit?

the <fn::1> should be implicit, the data can just contain '3'.  With the
assumption that you have a set of constraints recorded for each
function definition (which then naturally only refers to SSA vars in this
function).

> But the idea is to essentially "compile" down all
> variables/functions/locals/ssa/etc into integers. And have the program
> represented as integers and relation of integers. For example:
>
> int* a
>
> extern void foo (int* c);
>
> int main ()
> {
>   int b;
>   a = &b;
>   foo (a) // defined in a different function
> }
>
> Should have the following at LGEN time (more specifically write_summary)
>
> variable -> long integer encoding
> --------------------------------------------
> abstract null -> $null_id
> cgraph main -> 0
> cgraph foo -> 1
> varpool a -> 2
> tree b -> 0 x 0  // corresponds to main position 0
> real arg c -> 1 x 0 // corresponds to foo position 0
>
> Technically we can also map the other way around, we just need to know
> in which "table" the information is stored. (I.e., the symbol_table,
> the local_decl table or the ssa_table...)
>
> Then, we give them a unique id
>
> id for lgen <-> variable <-> long integer encoding
> --------------------------------------------------------------
> $null_id <-> abstract null -> $null_id
> 0 <-> cgraph main -> 0
> 1 <-> cgraph foo -> 1
> 2 <-> varpool a -> 2
> 3 <-> tree b -> 0 x 0
> 4 <-> real arg c -> 1 x 0
>
> Then we can generate the constraints
>
> 2 = &3 // a = &b
> 4 = 2   // parm c = a
> call foo
>
> The problem is that because this is happening in parallel the other
> partition might generate the following constraints:
>
> void foo(int *c)
> {
>   c = NULL;
> }
>
> abstract null -> $null_id
> cgraph foo -> 0
> formal arg c -> 0 x 0
>
> Give the following global id:
>
> $null_id <-> abstract null -> $null_id
> 0 <-> cgraph foo -> 0
> 1 <-> formal arg c -> 0 x 0
>
> And have the following constraint:
>
> 1 = $null_id
>
> and so if we were to merge the constraints from both partitions
> naively, we would get that 0 and 1 refer to different parts of the
> program.

Well, yeah - you have to remember and stream this mapping to
WPA and then produce a new merged mapping and rewrite the
integers.  So I'd complexify the initial items to not be all integers
but tuples of pieces that remap naturally with the LGEN -> WPA
merging process.

> I am trying to get the primary ID's to match at WPA time to be something like:
>
> FROM PARTITON pid 1
> 0 <-> cgraph main -> 0
> 1 <-> cgraph foo -> 1
> 2 <-> varpool a -> 2
> 3 <-> tree b -> 0 x 0
> 4 <-> real arg c -> 1 x 0
>
> 2 = &3 // a = &b
> 4 = 2   // parm c = a
> call 1
>
> FROM PARTITION pid 2
> $null_id <-> abstract null -> $null_id
> 0 <-> cgraph foo -> 0
> 1 <-> formal arg c -> 0 x 0
>
> 1 = $null_id
>
> MERGED with a map back to their old PID
> wpa id, pid x lgen id, var,
> 0 <-> 1 x 0 <-> cgraph main -> 0
> 1 <-> 1 x 1 <-> cgraph foo -> 1
> 1 <-> 2 x 0 <-> cgraph foo -> 0
> 2 <-> 1 x 2 <-> varpool a -> 2
> 3 <-> 1 x 3 <-> tree b -> 0 x 0
> 4 <-> 1 x 4 <-> real arg c -> 1 x 0
> 5 <-> 2 x 1 <-> formal arg c -> 1 x 0
>
> 2 = &3 // a = &b
> 4 = 2   // real arg c = a
> call 1  //  call foo
> 5 = $null_id  // formal arg c = NULL
>
> Finally, with this information we can run points-to analysis using
> integers standing in for memory locations and can output a pointer
> pointee relationship also as integers.
>
> I don't want to go through the whole derivation (and I already omitted
> details and probably have made some silly mistakes here) but in the
> end, for example we should at least have:
>
> Pointer, pointee
> ---------------------
> 2, 3  // a may-points-to b
> 4, 3  // real arg c may-points-to b
> 2, $null_id // a may-points-to NULL
> 5, $null_id // formal arg c may-points-to NULL
> 5, 3 // formal arg c may-points-to b
>
> And we can use these numbers to map back to the gimple source.
>
> This might be inefficient and there's room for removing some
> redundancy, but that's kinda what I'm thinking about.
>
>
> >
> > One node is dropped and all references are adjusted.  And somehow
> > IPA passes are notified about this _after_they have read their
> > summaries.
> >
> > Richard.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-07-28 10:35 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-07  9:27 tree decl stored during LGEN does not map to a symtab_node during WPA Erick Ochoa
2021-07-09  7:51 ` Erick Ochoa
2021-07-09  9:49   ` Richard Biener
2021-07-12 10:55     ` Erick Ochoa
2021-07-13  9:21       ` Erick Ochoa
2021-07-13  9:41         ` Richard Biener
2021-07-13 10:49           ` Erick Ochoa
2021-07-13 12:55             ` Richard Biener
2021-07-14 13:56               ` Erick Ochoa
2021-07-15  7:23                 ` Richard Biener
2021-07-21 16:55                   ` Erick Ochoa
2021-07-22 11:40                     ` Richard Biener
2021-07-22 12:04                       ` Erick Ochoa
2021-07-22 12:08                         ` Erick Ochoa
2021-07-22 12:23                         ` Richard Biener
2021-07-22 12:33                           ` Erick Ochoa
2021-07-22 12:48                             ` Richard Biener
2021-07-22 14:32                               ` Erick Ochoa
2021-07-28 10:35                                 ` Richard Biener
2021-07-13 11:56           ` Erick Ochoa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).