Question on special constraint variables in points-to analysis

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Question on special constraint variables in points-to analysis
@ 2021-03-16 12:14 Erick Ochoa
  2021-03-16 12:27 ` Richard Biener
  0 siblings, 1 reply; 2+ messages in thread
From: Erick Ochoa @ 2021-03-16 12:14 UTC (permalink / raw)
  To: gcc

Hello,

I'm currently working on improving my understanding of the implementation
of the intraprocedural points-to analysis in GCC. I have already read the
papers by Daniel Berlin and have been looking at the source for some time,
but I understand that those references are a little bit old and might not
fully reflect the status of the current points-to analysis.

In order to familiarize myself with the analysis I decided to take a look
at the constraints dumped in the logs and input them into a different
solver which also implements a points-to analysis. I do this during GCC's
run time. In other words, before GCC computes the points-to analysis using
the regular intraprocedural points-to analysis, I take a look at the
constraint vector and translate them to an interface the other solver can
understand. Therefore, after this other solver finished, there are two sets
of solutions for the points-to analysis.

In order to verify the correctness of my analysis I simply iterate over
GCC's solution and place an assertion to verify that both points-to sets
for each variable are equal. The solution computed by my implementation
solves the constraints by abstracting the varinfo into integers by using
the varinfo id. And then the relationship between points-to variables are
expressed as variable A id points to variable B id. Since the variable info
ids are just offsets into the varinfo array, then there should be a one to
one mapping between the solutions in this solver to the solutions computed
by GCC.

Its implemented something like the following:

```
    for (auto output : points_to_analysis) {
       int from_i; // pointer variable index in get_varinfo(from_i)
       int to_i; // point-ee variable index in get_varinfo(toi)

       if(!get_varinfo(from_i)->may_have_pointers) continue;
       bitmap vars = get_varinfo(from_i)->solution;
       if (!vars) continue;

       if (dump_file) fprintf(dump_file, "%s -> %s\n",
get_varinfo(from_i)->name, get_varinfo(to_i)->name);

       bool same = bitmap_bit_p(vars, to_i);
       if (dump_file) fprintf(dump_file, "SAME = %s\n", same ? "true" :
"false");
       gcc_assert(same);

    }
```

This means that if the two analyses were equal, all test cases testing
-ftree-pta should succeed. At the moment, the assertion is being triggered.
Now, I would like to understand why the assertion is triggered without
asking you to debug my code, so instead I'm trying to understand why GCC's
solution is correct. (I gave all this exposition because I may be making
assumptions that are incorrect and that can quickly be verified by someone
more knowledgeable.) Now, I'll focus on the test case:

The code in question is a slightly modified version of the test case
ipa-pta-16.c
```
struct X
{
  long l1;
  struct Y
    {
      long l2;
      int *p;
    } y;
};
int i;
static int __attribute__((noinline))
foo (struct X *x)
{
  struct Y y = x->y;
  *y.p = 0;
  i = 1;
  return *y.p;
}
extern void abort (void);
int main()
{
  struct X x;
  struct X t;
  x.y.p = &i;
  long* e = &t.l1;
  if (foo(&x) != 1)
    abort ();
  return e;
}
```

and is being compiled with the following flags:

gcc -O2 -ftree-pta -ftree-pta-external-solver -fdump-tree-alias
-ftree-salias ipa-pta-16.c

I would like to focus on the foo function, which is translated to the
following gimple:
```
__attribute__((noinline))
int foo.constprop.isra (int * ISRA.4)
{
  int * y$p;
  struct X * x;
  struct X * x;
  int _2;

  <bb 2> [local count: 1073741824]:
  *ISRA.4_3(D) = 0;
  i = 1;
  _2 = *ISRA.4_3(D);
  return _2;

}
```

GCC generates the following constraints (as seen on the dump):

```
ANYTHING = &ANYTHING
ESCAPED = *ESCAPED
ESCAPED = ESCAPED + UNKNOWN
*ESCAPED = NONLOCAL
NONLOCAL = &NONLOCAL
NONLOCAL = &ESCAPED
INTEGER = &ANYTHING
ISRA.4 = &NONLOCAL
derefaddrtmp(9) = &NULL
*ISRA.4 = derefaddrtmp(9)
i = NONLOCAL
i = &NONLOCAL
ESCAPED = &NONLOCAL
_2 = *ISRA.4
```

The points-to set generated by GCC are the following:

```
ANYTHING = { ANYTHING }
ESCAPED = { NULL ESCAPED NONLOCAL }
NONLOCAL = { ESCAPED NONLOCAL } same as i
STOREDANYTHING = { }
INTEGER = { ANYTHING }
ISRA.4 = { NONLOCAL }
derefaddrtmp(9) = { NULL }
i = { ESCAPED NONLOCAL }
_2 = { ESCAPED NONLOCAL }
foo.constprop.0.isra.0 = { }
```

I also noticed that there is more information about ISRA.4 which is not
printed in the points-to sets printed above. Instead, the dump continues
and adds this line:

```
Flow-insensitive points-to information

ISRA.4_3(D), points-to non-local, points-to NULL, points-to vars: { }
```

Now, I understand that the first set of solutions (i.e, `{ NONLOCAL }`) is
the solution of the constraint variable. That is, the constraint variable
ISRA.4 when assigned the set `{ NONLOCAL }` is a solution to the
constraints... I also understand that ISRA.4_3(D) is the SSA variable on
the gimple code. There is technically a mapping between the two, but I
guess there's some distinction. I also understand that the first dump is
generated from the function dump_solution_for_var and just looks for a set
bit.

```
  EXECUTE_IF_SET_IN_BITMAP (vi->solution, 0, i, bi)
    fprintf (file, "%s ", get_varinfo (i)->name);
```

The second dump is generated from dump_points_to_solution
```
  if (pt->null)
    fprintf (file, ", points-to NULL");
```

which looks into the points to solution struct.

My question here is, why is NULL not in the solution in the bitmap but
instead is in this special field? Probably for optimizations somehow. But,
wouldn't removing this bit from the constraint variable could somehow
affect what solution gets propagated to other variables?

But let's go with the idea that ISRA.4 does not point to NULL as given by
the points-to set dumped by GCC. Is this correct? We can try to compute the
analysis by hand by solving the constraints according to Anderson's
points-to analysis... I think we get that NONLOCAL points to NULL (which is
not in the dump).

derefaddrtmp(9) = &NULL
ISRA.4 = &NONLOCAL
*ISRA.4 = derefaddrtmp(9)

So, how I understand it, there are 4 nodes in this subgraph of the
constraint graph and the first two facts gives us the following relations:

1. derefaddrtmp(9) points to NULL
2. ISRA.4 points to NONLOCAL

The third constraint says to propagate the subset constraint to all
variables which are in the ISRA.4 set. So effectively we are creating a new
edge from NONLOCAL to deref that says that whatever deref is pointing to,
now NONLOCAL is pointing to as well. Basically, NONLOCAL now points to NULL.

But as I mentioned, the NULL does not appear on the first dump for the
variable NONLOCAL.

```
NONLOCAL = { ESCAPED NONLOCAL } same as i
```

I think all of this is not a bug, it is just some conventions that GCC
makes because NULL is a special variable, but:

* I would like some reassurance that this is indeed the case of treating
some constraint variables in a special way
* Some guidance on how to extend my solver to also treat the special
variables in the same way (in order to generate exactly the same solution).
* Some idea of what else I can expect to be handled somewhat differently
that I cannot foresee at the moment.

Thanks!

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Question on special constraint variables in points-to analysis
  2021-03-16 12:14 Question on special constraint variables in points-to analysis Erick Ochoa
@ 2021-03-16 12:27 ` Richard Biener
  0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2021-03-16 12:27 UTC (permalink / raw)
  To: Erick Ochoa; +Cc: GCC Development

On Tue, Mar 16, 2021 at 1:16 PM Erick Ochoa via Gcc <gcc@gcc.gnu.org> wrote:
>
> Hello,
>
> I'm currently working on improving my understanding of the implementation
> of the intraprocedural points-to analysis in GCC. I have already read the
> papers by Daniel Berlin and have been looking at the source for some time,
> but I understand that those references are a little bit old and might not
> fully reflect the status of the current points-to analysis.
>
> In order to familiarize myself with the analysis I decided to take a look
> at the constraints dumped in the logs and input them into a different
> solver which also implements a points-to analysis. I do this during GCC's
> run time. In other words, before GCC computes the points-to analysis using
> the regular intraprocedural points-to analysis, I take a look at the
> constraint vector and translate them to an interface the other solver can
> understand. Therefore, after this other solver finished, there are two sets
> of solutions for the points-to analysis.
>
> In order to verify the correctness of my analysis I simply iterate over
> GCC's solution and place an assertion to verify that both points-to sets
> for each variable are equal. The solution computed by my implementation
> solves the constraints by abstracting the varinfo into integers by using
> the varinfo id. And then the relationship between points-to variables are
> expressed as variable A id points to variable B id. Since the variable info
> ids are just offsets into the varinfo array, then there should be a one to
> one mapping between the solutions in this solver to the solutions computed
> by GCC.
>
> Its implemented something like the following:
>
> ```
>     for (auto output : points_to_analysis) {
>        int from_i; // pointer variable index in get_varinfo(from_i)
>        int to_i; // point-ee variable index in get_varinfo(toi)
>
>        if(!get_varinfo(from_i)->may_have_pointers) continue;
>        bitmap vars = get_varinfo(from_i)->solution;
>        if (!vars) continue;
>
>        if (dump_file) fprintf(dump_file, "%s -> %s\n",
> get_varinfo(from_i)->name, get_varinfo(to_i)->name);
>
>        bool same = bitmap_bit_p(vars, to_i);
>        if (dump_file) fprintf(dump_file, "SAME = %s\n", same ? "true" :
> "false");
>        gcc_assert(same);
>
>     }
> ```
>
> This means that if the two analyses were equal, all test cases testing
> -ftree-pta should succeed. At the moment, the assertion is being triggered.
> Now, I would like to understand why the assertion is triggered without
> asking you to debug my code, so instead I'm trying to understand why GCC's
> solution is correct. (I gave all this exposition because I may be making
> assumptions that are incorrect and that can quickly be verified by someone
> more knowledgeable.) Now, I'll focus on the test case:
>
> The code in question is a slightly modified version of the test case
> ipa-pta-16.c
> ```
> struct X
> {
>   long l1;
>   struct Y
>     {
>       long l2;
>       int *p;
>     } y;
> };
> int i;
> static int __attribute__((noinline))
> foo (struct X *x)
> {
>   struct Y y = x->y;
>   *y.p = 0;
>   i = 1;
>   return *y.p;
> }
> extern void abort (void);
> int main()
> {
>   struct X x;
>   struct X t;
>   x.y.p = &i;
>   long* e = &t.l1;
>   if (foo(&x) != 1)
>     abort ();
>   return e;
> }
> ```
>
> and is being compiled with the following flags:
>
> gcc -O2 -ftree-pta -ftree-pta-external-solver -fdump-tree-alias
> -ftree-salias ipa-pta-16.c
>
> I would like to focus on the foo function, which is translated to the
> following gimple:
> ```
> __attribute__((noinline))
> int foo.constprop.isra (int * ISRA.4)
> {
>   int * y$p;
>   struct X * x;
>   struct X * x;
>   int _2;
>
>   <bb 2> [local count: 1073741824]:
>   *ISRA.4_3(D) = 0;
>   i = 1;
>   _2 = *ISRA.4_3(D);
>   return _2;
>
> }
> ```
>
> GCC generates the following constraints (as seen on the dump):
>
> ```
> ANYTHING = &ANYTHING
> ESCAPED = *ESCAPED
> ESCAPED = ESCAPED + UNKNOWN
> *ESCAPED = NONLOCAL
> NONLOCAL = &NONLOCAL
> NONLOCAL = &ESCAPED
> INTEGER = &ANYTHING
> ISRA.4 = &NONLOCAL
> derefaddrtmp(9) = &NULL
> *ISRA.4 = derefaddrtmp(9)
> i = NONLOCAL
> i = &NONLOCAL
> ESCAPED = &NONLOCAL
> _2 = *ISRA.4
> ```
>
> The points-to set generated by GCC are the following:
>
> ```
> ANYTHING = { ANYTHING }
> ESCAPED = { NULL ESCAPED NONLOCAL }
> NONLOCAL = { ESCAPED NONLOCAL } same as i
> STOREDANYTHING = { }
> INTEGER = { ANYTHING }
> ISRA.4 = { NONLOCAL }
> derefaddrtmp(9) = { NULL }
> i = { ESCAPED NONLOCAL }
> _2 = { ESCAPED NONLOCAL }
> foo.constprop.0.isra.0 = { }
> ```
>
> I also noticed that there is more information about ISRA.4 which is not
> printed in the points-to sets printed above. Instead, the dump continues
> and adds this line:
>
> ```
> Flow-insensitive points-to information
>
> ISRA.4_3(D), points-to non-local, points-to NULL, points-to vars: { }
> ```
>
> Now, I understand that the first set of solutions (i.e, `{ NONLOCAL }`) is
> the solution of the constraint variable. That is, the constraint variable
> ISRA.4 when assigned the set `{ NONLOCAL }` is a solution to the
> constraints... I also understand that ISRA.4_3(D) is the SSA variable on
> the gimple code. There is technically a mapping between the two, but I
> guess there's some distinction. I also understand that the first dump is
> generated from the function dump_solution_for_var and just looks for a set
> bit.
>
> ```
>   EXECUTE_IF_SET_IN_BITMAP (vi->solution, 0, i, bi)
>     fprintf (file, "%s ", get_varinfo (i)->name);
> ```
>
> The second dump is generated from dump_points_to_solution
> ```
>   if (pt->null)
>     fprintf (file, ", points-to NULL");
> ```
>
> which looks into the points to solution struct.
>
> My question here is, why is NULL not in the solution in the bitmap but
> instead is in this special field? Probably for optimizations somehow. But,
> wouldn't removing this bit from the constraint variable could somehow
> affect what solution gets propagated to other variables?
>
> But let's go with the idea that ISRA.4 does not point to NULL as given by
> the points-to set dumped by GCC. Is this correct? We can try to compute the
> analysis by hand by solving the constraints according to Anderson's
> points-to analysis... I think we get that NONLOCAL points to NULL (which is
> not in the dump).
>
> derefaddrtmp(9) = &NULL
> ISRA.4 = &NONLOCAL
> *ISRA.4 = derefaddrtmp(9)
>
> So, how I understand it, there are 4 nodes in this subgraph of the
> constraint graph and the first two facts gives us the following relations:
>
> 1. derefaddrtmp(9) points to NULL
> 2. ISRA.4 points to NONLOCAL
>
> The third constraint says to propagate the subset constraint to all
> variables which are in the ISRA.4 set. So effectively we are creating a new
> edge from NONLOCAL to deref that says that whatever deref is pointing to,
> now NONLOCAL is pointing to as well. Basically, NONLOCAL now points to NULL.
>
> But as I mentioned, the NULL does not appear on the first dump for the
> variable NONLOCAL.
>
> ```
> NONLOCAL = { ESCAPED NONLOCAL } same as i
> ```
>
> I think all of this is not a bug, it is just some conventions that GCC
> makes because NULL is a special variable, but:
>
> * I would like some reassurance that this is indeed the case of treating
> some constraint variables in a special way
> * Some guidance on how to extend my solver to also treat the special
> variables in the same way (in order to generate exactly the same solution).
> * Some idea of what else I can expect to be handled somewhat differently
> that I cannot foresee at the moment.

The issue with NULL is that it is not conservatively correctly represented in
the constraints but that doesn't do any harm for correctness of the rest of
the solutions.

Now, at some point we decided to encode optimistic info into pt->null
which means points-to now has to compute a conservatively correct
pt->null.  We chose to leave the constraints imprecise and fixup during
the to SSA transition which is why you see more pt->null than NULL.
See find_what_p_points_to.

Richard.

> Thanks!

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-03-16 12:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-16 12:14 Question on special constraint variables in points-to analysis Erick Ochoa
2021-03-16 12:27 ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).