More aggressive threading causing loop-interchange-9.c regression

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* More aggressive threading causing loop-interchange-9.c regression
@ 2021-09-07 11:49 Aldy Hernandez
  2021-09-07 14:45 ` Michael Matz
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-07 11:49 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: Andrew MacLeod, GCC Mailing List

[-- Attachment #1: Type: text/plain, Size: 2301 bytes --]

Hi folks.

I have a pending patch for the path solver that pulls in global ranges 
when available (the stuff in SSA_NAME_RANGE_INFO).  In doing so, I have 
run into a regression I was hoping the loop experts could pontificate on.

The regression comes from the simple_reduc_1() function in 
tree-ssa/loop-interchange-9.c, and it involves the following path:

=========== BB 5 ============
Imports: n_10(D)  j_19
Exports: n_10(D)  j_13  j_19
          j_13 : j_19(I)
n_10(D)	int [1, 257]
j_19	int [0, 256]
Relational : (j_13 > j_19)
     <bb 5> [local count: 118111600]:
     # sum_21 = PHI <sum_14(4), sum_11(3)>
     c[j_19] = sum_21;
     j_13 = j_19 + 1;
     if (n_10(D) > j_13)
       goto <bb 9>; [89.00%]
     else
       goto <bb 6>; [11.00%]

j_13 : int [1, 257]
5->9  (T) n_10(D) : 	int [2, 257]
5->9  (T) j_13 : 	int [1, 256]
5->9  (T) j_19 : 	int [0, 255]
5->6  (F) n_10(D) : 	int [1, 257]
5->6  (F) j_13 : 	int [1, 257]
5->6  (F) j_19 : 	int [0, 256]

=========== BB 9 ============
n_10(D)	int [2, 257]
j_13	int [1, 256]
Relational : (n_10(D) > j_19)
Relational : (n_10(D) > j_13)
     <bb 9> [local count: 105119324]:
     goto <bb 3>; [100.00%]


=========== BB 3 ============
Imports: n_10(D)
Exports: n_10(D)
n_10(D)	int [1, +INF]
     <bb 3> [local count: 118111600]:
     # j_19 = PHI <j_13(9), 0(7)>
     sum_11 = c[j_19];
     if (n_10(D) > 0)
       goto <bb 8>; [89.00%]
     else
       goto <bb 5>; [11.00%]

With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that 
the loop cannot legally iterate outside the size of c[256].  So, j_13 
lies within [1, 257] and n_10 is [2, +INF] at the end of the path.  This 
allows us to thread BB3 to BB8.  Doing so, causes the test to fail here:

/* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is 
interchanged" 1 "linterchange" } } */

All the blocks lie within the same loop, and the path passes all the 
tests in path_profitable_p().

Is there something about the shape of this path that should make it 
invalid in the backward threader, or should the test be marked with 
-fdisable-tree-threadN (or something else entirely)?  Note that the 
backward threader is allowed to peel through loop headers.

I am attaching the gimple dump as well as the graphviz output for easier 
analysis.

Thanks.
Aldy

[-- Attachment #2: il-dump --]
[-- Type: text/plain, Size: 1125 bytes --]

__attribute__((noinline))
void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

  <bb 2> [local count: 14598063]:
  if (n_10(D) > 0)
    goto <bb 7>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 7> [local count: 12992276]:

  <bb 3> [local count: 118111600]:
  # j_19 = PHI <j_13(9), 0(7)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 8>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 8> [local count: 105119324]:

  <bb 4> [local count: 955630225]:
  # sum_20 = PHI <sum_14(10), sum_11(8)>
  # i_22 = PHI <i_15(10), 0(8)>
  _1 = a[i_22][j_19];
  _2 = b[i_22][j_19];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
    goto <bb 10>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 10> [local count: 850510901]:
  goto <bb 4>; [100.00%]

  <bb 5> [local count: 118111600]:
  # sum_21 = PHI <sum_14(4), sum_11(3)>
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
    goto <bb 9>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 9> [local count: 105119324]:
  goto <bb 3>; [100.00%]

  <bb 6> [local count: 14598063]:
  return;

}

[-- Attachment #3: graph.ps --]
[-- Type: application/postscript, Size: 22338 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-07 11:49 More aggressive threading causing loop-interchange-9.c regression Aldy Hernandez
@ 2021-09-07 14:45 ` Michael Matz
  2021-09-08 10:44   ` Aldy Hernandez
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Matz @ 2021-09-07 14:45 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Jeff Law, Richard Biener, GCC Mailing List

Hello,

On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:

> The regression comes from the simple_reduc_1() function in
> tree-ssa/loop-interchange-9.c, and it involves the following path:
> 
> =========== BB 5 ============
> Imports: n_10(D)  j_19
> Exports: n_10(D)  j_13  j_19
>          j_13 : j_19(I)
> n_10(D)	int [1, 257]
> j_19	int [0, 256]
> Relational : (j_13 > j_19)
>     <bb 5> [local count: 118111600]:
>     # sum_21 = PHI <sum_14(4), sum_11(3)>
>     c[j_19] = sum_21;
>     j_13 = j_19 + 1;
>     if (n_10(D) > j_13)
>       goto <bb 9>; [89.00%]
>     else
>       goto <bb 6>; [11.00%]

So, this is the outer loops exit block ...

> =========== BB 9 ============
> n_10(D)	int [2, 257]
> j_13	int [1, 256]
> Relational : (n_10(D) > j_19)
> Relational : (n_10(D) > j_13)
>     <bb 9> [local count: 105119324]:
>     goto <bb 3>; [100.00%]

... this the outer loops latch block ...

> =========== BB 3 ============
> Imports: n_10(D)
> Exports: n_10(D)
> n_10(D)	int [1, +INF]
>     <bb 3> [local count: 118111600]:
>     # j_19 = PHI <j_13(9), 0(7)>
>     sum_11 = c[j_19];
>     if (n_10(D) > 0)
>       goto <bb 8>; [89.00%]
>     else
>       goto <bb 5>; [11.00%]

... and this the outer loops header, as well as inner loops pre-header.
The attempted thread hence involves a back-edge (of the outer loop) and a 
loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would 
destroy some properties that our loop optimizers rely on, e.g. that the 
loop header of a natural loop is only reached by two edges: entry edge and 
back edge, and that the latch blocks are empty and that there's only a 
single latch.  (What exactly we require depends on several flags in 
loops_state_satisfies_p).

> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that 
> the loop cannot legally iterate outside the size of c[256].  So, j_13 
> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.  
> This allows us to thread BB3 to BB8.

So, IIUC doing this threading would create a new entry to BB8: it would 
then be entered by its natural entry (bb3), by its natural back edge 
(whatever bb that is now) and the new entry from the threaded outer back 
edge (bb9 or bb5?).

The inner loop wouldn't then be recognized as natural anymore and the 
whole nest not as perfect, and hence loop interchange can't easily happen 
anymore.  Most other structural loop optimizers of us would have problems 
with that situation as well.

> All the blocks lie within the same loop, and the path passes all the 
> tests in path_profitable_p().
> 
> Is there something about the shape of this path that should make it 
> invalid in the backward threader, or should the test be marked with 
> -fdisable-tree-threadN (or something else entirely)?

This is a functionality test checking that the very necessary interchange 
in this test does happen with default plus -floop-interchange (with the 
intention of it being enabled by O3 or with profile info).  So no 
additional flags can be added without changing the meaning of this test.

> Note that the 
> backward threader is allowed to peel through loop headers.

Something needs to give way in the path threaders before loop 
optimizations: either threading through back edges, through loop latches 
or through loop headers needs to be disabled.  I think traditionally at 
least threading through latches should be disabled, because doing so 
usually destroys simple loop structure.  I see that profitable_path_p() of 
the backward threader wants to be careful in some situations involving 
loops and latches; possibly it's not careful enough yet for the additional 
power brought by ranger.

See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is 
doing when loops are active.

After the structural loop optimizers the threader can go wild and thread 
everything it finds.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-07 14:45 ` Michael Matz
@ 2021-09-08 10:44   ` Aldy Hernandez
  2021-09-08 13:13     ` Richard Biener
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-08 10:44 UTC (permalink / raw)
  To: Michael Matz; +Cc: Jeff Law, Richard Biener, GCC Mailing List, Andrew MacLeod

[-- Attachment #1: Type: text/plain, Size: 6823 bytes --]

First of all.  Thanks so much for this incredibly useful explanation. 
It helps a lot.

I'm still trying to wrap my head around this, so please bear with me.

On 9/7/21 4:45 PM, Michael Matz wrote:
> Hello,
> 
> On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:
> 
>> The regression comes from the simple_reduc_1() function in
>> tree-ssa/loop-interchange-9.c, and it involves the following path:
>>
>> =========== BB 5 ============
>> Imports: n_10(D)  j_19
>> Exports: n_10(D)  j_13  j_19
>>           j_13 : j_19(I)
>> n_10(D)	int [1, 257]
>> j_19	int [0, 256]
>> Relational : (j_13 > j_19)
>>      <bb 5> [local count: 118111600]:
>>      # sum_21 = PHI <sum_14(4), sum_11(3)>
>>      c[j_19] = sum_21;
>>      j_13 = j_19 + 1;
>>      if (n_10(D) > j_13)
>>        goto <bb 9>; [89.00%]
>>      else
>>        goto <bb 6>; [11.00%]
> 
> So, this is the outer loops exit block ...
> 
>> =========== BB 9 ============
>> n_10(D)	int [2, 257]
>> j_13	int [1, 256]
>> Relational : (n_10(D) > j_19)
>> Relational : (n_10(D) > j_13)
>>      <bb 9> [local count: 105119324]:
>>      goto <bb 3>; [100.00%]
> 
> ... this the outer loops latch block ...
> 
>> =========== BB 3 ============
>> Imports: n_10(D)
>> Exports: n_10(D)
>> n_10(D)	int [1, +INF]
>>      <bb 3> [local count: 118111600]:
>>      # j_19 = PHI <j_13(9), 0(7)>
>>      sum_11 = c[j_19];
>>      if (n_10(D) > 0)
>>        goto <bb 8>; [89.00%]
>>      else
>>        goto <bb 5>; [11.00%]
> 
> ... and this the outer loops header, as well as inner loops pre-header.

Actually, the inner loop's pre-header seems to be BB8, which is empty, 
and leads into the BB4 which is the header of the inner loop.

> The attempted thread hence involves a back-edge (of the outer loop) and a
> loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
> destroy some properties that our loop optimizers rely on, e.g. that the
> loop header of a natural loop is only reached by two edges: entry edge and
> back edge, and that the latch blocks are empty and that there's only a
> single latch.  (What exactly we require depends on several flags in
> loops_state_satisfies_p).

But threading 3->8 doesn't involve a loop-entry edge.  It involves a new 
edge into the *pre-header* of the inner loop.

I am attaching the IL right after threading and before we clean up the 
CFG (il-after-threading.txt).

After threading we have have rewritten the 5->9 edge into 5->11:

   <bb 5> [local count: 118111600]:
   # sum_21 = PHI <sum_14(4), sum_11(3)>
   c[j_19] = sum_21;
   j_13 = j_19 + 1;
   if (n_10(D) > j_13)
     goto <bb 11>; [89.00%]
   else
     goto <bb 6>; [11.00%]

   <bb 11> [local count: 105119324]:

   <bb 12> [local count: 105119324]:
   # j_7 = PHI <j_13(11)>
   sum_6 = c[j_19];
   goto <bb 8>; [100.00%]

   <bb 9> [local count: 0]:	;; This becomes unreachable.
   goto <bb 3>; [100.00%]

Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.

Question: could BB12 not be considered the loop latch now and BB8 be the 
outer loop's entry?  This would also mean, that BB3 is now outside of 
the outer loop, which means the threader peeled off the first iteration 
of the loop.  Or is it a requirement that the latch be empty?

> 
>> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
>> the loop cannot legally iterate outside the size of c[256].  So, j_13
>> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
>> This allows us to thread BB3 to BB8.
> 
> So, IIUC doing this threading would create a new entry to BB8: it would
> then be entered by its natural entry (bb3), by its natural back edge
> (whatever bb that is now) and the new entry from the threaded outer back
> edge (bb9 or bb5?).

As I mentioned, BB8 looks like the pre-header to the inner loop.  But 
yes, it now has multiple entries: the edge from BB3, and the back edge 
from BB12 (which seems like it could be a new latch, since it's the only 
back edge).

> 
> The inner loop wouldn't then be recognized as natural anymore and the
> whole nest not as perfect, and hence loop interchange can't easily happen
> anymore.  Most other structural loop optimizers of us would have problems
> with that situation as well.

If I clean up the CFG and call loop_optimizer_init() again at this 
point, what is destroyed is the outer loop, not the inner loop.  If you 
look at the IL at this point 
(il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 
going back up to BB4, though the conditional is now in BB6 (eeech):

   <bb 4>
...
...
...


   <bb 6> [local count: 118111600]:
   # sum_21 = PHI <sum_14(5), sum_11(3)>
   # j_18 = PHI <j_17(5), j_19(3)>
   c[j_18] = sum_21;
   j_13 = j_18 + 1;
   if (n_10(D) > j_13)
     goto <bb 7>; [89.00%]
   else
     goto <bb 8>; [11.00%]

   <bb 7> [local count: 105119324]:
   # j_7 = PHI <j_13(6)>
   sum_6 = c[j_7];
   goto <bb 4>; [100.00%]

Perhaps, this looks sufficiently different that the loop optimizer can't 
recognize it as a loop?

> 
>> All the blocks lie within the same loop, and the path passes all the
>> tests in path_profitable_p().
>>
>> Is there something about the shape of this path that should make it
>> invalid in the backward threader, or should the test be marked with
>> -fdisable-tree-threadN (or something else entirely)?
> 
> This is a functionality test checking that the very necessary interchange
> in this test does happen with default plus -floop-interchange (with the
> intention of it being enabled by O3 or with profile info).  So no
> additional flags can be added without changing the meaning of this test.

Agreed.  We're shuffling things around.   What I'm missing here is why 
we're unable to recognize this new form as a loop.  That being said, I 
am not above figuring out what the magic incantation is to keep the 
threader from doing this transformation.  However, I just want to make 
sure we're not papering over something else.

> 
>> Note that the
>> backward threader is allowed to peel through loop headers.
> 
> Something needs to give way in the path threaders before loop
> optimizations: either threading through back edges, through loop latches
> or through loop headers needs to be disabled.  I think traditionally at
> least threading through latches should be disabled, because doing so
> usually destroys simple loop structure.  I see that profitable_path_p() of
> the backward threader wants to be careful in some situations involving
> loops and latches; possibly it's not careful enough yet for the additional
> power brought by ranger.
> 
> See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is
> doing when loops are active.
> 
> After the structural loop optimizers the threader can go wild and thread
> everything it finds.

Thanks again.  This is very helpful.

Aldy

[-- Attachment #2: il-after-threading.txt --]
[-- Type: text/plain, Size: 1260 bytes --]

__attribute__((noinline))
void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

  <bb 2> [local count: 14598063]:
  if (n_10(D) > 0)
    goto <bb 7>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 7> [local count: 12992276]:

  <bb 3> [local count: 12992276]:
  # j_19 = PHI <j_13(9), 0(7)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 8>; [66.92%]
  else
    goto <bb 5>; [33.08%]

  <bb 8> [local count: 105119324]:

  <bb 4> [local count: 955630225]:
  # sum_20 = PHI <sum_14(10), sum_11(8)>
  # i_22 = PHI <i_15(10), 0(8)>
  _1 = a[i_22][j_19];
  _2 = b[i_22][j_19];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
    goto <bb 10>; [89.00%]
  else
    goto <bb 5>; [11.00%]

  <bb 10> [local count: 850510901]:
  goto <bb 4>; [100.00%]

  <bb 5> [local count: 118111600]:
  # sum_21 = PHI <sum_14(4), sum_11(3)>
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
    goto <bb 11>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 11> [local count: 105119324]:

  <bb 12> [local count: 105119324]:
  # j_7 = PHI <j_13(11)>
  sum_6 = c[j_19];
  goto <bb 8>; [100.00%]

  <bb 9> [local count: 0]:
  goto <bb 3>; [100.00%]

  <bb 6> [local count: 14598063]:
  return;

}

[-- Attachment #3: il-after-cleanup-and-rerunning-loop-init.txt --]
[-- Type: text/plain, Size: 1196 bytes --]

void simple_reduc_1 (int n)
{
  int i;
  int sum;
  int j;
  int _1;
  int _2;
  int _3;

  <bb 2> [local count: 14598063]:
  if (n_10(D) > 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 3> [local count: 12992276]:
  # j_19 = PHI <0(2)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 4>; [66.92%]
  else
    goto <bb 6>; [33.08%]

  <bb 4> [local count: 105119324]:
  # sum_5 = PHI <sum_11(3), sum_6(7)>
  # j_17 = PHI <j_19(3), j_7(7)>

  <bb 5> [local count: 955630225]:
  # sum_20 = PHI <sum_14(9), sum_5(4)>
  # i_22 = PHI <i_15(9), 0(4)>
  _1 = a[i_22][j_17];
  _2 = b[i_22][j_17];
  _3 = _1 * _2;
  sum_14 = _3 + sum_20;
  i_15 = i_22 + 1;
  if (n_10(D) > i_15)
    goto <bb 9>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 9> [local count: 850510901]:
  goto <bb 5>; [100.00%]

  <bb 6> [local count: 118111600]:
  # sum_21 = PHI <sum_14(5), sum_11(3)>
  # j_18 = PHI <j_17(5), j_19(3)>
  c[j_18] = sum_21;
  j_13 = j_18 + 1;
  if (n_10(D) > j_13)
    goto <bb 7>; [89.00%]
  else
    goto <bb 8>; [11.00%]

  <bb 7> [local count: 105119324]:
  # j_7 = PHI <j_13(6)>
  sum_6 = c[j_7];
  goto <bb 4>; [100.00%]

  <bb 8> [local count: 14598063]:
  return;

}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 10:44   ` Aldy Hernandez
@ 2021-09-08 13:13     ` Richard Biener
  2021-09-08 13:25       ` Aldy Hernandez
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Biener @ 2021-09-08 13:13 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On Wed, Sep 8, 2021 at 12:44 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>
> First of all.  Thanks so much for this incredibly useful explanation.
> It helps a lot.
>
> I'm still trying to wrap my head around this, so please bear with me.
>
> On 9/7/21 4:45 PM, Michael Matz wrote:
> > Hello,
> >
> > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote:
> >
> >> The regression comes from the simple_reduc_1() function in
> >> tree-ssa/loop-interchange-9.c, and it involves the following path:
> >>
> >> =========== BB 5 ============
> >> Imports: n_10(D)  j_19
> >> Exports: n_10(D)  j_13  j_19
> >>           j_13 : j_19(I)
> >> n_10(D)      int [1, 257]
> >> j_19 int [0, 256]
> >> Relational : (j_13 > j_19)
> >>      <bb 5> [local count: 118111600]:
> >>      # sum_21 = PHI <sum_14(4), sum_11(3)>
> >>      c[j_19] = sum_21;
> >>      j_13 = j_19 + 1;
> >>      if (n_10(D) > j_13)
> >>        goto <bb 9>; [89.00%]
> >>      else
> >>        goto <bb 6>; [11.00%]
> >
> > So, this is the outer loops exit block ...
> >
> >> =========== BB 9 ============
> >> n_10(D)      int [2, 257]
> >> j_13 int [1, 256]
> >> Relational : (n_10(D) > j_19)
> >> Relational : (n_10(D) > j_13)
> >>      <bb 9> [local count: 105119324]:
> >>      goto <bb 3>; [100.00%]
> >
> > ... this the outer loops latch block ...
> >
> >> =========== BB 3 ============
> >> Imports: n_10(D)
> >> Exports: n_10(D)
> >> n_10(D)      int [1, +INF]
> >>      <bb 3> [local count: 118111600]:
> >>      # j_19 = PHI <j_13(9), 0(7)>
> >>      sum_11 = c[j_19];
> >>      if (n_10(D) > 0)
> >>        goto <bb 8>; [89.00%]
> >>      else
> >>        goto <bb 5>; [11.00%]
> >
> > ... and this the outer loops header, as well as inner loops pre-header.
>
> Actually, the inner loop's pre-header seems to be BB8, which is empty,
> and leads into the BB4 which is the header of the inner loop.
>
> > The attempted thread hence involves a back-edge (of the outer loop) and a
> > loop-entry edge (bb3->bb8) of the inner loop.  Doing that threading would
> > destroy some properties that our loop optimizers rely on, e.g. that the
> > loop header of a natural loop is only reached by two edges: entry edge and
> > back edge, and that the latch blocks are empty and that there's only a
> > single latch.  (What exactly we require depends on several flags in
> > loops_state_satisfies_p).
>
> But threading 3->8 doesn't involve a loop-entry edge.  It involves a new
> edge into the *pre-header* of the inner loop.
>
> I am attaching the IL right after threading and before we clean up the
> CFG (il-after-threading.txt).
>
> After threading we have have rewritten the 5->9 edge into 5->11:
>
>    <bb 5> [local count: 118111600]:
>    # sum_21 = PHI <sum_14(4), sum_11(3)>
>    c[j_19] = sum_21;
>    j_13 = j_19 + 1;
>    if (n_10(D) > j_13)
>      goto <bb 11>; [89.00%]
>    else
>      goto <bb 6>; [11.00%]
>
>    <bb 11> [local count: 105119324]:
>
>    <bb 12> [local count: 105119324]:
>    # j_7 = PHI <j_13(11)>
>    sum_6 = c[j_19];
>    goto <bb 8>; [100.00%]
>
>    <bb 9> [local count: 0]:     ;; This becomes unreachable.
>    goto <bb 3>; [100.00%]
>
> Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9.
>
> Question: could BB12 not be considered the loop latch now and BB8 be the
> outer loop's entry?  This would also mean, that BB3 is now outside of
> the outer loop, which means the threader peeled off the first iteration
> of the loop.  Or is it a requirement that the latch be empty?
>
> >
> >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that
> >> the loop cannot legally iterate outside the size of c[256].  So, j_13
> >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path.
> >> This allows us to thread BB3 to BB8.
> >
> > So, IIUC doing this threading would create a new entry to BB8: it would
> > then be entered by its natural entry (bb3), by its natural back edge
> > (whatever bb that is now) and the new entry from the threaded outer back
> > edge (bb9 or bb5?).
>
> As I mentioned, BB8 looks like the pre-header to the inner loop.  But
> yes, it now has multiple entries: the edge from BB3, and the back edge
> from BB12 (which seems like it could be a new latch, since it's the only
> back edge).

It would be helpful to have the patch causing the issue to look at the IL.
But as Micha said, there needs to be a perfect loop nest for interchange
to work.

Richard.

> >
> > The inner loop wouldn't then be recognized as natural anymore and the
> > whole nest not as perfect, and hence loop interchange can't easily happen
> > anymore.  Most other structural loop optimizers of us would have problems
> > with that situation as well.
>
> If I clean up the CFG and call loop_optimizer_init() again at this
> point, what is destroyed is the outer loop, not the inner loop.  If you
> look at the IL at this point
> (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7
> going back up to BB4, though the conditional is now in BB6 (eeech):
>
>    <bb 4>
> ...
> ...
> ...
>
>
>    <bb 6> [local count: 118111600]:
>    # sum_21 = PHI <sum_14(5), sum_11(3)>
>    # j_18 = PHI <j_17(5), j_19(3)>
>    c[j_18] = sum_21;
>    j_13 = j_18 + 1;
>    if (n_10(D) > j_13)
>      goto <bb 7>; [89.00%]
>    else
>      goto <bb 8>; [11.00%]
>
>    <bb 7> [local count: 105119324]:
>    # j_7 = PHI <j_13(6)>
>    sum_6 = c[j_7];
>    goto <bb 4>; [100.00%]
>
> Perhaps, this looks sufficiently different that the loop optimizer can't
> recognize it as a loop?
>
> >
> >> All the blocks lie within the same loop, and the path passes all the
> >> tests in path_profitable_p().
> >>
> >> Is there something about the shape of this path that should make it
> >> invalid in the backward threader, or should the test be marked with
> >> -fdisable-tree-threadN (or something else entirely)?
> >
> > This is a functionality test checking that the very necessary interchange
> > in this test does happen with default plus -floop-interchange (with the
> > intention of it being enabled by O3 or with profile info).  So no
> > additional flags can be added without changing the meaning of this test.
>
> Agreed.  We're shuffling things around.   What I'm missing here is why
> we're unable to recognize this new form as a loop.  That being said, I
> am not above figuring out what the magic incantation is to keep the
> threader from doing this transformation.  However, I just want to make
> sure we're not papering over something else.
>
> >
> >> Note that the
> >> backward threader is allowed to peel through loop headers.
> >
> > Something needs to give way in the path threaders before loop
> > optimizations: either threading through back edges, through loop latches
> > or through loop headers needs to be disabled.  I think traditionally at
> > least threading through latches should be disabled, because doing so
> > usually destroys simple loop structure.  I see that profitable_path_p() of
> > the backward threader wants to be careful in some situations involving
> > loops and latches; possibly it's not careful enough yet for the additional
> > power brought by ranger.
> >
> > See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is
> > doing when loops are active.
> >
> > After the structural loop optimizers the threader can go wild and thread
> > everything it finds.
>
> Thanks again.  This is very helpful.
>
> Aldy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 13:13     ` Richard Biener
@ 2021-09-08 13:25       ` Aldy Hernandez
  2021-09-08 13:49         ` Richard Biener
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-08 13:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

[-- Attachment #1: Type: text/plain, Size: 535 bytes --]

> It would be helpful to have the patch causing the issue to look at the IL.
> But as Micha said, there needs to be a perfect loop nest for interchange
> to work.
> 
> Richard.

Absolutely!  I'm attaching the reduced testcase, as well as the patch.

The problematic thread shows up in the thread2 dump:

Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0 
insns) bb:5
   Control statement insns: 2
   Overall: 2 insns
   Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8) 
nocopy; (3, 8)

Thanks.
Aldy

[-- Attachment #2: p.patch --]
[-- Type: text/x-patch, Size: 1312 bytes --]

commit 1bf3f76a5ff075396b5b9f5f88d6b18649dac2ce
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Sun Sep 5 16:54:00 2021 +0200

    Take into account global ranges when folding statements in path solver.
    
    The path solver used by the backwards threader can refine a folded
    range by taking into account any global range known.  This patch
    intersects the folded range with whatever global information we have.
    
    gcc/ChangeLog:
    
            * gimple-range-path.cc (path_range_query::internal_range_of_expr):
            Intersect with global ranges.
    
    > FAIL: gcc.dg/tree-ssa/loop-interchange-9.c scan-tree-dump-times linterchange "Loop_pair<outer:., inner:.> is interchanged" 1
    > FAIL: gcc.dg/tree-ssa/cunroll-15.c scan-tree-dump optimized "return 1;"

diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
index a4fa3b296ff..c616b65756f 100644
--- a/gcc/gimple-range-path.cc
+++ b/gcc/gimple-range-path.cc
@@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt)
   basic_block bb = stmt ? gimple_bb (stmt) : exit_bb ();
   if (stmt && range_defined_in_block (r, name, bb))
     {
+      if (TREE_CODE (name) == SSA_NAME)
+	r.intersect (gimple_range_global (name));
+
       set_cache (r, name);
       return true;
     }

[-- Attachment #3: a.c --]
[-- Type: text/x-csrc, Size: 461 bytes --]

/* { dg-do run } */
/* { dg-options "-O2 -floop-interchange -fdump-tree-linterchange-details" } */
/* { dg-require-effective-target size20plus } */
/* { dg-skip-if "too big data segment" { visium-*-* } } */

#define M 256
int a[M][M], b[M][M], c[M], d[M];
void __attribute__((noinline))
simple_reduc_1 (int n)
{
  for (int j = 0; j < n; j++)
    {
      int sum = c[j];
      for (int i = 0; i < n; i++)
	sum = sum + a[i][j]*b[i][j];

      c[j] = sum;
    }
}

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 13:25       ` Aldy Hernandez
@ 2021-09-08 13:49         ` Richard Biener
  2021-09-08 16:19           ` Aldy Hernandez
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Biener @ 2021-09-08 13:49 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>
> > It would be helpful to have the patch causing the issue to look at the IL.
> > But as Micha said, there needs to be a perfect loop nest for interchange
> > to work.
> >
> > Richard.
>
> Absolutely!  I'm attaching the reduced testcase, as well as the patch.
>
> The problematic thread shows up in the thread2 dump:
>
> Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0
> insns) bb:5
>    Control statement insns: 2
>    Overall: 2 insns
>    Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8)
> nocopy; (3, 8)

Well, so as Micha said, the threading destroys the outer loop by
making it have two entries - we actually _do_ recover from this
but it results in a non-empty latch of the outer loop and thus
the loop is no longer a perfect nest.  The interchange pass has
special-sauce to recover from the loop store motion applied
but not to handle the kind of loop rotation the jump threading
caused.

The forward threader guards against this by simply disallowing
threadings that involve different loops.  As I see
the threading done here should be 7->3 (outer loop entry) to bb 8
rather than one involving the backedge.  Also note the condition
is invariant in the loop and in fact subsumed by the condition
outside of the loop and it should have been simplified by
VRP after pass_ch but I see there's a jump threading pass
inbetween pass_ch and the next VRP which is likely the
problem.

Why does jump threading not try to simplify a condition before
attempting to thread through it?  After all ranger should be around?

Richard.

> Thanks.
> Aldy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 13:49         ` Richard Biener
@ 2021-09-08 16:19           ` Aldy Hernandez
  2021-09-08 16:39             ` Michael Matz
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-08 16:19 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On 9/8/21 3:49 PM, Richard Biener wrote:

> On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>> It would be helpful to have the patch causing the issue to look at the IL.
>>> But as Micha said, there needs to be a perfect loop nest for interchange
>>> to work.
>>>
>>> Richard.
>>
>> Absolutely!  I'm attaching the reduced testcase, as well as the patch.
>>
>> The problematic thread shows up in the thread2 dump:
>>
>> Checking profitability of path (backwards):  bb:3 (4 insns) bb:9 (0
>> insns) bb:5
>>     Control statement insns: 2
>>     Overall: 2 insns
>>     Registering FSM jump thread: (5, 9) incoming edge;  (9, 3)  (3, 8)
>> nocopy; (3, 8)
> 
> Well, so as Micha said, the threading destroys the outer loop by
> making it have two entries - we actually _do_ recover from this
> but it results in a non-empty latch of the outer loop and thus
> the loop is no longer a perfect nest.  The interchange pass has
> special-sauce to recover from the loop store motion applied
> but not to handle the kind of loop rotation the jump threading
> caused.

Understood.  The backedge into BB8 and the non-empty latch seem to cause 
problems.

> 
> The forward threader guards against this by simply disallowing
> threadings that involve different loops.  As I see

The thread in question (5->9->3) is all within the same outer loop, 
though.  BTW, the backward threader also disallows threading across 
different loops (see path_crosses_loops variable).

> the threading done here should be 7->3 (outer loop entry) to bb 8
> rather than one involving the backedge.  Also note the condition
> is invariant in the loop and in fact subsumed by the condition
> outside of the loop and it should have been simplified by
> VRP after pass_ch but I see there's a jump threading pass
> inbetween pass_ch and the next VRP which is likely the
> problem.

A 7->3->8 thread would cross loops though, because 7 is outside the 
outer loop.

> 
> Why does jump threading not try to simplify a condition before
> attempting to thread through it?  After all ranger should be around?

That's a very good question ;-).  The current code does not use the 
ranger to solve unknowns.  Anything further back than the first block in 
a path will return varying.  The plan was to add this ranger solving 
functionality as a follow-up.

I have a whole slew of pending patches adding precisely this 
functionality.  We should be able to solve ranges outside the path with 
ranger, as well as relationals occurring in a path.

However, even if there are alternate ways of threading this IL, 
something like 5->9->3 could still happen.  We need a way to disallow 
this.  I'm having a hard time determining the hammer for this.  I would 
vote for disabling threading through latches, but it seems the backward 
threader is aware of this scenario and allows it anyhow (see 
threaded_through_latch variable).  Ughh.

Thanks for looking into this.
Aldy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 16:19           ` Aldy Hernandez
@ 2021-09-08 16:39             ` Michael Matz
  2021-09-08 18:13               ` Michael Matz
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Matz @ 2021-09-08 16:39 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod

Hello,

On Wed, 8 Sep 2021, Aldy Hernandez wrote:

> > The forward threader guards against this by simply disallowing 
> > threadings that involve different loops.  As I see
> 
> The thread in question (5->9->3) is all within the same outer loop, 
> though. BTW, the backward threader also disallows threading across 
> different loops (see path_crosses_loops variable).
> 
> > the threading done here should be 7->3 (outer loop entry) to bb 8 
> > rather than one involving the backedge.  Also note the condition is 
> > invariant in the loop and in fact subsumed by the condition outside of 
> > the loop and it should have been simplified by VRP after pass_ch but I 
> > see there's a jump threading pass inbetween pass_ch and the next VRP 
> > which is likely the problem.
> 
> A 7->3->8 thread would cross loops though, because 7 is outside the 
> outer loop.

...

> However, even if there are alternate ways of threading this IL, 
> something like 5->9->3 could still happen.  We need a way to disallow 
> this.  I'm having a hard time determining the hammer for this.  I would 
> vote for disabling threading through latches, but it seems the backward 
> threader is aware of this scenario and allows it anyhow (see 
> threaded_through_latch variable).  Ughh.

The backward threader seems to want to be careful with latches, but still 
allow it in some situations, in particular when doing so doesn't create a 
loop with non-simple latches (which is basically a single and empty latch 
block).  If this improvement under discussion leads to creating a 
non-empty latch then those checks aren't restrictive enough (anymore).

I think threading through a latch is always dubious regarding the loop 
structure, it's basically either loop rotation or iteration peeling, even 
if it doesn't cause non-simple latches.  Those transformations should 
probably be left to a loop optimizer, or be only done when destroying loop 
structure is fine (i.e. late).

Maybe it's possible to not disable threading over latches alltogether in 
the backward threader (like it's tried now), but I haven't looked at the 
specific situation here in depth, so take my view only as opinion from a 
large distance :-)

Does anything break if you brutally disable any backward threading when 
any of the involved blocks is a latch when current_loops is set?  (I guess 
for that latter test to be effective you want to disable the 
loop_optimizer_init() for the "late" jump thread passes)

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 16:39             ` Michael Matz
@ 2021-09-08 18:13               ` Michael Matz
  2021-09-09  6:57                 ` Richard Biener
  2021-09-09  8:14                 ` Aldy Hernandez
  0 siblings, 2 replies; 33+ messages in thread
From: Michael Matz @ 2021-09-08 18:13 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod

Hello,

[lame answer to self]

On Wed, 8 Sep 2021, Michael Matz wrote:

> > > The forward threader guards against this by simply disallowing 
> > > threadings that involve different loops.  As I see
> > 
> > The thread in question (5->9->3) is all within the same outer loop, 
> > though. BTW, the backward threader also disallows threading across 
> > different loops (see path_crosses_loops variable).
...
> Maybe it's possible to not disable threading over latches alltogether in 
> the backward threader (like it's tried now), but I haven't looked at the 
> specific situation here in depth, so take my view only as opinion from a 
> large distance :-)

I've now looked at the concrete situation.  So yeah, the whole path is in 
the same loop, crosses the latch, _and there's code following the latch 
on that path_.  (I.e. the latch isn't the last block in the path).  In 
particular, after loop_optimizer_init() (before any threading) we have:

  <bb 3> [local count: 118111600]:
  # j_19 = PHI <j_13(9), 0(7)>
  sum_11 = c[j_19];
  if (n_10(D) > 0)
    goto <bb 8>; [89.00%]
  else
    goto <bb 5>; [11.00%]

     <bb 8> [local count: 105119324]:
...

  <bb 5> [local count: 118111600]:
  # sum_21 = PHI <sum_14(4), sum_11(3)>
  c[j_19] = sum_21;
  j_13 = j_19 + 1;
  if (n_10(D) > j_13)
    goto <bb 9>; [89.00%]
  else
    goto <bb 6>; [11.00%]

  <bb 9> [local count: 105119324]:
  goto <bb 3>; [100.00%]

With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the 
pre-header of inner loop, but more importantly something that's not at the 
start of the outer loop.

Now, any thread that includes the backedge 9->3 _including_ its 
destination (i.e. where the backedge isn't the last to-be-redirected edge) 
necessarily duplicates all code from that destination onto the back edge.  
Here it's the load from c[j] into sum_11.

The important part is the code is emitted onto the back edge, 
conceptually; in reality it's simply included into the (new) latch block 
(the duplicate of bb9, which is bb12 intermediately, then named bb7 after 
cfg_cleanup).

That's what we can't have for some of our structural loop optimizers: 
there must be no code executed after the exit test (e.g. in the latch 
block).  (This requirement makes reasoning about which code is or isn't 
executed completely for an iteration trivial; simply everything in the 
body is always executed; e.g. loop interchange uses this to check that 
there are no memory references after the exit test, because those would 
then be only conditional and hence make loop interchange very awkward).

Note that this situation can't be later rectified anymore: the duplicated 
instructions (because they are memory refs) must remain after the exit 
test.  Only by rerolling/unrotating the loop (i.e. noticing that the 
memory refs on the loop-entry path and on the back edge are equivalent) 
would that be possible, but that's something we aren't capable of.  Even 
if we were that would simply just revert the whole work that the threader 
did, so it's better to not even do that to start with.

I believe something like below would be appropriate, it disables threading 
if the path contains a latch at the non-last position (due to being 
backwards on the non-first position in the array).  I.e. it disables 
rotating the loop if there's danger of polluting the back edge.  It might 
be improved if the blocks following (preceding!) the latch are themself 
empty because then no code is duplicated.  It might also be improved if 
the latch is already non-empty.  That code should probably only be active 
before the loop optimizers, but currently the backward threader isn't 
differentiating between before/after loop-optims.

I haven't tested this patch at all, except that it fixes the testcase :)

Ciao,
Michael.

diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..528a753b886 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
   loop_p loop = m_path[0]->loop_father;
   bool path_crosses_loops = false;
   bool threaded_through_latch = false;
+  bool latch_within_path = false;
   bool multiway_branch_in_path = false;
   bool threaded_multiway_branch = false;
   bool contains_hot_bb = false;
@@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 the last entry in the array when determining if we thread
 	 through the loop latch.  */
       if (loop->latch == bb)
-	threaded_through_latch = true;
+	{
+	  threaded_through_latch = true;
+	  if (j != 0)
+	    latch_within_path = true;
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, " (latch)");
+	}
     }

   gimple *stmt = get_gimple_control_stmt (m_path[0]);
@@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		 "a multiway branch.\n");
       return false;
     }
+
+  if (latch_within_path)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  FAIL: FSM Thread through latch would create non-empty latch\n");
+      return false;
+
+    }
   return true;
 }

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 18:13               ` Michael Matz
@ 2021-09-09  6:57                 ` Richard Biener
  2021-09-09  7:37                   ` Aldy Hernandez
  2021-09-09 12:47                   ` Michael Matz
  2021-09-09  8:14                 ` Aldy Hernandez
  1 sibling, 2 replies; 33+ messages in thread
From: Richard Biener @ 2021-09-09  6:57 UTC (permalink / raw)
  To: Michael Matz; +Cc: Aldy Hernandez, Jeff Law, GCC Mailing List, Andrew MacLeod

On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>
> Hello,
>
> [lame answer to self]
>
> On Wed, 8 Sep 2021, Michael Matz wrote:
>
> > > > The forward threader guards against this by simply disallowing
> > > > threadings that involve different loops.  As I see
> > >
> > > The thread in question (5->9->3) is all within the same outer loop,
> > > though. BTW, the backward threader also disallows threading across
> > > different loops (see path_crosses_loops variable).
> ...
> > Maybe it's possible to not disable threading over latches alltogether in
> > the backward threader (like it's tried now), but I haven't looked at the
> > specific situation here in depth, so take my view only as opinion from a
> > large distance :-)
>
> I've now looked at the concrete situation.  So yeah, the whole path is in
> the same loop, crosses the latch, _and there's code following the latch
> on that path_.  (I.e. the latch isn't the last block in the path).  In
> particular, after loop_optimizer_init() (before any threading) we have:
>
>   <bb 3> [local count: 118111600]:
>   # j_19 = PHI <j_13(9), 0(7)>
>   sum_11 = c[j_19];
>   if (n_10(D) > 0)
>     goto <bb 8>; [89.00%]
>   else
>     goto <bb 5>; [11.00%]
>
>      <bb 8> [local count: 105119324]:
> ...
>
>   <bb 5> [local count: 118111600]:
>   # sum_21 = PHI <sum_14(4), sum_11(3)>
>   c[j_19] = sum_21;
>   j_13 = j_19 + 1;
>   if (n_10(D) > j_13)
>     goto <bb 9>; [89.00%]
>   else
>     goto <bb 6>; [11.00%]
>
>   <bb 9> [local count: 105119324]:
>   goto <bb 3>; [100.00%]
>
> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> pre-header of inner loop, but more importantly something that's not at the
> start of the outer loop.
>
> Now, any thread that includes the backedge 9->3 _including_ its
> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> necessarily duplicates all code from that destination onto the back edge.
> Here it's the load from c[j] into sum_11.
>
> The important part is the code is emitted onto the back edge,
> conceptually; in reality it's simply included into the (new) latch block
> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> cfg_cleanup).
>
> That's what we can't have for some of our structural loop optimizers:
> there must be no code executed after the exit test (e.g. in the latch
> block).  (This requirement makes reasoning about which code is or isn't
> executed completely for an iteration trivial; simply everything in the
> body is always executed; e.g. loop interchange uses this to check that
> there are no memory references after the exit test, because those would
> then be only conditional and hence make loop interchange very awkward).
>
> Note that this situation can't be later rectified anymore: the duplicated
> instructions (because they are memory refs) must remain after the exit
> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> memory refs on the loop-entry path and on the back edge are equivalent)
> would that be possible, but that's something we aren't capable of.  Even
> if we were that would simply just revert the whole work that the threader
> did, so it's better to not even do that to start with.
>
> I believe something like below would be appropriate, it disables threading
> if the path contains a latch at the non-last position (due to being
> backwards on the non-first position in the array).  I.e. it disables
> rotating the loop if there's danger of polluting the back edge.  It might
> be improved if the blocks following (preceding!) the latch are themself
> empty because then no code is duplicated.  It might also be improved if
> the latch is already non-empty.  That code should probably only be active
> before the loop optimizers, but currently the backward threader isn't
> differentiating between before/after loop-optims.
>
> I haven't tested this patch at all, except that it fixes the testcase :)

Lame comment at the current end of the thread - it's not threading through the
latch but threading through the loop header that's problematic, at least if the
end of the threading path ends within the loop (threading through the header
to the loop exit is fine).  Because in that situation you effectively created an
alternate loop entry.  Threading through the latch into the loop header is
fine but with simple latches that likely will never happen (if there are no
simple latches then the latch can contain the loop exit test).

See tree-ssa-threadupdate.c:thread_block_1

      e2 = path->last ()->e;
      if (!e2 || noloop_only)
        {
          /* If NOLOOP_ONLY is true, we only allow threading through the
             header of a loop to exit edges.  */

          /* One case occurs when there was loop header buried in a jump
             threading path that crosses loop boundaries.  We do not try
             and thread this elsewhere, so just cancel the jump threading
             request by clearing the AUX field now.  */
          if (bb->loop_father != e2->src->loop_father
              && (!loop_exit_edge_p (e2->src->loop_father, e2)
                  || flow_loop_nested_p (bb->loop_father,
                                         e2->dest->loop_father)))
            {
              /* Since this case is not handled by our special code
                 to thread through a loop header, we must explicitly
                 cancel the threading request here.  */
              delete_jump_thread_path (path);
              e->aux = NULL;
              continue;
            }

there are a lot of "useful" checks in this function and the backwards threader
should adopt those.  Note the backwards threader originally only did
FSM style threadings which are exactly those possibly "harmful" ones, forming
irreducible regions at worst or sub-loops at best.  That might explain the
lack of those checks.

Richard.

>
> Ciao,
> Michael.
>
> diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
> index 449232c7715..528a753b886 100644
> --- a/gcc/tree-ssa-threadbackward.c
> +++ b/gcc/tree-ssa-threadbackward.c
> @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
>    loop_p loop = m_path[0]->loop_father;
>    bool path_crosses_loops = false;
>    bool threaded_through_latch = false;
> +  bool latch_within_path = false;
>    bool multiway_branch_in_path = false;
>    bool threaded_multiway_branch = false;
>    bool contains_hot_bb = false;
> @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
>          the last entry in the array when determining if we thread
>          through the loop latch.  */
>        if (loop->latch == bb)
> -       threaded_through_latch = true;
> +       {
> +         threaded_through_latch = true;
> +         if (j != 0)
> +           latch_within_path = true;
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           fprintf (dump_file, " (latch)");
> +       }
>      }
>
>    gimple *stmt = get_gimple_control_stmt (m_path[0]);
> @@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
>                  "a multiway branch.\n");
>        return false;
>      }
> +
> +  if (latch_within_path)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file,
> +                "  FAIL: FSM Thread through latch would create non-empty latch\n");
> +      return false;
> +
> +    }
>    return true;
>  }
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  6:57                 ` Richard Biener
@ 2021-09-09  7:37                   ` Aldy Hernandez
  2021-09-09  7:45                     ` Richard Biener
  2021-09-09 12:47                   ` Michael Matz
  1 sibling, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09  7:37 UTC (permalink / raw)
  To: Richard Biener, Michael Matz; +Cc: Jeff Law, GCC Mailing List, Andrew MacLeod



On 9/9/21 8:57 AM, Richard Biener wrote:
> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>>
>> Hello,
>>
>> [lame answer to self]
>>
>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>
>>>>> The forward threader guards against this by simply disallowing
>>>>> threadings that involve different loops.  As I see
>>>>
>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>> though. BTW, the backward threader also disallows threading across
>>>> different loops (see path_crosses_loops variable).
>> ...
>>> Maybe it's possible to not disable threading over latches alltogether in
>>> the backward threader (like it's tried now), but I haven't looked at the
>>> specific situation here in depth, so take my view only as opinion from a
>>> large distance :-)
>>
>> I've now looked at the concrete situation.  So yeah, the whole path is in
>> the same loop, crosses the latch, _and there's code following the latch
>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>> particular, after loop_optimizer_init() (before any threading) we have:
>>
>>    <bb 3> [local count: 118111600]:
>>    # j_19 = PHI <j_13(9), 0(7)>
>>    sum_11 = c[j_19];
>>    if (n_10(D) > 0)
>>      goto <bb 8>; [89.00%]
>>    else
>>      goto <bb 5>; [11.00%]
>>
>>       <bb 8> [local count: 105119324]:
>> ...
>>
>>    <bb 5> [local count: 118111600]:
>>    # sum_21 = PHI <sum_14(4), sum_11(3)>
>>    c[j_19] = sum_21;
>>    j_13 = j_19 + 1;
>>    if (n_10(D) > j_13)
>>      goto <bb 9>; [89.00%]
>>    else
>>      goto <bb 6>; [11.00%]
>>
>>    <bb 9> [local count: 105119324]:
>>    goto <bb 3>; [100.00%]
>>
>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>> pre-header of inner loop, but more importantly something that's not at the
>> start of the outer loop.
>>
>> Now, any thread that includes the backedge 9->3 _including_ its
>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
>> necessarily duplicates all code from that destination onto the back edge.
>> Here it's the load from c[j] into sum_11.
>>
>> The important part is the code is emitted onto the back edge,
>> conceptually; in reality it's simply included into the (new) latch block
>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>> cfg_cleanup).
>>
>> That's what we can't have for some of our structural loop optimizers:
>> there must be no code executed after the exit test (e.g. in the latch
>> block).  (This requirement makes reasoning about which code is or isn't
>> executed completely for an iteration trivial; simply everything in the
>> body is always executed; e.g. loop interchange uses this to check that
>> there are no memory references after the exit test, because those would
>> then be only conditional and hence make loop interchange very awkward).
>>
>> Note that this situation can't be later rectified anymore: the duplicated
>> instructions (because they are memory refs) must remain after the exit
>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>> memory refs on the loop-entry path and on the back edge are equivalent)
>> would that be possible, but that's something we aren't capable of.  Even
>> if we were that would simply just revert the whole work that the threader
>> did, so it's better to not even do that to start with.
>>
>> I believe something like below would be appropriate, it disables threading
>> if the path contains a latch at the non-last position (due to being
>> backwards on the non-first position in the array).  I.e. it disables
>> rotating the loop if there's danger of polluting the back edge.  It might
>> be improved if the blocks following (preceding!) the latch are themself
>> empty because then no code is duplicated.  It might also be improved if
>> the latch is already non-empty.  That code should probably only be active
>> before the loop optimizers, but currently the backward threader isn't
>> differentiating between before/after loop-optims.
>>
>> I haven't tested this patch at all, except that it fixes the testcase :)
> 
> Lame comment at the current end of the thread - it's not threading through the

I don't know why y'all keep using the word "lame".  On the contrary, 
these are incredibly useful explanations.  Thanks.

> latch but threading through the loop header that's problematic, at least if the
> end of the threading path ends within the loop (threading through the header
> to the loop exit is fine).  Because in that situation you effectively created an
> alternate loop entry.  Threading through the latch into the loop header is
> fine but with simple latches that likely will never happen (if there are no
> simple latches then the latch can contain the loop exit test).
> 
> See tree-ssa-threadupdate.c:thread_block_1
> 
>        e2 = path->last ()->e;
>        if (!e2 || noloop_only)
>          {
>            /* If NOLOOP_ONLY is true, we only allow threading through the
>               header of a loop to exit edges.  */
> 
>            /* One case occurs when there was loop header buried in a jump
>               threading path that crosses loop boundaries.  We do not try
>               and thread this elsewhere, so just cancel the jump threading
>               request by clearing the AUX field now.  */
>            if (bb->loop_father != e2->src->loop_father
>                && (!loop_exit_edge_p (e2->src->loop_father, e2)
>                    || flow_loop_nested_p (bb->loop_father,
>                                           e2->dest->loop_father)))
>              {
>                /* Since this case is not handled by our special code
>                   to thread through a loop header, we must explicitly
>                   cancel the threading request here.  */
>                delete_jump_thread_path (path);
>                e->aux = NULL;
>                continue;
>              }

But this is for a threading path that crosses loop boundaries, which is 
not the case.  Perhaps we should restrict this further to threads within 
a loop?

> 
> there are a lot of "useful" checks in this function and the backwards threader
> should adopt those.  Note the backwards threader originally only did
> FSM style threadings which are exactly those possibly "harmful" ones, forming
> irreducible regions at worst or sub-loops at best.  That might explain the
> lack of those checks.

Also, the aforementioned checks are in jump_thread_path_registry, which 
is also shared by the backward threader.  These are thread discards 
_after_ a thread has been registered.  The backward threader should also 
be using these restrictions.  Unless, I'm missing some interaction with 
the FSM/etc threading types as per the preamble to the snippet you provided:

      if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
	  || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
	continue;

You are right though, there are a lot of checks throughout the entire 
forward threader that should be audited and somehow shared.  It's on my 
back burner, but I'm running out of cycles here :-/.

Thanks.
Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  7:37                   ` Aldy Hernandez
@ 2021-09-09  7:45                     ` Richard Biener
  2021-09-09  8:36                       ` Aldy Hernandez
  0 siblings, 1 reply; 33+ messages in thread
From: Richard Biener @ 2021-09-09  7:45 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 9/9/21 8:57 AM, Richard Biener wrote:
> > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
> >>
> >> Hello,
> >>
> >> [lame answer to self]
> >>
> >> On Wed, 8 Sep 2021, Michael Matz wrote:
> >>
> >>>>> The forward threader guards against this by simply disallowing
> >>>>> threadings that involve different loops.  As I see
> >>>>
> >>>> The thread in question (5->9->3) is all within the same outer loop,
> >>>> though. BTW, the backward threader also disallows threading across
> >>>> different loops (see path_crosses_loops variable).
> >> ...
> >>> Maybe it's possible to not disable threading over latches alltogether in
> >>> the backward threader (like it's tried now), but I haven't looked at the
> >>> specific situation here in depth, so take my view only as opinion from a
> >>> large distance :-)
> >>
> >> I've now looked at the concrete situation.  So yeah, the whole path is in
> >> the same loop, crosses the latch, _and there's code following the latch
> >> on that path_.  (I.e. the latch isn't the last block in the path).  In
> >> particular, after loop_optimizer_init() (before any threading) we have:
> >>
> >>    <bb 3> [local count: 118111600]:
> >>    # j_19 = PHI <j_13(9), 0(7)>
> >>    sum_11 = c[j_19];
> >>    if (n_10(D) > 0)
> >>      goto <bb 8>; [89.00%]
> >>    else
> >>      goto <bb 5>; [11.00%]
> >>
> >>       <bb 8> [local count: 105119324]:
> >> ...
> >>
> >>    <bb 5> [local count: 118111600]:
> >>    # sum_21 = PHI <sum_14(4), sum_11(3)>
> >>    c[j_19] = sum_21;
> >>    j_13 = j_19 + 1;
> >>    if (n_10(D) > j_13)
> >>      goto <bb 9>; [89.00%]
> >>    else
> >>      goto <bb 6>; [11.00%]
> >>
> >>    <bb 9> [local count: 105119324]:
> >>    goto <bb 3>; [100.00%]
> >>
> >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> >> pre-header of inner loop, but more importantly something that's not at the
> >> start of the outer loop.
> >>
> >> Now, any thread that includes the backedge 9->3 _including_ its
> >> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> >> necessarily duplicates all code from that destination onto the back edge.
> >> Here it's the load from c[j] into sum_11.
> >>
> >> The important part is the code is emitted onto the back edge,
> >> conceptually; in reality it's simply included into the (new) latch block
> >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> >> cfg_cleanup).
> >>
> >> That's what we can't have for some of our structural loop optimizers:
> >> there must be no code executed after the exit test (e.g. in the latch
> >> block).  (This requirement makes reasoning about which code is or isn't
> >> executed completely for an iteration trivial; simply everything in the
> >> body is always executed; e.g. loop interchange uses this to check that
> >> there are no memory references after the exit test, because those would
> >> then be only conditional and hence make loop interchange very awkward).
> >>
> >> Note that this situation can't be later rectified anymore: the duplicated
> >> instructions (because they are memory refs) must remain after the exit
> >> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> >> memory refs on the loop-entry path and on the back edge are equivalent)
> >> would that be possible, but that's something we aren't capable of.  Even
> >> if we were that would simply just revert the whole work that the threader
> >> did, so it's better to not even do that to start with.
> >>
> >> I believe something like below would be appropriate, it disables threading
> >> if the path contains a latch at the non-last position (due to being
> >> backwards on the non-first position in the array).  I.e. it disables
> >> rotating the loop if there's danger of polluting the back edge.  It might
> >> be improved if the blocks following (preceding!) the latch are themself
> >> empty because then no code is duplicated.  It might also be improved if
> >> the latch is already non-empty.  That code should probably only be active
> >> before the loop optimizers, but currently the backward threader isn't
> >> differentiating between before/after loop-optims.
> >>
> >> I haven't tested this patch at all, except that it fixes the testcase :)
> >
> > Lame comment at the current end of the thread - it's not threading through the
>
> I don't know why y'all keep using the word "lame".  On the contrary,
> these are incredibly useful explanations.  Thanks.
>
> > latch but threading through the loop header that's problematic, at least if the
> > end of the threading path ends within the loop (threading through the header
> > to the loop exit is fine).  Because in that situation you effectively created an
> > alternate loop entry.  Threading through the latch into the loop header is
> > fine but with simple latches that likely will never happen (if there are no
> > simple latches then the latch can contain the loop exit test).
> >
> > See tree-ssa-threadupdate.c:thread_block_1
> >
> >        e2 = path->last ()->e;
> >        if (!e2 || noloop_only)
> >          {
> >            /* If NOLOOP_ONLY is true, we only allow threading through the
> >               header of a loop to exit edges.  */
> >
> >            /* One case occurs when there was loop header buried in a jump
> >               threading path that crosses loop boundaries.  We do not try
> >               and thread this elsewhere, so just cancel the jump threading
> >               request by clearing the AUX field now.  */
> >            if (bb->loop_father != e2->src->loop_father
> >                && (!loop_exit_edge_p (e2->src->loop_father, e2)
> >                    || flow_loop_nested_p (bb->loop_father,
> >                                           e2->dest->loop_father)))
> >              {
> >                /* Since this case is not handled by our special code
> >                   to thread through a loop header, we must explicitly
> >                   cancel the threading request here.  */
> >                delete_jump_thread_path (path);
> >                e->aux = NULL;
> >                continue;
> >              }
>
> But this is for a threading path that crosses loop boundaries, which is
> not the case.  Perhaps we should restrict this further to threads within
> a loop?
>
> >
> > there are a lot of "useful" checks in this function and the backwards threader
> > should adopt those.  Note the backwards threader originally only did
> > FSM style threadings which are exactly those possibly "harmful" ones, forming
> > irreducible regions at worst or sub-loops at best.  That might explain the
> > lack of those checks.
>
> Also, the aforementioned checks are in jump_thread_path_registry, which
> is also shared by the backward threader.  These are thread discards
> _after_ a thread has been registered.

Yeah, that's indeed unfortunate.

>  The backward threader should also
> be using these restrictions.  Unless, I'm missing some interaction with
> the FSM/etc threading types as per the preamble to the snippet you provided:
>
>       if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>           || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>         continue;

Indeed.  But I understand the backwards threader does not (only) do FSM
threading now.

> You are right though, there are a lot of checks throughout the entire
> forward threader that should be audited and somehow shared.  It's on my
> back burner, but I'm running out of cycles here :-/.

yeah, it's quite a mess indeed and merging the path
validity/costing/code-transform
paths of both threaders would be incredibly useful.

Richard.

>
> Thanks.
> Aldy
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  7:45                     ` Richard Biener
@ 2021-09-09  8:36                       ` Aldy Hernandez
  2021-09-09  8:58                         ` Richard Biener
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09  8:36 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod



On 9/9/21 9:45 AM, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>>
>> On 9/9/21 8:57 AM, Richard Biener wrote:
>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>>>>
>>>> Hello,
>>>>
>>>> [lame answer to self]
>>>>
>>>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>>>
>>>>>>> The forward threader guards against this by simply disallowing
>>>>>>> threadings that involve different loops.  As I see
>>>>>>
>>>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>>>> though. BTW, the backward threader also disallows threading across
>>>>>> different loops (see path_crosses_loops variable).
>>>> ...
>>>>> Maybe it's possible to not disable threading over latches alltogether in
>>>>> the backward threader (like it's tried now), but I haven't looked at the
>>>>> specific situation here in depth, so take my view only as opinion from a
>>>>> large distance :-)
>>>>
>>>> I've now looked at the concrete situation.  So yeah, the whole path is in
>>>> the same loop, crosses the latch, _and there's code following the latch
>>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>>>> particular, after loop_optimizer_init() (before any threading) we have:
>>>>
>>>>     <bb 3> [local count: 118111600]:
>>>>     # j_19 = PHI <j_13(9), 0(7)>
>>>>     sum_11 = c[j_19];
>>>>     if (n_10(D) > 0)
>>>>       goto <bb 8>; [89.00%]
>>>>     else
>>>>       goto <bb 5>; [11.00%]
>>>>
>>>>        <bb 8> [local count: 105119324]:
>>>> ...
>>>>
>>>>     <bb 5> [local count: 118111600]:
>>>>     # sum_21 = PHI <sum_14(4), sum_11(3)>
>>>>     c[j_19] = sum_21;
>>>>     j_13 = j_19 + 1;
>>>>     if (n_10(D) > j_13)
>>>>       goto <bb 9>; [89.00%]
>>>>     else
>>>>       goto <bb 6>; [11.00%]
>>>>
>>>>     <bb 9> [local count: 105119324]:
>>>>     goto <bb 3>; [100.00%]
>>>>
>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>>>> pre-header of inner loop, but more importantly something that's not at the
>>>> start of the outer loop.
>>>>
>>>> Now, any thread that includes the backedge 9->3 _including_ its
>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
>>>> necessarily duplicates all code from that destination onto the back edge.
>>>> Here it's the load from c[j] into sum_11.
>>>>
>>>> The important part is the code is emitted onto the back edge,
>>>> conceptually; in reality it's simply included into the (new) latch block
>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>>>> cfg_cleanup).
>>>>
>>>> That's what we can't have for some of our structural loop optimizers:
>>>> there must be no code executed after the exit test (e.g. in the latch
>>>> block).  (This requirement makes reasoning about which code is or isn't
>>>> executed completely for an iteration trivial; simply everything in the
>>>> body is always executed; e.g. loop interchange uses this to check that
>>>> there are no memory references after the exit test, because those would
>>>> then be only conditional and hence make loop interchange very awkward).
>>>>
>>>> Note that this situation can't be later rectified anymore: the duplicated
>>>> instructions (because they are memory refs) must remain after the exit
>>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>>>> memory refs on the loop-entry path and on the back edge are equivalent)
>>>> would that be possible, but that's something we aren't capable of.  Even
>>>> if we were that would simply just revert the whole work that the threader
>>>> did, so it's better to not even do that to start with.
>>>>
>>>> I believe something like below would be appropriate, it disables threading
>>>> if the path contains a latch at the non-last position (due to being
>>>> backwards on the non-first position in the array).  I.e. it disables
>>>> rotating the loop if there's danger of polluting the back edge.  It might
>>>> be improved if the blocks following (preceding!) the latch are themself
>>>> empty because then no code is duplicated.  It might also be improved if
>>>> the latch is already non-empty.  That code should probably only be active
>>>> before the loop optimizers, but currently the backward threader isn't
>>>> differentiating between before/after loop-optims.
>>>>
>>>> I haven't tested this patch at all, except that it fixes the testcase :)
>>>
>>> Lame comment at the current end of the thread - it's not threading through the
>>
>> I don't know why y'all keep using the word "lame".  On the contrary,
>> these are incredibly useful explanations.  Thanks.
>>
>>> latch but threading through the loop header that's problematic, at least if the
>>> end of the threading path ends within the loop (threading through the header
>>> to the loop exit is fine).  Because in that situation you effectively created an
>>> alternate loop entry.  Threading through the latch into the loop header is
>>> fine but with simple latches that likely will never happen (if there are no
>>> simple latches then the latch can contain the loop exit test).
>>>
>>> See tree-ssa-threadupdate.c:thread_block_1
>>>
>>>         e2 = path->last ()->e;
>>>         if (!e2 || noloop_only)
>>>           {
>>>             /* If NOLOOP_ONLY is true, we only allow threading through the
>>>                header of a loop to exit edges.  */
>>>
>>>             /* One case occurs when there was loop header buried in a jump
>>>                threading path that crosses loop boundaries.  We do not try
>>>                and thread this elsewhere, so just cancel the jump threading
>>>                request by clearing the AUX field now.  */
>>>             if (bb->loop_father != e2->src->loop_father
>>>                 && (!loop_exit_edge_p (e2->src->loop_father, e2)
>>>                     || flow_loop_nested_p (bb->loop_father,
>>>                                            e2->dest->loop_father)))
>>>               {
>>>                 /* Since this case is not handled by our special code
>>>                    to thread through a loop header, we must explicitly
>>>                    cancel the threading request here.  */
>>>                 delete_jump_thread_path (path);
>>>                 e->aux = NULL;
>>>                 continue;
>>>               }
>>
>> But this is for a threading path that crosses loop boundaries, which is
>> not the case.  Perhaps we should restrict this further to threads within
>> a loop?
>>
>>>
>>> there are a lot of "useful" checks in this function and the backwards threader
>>> should adopt those.  Note the backwards threader originally only did
>>> FSM style threadings which are exactly those possibly "harmful" ones, forming
>>> irreducible regions at worst or sub-loops at best.  That might explain the
>>> lack of those checks.
>>
>> Also, the aforementioned checks are in jump_thread_path_registry, which
>> is also shared by the backward threader.  These are thread discards
>> _after_ a thread has been registered.
> 
> Yeah, that's indeed unfortunate.
> 
>>   The backward threader should also
>> be using these restrictions.  Unless, I'm missing some interaction with
>> the FSM/etc threading types as per the preamble to the snippet you provided:
>>
>>        if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>>            || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>>          continue;
> 
> Indeed.  But I understand the backwards threader does not (only) do FSM
> threading now.

If it does, it was not part of my rewrite.  I was careful to not touch 
anything dealing with either path profitability or low-level path 
registering.

The path registering is in back_threader_registry::register_path().  We 
only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that 
those are only EDGE_FSM_THREADs??

> 
>> You are right though, there are a lot of checks throughout the entire
>> forward threader that should be audited and somehow shared.  It's on my
>> back burner, but I'm running out of cycles here :-/.
> 
> yeah, it's quite a mess indeed and merging the path
> validity/costing/code-transform
> paths of both threaders would be incredibly useful.

For the VRP threader replacement work I am doing, I was hoping to use an 
enhanced version of the backward threader, but it was incredibly hard to 
compare forward and backward threads, because the difference in the cost 
models.  I've opted for a hybrid approach using the forward threader, 
but with the path solver to resolve conditionals/etc.

I hope to return to the merging of the threaders for the next release, 
after the dust settles.

Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  8:36                       ` Aldy Hernandez
@ 2021-09-09  8:58                         ` Richard Biener
  2021-09-09  9:21                           ` Aldy Hernandez
  2021-09-09 16:59                           ` Jeff Law
  0 siblings, 2 replies; 33+ messages in thread
From: Richard Biener @ 2021-09-09  8:58 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 9/9/21 9:45 AM, Richard Biener wrote:
> > On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
> >>
> >>
> >>
> >> On 9/9/21 8:57 AM, Richard Biener wrote:
> >>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> [lame answer to self]
> >>>>
> >>>> On Wed, 8 Sep 2021, Michael Matz wrote:
> >>>>
> >>>>>>> The forward threader guards against this by simply disallowing
> >>>>>>> threadings that involve different loops.  As I see
> >>>>>>
> >>>>>> The thread in question (5->9->3) is all within the same outer loop,
> >>>>>> though. BTW, the backward threader also disallows threading across
> >>>>>> different loops (see path_crosses_loops variable).
> >>>> ...
> >>>>> Maybe it's possible to not disable threading over latches alltogether in
> >>>>> the backward threader (like it's tried now), but I haven't looked at the
> >>>>> specific situation here in depth, so take my view only as opinion from a
> >>>>> large distance :-)
> >>>>
> >>>> I've now looked at the concrete situation.  So yeah, the whole path is in
> >>>> the same loop, crosses the latch, _and there's code following the latch
> >>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
> >>>> particular, after loop_optimizer_init() (before any threading) we have:
> >>>>
> >>>>     <bb 3> [local count: 118111600]:
> >>>>     # j_19 = PHI <j_13(9), 0(7)>
> >>>>     sum_11 = c[j_19];
> >>>>     if (n_10(D) > 0)
> >>>>       goto <bb 8>; [89.00%]
> >>>>     else
> >>>>       goto <bb 5>; [11.00%]
> >>>>
> >>>>        <bb 8> [local count: 105119324]:
> >>>> ...
> >>>>
> >>>>     <bb 5> [local count: 118111600]:
> >>>>     # sum_21 = PHI <sum_14(4), sum_11(3)>
> >>>>     c[j_19] = sum_21;
> >>>>     j_13 = j_19 + 1;
> >>>>     if (n_10(D) > j_13)
> >>>>       goto <bb 9>; [89.00%]
> >>>>     else
> >>>>       goto <bb 6>; [11.00%]
> >>>>
> >>>>     <bb 9> [local count: 105119324]:
> >>>>     goto <bb 3>; [100.00%]
> >>>>
> >>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> >>>> pre-header of inner loop, but more importantly something that's not at the
> >>>> start of the outer loop.
> >>>>
> >>>> Now, any thread that includes the backedge 9->3 _including_ its
> >>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> >>>> necessarily duplicates all code from that destination onto the back edge.
> >>>> Here it's the load from c[j] into sum_11.
> >>>>
> >>>> The important part is the code is emitted onto the back edge,
> >>>> conceptually; in reality it's simply included into the (new) latch block
> >>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> >>>> cfg_cleanup).
> >>>>
> >>>> That's what we can't have for some of our structural loop optimizers:
> >>>> there must be no code executed after the exit test (e.g. in the latch
> >>>> block).  (This requirement makes reasoning about which code is or isn't
> >>>> executed completely for an iteration trivial; simply everything in the
> >>>> body is always executed; e.g. loop interchange uses this to check that
> >>>> there are no memory references after the exit test, because those would
> >>>> then be only conditional and hence make loop interchange very awkward).
> >>>>
> >>>> Note that this situation can't be later rectified anymore: the duplicated
> >>>> instructions (because they are memory refs) must remain after the exit
> >>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> >>>> memory refs on the loop-entry path and on the back edge are equivalent)
> >>>> would that be possible, but that's something we aren't capable of.  Even
> >>>> if we were that would simply just revert the whole work that the threader
> >>>> did, so it's better to not even do that to start with.
> >>>>
> >>>> I believe something like below would be appropriate, it disables threading
> >>>> if the path contains a latch at the non-last position (due to being
> >>>> backwards on the non-first position in the array).  I.e. it disables
> >>>> rotating the loop if there's danger of polluting the back edge.  It might
> >>>> be improved if the blocks following (preceding!) the latch are themself
> >>>> empty because then no code is duplicated.  It might also be improved if
> >>>> the latch is already non-empty.  That code should probably only be active
> >>>> before the loop optimizers, but currently the backward threader isn't
> >>>> differentiating between before/after loop-optims.
> >>>>
> >>>> I haven't tested this patch at all, except that it fixes the testcase :)
> >>>
> >>> Lame comment at the current end of the thread - it's not threading through the
> >>
> >> I don't know why y'all keep using the word "lame".  On the contrary,
> >> these are incredibly useful explanations.  Thanks.
> >>
> >>> latch but threading through the loop header that's problematic, at least if the
> >>> end of the threading path ends within the loop (threading through the header
> >>> to the loop exit is fine).  Because in that situation you effectively created an
> >>> alternate loop entry.  Threading through the latch into the loop header is
> >>> fine but with simple latches that likely will never happen (if there are no
> >>> simple latches then the latch can contain the loop exit test).
> >>>
> >>> See tree-ssa-threadupdate.c:thread_block_1
> >>>
> >>>         e2 = path->last ()->e;
> >>>         if (!e2 || noloop_only)
> >>>           {
> >>>             /* If NOLOOP_ONLY is true, we only allow threading through the
> >>>                header of a loop to exit edges.  */
> >>>
> >>>             /* One case occurs when there was loop header buried in a jump
> >>>                threading path that crosses loop boundaries.  We do not try
> >>>                and thread this elsewhere, so just cancel the jump threading
> >>>                request by clearing the AUX field now.  */
> >>>             if (bb->loop_father != e2->src->loop_father
> >>>                 && (!loop_exit_edge_p (e2->src->loop_father, e2)
> >>>                     || flow_loop_nested_p (bb->loop_father,
> >>>                                            e2->dest->loop_father)))
> >>>               {
> >>>                 /* Since this case is not handled by our special code
> >>>                    to thread through a loop header, we must explicitly
> >>>                    cancel the threading request here.  */
> >>>                 delete_jump_thread_path (path);
> >>>                 e->aux = NULL;
> >>>                 continue;
> >>>               }
> >>
> >> But this is for a threading path that crosses loop boundaries, which is
> >> not the case.  Perhaps we should restrict this further to threads within
> >> a loop?
> >>
> >>>
> >>> there are a lot of "useful" checks in this function and the backwards threader
> >>> should adopt those.  Note the backwards threader originally only did
> >>> FSM style threadings which are exactly those possibly "harmful" ones, forming
> >>> irreducible regions at worst or sub-loops at best.  That might explain the
> >>> lack of those checks.
> >>
> >> Also, the aforementioned checks are in jump_thread_path_registry, which
> >> is also shared by the backward threader.  These are thread discards
> >> _after_ a thread has been registered.
> >
> > Yeah, that's indeed unfortunate.
> >
> >>   The backward threader should also
> >> be using these restrictions.  Unless, I'm missing some interaction with
> >> the FSM/etc threading types as per the preamble to the snippet you provided:
> >>
> >>        if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
> >>            || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
> >>          continue;
> >
> > Indeed.  But I understand the backwards threader does not (only) do FSM
> > threading now.
>
> If it does, it was not part of my rewrite.  I was careful to not touch
> anything dealing with either path profitability or low-level path
> registering.
>
> The path registering is in back_threader_registry::register_path().  We
> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
> those are only EDGE_FSM_THREADs??

Well, if the backwards threader classifies everything as FSM that's probably
inaccurate since only threads through the loop latch are "FSM".  There is
the comment

  /* If this path does not thread through the loop latch, then we are
     using the FSM threader to find old style jump threads.  This
     is good, except the FSM threader does not re-use an existing
     threading path to reduce code duplication.

     So for that case, drastically reduce the number of statements
     we are allowed to copy.  */

so these cases should use the "old style" validity/costing metrics and thus
classify threading opportunities in a different way?

I think today "backwards" vs, "forwards" only refers to the way we find
threading opportunities.

> >
> >> You are right though, there are a lot of checks throughout the entire
> >> forward threader that should be audited and somehow shared.  It's on my
> >> back burner, but I'm running out of cycles here :-/.
> >
> > yeah, it's quite a mess indeed and merging the path
> > validity/costing/code-transform
> > paths of both threaders would be incredibly useful.
>
> For the VRP threader replacement work I am doing, I was hoping to use an
> enhanced version of the backward threader, but it was incredibly hard to
> compare forward and backward threads, because the difference in the cost
> models.  I've opted for a hybrid approach using the forward threader,
> but with the path solver to resolve conditionals/etc.
>
> I hope to return to the merging of the threaders for the next release,
> after the dust settles.

Hmm, I see.

Richard.

> Aldy
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  8:58                         ` Richard Biener
@ 2021-09-09  9:21                           ` Aldy Hernandez
  2021-09-09 10:15                             ` Richard Biener
  2021-09-10 15:43                             ` Jeff Law
  2021-09-09 16:59                           ` Jeff Law
  1 sibling, 2 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09  9:21 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod



On 9/9/21 10:58 AM, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>>
>> On 9/9/21 9:45 AM, Richard Biener wrote:
>>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/9/21 8:57 AM, Richard Biener wrote:
>>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> [lame answer to self]
>>>>>>
>>>>>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>>>>>
>>>>>>>>> The forward threader guards against this by simply disallowing
>>>>>>>>> threadings that involve different loops.  As I see
>>>>>>>>
>>>>>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>>>>>> though. BTW, the backward threader also disallows threading across
>>>>>>>> different loops (see path_crosses_loops variable).
>>>>>> ...
>>>>>>> Maybe it's possible to not disable threading over latches alltogether in
>>>>>>> the backward threader (like it's tried now), but I haven't looked at the
>>>>>>> specific situation here in depth, so take my view only as opinion from a
>>>>>>> large distance :-)
>>>>>>
>>>>>> I've now looked at the concrete situation.  So yeah, the whole path is in
>>>>>> the same loop, crosses the latch, _and there's code following the latch
>>>>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>>>>>> particular, after loop_optimizer_init() (before any threading) we have:
>>>>>>
>>>>>>      <bb 3> [local count: 118111600]:
>>>>>>      # j_19 = PHI <j_13(9), 0(7)>
>>>>>>      sum_11 = c[j_19];
>>>>>>      if (n_10(D) > 0)
>>>>>>        goto <bb 8>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 5>; [11.00%]
>>>>>>
>>>>>>         <bb 8> [local count: 105119324]:
>>>>>> ...
>>>>>>
>>>>>>      <bb 5> [local count: 118111600]:
>>>>>>      # sum_21 = PHI <sum_14(4), sum_11(3)>
>>>>>>      c[j_19] = sum_21;
>>>>>>      j_13 = j_19 + 1;
>>>>>>      if (n_10(D) > j_13)
>>>>>>        goto <bb 9>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 6>; [11.00%]
>>>>>>
>>>>>>      <bb 9> [local count: 105119324]:
>>>>>>      goto <bb 3>; [100.00%]
>>>>>>
>>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>>>>>> pre-header of inner loop, but more importantly something that's not at the
>>>>>> start of the outer loop.
>>>>>>
>>>>>> Now, any thread that includes the backedge 9->3 _including_ its
>>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
>>>>>> necessarily duplicates all code from that destination onto the back edge.
>>>>>> Here it's the load from c[j] into sum_11.
>>>>>>
>>>>>> The important part is the code is emitted onto the back edge,
>>>>>> conceptually; in reality it's simply included into the (new) latch block
>>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>>>>>> cfg_cleanup).
>>>>>>
>>>>>> That's what we can't have for some of our structural loop optimizers:
>>>>>> there must be no code executed after the exit test (e.g. in the latch
>>>>>> block).  (This requirement makes reasoning about which code is or isn't
>>>>>> executed completely for an iteration trivial; simply everything in the
>>>>>> body is always executed; e.g. loop interchange uses this to check that
>>>>>> there are no memory references after the exit test, because those would
>>>>>> then be only conditional and hence make loop interchange very awkward).
>>>>>>
>>>>>> Note that this situation can't be later rectified anymore: the duplicated
>>>>>> instructions (because they are memory refs) must remain after the exit
>>>>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>>>>>> memory refs on the loop-entry path and on the back edge are equivalent)
>>>>>> would that be possible, but that's something we aren't capable of.  Even
>>>>>> if we were that would simply just revert the whole work that the threader
>>>>>> did, so it's better to not even do that to start with.
>>>>>>
>>>>>> I believe something like below would be appropriate, it disables threading
>>>>>> if the path contains a latch at the non-last position (due to being
>>>>>> backwards on the non-first position in the array).  I.e. it disables
>>>>>> rotating the loop if there's danger of polluting the back edge.  It might
>>>>>> be improved if the blocks following (preceding!) the latch are themself
>>>>>> empty because then no code is duplicated.  It might also be improved if
>>>>>> the latch is already non-empty.  That code should probably only be active
>>>>>> before the loop optimizers, but currently the backward threader isn't
>>>>>> differentiating between before/after loop-optims.
>>>>>>
>>>>>> I haven't tested this patch at all, except that it fixes the testcase :)
>>>>>
>>>>> Lame comment at the current end of the thread - it's not threading through the
>>>>
>>>> I don't know why y'all keep using the word "lame".  On the contrary,
>>>> these are incredibly useful explanations.  Thanks.
>>>>
>>>>> latch but threading through the loop header that's problematic, at least if the
>>>>> end of the threading path ends within the loop (threading through the header
>>>>> to the loop exit is fine).  Because in that situation you effectively created an
>>>>> alternate loop entry.  Threading through the latch into the loop header is
>>>>> fine but with simple latches that likely will never happen (if there are no
>>>>> simple latches then the latch can contain the loop exit test).
>>>>>
>>>>> See tree-ssa-threadupdate.c:thread_block_1
>>>>>
>>>>>          e2 = path->last ()->e;
>>>>>          if (!e2 || noloop_only)
>>>>>            {
>>>>>              /* If NOLOOP_ONLY is true, we only allow threading through the
>>>>>                 header of a loop to exit edges.  */
>>>>>
>>>>>              /* One case occurs when there was loop header buried in a jump
>>>>>                 threading path that crosses loop boundaries.  We do not try
>>>>>                 and thread this elsewhere, so just cancel the jump threading
>>>>>                 request by clearing the AUX field now.  */
>>>>>              if (bb->loop_father != e2->src->loop_father
>>>>>                  && (!loop_exit_edge_p (e2->src->loop_father, e2)
>>>>>                      || flow_loop_nested_p (bb->loop_father,
>>>>>                                             e2->dest->loop_father)))
>>>>>                {
>>>>>                  /* Since this case is not handled by our special code
>>>>>                     to thread through a loop header, we must explicitly
>>>>>                     cancel the threading request here.  */
>>>>>                  delete_jump_thread_path (path);
>>>>>                  e->aux = NULL;
>>>>>                  continue;
>>>>>                }
>>>>
>>>> But this is for a threading path that crosses loop boundaries, which is
>>>> not the case.  Perhaps we should restrict this further to threads within
>>>> a loop?
>>>>
>>>>>
>>>>> there are a lot of "useful" checks in this function and the backwards threader
>>>>> should adopt those.  Note the backwards threader originally only did
>>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming
>>>>> irreducible regions at worst or sub-loops at best.  That might explain the
>>>>> lack of those checks.
>>>>
>>>> Also, the aforementioned checks are in jump_thread_path_registry, which
>>>> is also shared by the backward threader.  These are thread discards
>>>> _after_ a thread has been registered.
>>>
>>> Yeah, that's indeed unfortunate.
>>>
>>>>    The backward threader should also
>>>> be using these restrictions.  Unless, I'm missing some interaction with
>>>> the FSM/etc threading types as per the preamble to the snippet you provided:
>>>>
>>>>         if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>>>>             || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>>>>           continue;
>>>
>>> Indeed.  But I understand the backwards threader does not (only) do FSM
>>> threading now.
>>
>> If it does, it was not part of my rewrite.  I was careful to not touch
>> anything dealing with either path profitability or low-level path
>> registering.
>>
>> The path registering is in back_threader_registry::register_path().  We
>> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
>> those are only EDGE_FSM_THREADs??
> 
> Well, if the backwards threader classifies everything as FSM that's probably
> inaccurate since only threads through the loop latch are "FSM".  There is
> the comment
> 
>    /* If this path does not thread through the loop latch, then we are
>       using the FSM threader to find old style jump threads.  This
>       is good, except the FSM threader does not re-use an existing
>       threading path to reduce code duplication.
> 
>       So for that case, drastically reduce the number of statements
>       we are allowed to copy.  */

*blink*

Woah.  The backward threader has been using FSM threads indiscriminately 
as far as I can remember.  I wonder what would break if we "fixed it".

> 
> so these cases should use the "old style" validity/costing metrics and thus
> classify threading opportunities in a different way?

Jeff, do you have any insight here?

> 
> I think today "backwards" vs, "forwards" only refers to the way we find
> threading opportunities.

Yes, it's a mess.

I ran some experiments a while back, and my current work on the enhanced 
solver/threader, can fold virtually everything the DOM/threader gets 
(even with its use of const_and_copies, avail_exprs, and 
evrp_range_analyzer), while getting 5% more DOM threads and 1% more 
overall threads.  That is, I've been testing if the path solver can 
solve everything the DOM threader needs (the hybrid approach I mentioned).

Unfortunately, replacing the forward threader right now is not feasible 
for a few reasons:

a) The const_and_copies/avail_exprs relation framework can do floats, 
and that's next year's ranger work.

b) Even though we can seemingly fold everything DOM/threader does, in 
order to replace it with a backward threader instance we'd have to merge 
the cost/profitability code scattered throughout the forward threader, 
as well as the EDGE_FSM* / EDGE_COPY* business.

c) DOM changes the IL as it goes.  Though we could conceivably divorce 
do the threading after DOM is done.

But I digress...
Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  9:21                           ` Aldy Hernandez
@ 2021-09-09 10:15                             ` Richard Biener
  2021-09-09 11:28                               ` Aldy Hernandez
  2021-09-10 15:51                               ` Jeff Law
  2021-09-10 15:43                             ` Jeff Law
  1 sibling, 2 replies; 33+ messages in thread
From: Richard Biener @ 2021-09-09 10:15 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 9/9/21 10:58 AM, Richard Biener wrote:
> > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote:
> >>
> >>
> >>
> >> On 9/9/21 9:45 AM, Richard Biener wrote:
> >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 9/9/21 8:57 AM, Richard Biener wrote:
> >>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> [lame answer to self]
> >>>>>>
> >>>>>> On Wed, 8 Sep 2021, Michael Matz wrote:
> >>>>>>
> >>>>>>>>> The forward threader guards against this by simply disallowing
> >>>>>>>>> threadings that involve different loops.  As I see
> >>>>>>>>
> >>>>>>>> The thread in question (5->9->3) is all within the same outer loop,
> >>>>>>>> though. BTW, the backward threader also disallows threading across
> >>>>>>>> different loops (see path_crosses_loops variable).
> >>>>>> ...
> >>>>>>> Maybe it's possible to not disable threading over latches alltogether in
> >>>>>>> the backward threader (like it's tried now), but I haven't looked at the
> >>>>>>> specific situation here in depth, so take my view only as opinion from a
> >>>>>>> large distance :-)
> >>>>>>
> >>>>>> I've now looked at the concrete situation.  So yeah, the whole path is in
> >>>>>> the same loop, crosses the latch, _and there's code following the latch
> >>>>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
> >>>>>> particular, after loop_optimizer_init() (before any threading) we have:
> >>>>>>
> >>>>>>      <bb 3> [local count: 118111600]:
> >>>>>>      # j_19 = PHI <j_13(9), 0(7)>
> >>>>>>      sum_11 = c[j_19];
> >>>>>>      if (n_10(D) > 0)
> >>>>>>        goto <bb 8>; [89.00%]
> >>>>>>      else
> >>>>>>        goto <bb 5>; [11.00%]
> >>>>>>
> >>>>>>         <bb 8> [local count: 105119324]:
> >>>>>> ...
> >>>>>>
> >>>>>>      <bb 5> [local count: 118111600]:
> >>>>>>      # sum_21 = PHI <sum_14(4), sum_11(3)>
> >>>>>>      c[j_19] = sum_21;
> >>>>>>      j_13 = j_19 + 1;
> >>>>>>      if (n_10(D) > j_13)
> >>>>>>        goto <bb 9>; [89.00%]
> >>>>>>      else
> >>>>>>        goto <bb 6>; [11.00%]
> >>>>>>
> >>>>>>      <bb 9> [local count: 105119324]:
> >>>>>>      goto <bb 3>; [100.00%]
> >>>>>>
> >>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> >>>>>> pre-header of inner loop, but more importantly something that's not at the
> >>>>>> start of the outer loop.
> >>>>>>
> >>>>>> Now, any thread that includes the backedge 9->3 _including_ its
> >>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> >>>>>> necessarily duplicates all code from that destination onto the back edge.
> >>>>>> Here it's the load from c[j] into sum_11.
> >>>>>>
> >>>>>> The important part is the code is emitted onto the back edge,
> >>>>>> conceptually; in reality it's simply included into the (new) latch block
> >>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> >>>>>> cfg_cleanup).
> >>>>>>
> >>>>>> That's what we can't have for some of our structural loop optimizers:
> >>>>>> there must be no code executed after the exit test (e.g. in the latch
> >>>>>> block).  (This requirement makes reasoning about which code is or isn't
> >>>>>> executed completely for an iteration trivial; simply everything in the
> >>>>>> body is always executed; e.g. loop interchange uses this to check that
> >>>>>> there are no memory references after the exit test, because those would
> >>>>>> then be only conditional and hence make loop interchange very awkward).
> >>>>>>
> >>>>>> Note that this situation can't be later rectified anymore: the duplicated
> >>>>>> instructions (because they are memory refs) must remain after the exit
> >>>>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> >>>>>> memory refs on the loop-entry path and on the back edge are equivalent)
> >>>>>> would that be possible, but that's something we aren't capable of.  Even
> >>>>>> if we were that would simply just revert the whole work that the threader
> >>>>>> did, so it's better to not even do that to start with.
> >>>>>>
> >>>>>> I believe something like below would be appropriate, it disables threading
> >>>>>> if the path contains a latch at the non-last position (due to being
> >>>>>> backwards on the non-first position in the array).  I.e. it disables
> >>>>>> rotating the loop if there's danger of polluting the back edge.  It might
> >>>>>> be improved if the blocks following (preceding!) the latch are themself
> >>>>>> empty because then no code is duplicated.  It might also be improved if
> >>>>>> the latch is already non-empty.  That code should probably only be active
> >>>>>> before the loop optimizers, but currently the backward threader isn't
> >>>>>> differentiating between before/after loop-optims.
> >>>>>>
> >>>>>> I haven't tested this patch at all, except that it fixes the testcase :)
> >>>>>
> >>>>> Lame comment at the current end of the thread - it's not threading through the
> >>>>
> >>>> I don't know why y'all keep using the word "lame".  On the contrary,
> >>>> these are incredibly useful explanations.  Thanks.
> >>>>
> >>>>> latch but threading through the loop header that's problematic, at least if the
> >>>>> end of the threading path ends within the loop (threading through the header
> >>>>> to the loop exit is fine).  Because in that situation you effectively created an
> >>>>> alternate loop entry.  Threading through the latch into the loop header is
> >>>>> fine but with simple latches that likely will never happen (if there are no
> >>>>> simple latches then the latch can contain the loop exit test).
> >>>>>
> >>>>> See tree-ssa-threadupdate.c:thread_block_1
> >>>>>
> >>>>>          e2 = path->last ()->e;
> >>>>>          if (!e2 || noloop_only)
> >>>>>            {
> >>>>>              /* If NOLOOP_ONLY is true, we only allow threading through the
> >>>>>                 header of a loop to exit edges.  */
> >>>>>
> >>>>>              /* One case occurs when there was loop header buried in a jump
> >>>>>                 threading path that crosses loop boundaries.  We do not try
> >>>>>                 and thread this elsewhere, so just cancel the jump threading
> >>>>>                 request by clearing the AUX field now.  */
> >>>>>              if (bb->loop_father != e2->src->loop_father
> >>>>>                  && (!loop_exit_edge_p (e2->src->loop_father, e2)
> >>>>>                      || flow_loop_nested_p (bb->loop_father,
> >>>>>                                             e2->dest->loop_father)))
> >>>>>                {
> >>>>>                  /* Since this case is not handled by our special code
> >>>>>                     to thread through a loop header, we must explicitly
> >>>>>                     cancel the threading request here.  */
> >>>>>                  delete_jump_thread_path (path);
> >>>>>                  e->aux = NULL;
> >>>>>                  continue;
> >>>>>                }
> >>>>
> >>>> But this is for a threading path that crosses loop boundaries, which is
> >>>> not the case.  Perhaps we should restrict this further to threads within
> >>>> a loop?
> >>>>
> >>>>>
> >>>>> there are a lot of "useful" checks in this function and the backwards threader
> >>>>> should adopt those.  Note the backwards threader originally only did
> >>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming
> >>>>> irreducible regions at worst or sub-loops at best.  That might explain the
> >>>>> lack of those checks.
> >>>>
> >>>> Also, the aforementioned checks are in jump_thread_path_registry, which
> >>>> is also shared by the backward threader.  These are thread discards
> >>>> _after_ a thread has been registered.
> >>>
> >>> Yeah, that's indeed unfortunate.
> >>>
> >>>>    The backward threader should also
> >>>> be using these restrictions.  Unless, I'm missing some interaction with
> >>>> the FSM/etc threading types as per the preamble to the snippet you provided:
> >>>>
> >>>>         if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
> >>>>             || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
> >>>>           continue;
> >>>
> >>> Indeed.  But I understand the backwards threader does not (only) do FSM
> >>> threading now.
> >>
> >> If it does, it was not part of my rewrite.  I was careful to not touch
> >> anything dealing with either path profitability or low-level path
> >> registering.
> >>
> >> The path registering is in back_threader_registry::register_path().  We
> >> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
> >> those are only EDGE_FSM_THREADs??
> >
> > Well, if the backwards threader classifies everything as FSM that's probably
> > inaccurate since only threads through the loop latch are "FSM".  There is
> > the comment
> >
> >    /* If this path does not thread through the loop latch, then we are
> >       using the FSM threader to find old style jump threads.  This
> >       is good, except the FSM threader does not re-use an existing
> >       threading path to reduce code duplication.
> >
> >       So for that case, drastically reduce the number of statements
> >       we are allowed to copy.  */
>
> *blink*
>
> Woah.  The backward threader has been using FSM threads indiscriminately
> as far as I can remember.  I wonder what would break if we "fixed it".
>
> >
> > so these cases should use the "old style" validity/costing metrics and thus
> > classify threading opportunities in a different way?
>
> Jeff, do you have any insight here?
>
> >
> > I think today "backwards" vs, "forwards" only refers to the way we find
> > threading opportunities.
>
> Yes, it's a mess.
>
> I ran some experiments a while back, and my current work on the enhanced
> solver/threader, can fold virtually everything the DOM/threader gets
> (even with its use of const_and_copies, avail_exprs, and
> evrp_range_analyzer), while getting 5% more DOM threads and 1% more
> overall threads.  That is, I've been testing if the path solver can
> solve everything the DOM threader needs (the hybrid approach I mentioned).
>
> Unfortunately, replacing the forward threader right now is not feasible
> for a few reasons:
>
> a) The const_and_copies/avail_exprs relation framework can do floats,
> and that's next year's ranger work.

Hmm, it sounds like relying fully on ranger is a bit premature then.

> b) Even though we can seemingly fold everything DOM/threader does, in
> order to replace it with a backward threader instance we'd have to merge
> the cost/profitability code scattered throughout the forward threader,
> as well as the EDGE_FSM* / EDGE_COPY* business.
>
> c) DOM changes the IL as it goes.  Though we could conceivably divorce
> do the threading after DOM is done.

Yeah, it does not actually process/simplify the blocks copied by threading.
In fact I think it only registers jump threading opportunities during the DOM
walk and commits them only later.  But it of course uses its avail / copies
stack to find them - that part you cannot easily divorce.

DOM is also yet another value-numbering framework - one could think
of stripping it down from that side, keeping the threading bits only
(but you'd still have the avail / copies bits).

That said, it has one nice property it can leverage due to its incredibly
simple memory redundancy handling, in that it usually performs way less
alias queries than FRE (even when you run the latter in non-iterative mode).

But the same way DOM can register jump threading opportunities FRE
could do as well.

Richard.

> But I digress...
> Aldy
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09 10:15                             ` Richard Biener
@ 2021-09-09 11:28                               ` Aldy Hernandez
  2021-09-10 15:51                               ` Jeff Law
  1 sibling, 0 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09 11:28 UTC (permalink / raw)
  To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On 9/9/21 12:15 PM, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>>
>> On 9/9/21 10:58 AM, Richard Biener wrote:

>> I ran some experiments a while back, and my current work on the enhanced
>> solver/threader, can fold virtually everything the DOM/threader gets
>> (even with its use of const_and_copies, avail_exprs, and
>> evrp_range_analyzer), while getting 5% more DOM threads and 1% more
>> overall threads.  That is, I've been testing if the path solver can
>> solve everything the DOM threader needs (the hybrid approach I mentioned).
>>
>> Unfortunately, replacing the forward threader right now is not feasible
>> for a few reasons:
>>
>> a) The const_and_copies/avail_exprs relation framework can do floats,
>> and that's next year's ranger work.
> 
> Hmm, it sounds like relying fully on ranger is a bit premature then.

For floats, most definitely.  Ranger doesn't do them.  They're for the 
next release.  However, const_and_copies/avail_exprs for integers we can 
do just fine, that's why evrp/VRP are much easier targets.

> 
>> b) Even though we can seemingly fold everything DOM/threader does, in
>> order to replace it with a backward threader instance we'd have to merge
>> the cost/profitability code scattered throughout the forward threader,
>> as well as the EDGE_FSM* / EDGE_COPY* business.
>>
>> c) DOM changes the IL as it goes.  Though we could conceivably divorce
>> do the threading after DOM is done.
> 
> Yeah, it does not actually process/simplify the blocks copied by threading.
> In fact I think it only registers jump threading opportunities during the DOM
> walk and commits them only later.  But it of course uses its avail / copies
> stack to find them - that part you cannot easily divorce.

Yes, all the threaders register paths ahead of time and only do the 
actual CFG shuffling at the end with thread_through_all_blocks().

> 
> DOM is also yet another value-numbering framework - one could think
> of stripping it down from that side, keeping the threading bits only
> (but you'd still have the avail / copies bits).

Ughh, yeah.  I've seen some random missed cases because of the 
value-numbering it does :-/.  It's very frustrating that we have various 
ad-hoc VN methods.

> 
> That said, it has one nice property it can leverage due to its incredibly
> simple memory redundancy handling, in that it usually performs way less
> alias queries than FRE (even when you run the latter in non-iterative mode).
> 
> But the same way DOM can register jump threading opportunities FRE
> could do as well.

My plan for DOM/threading for this release is to replace its evrp 
analyzer with a path solver (path_range_query).  We can still use the 
avail/copies, and everything else should just work.  I'm hand waiving on 
a few missed cases due to IL changing, but last time I checked we got 
way more in return.  Andrew even thought we should get most things even 
in the presence of changing IL, but I haven't distilled testcases for him.

For VRP/threading, the same hybrid thing, but replacing the 
ASSERT_EXPRs, and the avail/copies with a path solver.  There are no 
floats here.  Things are looking good in my tests.

Once we do floats, if we could merge the EDGE_*_THREAD and the 
cost/profit bits between both threaders, we should be able to replace 
the entire forward threader with a backward client.  Hopefully I don't 
run out of steam by then ;-).

Thanks for your insight on DOM.  It's very helpful.

Aldy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09 10:15                             ` Richard Biener
  2021-09-09 11:28                               ` Aldy Hernandez
@ 2021-09-10 15:51                               ` Jeff Law
  2021-09-10 16:11                                 ` Aldy Hernandez
  1 sibling, 1 reply; 33+ messages in thread
From: Jeff Law @ 2021-09-10 15:51 UTC (permalink / raw)
  To: Richard Biener, Aldy Hernandez
  Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/9/2021 4:15 AM, Richard Biener wrote:
>
>> b) Even though we can seemingly fold everything DOM/threader does, in
>> order to replace it with a backward threader instance we'd have to merge
>> the cost/profitability code scattered throughout the forward threader,
>> as well as the EDGE_FSM* / EDGE_COPY* business.
>>
>> c) DOM changes the IL as it goes.  Though we could conceivably divorce
>> do the threading after DOM is done.
> Yeah, it does not actually process/simplify the blocks copied by threading.
> In fact I think it only registers jump threading opportunities during the DOM
> walk and commits them only later.  But it of course uses its avail / copies
> stack to find them - that part you cannot easily divorce.
Well, divorcing from using the context sensitive avail/copies is part of 
what Aldy & Andrew have been working on.  All indications I've seen are 
they're on track to be able to do that.

And yes, it only registers the threads and waits until after DOM is done 
to transform the CFG.  That in and of itself introduces all kinds of 
complexity.  If we can get to the point where we don't need the context 
sensitive avail/copies, then we've got a real shot at untangling DOM and 
threading which would be a huge maintainability win in my mind.

>
> DOM is also yet another value-numbering framework - one could think
> of stripping it down from that side, keeping the threading bits only
> (but you'd still have the avail / copies bits).
Yes.  I think you and I touched on this a while back.    At a high level 
I'd prefer to have FRE rather than DOM doing the bulk of the redundant 
expression elimination.  The big blocker there was the tight integration 
of DOM and threading.  But if Aldy can untangle that we can then 
evaluate replacing DOM with FRE.


>
> That said, it has one nice property it can leverage due to its incredibly
> simple memory redundancy handling, in that it usually performs way less
> alias queries than FRE (even when you run the latter in non-iterative mode).
DOM as an infrastructure for optimization is probably reaching the end 
of its useful life.  FRE has a lot more going for it.

>
> But the same way DOM can register jump threading opportunities FRE
> could do as well.
I'd advise against that and instead look towards a model where no pass 
has integrated jump threading and the only jump threading module we have 
is the backwards threader.

jeff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-10 15:51                               ` Jeff Law
@ 2021-09-10 16:11                                 ` Aldy Hernandez
  0 siblings, 0 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-10 16:11 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/10/21 5:51 PM, Jeff Law wrote:
> 
> 
> On 9/9/2021 4:15 AM, Richard Biener wrote:
>>
>>> b) Even though we can seemingly fold everything DOM/threader does, in
>>> order to replace it with a backward threader instance we'd have to merge
>>> the cost/profitability code scattered throughout the forward threader,
>>> as well as the EDGE_FSM* / EDGE_COPY* business.
>>>
>>> c) DOM changes the IL as it goes.  Though we could conceivably divorce
>>> do the threading after DOM is done.
>> Yeah, it does not actually process/simplify the blocks copied by 
>> threading.
>> In fact I think it only registers jump threading opportunities during 
>> the DOM
>> walk and commits them only later.  But it of course uses its avail / 
>> copies
>> stack to find them - that part you cannot easily divorce.
> Well, divorcing from using the context sensitive avail/copies is part of 
> what Aldy & Andrew have been working on.  All indications I've seen are 
> they're on track to be able to do that.
> 
> And yes, it only registers the threads and waits until after DOM is done 
> to transform the CFG.  That in and of itself introduces all kinds of 
> complexity.  If we can get to the point where we don't need the context 
> sensitive avail/copies, then we've got a real shot at untangling DOM and 
> threading which would be a huge maintainability win in my mind.
> 
>>
>> DOM is also yet another value-numbering framework - one could think
>> of stripping it down from that side, keeping the threading bits only
>> (but you'd still have the avail / copies bits).
> Yes.  I think you and I touched on this a while back.    At a high level 
> I'd prefer to have FRE rather than DOM doing the bulk of the redundant 
> expression elimination.  The big blocker there was the tight integration 
> of DOM and threading.  But if Aldy can untangle that we can then 
> evaluate replacing DOM with FRE.

Once ranger does floats, I can't think of anything the forward threader 
could get that the backward threader couldn't.

> 
> 
>>
>> That said, it has one nice property it can leverage due to its incredibly
>> simple memory redundancy handling, in that it usually performs way less
>> alias queries than FRE (even when you run the latter in non-iterative 
>> mode).
> DOM as an infrastructure for optimization is probably reaching the end 
> of its useful life.  FRE has a lot more going for it.
> 
>>
>> But the same way DOM can register jump threading opportunities FRE
>> could do as well.
> I'd advise against that and instead look towards a model where no pass 
> has integrated jump threading and the only jump threading module we have 
> is the backwards threader.

Yes, please.  We need to separate jump threading from all passes.  One 
thing, and do it well.

Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  9:21                           ` Aldy Hernandez
  2021-09-09 10:15                             ` Richard Biener
@ 2021-09-10 15:43                             ` Jeff Law
  2021-09-10 16:05                               ` Aldy Hernandez
  1 sibling, 1 reply; 33+ messages in thread
From: Jeff Law @ 2021-09-10 15:43 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener
  Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/9/2021 3:21 AM, Aldy Hernandez wrote:
>
>>
>>    /* If this path does not thread through the loop latch, then we are
>>       using the FSM threader to find old style jump threads. This
>>       is good, except the FSM threader does not re-use an existing
>>       threading path to reduce code duplication.
>>
>>       So for that case, drastically reduce the number of statements
>>       we are allowed to copy.  */
>
> *blink*
>
> Woah.  The backward threader has been using FSM threads 
> indiscriminately as far as I can remember.  I wonder what would break 
> if we "fixed it".
?!?  I'm not sure what you're suggesting here.  If you s/FSM 
threader/backwards threader/ in the comment does it make more sense?   
The term FSM really should largely have been dropped as the backwards 
threader was improved to handle more cases.




>
>>
>> so these cases should use the "old style" validity/costing metrics 
>> and thus
>> classify threading opportunities in a different way?
>
> Jeff, do you have any insight here?
This is precisely what you're cleaning up.


>
>>
>> I think today "backwards" vs, "forwards" only refers to the way we find
>> threading opportunities.
>
> Yes, it's a mess.
>
> I ran some experiments a while back, and my current work on the 
> enhanced solver/threader, can fold virtually everything the 
> DOM/threader gets (even with its use of const_and_copies, avail_exprs, 
> and evrp_range_analyzer), while getting 5% more DOM threads and 1% 
> more overall threads.  That is, I've been testing if the path solver 
> can solve everything the DOM threader needs (the hybrid approach I 
> mentioned).
>
> Unfortunately, replacing the forward threader right now is not 
> feasible for a few reasons:
Right.  But I thought the short term goal was to replace/remove the 
forward threading from VRP.   Dropping from DOM is going to be tougher.

>
> a) The const_and_copies/avail_exprs relation framework can do floats, 
> and that's next year's ranger work.
Right.  I'd actually run into this as well when I wanted to drop all the 
range bits out of DOM and rely exclusively on EVRP.   It'd still be a 
step forward to rip out the EVRP engine from DOM and simplify all the 
code that derives one equivalence from another so that it's only working 
on FP values.

>
> b) Even though we can seemingly fold everything DOM/threader does, in 
> order to replace it with a backward threader instance we'd have to 
> merge the cost/profitability code scattered throughout the forward 
> threader, as well as the EDGE_FSM* / EDGE_COPY* business.
Right.  This is a prerequisite.  Though some of the costing will need to 
be conditional on the threader being used.  Refer back to the discussion 
around how the forward threader can commonize thread paths that lead to 
the same destination while the backwards threader can not.

>
> c) DOM changes the IL as it goes.  Though we could conceivably divorce 
> do the threading after DOM is done.
The only reason threading runs in parallel with DOM is so that it can 
use the context sensitive equivalences.  With the infrastructure you're 
building, there's a reasonable chance we can move to a model where we 
run DOM (and in the long term a simpler DOM) and threading as distinct, 
independent passes.

Jeff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-10 15:43                             ` Jeff Law
@ 2021-09-10 16:05                               ` Aldy Hernandez
  2021-09-10 16:21                                 ` Jeff Law
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-10 16:05 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/10/21 5:43 PM, Jeff Law wrote:
> 
> 
> On 9/9/2021 3:21 AM, Aldy Hernandez wrote:
>>
>>>
>>>    /* If this path does not thread through the loop latch, then we are
>>>       using the FSM threader to find old style jump threads. This
>>>       is good, except the FSM threader does not re-use an existing
>>>       threading path to reduce code duplication.
>>>
>>>       So for that case, drastically reduce the number of statements
>>>       we are allowed to copy.  */
>>
>> *blink*
>>
>> Woah.  The backward threader has been using FSM threads 
>> indiscriminately as far as I can remember.  I wonder what would break 
>> if we "fixed it".
> ?!?  I'm not sure what you're suggesting here.  If you s/FSM 
> threader/backwards threader/ in the comment does it make more sense? The 
> term FSM really should largely have been dropped as the backwards 
> threader was improved to handle more cases.

back_threader_registry::register_path() uses EDGE_FSM_THREAD as the 
thread type to register threads.  I was wondering if it should have been 
some combination of EDGE_START_JUMP_THREAD / EDGE_*_COPY_SRC_BLOCK, etc. 
  I (purposely) know nothing about the underlying threading types ;-). 
But if the backwards threader has been improved, then perhaps we should 
just remove the confusing FSM references.

> 
> 
> 
> 
>>
>>>
>>> so these cases should use the "old style" validity/costing metrics 
>>> and thus
>>> classify threading opportunities in a different way?
>>
>> Jeff, do you have any insight here?
> This is precisely what you're cleaning up.
> 
> 
>>
>>>
>>> I think today "backwards" vs, "forwards" only refers to the way we find
>>> threading opportunities.
>>
>> Yes, it's a mess.
>>
>> I ran some experiments a while back, and my current work on the 
>> enhanced solver/threader, can fold virtually everything the 
>> DOM/threader gets (even with its use of const_and_copies, avail_exprs, 
>> and evrp_range_analyzer), while getting 5% more DOM threads and 1% 
>> more overall threads.  That is, I've been testing if the path solver 
>> can solve everything the DOM threader needs (the hybrid approach I 
>> mentioned).
>>
>> Unfortunately, replacing the forward threader right now is not 
>> feasible for a few reasons:
> Right.  But I thought the short term goal was to replace/remove the 
> forward threading from VRP.   Dropping from DOM is going to be tougher.

My current thinking is that replacing the forward VRP threader with a 
hybrid one is a gentler approach to the longer term goal of replacing 
the forward threader altogether.  However, all the work I've been doing 
could go either way-- we could try the forward/VRP replacement or a 
hybrid approach.  It will all use the path solver underneath.

My main problem with replacing the forward/VRP with a backward client is 
that the cost models are so different that it was difficult to compare 
how we fared.  I constantly ran into threads the solver could handle 
just fine, but profitable_path_p was holding it back.

FWIW, we get virtually everything the forward threader gets, minus a 
very few things.  At least when I plug in the solver to the 
DOM/forwarder threader, it can solve everything it can (minus noise and 
floats).

If you prefer a backward threader instance to replace the VRP/forward 
threader, I'm game.  It's just harder to compare.  Either way (backward 
threader or a hybrid forward+solver) uses the same underlying solver 
which is solid.

>> a) The const_and_copies/avail_exprs relation framework can do floats, 
>> and that's next year's ranger work.
> Right.  I'd actually run into this as well when I wanted to drop all the 
> range bits out of DOM and rely exclusively on EVRP.   It'd still be a 
> step forward to rip out the EVRP engine from DOM and simplify all the 
> code that derives one equivalence from another so that it's only working 
> on FP values.

Sure.

> 
>>
>> b) Even though we can seemingly fold everything DOM/threader does, in 
>> order to replace it with a backward threader instance we'd have to 
>> merge the cost/profitability code scattered throughout the forward 
>> threader, as well as the EDGE_FSM* / EDGE_COPY* business.
> Right.  This is a prerequisite.  Though some of the costing will need to 
> be conditional on the threader being used.  Refer back to the discussion 
> around how the forward threader can commonize thread paths that lead to 
> the same destination while the backwards threader can not.

Yup, yup.

> 
>>
>> c) DOM changes the IL as it goes.  Though we could conceivably divorce 
>> do the threading after DOM is done.
> The only reason threading runs in parallel with DOM is so that it can 
> use the context sensitive equivalences.  With the infrastructure you're 
> building, there's a reasonable chance we can move to a model where we 
> run DOM (and in the long term a simpler DOM) and threading as distinct, 
> independent passes.

Andrew mumbled something about replacing all of DOM eventually :-). 
Well, except that value-numbering business I bet.

Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-10 16:05                               ` Aldy Hernandez
@ 2021-09-10 16:21                                 ` Jeff Law
  2021-09-10 16:38                                   ` Aldy Hernandez
  0 siblings, 1 reply; 33+ messages in thread
From: Jeff Law @ 2021-09-10 16:21 UTC (permalink / raw)
  To: Aldy Hernandez, Richard Biener
  Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/10/2021 10:05 AM, Aldy Hernandez wrote:
>
>
> On 9/10/21 5:43 PM, Jeff Law wrote:
>>
>>
>> On 9/9/2021 3:21 AM, Aldy Hernandez wrote:
>>>
>>>>
>>>>    /* If this path does not thread through the loop latch, then we are
>>>>       using the FSM threader to find old style jump threads. This
>>>>       is good, except the FSM threader does not re-use an existing
>>>>       threading path to reduce code duplication.
>>>>
>>>>       So for that case, drastically reduce the number of statements
>>>>       we are allowed to copy.  */
>>>
>>> *blink*
>>>
>>> Woah.  The backward threader has been using FSM threads 
>>> indiscriminately as far as I can remember.  I wonder what would 
>>> break if we "fixed it".
>> ?!?  I'm not sure what you're suggesting here.  If you s/FSM 
>> threader/backwards threader/ in the comment does it make more sense? 
>> The term FSM really should largely have been dropped as the backwards 
>> threader was improved to handle more cases.
>
> back_threader_registry::register_path() uses EDGE_FSM_THREAD as the 
> thread type to register threads.  I was wondering if it should have 
> been some combination of EDGE_START_JUMP_THREAD / 
> EDGE_*_COPY_SRC_BLOCK, etc.  I (purposely) know nothing about the 
> underlying threading types ;-). But if the backwards threader has been 
> improved, then perhaps we should just remove the confusing FSM 
> references.
No we shouldn't change it to any of the other types. EDGE_FSM_THREAD 
means a thread found by the backwards threader and it's the key used to 
determine which of the two CFG updating mechanisms should be used, the 
generic copier in the case of EDGE_FSM_THREAD.


Changing the name, yes, absolutely.  I probably should have done that 
when the original FSM threader was tweaked to handle generic threading.

As you've probably heard me mention before, all the EDGE_FSM_THREAD 
stuff in the registry really could be pulled out.   The registry's 
purpose is to deal with the two stage nature of jump threading in DOM 
(find threads, then optimize them later).  I don't think any of the 
backwards threading bits need that two stage handling.

> My current thinking is that replacing the forward VRP threader with a 
> hybrid one is a gentler approach to the longer term goal of replacing 
> the forward threader altogether.  However, all the work I've been 
> doing could go either way-- we could try the forward/VRP replacement 
> or a hybrid approach.  It will all use the path solver underneath.
And that's probably a reasonable intermediate step on the way towards 
removing the VRP threading.

>
> My main problem with replacing the forward/VRP with a backward client 
> is that the cost models are so different that it was difficult to 
> compare how we fared.  I constantly ran into threads the solver could 
> handle just fine, but profitable_path_p was holding it back.
Yea.  Sorry about that tangle of insanity

>
> FWIW, we get virtually everything the forward threader gets, minus a 
> very few things.  At least when I plug in the solver to the 
> DOM/forwarder threader, it can solve everything it can (minus noise 
> and floats).
So once you plug those bits in, we don't have to carry around the 
avail/copies tables for the threader anymore, right?  That's a nice 
cleanup in and of itself.

>
> If you prefer a backward threader instance to replace the VRP/forward 
> threader, I'm game.  It's just harder to compare. Either way (backward 
> threader or a hybrid forward+solver) uses the same underlying solver 
> which is solid.
I think we can go hybrid, then look at the next step, which could well 
be bringing some consistency into the costing models.
>>
>>>
>>> c) DOM changes the IL as it goes.  Though we could conceivably 
>>> divorce do the threading after DOM is done.
>> The only reason threading runs in parallel with DOM is so that it can 
>> use the context sensitive equivalences.  With the infrastructure 
>> you're building, there's a reasonable chance we can move to a model 
>> where we run DOM (and in the long term a simpler DOM) and threading 
>> as distinct, independent passes.
>
> Andrew mumbled something about replacing all of DOM eventually :-). 
> Well, except that value-numbering business I bet.
Essentially a realization of Bodik's work in GCC.   The nugget in there 
is it's a path sensitive optimizer.  That's kindof what I've envisioned 
DOM turning into.

1. We separate out jump threading from DOM.
2. We replace the bulk of DOM with FRE
3. The remnants of DOM turn into a path sensitive optimizer (and for 
god's sake we don't want to call it DOM anymore :-)



Jeff

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-10 16:21                                 ` Jeff Law
@ 2021-09-10 16:38                                   ` Aldy Hernandez
  0 siblings, 0 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-10 16:38 UTC (permalink / raw)
  To: Jeff Law, Richard Biener; +Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/10/21 6:21 PM, Jeff Law wrote:
> 
> 
> On 9/10/2021 10:05 AM, Aldy Hernandez wrote:
>>
>>
>> On 9/10/21 5:43 PM, Jeff Law wrote:
>>>
>>>
>>> On 9/9/2021 3:21 AM, Aldy Hernandez wrote:
>>>>
>>>>>
>>>>>    /* If this path does not thread through the loop latch, then we are
>>>>>       using the FSM threader to find old style jump threads. This
>>>>>       is good, except the FSM threader does not re-use an existing
>>>>>       threading path to reduce code duplication.
>>>>>
>>>>>       So for that case, drastically reduce the number of statements
>>>>>       we are allowed to copy.  */
>>>>
>>>> *blink*
>>>>
>>>> Woah.  The backward threader has been using FSM threads 
>>>> indiscriminately as far as I can remember.  I wonder what would 
>>>> break if we "fixed it".
>>> ?!?  I'm not sure what you're suggesting here.  If you s/FSM 
>>> threader/backwards threader/ in the comment does it make more sense? 
>>> The term FSM really should largely have been dropped as the backwards 
>>> threader was improved to handle more cases.
>>
>> back_threader_registry::register_path() uses EDGE_FSM_THREAD as the 
>> thread type to register threads.  I was wondering if it should have 
>> been some combination of EDGE_START_JUMP_THREAD / 
>> EDGE_*_COPY_SRC_BLOCK, etc.  I (purposely) know nothing about the 
>> underlying threading types ;-). But if the backwards threader has been 
>> improved, then perhaps we should just remove the confusing FSM 
>> references.
> No we shouldn't change it to any of the other types. EDGE_FSM_THREAD 
> means a thread found by the backwards threader and it's the key used to 
> determine which of the two CFG updating mechanisms should be used, the 
> generic copier in the case of EDGE_FSM_THREAD.
> 
> 
> Changing the name, yes, absolutely.  I probably should have done that 
> when the original FSM threader was tweaked to handle generic threading.

I'll put that on my list.

> 
> As you've probably heard me mention before, all the EDGE_FSM_THREAD 
> stuff in the registry really could be pulled out.   The registry's 
> purpose is to deal with the two stage nature of jump threading in DOM 
> (find threads, then optimize them later).  I don't think any of the 
> backwards threading bits need that two stage handling.

Yeah yeah.  But you said that a year ago, and all I heard was *mumble 
mumble/complicated things*.  That was before I had much exposure to the 
code.  Now I feel a lot more comfortable ;-).

I'll also put this on my list, but it may not get done this release, 
cause I'm running against the clock with the VRP/threader replacement, 
which is what's keeping us back from replacing VRP with an evrp instance 
right now :).

> 
>> My current thinking is that replacing the forward VRP threader with a 
>> hybrid one is a gentler approach to the longer term goal of replacing 
>> the forward threader altogether.  However, all the work I've been 
>> doing could go either way-- we could try the forward/VRP replacement 
>> or a hybrid approach.  It will all use the path solver underneath.
> And that's probably a reasonable intermediate step on the way towards 
> removing the VRP threading.
> 
>>
>> My main problem with replacing the forward/VRP with a backward client 
>> is that the cost models are so different that it was difficult to 
>> compare how we fared.  I constantly ran into threads the solver could 
>> handle just fine, but profitable_path_p was holding it back.
> Yea.  Sorry about that tangle of insanity
> 
>>
>> FWIW, we get virtually everything the forward threader gets, minus a 
>> very few things.  At least when I plug in the solver to the 
>> DOM/forwarder threader, it can solve everything it can (minus noise 
>> and floats).
> So once you plug those bits in, we don't have to carry around the 
> avail/copies tables for the threader anymore, right?  That's a nice 
> cleanup in and of itself.

Correct.  For the VRP/hybrid approach I'm working on, there are no 
copies/avails.  The solver can do everything they did.  After all, it's 
an easier target, since VRP threading only happens on ints and without 
the IL changing constantly.

> 
>>
>> If you prefer a backward threader instance to replace the VRP/forward 
>> threader, I'm game.  It's just harder to compare. Either way (backward 
>> threader or a hybrid forward+solver) uses the same underlying solver 
>> which is solid.
> I think we can go hybrid, then look at the next step, which could well 
> be bringing some consistency into the costing models.
>>>
>>>>
>>>> c) DOM changes the IL as it goes.  Though we could conceivably 
>>>> divorce do the threading after DOM is done.
>>> The only reason threading runs in parallel with DOM is so that it can 
>>> use the context sensitive equivalences.  With the infrastructure 
>>> you're building, there's a reasonable chance we can move to a model 
>>> where we run DOM (and in the long term a simpler DOM) and threading 
>>> as distinct, independent passes.
>>
>> Andrew mumbled something about replacing all of DOM eventually :-). 
>> Well, except that value-numbering business I bet.
> Essentially a realization of Bodik's work in GCC.   The nugget in there 
> is it's a path sensitive optimizer.  That's kindof what I've envisioned 
> DOM turning into.
> 
> 1. We separate out jump threading from DOM.
> 2. We replace the bulk of DOM with FRE
> 3. The remnants of DOM turn into a path sensitive optimizer (and for 
> god's sake we don't want to call it DOM anymore :-)

Well, my tree has improvements to the solver for full path sensitive 
ranges (using ranger to resolve definitions outside of the path).  And 
it also does path sensitive relations, though Andrew is overhauling them 
next week.  So yeah, given a path, we could tell you all sorts of 
interesting things about it :).

Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  8:58                         ` Richard Biener
  2021-09-09  9:21                           ` Aldy Hernandez
@ 2021-09-09 16:59                           ` Jeff Law
  1 sibling, 0 replies; 33+ messages in thread
From: Jeff Law @ 2021-09-09 16:59 UTC (permalink / raw)
  To: Richard Biener, Aldy Hernandez
  Cc: Michael Matz, GCC Mailing List, Andrew MacLeod



On 9/9/2021 2:58 AM, Richard Biener wrote:
> On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>
>>
>> On 9/9/21 9:45 AM, Richard Biener wrote:
>>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>>>>
>>>>
>>>> On 9/9/21 8:57 AM, Richard Biener wrote:
>>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> [lame answer to self]
>>>>>>
>>>>>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>>>>>
>>>>>>>>> The forward threader guards against this by simply disallowing
>>>>>>>>> threadings that involve different loops.  As I see
>>>>>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>>>>>> though. BTW, the backward threader also disallows threading across
>>>>>>>> different loops (see path_crosses_loops variable).
>>>>>> ...
>>>>>>> Maybe it's possible to not disable threading over latches alltogether in
>>>>>>> the backward threader (like it's tried now), but I haven't looked at the
>>>>>>> specific situation here in depth, so take my view only as opinion from a
>>>>>>> large distance :-)
>>>>>> I've now looked at the concrete situation.  So yeah, the whole path is in
>>>>>> the same loop, crosses the latch, _and there's code following the latch
>>>>>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>>>>>> particular, after loop_optimizer_init() (before any threading) we have:
>>>>>>
>>>>>>      <bb 3> [local count: 118111600]:
>>>>>>      # j_19 = PHI <j_13(9), 0(7)>
>>>>>>      sum_11 = c[j_19];
>>>>>>      if (n_10(D) > 0)
>>>>>>        goto <bb 8>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 5>; [11.00%]
>>>>>>
>>>>>>         <bb 8> [local count: 105119324]:
>>>>>> ...
>>>>>>
>>>>>>      <bb 5> [local count: 118111600]:
>>>>>>      # sum_21 = PHI <sum_14(4), sum_11(3)>
>>>>>>      c[j_19] = sum_21;
>>>>>>      j_13 = j_19 + 1;
>>>>>>      if (n_10(D) > j_13)
>>>>>>        goto <bb 9>; [89.00%]
>>>>>>      else
>>>>>>        goto <bb 6>; [11.00%]
>>>>>>
>>>>>>      <bb 9> [local count: 105119324]:
>>>>>>      goto <bb 3>; [100.00%]
>>>>>>
>>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>>>>>> pre-header of inner loop, but more importantly something that's not at the
>>>>>> start of the outer loop.
>>>>>>
>>>>>> Now, any thread that includes the backedge 9->3 _including_ its
>>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge)
>>>>>> necessarily duplicates all code from that destination onto the back edge.
>>>>>> Here it's the load from c[j] into sum_11.
>>>>>>
>>>>>> The important part is the code is emitted onto the back edge,
>>>>>> conceptually; in reality it's simply included into the (new) latch block
>>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
>>>>>> cfg_cleanup).
>>>>>>
>>>>>> That's what we can't have for some of our structural loop optimizers:
>>>>>> there must be no code executed after the exit test (e.g. in the latch
>>>>>> block).  (This requirement makes reasoning about which code is or isn't
>>>>>> executed completely for an iteration trivial; simply everything in the
>>>>>> body is always executed; e.g. loop interchange uses this to check that
>>>>>> there are no memory references after the exit test, because those would
>>>>>> then be only conditional and hence make loop interchange very awkward).
>>>>>>
>>>>>> Note that this situation can't be later rectified anymore: the duplicated
>>>>>> instructions (because they are memory refs) must remain after the exit
>>>>>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>>>>>> memory refs on the loop-entry path and on the back edge are equivalent)
>>>>>> would that be possible, but that's something we aren't capable of.  Even
>>>>>> if we were that would simply just revert the whole work that the threader
>>>>>> did, so it's better to not even do that to start with.
>>>>>>
>>>>>> I believe something like below would be appropriate, it disables threading
>>>>>> if the path contains a latch at the non-last position (due to being
>>>>>> backwards on the non-first position in the array).  I.e. it disables
>>>>>> rotating the loop if there's danger of polluting the back edge.  It might
>>>>>> be improved if the blocks following (preceding!) the latch are themself
>>>>>> empty because then no code is duplicated.  It might also be improved if
>>>>>> the latch is already non-empty.  That code should probably only be active
>>>>>> before the loop optimizers, but currently the backward threader isn't
>>>>>> differentiating between before/after loop-optims.
>>>>>>
>>>>>> I haven't tested this patch at all, except that it fixes the testcase :)
>>>>> Lame comment at the current end of the thread - it's not threading through the
>>>> I don't know why y'all keep using the word "lame".  On the contrary,
>>>> these are incredibly useful explanations.  Thanks.
>>>>
>>>>> latch but threading through the loop header that's problematic, at least if the
>>>>> end of the threading path ends within the loop (threading through the header
>>>>> to the loop exit is fine).  Because in that situation you effectively created an
>>>>> alternate loop entry.  Threading through the latch into the loop header is
>>>>> fine but with simple latches that likely will never happen (if there are no
>>>>> simple latches then the latch can contain the loop exit test).
>>>>>
>>>>> See tree-ssa-threadupdate.c:thread_block_1
>>>>>
>>>>>          e2 = path->last ()->e;
>>>>>          if (!e2 || noloop_only)
>>>>>            {
>>>>>              /* If NOLOOP_ONLY is true, we only allow threading through the
>>>>>                 header of a loop to exit edges.  */
>>>>>
>>>>>              /* One case occurs when there was loop header buried in a jump
>>>>>                 threading path that crosses loop boundaries.  We do not try
>>>>>                 and thread this elsewhere, so just cancel the jump threading
>>>>>                 request by clearing the AUX field now.  */
>>>>>              if (bb->loop_father != e2->src->loop_father
>>>>>                  && (!loop_exit_edge_p (e2->src->loop_father, e2)
>>>>>                      || flow_loop_nested_p (bb->loop_father,
>>>>>                                             e2->dest->loop_father)))
>>>>>                {
>>>>>                  /* Since this case is not handled by our special code
>>>>>                     to thread through a loop header, we must explicitly
>>>>>                     cancel the threading request here.  */
>>>>>                  delete_jump_thread_path (path);
>>>>>                  e->aux = NULL;
>>>>>                  continue;
>>>>>                }
>>>> But this is for a threading path that crosses loop boundaries, which is
>>>> not the case.  Perhaps we should restrict this further to threads within
>>>> a loop?
>>>>
>>>>> there are a lot of "useful" checks in this function and the backwards threader
>>>>> should adopt those.  Note the backwards threader originally only did
>>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming
>>>>> irreducible regions at worst or sub-loops at best.  That might explain the
>>>>> lack of those checks.
>>>> Also, the aforementioned checks are in jump_thread_path_registry, which
>>>> is also shared by the backward threader.  These are thread discards
>>>> _after_ a thread has been registered.
>>> Yeah, that's indeed unfortunate.
>>>
>>>>    The backward threader should also
>>>> be using these restrictions.  Unless, I'm missing some interaction with
>>>> the FSM/etc threading types as per the preamble to the snippet you provided:
>>>>
>>>>         if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners)
>>>>             || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners))
>>>>           continue;
>>> Indeed.  But I understand the backwards threader does not (only) do FSM
>>> threading now.
>> If it does, it was not part of my rewrite.  I was careful to not touch
>> anything dealing with either path profitability or low-level path
>> registering.
>>
>> The path registering is in back_threader_registry::register_path().  We
>> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY.  ISTM that
>> those are only EDGE_FSM_THREADs??
> Well, if the backwards threader classifies everything as FSM that's probably
> inaccurate since only threads through the loop latch are "FSM".  There is
> the comment
FSM is used all over the place, but it really just applies to threading 
through the latch and through an indirect jump/switch. Many of the 
places that say FSM should probably just say "backwards" or something 
similar.


>
>    /* If this path does not thread through the loop latch, then we are
>       using the FSM threader to find old style jump threads.  This
>       is good, except the FSM threader does not re-use an existing
>       threading path to reduce code duplication.
>
>       So for that case, drastically reduce the number of statements
>       we are allowed to copy.  */
>
> so these cases should use the "old style" validity/costing metrics and thus
> classify threading opportunities in a different way?
It's not that simple.  The backwards threader uses a different block 
copier and CFG update mechanism that does not support commonizing 
threads from multiple paths to the same target.  So each jump thread 
ends up with a copy of the path which can lead to excessive code bloat.

THe forward threader knows how to commonize those paths so cut down on 
the duplicates and thus can accept larger blocks without having such a 
negative impact on code size.


>
> I think today "backwards" vs, "forwards" only refers to the way we find
> threading opportunities.
Correct.

jeff


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  6:57                 ` Richard Biener
  2021-09-09  7:37                   ` Aldy Hernandez
@ 2021-09-09 12:47                   ` Michael Matz
  1 sibling, 0 replies; 33+ messages in thread
From: Michael Matz @ 2021-09-09 12:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: Aldy Hernandez, Jeff Law, GCC Mailing List, Andrew MacLeod

Hello,

On Thu, 9 Sep 2021, Richard Biener wrote:

> > I believe something like below would be appropriate, it disables 
> > threading if the path contains a latch at the non-last position (due 
> > to being backwards on the non-first position in the array).  I.e. it 
> > disables rotating the loop if there's danger of polluting the back 
> > edge.  It might be improved if the blocks following (preceding!) the 
> > latch are themself empty because then no code is duplicated.  It might 
> > also be improved if the latch is already non-empty.  That code should 
> > probably only be active before the loop optimizers, but currently the 
> > backward threader isn't differentiating between before/after 
> > loop-optims.
> >
> > I haven't tested this patch at all, except that it fixes the testcase 
> > :)
> 
> Lame comment at the current end of the thread - it's not threading 
> through the latch but threading through the loop header that's 
> problematic,

I beg to differ, but that's probably because of the ambiguity of the word 
"through" (does it or does it not include the ends of the path :) ).  If 
you thread through the loop header from the entry block (i.e. duplicate 
code from header to entry) all would be fine (or not, in case you created 
multiple entries from outside).  If you thread through the latch, then 
through an empty header and then through another non-empty basic block 
within the loop, you still wouldn't be fine: you've just created code on 
the back edge (and hence into the new latch block).  If you thread through 
the latch and through an empty header (and stop there), you also are fine.

Also note that in this situation you do _not_ create new entries into the 
loop, not even intermediately.  The old back edge is the one that goes 
away due to the threading, the old entry edge is moved comletely out of 
the loop, the edge header->thread-dest becomes the new entry edge, and the 
edge new-latch->thread-dest becomes the back edge.  No additional entries.

So, no, it's not the threading through the loop header that is problematic 
but creating a situation that fills the (new) latch with code, and that 
can only happen if the candidate thread contains the latch block.

(Of course it's somewhat related to the header block as well, because that 
is simply the only destination the latch has, and hence is usually 
included in any thread that also include the latch; but it's not the 
header that indicates the situation).

> See tree-ssa-threadupdate.c:thread_block_1
> 
>       e2 = path->last ()->e;
>       if (!e2 || noloop_only)
>         {
>           /* If NOLOOP_ONLY is true, we only allow threading through the
>              header of a loop to exit edges.  */
> 
>           /* One case occurs when there was loop header buried in a jump
>              threading path that crosses loop boundaries.  We do not try
>              and thread this elsewhere, so just cancel the jump threading
>              request by clearing the AUX field now.  */
>           if (bb->loop_father != e2->src->loop_father
>               && (!loop_exit_edge_p (e2->src->loop_father, e2)
>                   || flow_loop_nested_p (bb->loop_father,
>                                          e2->dest->loop_father)))
>             {
>               /* Since this case is not handled by our special code
>                  to thread through a loop header, we must explicitly
>                  cancel the threading request here.  */
>               delete_jump_thread_path (path);
>               e->aux = NULL;
>               continue;
>             }

Yeah, sure, but I'm not sure if the above check is _really_ testing it 
wants to test or if the effects it achieves are side effects.  Like in my 
proposed patch: I could also test for existence of loop header in the 
thread and reject that; it would work as well, except that it works 
because any useful thread including a latch (which is the problematic one) 
also includes the header.  I'm not sure if the above check is in the same 
line, or tests for some still another situation.

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-08 18:13               ` Michael Matz
  2021-09-09  6:57                 ` Richard Biener
@ 2021-09-09  8:14                 ` Aldy Hernandez
  2021-09-09  8:24                   ` Richard Biener
                                     ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09  8:14 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod



On 9/8/21 8:13 PM, Michael Matz wrote:
> Hello,
> 
> [lame answer to self]
> 
> On Wed, 8 Sep 2021, Michael Matz wrote:
> 
>>>> The forward threader guards against this by simply disallowing
>>>> threadings that involve different loops.  As I see
>>>
>>> The thread in question (5->9->3) is all within the same outer loop,
>>> though. BTW, the backward threader also disallows threading across
>>> different loops (see path_crosses_loops variable).
> ...
>> Maybe it's possible to not disable threading over latches alltogether in
>> the backward threader (like it's tried now), but I haven't looked at the
>> specific situation here in depth, so take my view only as opinion from a
>> large distance :-)
> 
> I've now looked at the concrete situation.  So yeah, the whole path is in
> the same loop, crosses the latch, _and there's code following the latch
> on that path_.  (I.e. the latch isn't the last block in the path).  In
> particular, after loop_optimizer_init() (before any threading) we have:
> 
>    <bb 3> [local count: 118111600]:
>    # j_19 = PHI <j_13(9), 0(7)>
>    sum_11 = c[j_19];
>    if (n_10(D) > 0)
>      goto <bb 8>; [89.00%]
>    else
>      goto <bb 5>; [11.00%]
> 
>       <bb 8> [local count: 105119324]:
> ...
> 
>    <bb 5> [local count: 118111600]:
>    # sum_21 = PHI <sum_14(4), sum_11(3)>
>    c[j_19] = sum_21;
>    j_13 = j_19 + 1;
>    if (n_10(D) > j_13)
>      goto <bb 9>; [89.00%]
>    else
>      goto <bb 6>; [11.00%]
> 
>    <bb 9> [local count: 105119324]:
>    goto <bb 3>; [100.00%]
> 
> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> pre-header of inner loop, but more importantly something that's not at the
> start of the outer loop.
> 
> Now, any thread that includes the backedge 9->3 _including_ its
> destination (i.e. where the backedge isn't the last to-be-redirected edge)
> necessarily duplicates all code from that destination onto the back edge.
> Here it's the load from c[j] into sum_11.
> 
> The important part is the code is emitted onto the back edge,
> conceptually; in reality it's simply included into the (new) latch block
> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> cfg_cleanup).
> 
> That's what we can't have for some of our structural loop optimizers:
> there must be no code executed after the exit test (e.g. in the latch
> block).  (This requirement makes reasoning about which code is or isn't
> executed completely for an iteration trivial; simply everything in the
> body is always executed; e.g. loop interchange uses this to check that
> there are no memory references after the exit test, because those would
> then be only conditional and hence make loop interchange very awkward).
> 
> Note that this situation can't be later rectified anymore: the duplicated
> instructions (because they are memory refs) must remain after the exit
> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> memory refs on the loop-entry path and on the back edge are equivalent)
> would that be possible, but that's something we aren't capable of.  Even
> if we were that would simply just revert the whole work that the threader
> did, so it's better to not even do that to start with.
> 
> I believe something like below would be appropriate, it disables threading
> if the path contains a latch at the non-last position (due to being
> backwards on the non-first position in the array).  I.e. it disables
> rotating the loop if there's danger of polluting the back edge.  It might
> be improved if the blocks following (preceding!) the latch are themself
> empty because then no code is duplicated.  It might also be improved if
> the latch is already non-empty.  That code should probably only be active
> before the loop optimizers, but currently the backward threader isn't
> differentiating between before/after loop-optims.
> 
> I haven't tested this patch at all, except that it fixes the testcase :)

Thanks for looking at this.

I think you're onto something with this approach.  Perhaps in addition 
to the loop header threading Richard mentions.

Your patch causes some regressions, but I think most are noise from FSM 
tests that must be adjusted.  However, there are some other ones that 
are curious:

 > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated 
memset zero"
 > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
< XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue 
"Performing shrink-wrapping"
< XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, 
line 36)
 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
graphite "1 loops carried no dependency" 1
 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
optimized "loopfn.1" 4
 > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times 
graphite "5 loops carried no dependency" 1

Interestingly your patch is fixing shrink-wrap-loop.c and 
Warray-bounds-87, both of which were introduced by the backward threader 
rewrite.  At least the Warray-bounds was the threader peeling off an 
iteration that caused a bogus warning.

The ldist-22 regression is interesting though:

void foo ()
{
   int i;

   <bb 2> :
   goto <bb 6>; [INV]

   <bb 3> :
   a[i_1] = 0;
   if (i_1 > 100)
     goto <bb 4>; [INV]
   else
     goto <bb 5>; [INV]

   <bb 4> :
   b[i_1] = i_1;

   <bb 5> :
   i_8 = i_1 + 1;

   <bb 6> :
   # i_1 = PHI <0(2), i_8(5)>
   if (i_1 <= 1023)
     goto <bb 3>; [INV]
   else
     goto <bb 7>; [INV]

   <bb 7> :
   return;

}

Here we fail to look past 5->6 because BB5 is the latch and is not the 
last block in the path.  So we fail to thread 3->5->6->3.  Doing so 
would have split the function into two loops, one of which could use a 
memset:

void foo ()
{
   int i;

   <bb 2> :
   goto <bb 6>; [INV]

   <bb 3> :
   # i_12 = PHI <i_1(6), i_9(4)>
   a[i_12] = 0;
   if (i_12 > 100)
     goto <bb 5>; [INV]
   else
     goto <bb 4>; [INV]

   <bb 4> :
   i_9 = i_12 + 1;
   goto <bb 3>; [100.00%]

   <bb 5> :
   b[i_12] = i_12;
   i_8 = i_12 + 1;

   <bb 6> :
   # i_1 = PHI <0(2), i_8(5)>
   if (i_1 <= 1023)
     goto <bb 3>; [INV]
   else
     goto <bb 7>; [INV]

   <bb 7> :
   return;

}

I would have to agree that threading through latches is problematic. 
For that matter, the ldist-22 test shows that we're depending on the 
threader to do work that seems to belong in the loop optimizer world.

Would it be crazy to suggest that we disable threading through latches 
altogether, and do whatever we're missing in the loop world?  It seems 
loop has all the tools, cost model, and framework to do so.  Of course, 
I know 0 about loop, and would hate to add work to other's plates.

Thanks.
Aldy

BTW, I haven't looked at the graphite regressions.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  8:14                 ` Aldy Hernandez
@ 2021-09-09  8:24                   ` Richard Biener
  2021-09-09 12:52                   ` Michael Matz
  2021-09-09 16:54                   ` Jeff Law
  2 siblings, 0 replies; 33+ messages in thread
From: Richard Biener @ 2021-09-09  8:24 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod

On Thu, Sep 9, 2021 at 10:14 AM Aldy Hernandez <aldyh@redhat.com> wrote:
>
>
>
> On 9/8/21 8:13 PM, Michael Matz wrote:
> > Hello,
> >
> > [lame answer to self]
> >
> > On Wed, 8 Sep 2021, Michael Matz wrote:
> >
> >>>> The forward threader guards against this by simply disallowing
> >>>> threadings that involve different loops.  As I see
> >>>
> >>> The thread in question (5->9->3) is all within the same outer loop,
> >>> though. BTW, the backward threader also disallows threading across
> >>> different loops (see path_crosses_loops variable).
> > ...
> >> Maybe it's possible to not disable threading over latches alltogether in
> >> the backward threader (like it's tried now), but I haven't looked at the
> >> specific situation here in depth, so take my view only as opinion from a
> >> large distance :-)
> >
> > I've now looked at the concrete situation.  So yeah, the whole path is in
> > the same loop, crosses the latch, _and there's code following the latch
> > on that path_.  (I.e. the latch isn't the last block in the path).  In
> > particular, after loop_optimizer_init() (before any threading) we have:
> >
> >    <bb 3> [local count: 118111600]:
> >    # j_19 = PHI <j_13(9), 0(7)>
> >    sum_11 = c[j_19];
> >    if (n_10(D) > 0)
> >      goto <bb 8>; [89.00%]
> >    else
> >      goto <bb 5>; [11.00%]
> >
> >       <bb 8> [local count: 105119324]:
> > ...
> >
> >    <bb 5> [local count: 118111600]:
> >    # sum_21 = PHI <sum_14(4), sum_11(3)>
> >    c[j_19] = sum_21;
> >    j_13 = j_19 + 1;
> >    if (n_10(D) > j_13)
> >      goto <bb 9>; [89.00%]
> >    else
> >      goto <bb 6>; [11.00%]
> >
> >    <bb 9> [local count: 105119324]:
> >    goto <bb 3>; [100.00%]
> >
> > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
> > pre-header of inner loop, but more importantly something that's not at the
> > start of the outer loop.
> >
> > Now, any thread that includes the backedge 9->3 _including_ its
> > destination (i.e. where the backedge isn't the last to-be-redirected edge)
> > necessarily duplicates all code from that destination onto the back edge.
> > Here it's the load from c[j] into sum_11.
> >
> > The important part is the code is emitted onto the back edge,
> > conceptually; in reality it's simply included into the (new) latch block
> > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after
> > cfg_cleanup).
> >
> > That's what we can't have for some of our structural loop optimizers:
> > there must be no code executed after the exit test (e.g. in the latch
> > block).  (This requirement makes reasoning about which code is or isn't
> > executed completely for an iteration trivial; simply everything in the
> > body is always executed; e.g. loop interchange uses this to check that
> > there are no memory references after the exit test, because those would
> > then be only conditional and hence make loop interchange very awkward).
> >
> > Note that this situation can't be later rectified anymore: the duplicated
> > instructions (because they are memory refs) must remain after the exit
> > test.  Only by rerolling/unrotating the loop (i.e. noticing that the
> > memory refs on the loop-entry path and on the back edge are equivalent)
> > would that be possible, but that's something we aren't capable of.  Even
> > if we were that would simply just revert the whole work that the threader
> > did, so it's better to not even do that to start with.
> >
> > I believe something like below would be appropriate, it disables threading
> > if the path contains a latch at the non-last position (due to being
> > backwards on the non-first position in the array).  I.e. it disables
> > rotating the loop if there's danger of polluting the back edge.  It might
> > be improved if the blocks following (preceding!) the latch are themself
> > empty because then no code is duplicated.  It might also be improved if
> > the latch is already non-empty.  That code should probably only be active
> > before the loop optimizers, but currently the backward threader isn't
> > differentiating between before/after loop-optims.
> >
> > I haven't tested this patch at all, except that it fixes the testcase :)
>
> Thanks for looking at this.
>
> I think you're onto something with this approach.  Perhaps in addition
> to the loop header threading Richard mentions.
>
> Your patch causes some regressions, but I think most are noise from FSM
> tests that must be adjusted.  However, there are some other ones that
> are curious:
>
>  > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated
> memset zero"
>  > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
> < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue
> "Performing shrink-wrapping"
> < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages,
> line 36)
>  > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times
> graphite "1 loops carried no dependency" 1
>  > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times
> optimized "loopfn.1" 4
>  > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times
> graphite "5 loops carried no dependency" 1
>
> Interestingly your patch is fixing shrink-wrap-loop.c and
> Warray-bounds-87, both of which were introduced by the backward threader
> rewrite.  At least the Warray-bounds was the threader peeling off an
> iteration that caused a bogus warning.
>
> The ldist-22 regression is interesting though:
>
> void foo ()
> {
>    int i;
>
>    <bb 2> :
>    goto <bb 6>; [INV]
>
>    <bb 3> :
>    a[i_1] = 0;
>    if (i_1 > 100)
>      goto <bb 4>; [INV]
>    else
>      goto <bb 5>; [INV]
>
>    <bb 4> :
>    b[i_1] = i_1;
>
>    <bb 5> :
>    i_8 = i_1 + 1;
>
>    <bb 6> :
>    # i_1 = PHI <0(2), i_8(5)>
>    if (i_1 <= 1023)
>      goto <bb 3>; [INV]
>    else
>      goto <bb 7>; [INV]
>
>    <bb 7> :
>    return;
>
> }
>
> Here we fail to look past 5->6 because BB5 is the latch and is not the
> last block in the path.  So we fail to thread 3->5->6->3.  Doing so
> would have split the function into two loops, one of which could use a
> memset:

The loop splitting pass should handle this case though but I guess
it never gets to see this original loop form?

> void foo ()
> {
>    int i;
>
>    <bb 2> :
>    goto <bb 6>; [INV]
>
>    <bb 3> :
>    # i_12 = PHI <i_1(6), i_9(4)>
>    a[i_12] = 0;
>    if (i_12 > 100)
>      goto <bb 5>; [INV]
>    else
>      goto <bb 4>; [INV]
>
>    <bb 4> :
>    i_9 = i_12 + 1;
>    goto <bb 3>; [100.00%]
>
>    <bb 5> :
>    b[i_12] = i_12;
>    i_8 = i_12 + 1;
>
>    <bb 6> :
>    # i_1 = PHI <0(2), i_8(5)>
>    if (i_1 <= 1023)
>      goto <bb 3>; [INV]
>    else
>      goto <bb 7>; [INV]
>
>    <bb 7> :
>    return;
>
> }
>
> I would have to agree that threading through latches is problematic.
> For that matter, the ldist-22 test shows that we're depending on the
> threader to do work that seems to belong in the loop optimizer world.
>
> Would it be crazy to suggest that we disable threading through latches
> altogether, and do whatever we're missing in the loop world?  It seems
> loop has all the tools, cost model, and framework to do so.  Of course,
> I know 0 about loop, and would hate to add work to other's plates.

At least FSM threading should be done before loop opts when it introduces
sub-loops (but not when it creates irreducible regions).

Otherwise I agree that jump threading should try very hard to not disrupt
loops before loop optimizations but after those it can (and should) go wild.
Yes, we have RTL loop optimizations, but ...

Richard.

> Thanks.
> Aldy
>
> BTW, I haven't looked at the graphite regressions.
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  8:14                 ` Aldy Hernandez
  2021-09-09  8:24                   ` Richard Biener
@ 2021-09-09 12:52                   ` Michael Matz
  2021-09-09 13:37                     ` Aldy Hernandez
  2021-09-09 16:54                   ` Jeff Law
  2 siblings, 1 reply; 33+ messages in thread
From: Michael Matz @ 2021-09-09 12:52 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod

Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:

> The ldist-22 regression is interesting though:
> 
> void foo ()
> {
>   int i;
> 
>   <bb 2> :
>   goto <bb 6>; [INV]
> 
>   <bb 3> :
>   a[i_1] = 0;
>   if (i_1 > 100)
>     goto <bb 4>; [INV]
>   else
>     goto <bb 5>; [INV]
> 
>   <bb 4> :
>   b[i_1] = i_1;
> 
>   <bb 5> :
>   i_8 = i_1 + 1;
> 
>   <bb 6> :
>   # i_1 = PHI <0(2), i_8(5)>
>   if (i_1 <= 1023)
>     goto <bb 3>; [INV]
>   else
>     goto <bb 7>; [INV]

Here there's no simple latch block to start with (the backedge comes 
directly out of the loop exit block).  So my suggested improvement 
(testing if the latch was empty and only then reject the thread), would 
solve this.

> Would it be crazy to suggest that we disable threading through latches 
> altogether,

I think it wouldn't be crazy, but we can do a bit better as suggested 
above (only reject empty latches, and reject it only for the threaders 
coming before the loop optims).


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09 12:52                   ` Michael Matz
@ 2021-09-09 13:37                     ` Aldy Hernandez
  2021-09-09 14:44                       ` Michael Matz
  0 siblings, 1 reply; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09 13:37 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod



On 9/9/21 2:52 PM, Michael Matz wrote:
> Hello,
> 
> On Thu, 9 Sep 2021, Aldy Hernandez wrote:
> 
>> The ldist-22 regression is interesting though:
>>
>> void foo ()
>> {
>>    int i;
>>
>>    <bb 2> :
>>    goto <bb 6>; [INV]
>>
>>    <bb 3> :
>>    a[i_1] = 0;
>>    if (i_1 > 100)
>>      goto <bb 4>; [INV]
>>    else
>>      goto <bb 5>; [INV]
>>
>>    <bb 4> :
>>    b[i_1] = i_1;
>>
>>    <bb 5> :
>>    i_8 = i_1 + 1;
>>
>>    <bb 6> :
>>    # i_1 = PHI <0(2), i_8(5)>
>>    if (i_1 <= 1023)
>>      goto <bb 3>; [INV]
>>    else
>>      goto <bb 7>; [INV]
> 
> Here there's no simple latch block to start with (the backedge comes
> directly out of the loop exit block).  So my suggested improvement
> (testing if the latch was empty and only then reject the thread), would
> solve this.

Well, there's the thing.  Loop discovery is marking BB5 as the latch, so 
it's not getting threaded:

Checking profitability of path (backwards):  bb:6 (2 insns) bb:5 (latch)

> 
>> Would it be crazy to suggest that we disable threading through latches
>> altogether,
> 
> I think it wouldn't be crazy, but we can do a bit better as suggested
> above (only reject empty latches, and reject it only for the threaders
> coming before the loop optims).

BTW, I'm not sure your check for the non-last position makes a difference:

> diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
> index 449232c7715..528a753b886 100644
> --- a/gcc/tree-ssa-threadbackward.c
> +++ b/gcc/tree-ssa-threadbackward.c
> @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
>    loop_p loop = m_path[0]->loop_father;
>    bool path_crosses_loops = false;
>    bool threaded_through_latch = false;
> +  bool latch_within_path = false;
>    bool multiway_branch_in_path = false;
>    bool threaded_multiway_branch = false;
>    bool contains_hot_bb = false;
> @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
>  	 the last entry in the array when determining if we thread
>  	 through the loop latch.  */
>        if (loop->latch == bb)
> -	threaded_through_latch = true;
> +	{
> +	  threaded_through_latch = true;
> +	  if (j != 0)
> +	    latch_within_path = true;
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file, " (latch)");
> +	}
>      }

If the last position being considered is a simple latch, it only has a 
simple outgoing jump.  There's no need to thread that.  You need a block 
with >= 2 succ edges to thread anything.

Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09 13:37                     ` Aldy Hernandez
@ 2021-09-09 14:44                       ` Michael Matz
  2021-09-09 15:07                         ` Aldy Hernandez
  2021-09-10  7:04                         ` Aldy Hernandez
  0 siblings, 2 replies; 33+ messages in thread
From: Michael Matz @ 2021-09-09 14:44 UTC (permalink / raw)
  To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod

Hello,

On Thu, 9 Sep 2021, Aldy Hernandez wrote:

> > Here there's no simple latch block to start with (the backedge comes
> > directly out of the loop exit block).  So my suggested improvement
> > (testing if the latch was empty and only then reject the thread), would
> > solve this.
> 
> Well, there's the thing.  Loop discovery is marking BB5 as the latch, so 
> it's not getting threaded:

Yes, it's a latch, but not an empty one.  So the thread would make it just 
even more non-empty, which might not be a problem anymore.  So amending my 
patch somewhere with a strategic

  && empty_block_p (latch)

and only then rejecting the thread should make this testcase work again.

(There's still a catch, though: if this non-empty latch, which is also the 
exit test block, is threaded through and is followed by actual code, then 
that code will be inserted onto the back edge, not into the latch block 
before the exit test, and so also create a (new) non-empty latch.  That 
might or might not create problems downstreams, but not as many as 
transformaing an empty into a non-empty latch would do; but it could 
create for instance a second back edge (not in this testcase) and 
suchlike)

> BTW, I'm not sure your check for the non-last position makes a difference:
> 
> > diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
> > index 449232c7715..528a753b886 100644
> > --- a/gcc/tree-ssa-threadbackward.c
> > +++ b/gcc/tree-ssa-threadbackward.c
> > -	threaded_through_latch = true;
> > +	{
> > +	  threaded_through_latch = true;
> > +	  if (j != 0)
> > +	    latch_within_path = true;
> > +	  if (dump_file && (dump_flags & TDF_DETAILS))
> > +	    fprintf (dump_file, " (latch)");
> > +	}
> >      }
> 
> If the last position being considered is a simple latch, it only has a simple
> outgoing jump.  There's no need to thread that.  You need a block with >= 2
> succ edges to thread anything.

So, you are saying that any candidate thread path wouldn't have the latch 
in the last position if it were just an empty forwarder?  I simply wasn't 
sure about that, so was conservative and only wanted to reject things I 
knew where positively bad (namely code in the path following the latch 
that is in danger of being moved into the latch).

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09 14:44                       ` Michael Matz
@ 2021-09-09 15:07                         ` Aldy Hernandez
  2021-09-10  7:04                         ` Aldy Hernandez
  1 sibling, 0 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-09 15:07 UTC (permalink / raw)
  To: Michael Matz; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod



On 9/9/21 4:44 PM, Michael Matz wrote:
> Hello,
> 
> On Thu, 9 Sep 2021, Aldy Hernandez wrote:
> 
>>> Here there's no simple latch block to start with (the backedge comes
>>> directly out of the loop exit block).  So my suggested improvement
>>> (testing if the latch was empty and only then reject the thread), would
>>> solve this.
>>
>> Well, there's the thing.  Loop discovery is marking BB5 as the latch, so
>> it's not getting threaded:
> 
> Yes, it's a latch, but not an empty one.  So the thread would make it just
> even more non-empty, which might not be a problem anymore.  So amending my
> patch somewhere with a strategic
> 
>    && empty_block_p (latch)
> 
> and only then rejecting the thread should make this testcase work again.

Sweet.

> 
> (There's still a catch, though: if this non-empty latch, which is also the
> exit test block, is threaded through and is followed by actual code, then
> that code will be inserted onto the back edge, not into the latch block
> before the exit test, and so also create a (new) non-empty latch.  That
> might or might not create problems downstreams, but not as many as
> transformaing an empty into a non-empty latch would do; but it could
> create for instance a second back edge (not in this testcase) and
> suchlike)
> 
>> BTW, I'm not sure your check for the non-last position makes a difference:
>>
>>> diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
>>> index 449232c7715..528a753b886 100644
>>> --- a/gcc/tree-ssa-threadbackward.c
>>> +++ b/gcc/tree-ssa-threadbackward.c
>>> -	threaded_through_latch = true;
>>> +	{
>>> +	  threaded_through_latch = true;
>>> +	  if (j != 0)
>>> +	    latch_within_path = true;
>>> +	  if (dump_file && (dump_flags & TDF_DETAILS))
>>> +	    fprintf (dump_file, " (latch)");
>>> +	}
>>>       }
>>
>> If the last position being considered is a simple latch, it only has a simple
>> outgoing jump.  There's no need to thread that.  You need a block with >= 2
>> succ edges to thread anything.
> 
> So, you are saying that any candidate thread path wouldn't have the latch
> in the last position if it were just an empty forwarder?  I simply wasn't
> sure about that, so was conservative and only wanted to reject things I
> knew where positively bad (namely code in the path following the latch
> that is in danger of being moved into the latch).

Correct.  In the backwards threader, we don't start the party if the 
last block has < 2 succs:

   FOR_EACH_BB_FN (bb, fun)
     {
       if (EDGE_COUNT (bb->succs) > 1)
	threader.maybe_thread_block (bb);
     }

Thanks.
Aldy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09 14:44                       ` Michael Matz
  2021-09-09 15:07                         ` Aldy Hernandez
@ 2021-09-10  7:04                         ` Aldy Hernandez
  1 sibling, 0 replies; 33+ messages in thread
From: Aldy Hernandez @ 2021-09-10  7:04 UTC (permalink / raw)
  To: Michael Matz
  Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

I think things are clear enough to propose a patch.  Thanks for 
everyone's input.

I have added some comments and tweaked Michael's patch to include the 
final BB where we're threading to, in the check.  Without this last 
check, we fail in tree-ssa/cunroll-15.c because the latch becomes 
polluted with PHI nodes.

OK for trunk?
Aldy

commit e735a2cd01485773635bd66c97af21059992d5a7 (HEAD -> 
pending/global-ranges)
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Thu Sep 9 20:30:28 2021 +0200

     Disable threading through latches until after loop optimizations.

     The motivation for this patch was enabling the use of global ranges in
     the path solver, but this caused certain properties of loops being
     destroyed which made subsequent loop optimizations to fail.
     Consequently, this patch's mail goal is to disable jump threading
     involving the latch until after loop optimizations have run.

     I have added a bit in cfun to indicate when the "loopdone" optimization
     has completed.  This helps the path threader determine when it's 
safe to
     thread through loop structures.  I can adapt the patch if there is an
     alternate way of determining this.

     As can be seen in the test adjustments, we mostly shift the threading
     from the early threaders (ethread, thread[12] to the late threaders
     thread[34]).  I have nuked some of the early notes in the testcases
     that came as part of the jump threader rewrite.  They're mostly noise
     now.

     Note that we could probably relax some other restrictions in
     profitable_path_p when loop optimizations have completed, but it would
     require more testing, and I'm hesitant to touch more things than needed
     at this point.  I have added a reminder to the function to keep this
     in mind.

     Finally, perhaps as a follow-up, we should apply the same 
restrictions to
     the forward threader.  At some point I'd like to combine the cost 
models.

     Tested on x86-64 Linux.

     p.s. There is a thorough discussion involving the limitations of jump
     threading involving loops here:

             https://gcc.gnu.org/pipermail/gcc/2021-September/237247.html

[-- Attachment #2: 0001-Disable-threading-through-latches-until-after-loop-o.patch --]
[-- Type: text/x-patch, Size: 11395 bytes --]

From e735a2cd01485773635bd66c97af21059992d5a7 Mon Sep 17 00:00:00 2001
From: Aldy Hernandez <aldyh@redhat.com>
Date: Thu, 9 Sep 2021 20:30:28 +0200
Subject: [PATCH] Disable threading through latches until after loop
 optimizations.

The motivation for this patch was enabling the use of global ranges in
the path solver, but this caused certain properties of loops being
destroyed which made subsequent loop optimizations to fail.
Consequently, this patch's mail goal is to disable jump threading
involving the latch until after loop optimizations have run.

I have added a bit in cfun to indicate when the "loopdone" optimization
has completed.  This helps the path threader determine when it's safe to
thread through loop structures.  I can adapt the patch if there is an
alternate way of determining this.

As can be seen in the test adjustments, we mostly shift the threading
from the early threaders (ethread, thread[12] to the late threaders
thread[34]).  I have nuked some of the early notes in the testcases
that came as part of the jump threader rewrite.  They're mostly noise
now.

Note that we could probably relax some other restrictions in
profitable_path_p when loop optimizations have completed, but it would
require more testing, and I'm hesitant to touch more things than needed
at this point.  I have added a reminder to the function to keep this
in mind.

Finally, perhaps as a follow-up, we should apply the same restrictions to
the forward threader.  At some point I'd like to combine the cost models.

Tested on x86-64 Linux.

p.s. There is a thorough discussion involving the limitations of jump
threading involving loops here:

	https://gcc.gnu.org/pipermail/gcc/2021-September/237247.html

gcc/ChangeLog:

	* function.h (struct function): Add loop_optimizers_done.
	(set_loop_optimizers_done): New.
	(loop_optimizers_done_p): New.
	* gimple-range-path.cc (path_range_query::internal_range_of_expr):
	Intersect with global range.
	* tree-ssa-loop.c (tree_ssa_loop_done): Call set_loop_optimizers_done.
	* tree-ssa-threadbackward.c
	(back_threader_profitability::profitable_path_p): Disable
	threading through latches until after loop optimizations have run.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Adjust for disabling of
	threading through latches.
	* gcc.dg/tree-ssa/ssa-dom-thread-6.c: Same.
	* gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same.

Co-authored-by: Michael Matz <matz@suse.de>
---
 gcc/function.h                                | 15 ++++++++
 gcc/gimple-range-path.cc                      |  3 ++
 .../gcc.dg/tree-ssa/ssa-dom-thread-2b.c       |  4 +-
 .../gcc.dg/tree-ssa/ssa-dom-thread-6.c        | 37 +------------------
 .../gcc.dg/tree-ssa/ssa-dom-thread-7.c        | 17 +--------
 gcc/tree-ssa-loop.c                           |  1 +
 gcc/tree-ssa-threadbackward.c                 | 28 +++++++++++++-
 7 files changed, 50 insertions(+), 55 deletions(-)

diff --git a/gcc/function.h b/gcc/function.h
index 36003e7576a..fb99b6a44fc 100644
--- a/gcc/function.h
+++ b/gcc/function.h
@@ -438,6 +438,9 @@ struct GTY(()) function {
 
   /* Set if there are any OMP_TARGET regions in the function.  */
   unsigned int has_omp_target : 1;
+
+  /* Set if SSA loop optimizers have completed.  */
+  unsigned int loop_optimizers_done : 1;
 };
 
 /* Add the decl D to the local_decls list of FUN.  */
@@ -514,6 +517,18 @@ set_loops_for_fn (struct function *fn, struct loops *loops)
   fn->x_current_loops = loops;
 }
 
+inline void
+set_loop_optimizers_done (struct function *fn)
+{
+  fn->loop_optimizers_done = 1;
+}
+
+inline bool
+loop_optimizers_done_p (struct function *fn)
+{
+  return fn->loop_optimizers_done;
+}
+
 /* For backward compatibility... eventually these should all go away.  */
 #define current_function_funcdef_no (cfun->funcdef_no)
 
diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc
index a4fa3b296ff..c616b65756f 100644
--- a/gcc/gimple-range-path.cc
+++ b/gcc/gimple-range-path.cc
@@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt)
   basic_block bb = stmt ? gimple_bb (stmt) : exit_bb ();
   if (stmt && range_defined_in_block (r, name, bb))
     {
+      if (TREE_CODE (name) == SSA_NAME)
+	r.intersect (gimple_range_global (name));
+
       set_cache (r, name);
       return true;
     }
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
index e1c33e86cd7..823ada982ff 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */ 
-/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */
+/* { dg-options "-O2 -fdump-tree-thread3-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */
 
 void foo();
 void bla();
@@ -26,4 +26,4 @@ void thread_latch_through_header (void)
    case.  And we want to thread through the header as well.  These
    are both caught by threading in DOM.  */
 /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */
-/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "thread1"} } */
+/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "thread3"} } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
index c7bf867b084..ee46759bacc 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c
@@ -1,41 +1,8 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread2-details" } */
+/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */
 
-/* All the threads in the thread1 dump start on a X->BB12 edge, as can
-   be seen in the dump:
-
-     Registering FSM jump thread: (x, 12) incoming edge; ...
-     etc
-     etc
-
-   Before the new evrp, we were threading paths that started at the
-   following edges:
-
-      Registering FSM jump thread: (10, 12) incoming edge
-      Registering FSM jump thread:  (6, 12) incoming edge
-      Registering FSM jump thread:  (9, 12) incoming edge
-
-   This was because the PHI at BB12 had constant values coming in from
-   BB10, BB6, and BB9:
-
-   # state_10 = PHI <state_11(7), 0(10), state_11(5), 1(6), state_11(8), 2(9), state_11(11)>
-
-   Now with the new evrp, we get:
-
-   # state_10 = PHI <0(7), 0(10), state_11(5), 1(6), 0(8), 2(9), 1(11)>
-
-   Thus, we have 3 more paths that are known to be constant and can be
-   threaded.  Which means that by the second threading pass, we can
-   only find one profitable path.
-
-   For the record, all these extra constants are better paths coming
-   out of switches.  For example:
-
-     SWITCH_BB -> BBx -> BBy -> BBz -> PHI
-
-   We now know the value of the switch index at PHI.  */
 /* { dg-final { scan-tree-dump-times "Registering FSM jump" 6 "thread1" } } */
-/* { dg-final { scan-tree-dump-times "Registering FSM jump" 1 "thread2" } } */
+/* { dg-final { scan-tree-dump-times "Registering FSM jump" 1 "thread3" } } */
 
 int sum0, sum1, sum2, sum3;
 int foo (char *s, char **ret)
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
index 5fc2145a432..ba07942f9dd 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c
@@ -1,23 +1,8 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */
 
-/* Here we have the same issue as was commented in ssa-dom-thread-6.c.
-   The PHI coming into the threader has a lot more constants, so the
-   threader can thread more paths.
-
-$ diff clean/a.c.105t.mergephi2 a.c.105t.mergephi2 
-252c252
-<   # s_50 = PHI <s_49(10), 5(14), s_51(18), s_51(22), 1(26), 1(29), 1(31), s_51(5), 4(12), 1(15), 5(17), 1(19), 3(21), 1(23), 6(25), 7(28), s_51(30)>
----
->   # s_50 = PHI <s_49(10), 5(14), 4(18), 5(22), 1(26), 1(29), 1(31), s_51(5), 4(12), 1(15), 5(17), 1(19), 3(21), 1(23), 6(25), 7(28), 7(30)>
-272a273
-
-  I spot checked a few and they all have the same pattern.  We are
-  basically tracking the switch index better through multiple
-  paths.  */
-
 /* { dg-final { scan-tree-dump "Jumps threaded: 18"  "thread1" } } */
-/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread2" } } */
+/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" } } */
 /* { dg-final { scan-tree-dump-not "Jumps threaded"  "dom2" } } */
 
 /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC.  It's high enough
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 0cc4b3bbccf..c827a51d3ce 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -528,6 +528,7 @@ tree_ssa_loop_done (void)
   free_numbers_of_iterations_estimates (cfun);
   scev_finalize ();
   loop_optimizer_finalize (cfun, true);
+  set_loop_optimizers_done (cfun);
   return 0;
 }
 
diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c
index 449232c7715..ded31a9e2de 100644
--- a/gcc/tree-ssa-threadbackward.c
+++ b/gcc/tree-ssa-threadbackward.c
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "ssa.h"
 #include "tree-cfgcleanup.h"
 #include "tree-pretty-print.h"
+#include "cfghooks.h"
 
 // Path registry for the backwards threader.  After all paths have been
 // registered with register_path(), thread_through_all_blocks() is called
@@ -564,7 +565,10 @@ back_threader_registry::thread_through_all_blocks (bool may_peel_loop_headers)
    TAKEN_EDGE, otherwise it is NULL.
 
    CREATES_IRREDUCIBLE_LOOP, if non-null is set to TRUE if threading this path
-   would create an irreducible loop.  */
+   would create an irreducible loop.
+
+   ?? It seems we should be able to loosen some of the restrictions in
+   this function after loop optimizations have run.  */
 
 bool
 back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
@@ -725,7 +729,11 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 	 the last entry in the array when determining if we thread
 	 through the loop latch.  */
       if (loop->latch == bb)
-	threaded_through_latch = true;
+	{
+	  threaded_through_latch = true;
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, " (latch)");
+	}
     }
 
   gimple *stmt = get_gimple_control_stmt (m_path[0]);
@@ -845,6 +853,22 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path,
 		 "a multiway branch.\n");
       return false;
     }
+
+  /* Threading through a non-empty latch would cause code to be added
+     to the latch.  This could alter the loop form sufficiently to
+     cause loop optimizations to fail.  Disable these threads until
+     after loop optimizations have run.  */
+  if ((threaded_through_latch
+       || (taken_edge && taken_edge->dest == loop->latch))
+      && !loop_optimizers_done_p (cfun)
+      && empty_block_p (loop->latch))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "  FAIL: FSM Thread through latch before loop opts would create non-empty latch\n");
+      return false;
+
+    }
   return true;
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: More aggressive threading causing loop-interchange-9.c regression
  2021-09-09  8:14                 ` Aldy Hernandez
  2021-09-09  8:24                   ` Richard Biener
  2021-09-09 12:52                   ` Michael Matz
@ 2021-09-09 16:54                   ` Jeff Law
  2 siblings, 0 replies; 33+ messages in thread
From: Jeff Law @ 2021-09-09 16:54 UTC (permalink / raw)
  To: Aldy Hernandez, Michael Matz
  Cc: Richard Biener, GCC Mailing List, Andrew MacLeod



On 9/9/2021 2:14 AM, Aldy Hernandez wrote:
>
>
> On 9/8/21 8:13 PM, Michael Matz wrote:
>> Hello,
>>
>> [lame answer to self]
>>
>> On Wed, 8 Sep 2021, Michael Matz wrote:
>>
>>>>> The forward threader guards against this by simply disallowing
>>>>> threadings that involve different loops.  As I see
>>>>
>>>> The thread in question (5->9->3) is all within the same outer loop,
>>>> though. BTW, the backward threader also disallows threading across
>>>> different loops (see path_crosses_loops variable).
>> ...
>>> Maybe it's possible to not disable threading over latches 
>>> alltogether in
>>> the backward threader (like it's tried now), but I haven't looked at 
>>> the
>>> specific situation here in depth, so take my view only as opinion 
>>> from a
>>> large distance :-)
>>
>> I've now looked at the concrete situation.  So yeah, the whole path 
>> is in
>> the same loop, crosses the latch, _and there's code following the latch
>> on that path_.  (I.e. the latch isn't the last block in the path).  In
>> particular, after loop_optimizer_init() (before any threading) we have:
>>
>>    <bb 3> [local count: 118111600]:
>>    # j_19 = PHI <j_13(9), 0(7)>
>>    sum_11 = c[j_19];
>>    if (n_10(D) > 0)
>>      goto <bb 8>; [89.00%]
>>    else
>>      goto <bb 5>; [11.00%]
>>
>>       <bb 8> [local count: 105119324]:
>> ...
>>
>>    <bb 5> [local count: 118111600]:
>>    # sum_21 = PHI <sum_14(4), sum_11(3)>
>>    c[j_19] = sum_21;
>>    j_13 = j_19 + 1;
>>    if (n_10(D) > j_13)
>>      goto <bb 9>; [89.00%]
>>    else
>>      goto <bb 6>; [11.00%]
>>
>>    <bb 9> [local count: 105119324]:
>>    goto <bb 3>; [100.00%]
>>
>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the
>> pre-header of inner loop, but more importantly something that's not 
>> at the
>> start of the outer loop.
>>
>> Now, any thread that includes the backedge 9->3 _including_ its
>> destination (i.e. where the backedge isn't the last to-be-redirected 
>> edge)
>> necessarily duplicates all code from that destination onto the back 
>> edge.
>> Here it's the load from c[j] into sum_11.
>>
>> The important part is the code is emitted onto the back edge,
>> conceptually; in reality it's simply included into the (new) latch block
>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 
>> after
>> cfg_cleanup).
>>
>> That's what we can't have for some of our structural loop optimizers:
>> there must be no code executed after the exit test (e.g. in the latch
>> block).  (This requirement makes reasoning about which code is or isn't
>> executed completely for an iteration trivial; simply everything in the
>> body is always executed; e.g. loop interchange uses this to check that
>> there are no memory references after the exit test, because those would
>> then be only conditional and hence make loop interchange very awkward).
>>
>> Note that this situation can't be later rectified anymore: the 
>> duplicated
>> instructions (because they are memory refs) must remain after the exit
>> test.  Only by rerolling/unrotating the loop (i.e. noticing that the
>> memory refs on the loop-entry path and on the back edge are equivalent)
>> would that be possible, but that's something we aren't capable of.  Even
>> if we were that would simply just revert the whole work that the 
>> threader
>> did, so it's better to not even do that to start with.
>>
>> I believe something like below would be appropriate, it disables 
>> threading
>> if the path contains a latch at the non-last position (due to being
>> backwards on the non-first position in the array).  I.e. it disables
>> rotating the loop if there's danger of polluting the back edge. It might
>> be improved if the blocks following (preceding!) the latch are themself
>> empty because then no code is duplicated.  It might also be improved if
>> the latch is already non-empty.  That code should probably only be 
>> active
>> before the loop optimizers, but currently the backward threader isn't
>> differentiating between before/after loop-optims.
>>
>> I haven't tested this patch at all, except that it fixes the testcase :)
>
> Thanks for looking at this.
>
> I think you're onto something with this approach.  Perhaps in addition 
> to the loop header threading Richard mentions.
>
> Your patch causes some regressions, but I think most are noise from 
> FSM tests that must be adjusted.  However, there are some other ones 
> that are curious:
>
> > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated 
> memset zero"
> > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag"
> < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue 
> "Performing shrink-wrapping"
> < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, 
> line 36)
> > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
> graphite "1 loops carried no dependency" 1
> > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times 
> optimized "loopfn.1" 4
> > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times 
> graphite "5 loops carried no dependency" 1
>
> Interestingly your patch is fixing shrink-wrap-loop.c and 
> Warray-bounds-87, both of which were introduced by the backward 
> threader rewrite.  At least the Warray-bounds was the threader peeling 
> off an iteration that caused a bogus warning.
>
> The ldist-22 regression is interesting though:
>
> void foo ()
> {
>   int i;
>
>   <bb 2> :
>   goto <bb 6>; [INV]
>
>   <bb 3> :
>   a[i_1] = 0;
>   if (i_1 > 100)
>     goto <bb 4>; [INV]
>   else
>     goto <bb 5>; [INV]
>
>   <bb 4> :
>   b[i_1] = i_1;
>
>   <bb 5> :
>   i_8 = i_1 + 1;
>
>   <bb 6> :
>   # i_1 = PHI <0(2), i_8(5)>
>   if (i_1 <= 1023)
>     goto <bb 3>; [INV]
>   else
>     goto <bb 7>; [INV]
>
>   <bb 7> :
>   return;
>
> }
>
> Here we fail to look past 5->6 because BB5 is the latch and is not the 
> last block in the path.  So we fail to thread 3->5->6->3.  Doing so 
> would have split the function into two loops, one of which could use a 
> memset:
>
> void foo ()
> {
>   int i;
>
>   <bb 2> :
>   goto <bb 6>; [INV]
>
>   <bb 3> :
>   # i_12 = PHI <i_1(6), i_9(4)>
>   a[i_12] = 0;
>   if (i_12 > 100)
>     goto <bb 5>; [INV]
>   else
>     goto <bb 4>; [INV]
>
>   <bb 4> :
>   i_9 = i_12 + 1;
>   goto <bb 3>; [100.00%]
>
>   <bb 5> :
>   b[i_12] = i_12;
>   i_8 = i_12 + 1;
>
>   <bb 6> :
>   # i_1 = PHI <0(2), i_8(5)>
>   if (i_1 <= 1023)
>     goto <bb 3>; [INV]
>   else
>     goto <bb 7>; [INV]
>
>   <bb 7> :
>   return;
>
> }
>
> I would have to agree that threading through latches is problematic. 
> For that matter, the ldist-22 test shows that we're depending on the 
> threader to do work that seems to belong in the loop optimizer world.
>
> Would it be crazy to suggest that we disable threading through latches 
> altogether, and do whatever we're missing in the loop world?  It seems 
> loop has all the tools, cost model, and framework to do so.  Of 
> course, I know 0 about loop, and would hate to add work to other's 
> plates.
Threading through latches can be extremely problematical as they can 
change the underlying loop structure.    For example, it can take a 
simple loop and turn it into a loop nest.

I thought we'd already disabled threading through latches except for the 
case where doing so allows us to thread through an indirect jump (ie, 
the original motivation for the FSM threader).

As I mentioned in our private email, threading through headers, through 
latches, across loops, etc has a number of negative effects and we 
generally want to avoid that, at least in the instances before loop 
optimization and vectorization.  THe threader doesn't have any cost 
model for what is effectively loop peeling, loop rotatation, loop header 
copying and the like.  Zdenek argued those belong in the loop 
optimizers, not jump threading and I broadly agree with that assessment.

Jeff



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-09-10 16:38 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-07 11:49 More aggressive threading causing loop-interchange-9.c regression Aldy Hernandez
2021-09-07 14:45 ` Michael Matz
2021-09-08 10:44   ` Aldy Hernandez
2021-09-08 13:13     ` Richard Biener
2021-09-08 13:25       ` Aldy Hernandez
2021-09-08 13:49         ` Richard Biener
2021-09-08 16:19           ` Aldy Hernandez
2021-09-08 16:39             ` Michael Matz
2021-09-08 18:13               ` Michael Matz
2021-09-09  6:57                 ` Richard Biener
2021-09-09  7:37                   ` Aldy Hernandez
2021-09-09  7:45                     ` Richard Biener
2021-09-09  8:36                       ` Aldy Hernandez
2021-09-09  8:58                         ` Richard Biener
2021-09-09  9:21                           ` Aldy Hernandez
2021-09-09 10:15                             ` Richard Biener
2021-09-09 11:28                               ` Aldy Hernandez
2021-09-10 15:51                               ` Jeff Law
2021-09-10 16:11                                 ` Aldy Hernandez
2021-09-10 15:43                             ` Jeff Law
2021-09-10 16:05                               ` Aldy Hernandez
2021-09-10 16:21                                 ` Jeff Law
2021-09-10 16:38                                   ` Aldy Hernandez
2021-09-09 16:59                           ` Jeff Law
2021-09-09 12:47                   ` Michael Matz
2021-09-09  8:14                 ` Aldy Hernandez
2021-09-09  8:24                   ` Richard Biener
2021-09-09 12:52                   ` Michael Matz
2021-09-09 13:37                     ` Aldy Hernandez
2021-09-09 14:44                       ` Michael Matz
2021-09-09 15:07                         ` Aldy Hernandez
2021-09-10  7:04                         ` Aldy Hernandez
2021-09-09 16:54                   ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).