[PATCH] Fix incorrect computation in fill_always_executed_in

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Fix incorrect computation in fill_always_executed_in_1
@ 2021-08-16  8:46 Xiong Hu Luo
  2021-08-16 11:46 ` Richard Biener
  2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool
  0 siblings, 2 replies; 17+ messages in thread
From: Xiong Hu Luo @ 2021-08-16  8:46 UTC (permalink / raw)
  To: gcc-patches
  Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, rguenther, Xiong Hu Luo

It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
nested loops.  inn_loop is updated to inner loop, so it need be restored
when exiting from innermost loop. With this patch, the store instruction
in outer loop could also be moved out of outer loop by store motion.
Any comments?  Thanks.

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
	inn_loop when exiting from innermost loop.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
 gcc/tree-ssa-loop-im.c                     |  6 +++++-
 2 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..097a5ee4a4b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,24 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..5ca4738b20e 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
 	    last = bb;
 
+	  if (inn_loop != loop
+	      && flow_loop_nested_p (bb->loop_father, inn_loop))
+	    inn_loop = bb->loop_father;
+
 	  if (bitmap_bit_p (contains_call, bb->index))
 	    break;
 
@@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 
 	  if (bb->loop_father->header == bb)
 	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
+	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb))
 		break;
 
 	      /* In a loop that is always entered we may proceed anyway.
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1
  2021-08-16  8:46 [PATCH] Fix incorrect computation in fill_always_executed_in_1 Xiong Hu Luo
@ 2021-08-16 11:46 ` Richard Biener
  2021-08-17  5:17   ` Xionghu Luo
  2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-16 11:46 UTC (permalink / raw)
  To: Xiong Hu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Mon, 16 Aug 2021, Xiong Hu Luo wrote:

> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> nested loops.  inn_loop is updated to inner loop, so it need be restored
> when exiting from innermost loop. With this patch, the store instruction
> in outer loop could also be moved out of outer loop by store motion.
> Any comments?  Thanks.

> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
> 	inn_loop when exiting from innermost loop.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
>  gcc/tree-ssa-loop-im.c                     |  6 +++++-
>  2 files changed, 29 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..097a5ee4a4b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,24 @@
> +/* PR/101293 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +struct X { int i; int j; int k;};
> +
> +void foo(struct X *x, int n, int l)
> +{
> +  for (int j = 0; j < l; j++)
> +    {
> +      for (int i = 0; i < n; ++i)
> +	{
> +	  int *p = &x->j;
> +	  int tem = *p;
> +	  x->j += tem * i;
> +	}
> +      int *r = &x->k;
> +      int tem2 = *r;
> +      x->k += tem2 * j;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
> +
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index b24bc64f2a7..5ca4738b20e 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>  	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>  	    last = bb;
>  
> +	  if (inn_loop != loop
> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
> +	    inn_loop = bb->loop_father;
> +

The comment says

              /* In a loop that is always entered we may proceed anyway.
                 But record that we entered it and stop once we leave it.  
*/
              inn_loop = bb->loop_father;

and your change would defeat that early return, no?

>  	  if (bitmap_bit_p (contains_call, bb->index))
>  	    break;
>  
> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>  
>  	  if (bb->loop_father->header == bb)
>  	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb))
>  		break;

That's now a always false condition - a loops latch is always dominated
by its header.  The condition as written tries to verify whether the
loop is always entered - mind we visit all blocks, not only those
always executed.

In fact for your testcase the x->j ref is _not_ always executed
since the inner loop is conditional on n > 0.

Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1
  2021-08-16 11:46 ` Richard Biener
@ 2021-08-17  5:17   ` Xionghu Luo
  2021-08-17  5:24     ` Xionghu Luo
  2021-08-17  7:12     ` Richard Biener
  0 siblings, 2 replies; 17+ messages in thread
From: Xionghu Luo @ 2021-08-17  5:17 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

[-- Attachment #1: Type: text/plain, Size: 5254 bytes --]

Hi,

On 2021/8/16 19:46, Richard Biener wrote:
> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
> 
>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
>> nested loops.  inn_loop is updated to inner loop, so it need be restored
>> when exiting from innermost loop. With this patch, the store instruction
>> in outer loop could also be moved out of outer loop by store motion.
>> Any comments?  Thanks.
> 
>> gcc/ChangeLog:
>>
>> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
>> 	inn_loop when exiting from innermost loop.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>> ---
>>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
>>   gcc/tree-ssa-loop-im.c                     |  6 +++++-
>>   2 files changed, 29 insertions(+), 1 deletion(-)
>>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>
>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>> new file mode 100644
>> index 00000000000..097a5ee4a4b
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>> @@ -0,0 +1,24 @@
>> +/* PR/101293 */
>> +/* { dg-do compile } */
>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
>> +
>> +struct X { int i; int j; int k;};
>> +
>> +void foo(struct X *x, int n, int l)
>> +{
>> +  for (int j = 0; j < l; j++)
>> +    {
>> +      for (int i = 0; i < n; ++i)
>> +	{
>> +	  int *p = &x->j;
>> +	  int tem = *p;
>> +	  x->j += tem * i;
>> +	}
>> +      int *r = &x->k;
>> +      int tem2 = *r;
>> +      x->k += tem2 * j;
>> +    }
>> +}
>> +
>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
>> +
>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>> index b24bc64f2a7..5ca4738b20e 100644
>> --- a/gcc/tree-ssa-loop-im.c
>> +++ b/gcc/tree-ssa-loop-im.c
>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>   	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>   	    last = bb;
>>   
>> +	  if (inn_loop != loop
>> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
>> +	    inn_loop = bb->loop_father;
>> +
> 
> The comment says
> 
>                /* In a loop that is always entered we may proceed anyway.
>                   But record that we entered it and stop once we leave it.
> */
>                inn_loop = bb->loop_father;
> 
> and your change would defeat that early return, no?

The issue is the search method exits too early when iterating the outer
loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
they won't be processed by store motion again.
 

     5<----
     |\   |
     8 \  9
     |  \ |
 --->3--->4
|    | 
10---|
 

SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
patch, it will continue search when meet bb 3 until bb 4, then last is updated
to bb 4, it will break until exit edge is found at bb 4 by
"if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code will
set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.


      while (1)
	{
	  SET_ALWAYS_EXECUTED_IN (last, loop);
	  if (last == loop->header)
	    break;
	  last = get_immediate_dominator (CDI_DOMINATORS, last);
	}

After further discussion with Kewen, we found that the inn_loop variable is
totally useless and could be removed.


> 
>>   	  if (bitmap_bit_p (contains_call, bb->index))
>>   	    break;
>>   
>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>   
>>   	  if (bb->loop_father->header == bb)
>>   	    {
>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb))
>>   		break;
> 
> That's now a always false condition - a loops latch is always dominated
> by its header.  The condition as written tries to verify whether the
> loop is always entered - mind we visit all blocks, not only those
> always executed.

Thanks for the catch!  I am afraid the piece of code should be removed since it stops
search of potential ALWAYS EXECUTED bb after inner loop...

> 
> In fact for your testcase the x->j ref is _not_ always executed
> since the inner loop is conditional on n > 0.

Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in store-motion.
Attached the diff file without and with my patch to show the extra optimization.

x->j is already moved out of loop 2 on master code.
If change n and l to constant numbers like 100, master code could also do 2 store
motions as expected. The edge from bb 5 to bb 4 doesn't exist now, so bb 4, bb 3
and bb 5 are ALWAYS EXECUTED for loop 1.


struct X { int i; int j; int k;};

void foo(struct X *x, int n, int l)
{
  for (int j = 0; j < l; j++) // loop 1
    {
      for (int i = 0; i < n; ++i)  // loop 2
        {
          int *p = &x->j;
          int tem = *p;
          x->j += tem * i;
        }
      int *r = &x->k;
      int tem2 = *r;
      x->k += tem2 * j;
    }
}

> 
> Richard.
> 

-- 
Thanks,
Xionghu

[-- Attachment #2: ssa-lim-19.c.137t.lim2.diff --]
[-- Type: text/plain, Size: 6665 bytes --]

--- ssa-lim-19.c.137t.lim2	2021-08-16 03:27:04.173709326 -0500
+++ ../debug/ssa-lim-19.c.137t.lim2	2021-08-16 03:26:47.866261053 -0500
@@ -34,141 +34,172 @@
 Basic block 5 (loop 1 -- depth 1):
 
 if (n_11(D) > 0)
   invariant up to level 1, cost 20.
 
 Basic block 8 (loop 1 -- depth 1):
 
 Basic block 3 (loop 2 -- depth 2):
 
 Basic block 4 (loop 1 -- depth 1):
 
 Basic block 9 (loop 1 -- depth 1):
 
 Basic block 10 (loop 2 -- depth 2):
 
+Querying dependency of refs 2 and 1: independent.
+Querying RAW dependencies of ref 2 in loop 2: independent
+Querying RAW dependencies of ref 2 in loop 1: independent
+Querying dependency of refs 2 and 1: independent.
+Querying SM WAR dependencies of ref 2 in loop 2: independent
+Querying SM WAR dependencies of ref 2 in loop 1: independent
+Executing store motion of MEM[(int *)x_12(D) + 8B] from loop 1
 Querying RAW dependencies of ref 1 in loop 2: independent
 Querying SM WAR dependencies of ref 1 in loop 2: independent
 Executing store motion of MEM[(int *)x_12(D) + 4B] from loop 2
 Moving statement
-x__lsm.4 = MEM[(int *)x_12(D) + 4B];
+x__lsm.5 = MEM[(int *)x_12(D) + 4B];
 (cost 0) out of loop 2.
 
+Moving statement
+x__lsm.4 = MEM[(int *)x_12(D) + 8B];
+(cost 0) out of loop 1.
+
 
 Updating SSA:
-creating PHI node in block #3 for x__lsm.4
+creating PHI node in block #5 for x__lsm.4
+creating PHI node in block #3 for x__lsm.5
 Registering new PHI nodes in block #0
 Registering new PHI nodes in block #2
 Registering new PHI nodes in block #7
+Updating SSA information for statement x__lsm.4 = MEM[(int *)x_12(D) + 8B];
 Registering new PHI nodes in block #5
 Registering new PHI nodes in block #8
-Updating SSA information for statement x__lsm.4 = MEM[(int *)x_12(D) + 4B];
+Updating SSA information for statement x__lsm.5 = MEM[(int *)x_12(D) + 4B];
 Registering new PHI nodes in block #3
-Updating SSA information for statement tem_16 = x__lsm.4;
-Updating SSA information for statement x__lsm.4 = _2;
-Registering new PHI nodes in block #11
-Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.4;
+Updating SSA information for statement tem_16 = x__lsm.5;
+Updating SSA information for statement x__lsm.5 = _2;
+Registering new PHI nodes in block #12
+Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.5;
 Registering new PHI nodes in block #10
 Registering new PHI nodes in block #4
-Updating SSA information for statement tem2_13 = MEM[(int *)x_12(D) + 8B];
-Updating SSA information for statement x_12(D)->k = _4;
+Updating SSA information for statement tem2_13 = x__lsm.4;
+Updating SSA information for statement x__lsm.4 = _4;
+Registering new PHI nodes in block #11
+Updating SSA information for statement MEM[(int *)x_12(D) + 8B] = x__lsm.4;
 Registering new PHI nodes in block #9
 Registering new PHI nodes in block #6
 Updating SSA information for statement return;
 
 Symbols to be put in SSA form
-{ D.3324 D.3329 }
+{ D.3324 D.3329 D.3330 }
 Incremental SSA update started at block: 0
-Number of blocks in CFG: 12
-Number of blocks to update: 11 ( 92%)
-Affected blocks: 0 2 3 4 5 6 7 8 9 10 11
+Number of blocks in CFG: 13
+Number of blocks to update: 12 ( 92%)
+Affected blocks: 0 2 3 4 5 6 7 8 9 10 11 12
 
 
-;; Created LCSSA PHI: x__lsm.4_6 = PHI <x__lsm.4_5(3)>
+;; Created LCSSA PHI: x__lsm.5_29 = PHI <x__lsm.5_6(3)>
+;; Created LCSSA PHI: x__lsm.4_30 = PHI <x__lsm.4_21(4)>
 
 Updating SSA:
+Registering new PHI nodes in block #5
+Registering new PHI nodes in block #8
 Registering new PHI nodes in block #3
-Updating SSA information for statement x__lsm.4_5 = _2;
-Registering new PHI nodes in block #11
-Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.4_5;
+Updating SSA information for statement x__lsm.5_6 = _2;
+Registering new PHI nodes in block #12
+Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.5_6;
 Registering new PHI nodes in block #10
+Registering new PHI nodes in block #4
+Updating SSA information for statement x__lsm.4_21 = _4;
+Registering new PHI nodes in block #11
+Updating SSA information for statement MEM[(int *)x_12(D) + 8B] = x__lsm.4_21;
+Registering new PHI nodes in block #9
 
 SSA replacement table
 N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j
 
-x__lsm.4_6 -> { x__lsm.4_5 }
-Incremental SSA update started at block: 3
-Number of blocks in CFG: 12
-Number of blocks to update: 3 ( 25%)
-Affected blocks: 3 10 11
+x__lsm.5_29 -> { x__lsm.5_6 }
+x__lsm.4_30 -> { x__lsm.4_21 }
+Incremental SSA update started at block: 5
+Number of blocks in CFG: 13
+Number of blocks to update: 7 ( 54%)
+Affected blocks: 3 4 5 9 10 11 12
 
 
 void foo (struct X * x, int n, int l)
 {
+  int x__lsm.5;
   int x__lsm.4;
   int tem;
   int i;
   int tem2;
   int j;
   int _1;
   int _2;
   int _3;
   int _4;
 
   <bb 2> [local count: 14598063]:
   if (l_10(D) > 0)
     goto <bb 7>; [89.00%]
   else
     goto <bb 6>; [11.00%]
 
   <bb 7> [local count: 12992276]:
+  x__lsm.4_5 = MEM[(int *)x_12(D) + 8B];
   goto <bb 5>; [100.00%]
 
   <bb 8> [local count: 105119324]:
-  x__lsm.4_8 = MEM[(int *)x_12(D) + 4B];
+  x__lsm.5_7 = MEM[(int *)x_12(D) + 4B];
 
   <bb 3> [local count: 955630225]:
   # i_24 = PHI <i_18(10), 0(8)>
-  # x__lsm.4_20 = PHI <x__lsm.4_5(10), x__lsm.4_8(8)>
-  tem_16 = x__lsm.4_20;
+  # x__lsm.5_8 = PHI <x__lsm.5_6(10), x__lsm.5_7(8)>
+  tem_16 = x__lsm.5_8;
   _1 = tem_16 * i_24;
   _2 = _1 + tem_16;
-  x__lsm.4_5 = _2;
+  x__lsm.5_6 = _2;
   i_18 = i_24 + 1;
   if (n_11(D) > i_18)
     goto <bb 10>; [89.00%]
   else
-    goto <bb 11>; [11.00%]
+    goto <bb 12>; [11.00%]
 
   <bb 10> [local count: 850510901]:
   goto <bb 3>; [100.00%]
 
-  <bb 11> [local count: 105119324]:
-  # x__lsm.4_6 = PHI <x__lsm.4_5(3)>
-  MEM[(int *)x_12(D) + 4B] = x__lsm.4_6;
+  <bb 12> [local count: 105119324]:
+  # x__lsm.5_29 = PHI <x__lsm.5_6(3)>
+  MEM[(int *)x_12(D) + 4B] = x__lsm.5_29;
 
   <bb 4> [local count: 118111600]:
-  tem2_13 = MEM[(int *)x_12(D) + 8B];
+  tem2_13 = x__lsm.4_20;
   _3 = tem2_13 * j_23;
   _4 = _3 + tem2_13;
-  x_12(D)->k = _4;
+  x__lsm.4_21 = _4;
   j_15 = j_23 + 1;
   if (l_10(D) > j_15)
     goto <bb 9>; [89.00%]
   else
-    goto <bb 6>; [11.00%]
+    goto <bb 11>; [11.00%]
 
   <bb 9> [local count: 105119324]:
 
   <bb 5> [local count: 118111600]:
   # j_23 = PHI <j_15(9), 0(7)>
+  # x__lsm.4_20 = PHI <x__lsm.4_21(9), x__lsm.4_5(7)>
   if (n_11(D) > 0)
     goto <bb 8>; [89.00%]
   else
     goto <bb 4>; [11.00%]
 
+  <bb 11> [local count: 12992276]:
+  # x__lsm.4_30 = PHI <x__lsm.4_21(4)>
+  MEM[(int *)x_12(D) + 8B] = x__lsm.4_30;
+
   <bb 6> [local count: 14598063]:
   return;
 
 }
 
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1
  2021-08-17  5:17   ` Xionghu Luo
@ 2021-08-17  5:24     ` Xionghu Luo
  2021-08-17  7:12     ` Richard Biener
  1 sibling, 0 replies; 17+ messages in thread
From: Xionghu Luo @ 2021-08-17  5:24 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc



On 2021/8/17 13:17, Xionghu Luo via Gcc-patches wrote:
> Hi,
> 
> On 2021/8/16 19:46, Richard Biener wrote:
>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
>>
>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
>>> nested loops.  inn_loop is updated to inner loop, so it need be restored
>>> when exiting from innermost loop. With this patch, the store instruction
>>> in outer loop could also be moved out of outer loop by store motion.
>>> Any comments?  Thanks.
>>
>>> gcc/ChangeLog:
>>>
>>>     * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
>>>     inn_loop when exiting from innermost loop.
>>>
>>> gcc/testsuite/ChangeLog:
>>>
>>>     * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>> ---
>>>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
>>>   gcc/tree-ssa-loop-im.c                     |  6 +++++-
>>>   2 files changed, 29 insertions(+), 1 deletion(-)
>>>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>
>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c 
>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>> new file mode 100644
>>> index 00000000000..097a5ee4a4b
>>> --- /dev/null
>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>> @@ -0,0 +1,24 @@
>>> +/* PR/101293 */
>>> +/* { dg-do compile } */
>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
>>> +
>>> +struct X { int i; int j; int k;};
>>> +
>>> +void foo(struct X *x, int n, int l)
>>> +{
>>> +  for (int j = 0; j < l; j++)
>>> +    {
>>> +      for (int i = 0; i < n; ++i)
>>> +    {
>>> +      int *p = &x->j;
>>> +      int tem = *p;
>>> +      x->j += tem * i;
>>> +    }
>>> +      int *r = &x->k;
>>> +      int tem2 = *r;
>>> +      x->k += tem2 * j;
>>> +    }
>>> +}
>>> +
>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 
>>> "lim2" } } */
>>> +
>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>> index b24bc64f2a7..5ca4738b20e 100644
>>> --- a/gcc/tree-ssa-loop-im.c
>>> +++ b/gcc/tree-ssa-loop-im.c
>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, 
>>> sbitmap contains_call)
>>>         if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>           last = bb;
>>> +      if (inn_loop != loop
>>> +          && flow_loop_nested_p (bb->loop_father, inn_loop))
>>> +        inn_loop = bb->loop_father;
>>> +
>>
>> The comment says
>>
>>                /* In a loop that is always entered we may proceed anyway.
>>                   But record that we entered it and stop once we leave 
>> it.
>> */
>>                inn_loop = bb->loop_father;
>>
>> and your change would defeat that early return, no?
> 
> The issue is the search method exits too early when iterating the outer
> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
> they won't be processed by store motion again.
> 
> 
>      5<----
>      |\   |
>      8 \  9
>      |  \ |
> --->3--->4
> |    | 10---|



Correct the graph display:

      5<----
      |\   |
      8 \  9
      |  \ |
  --->3--->4
|   |
  ---10


> 
> 
> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
> patch, it will continue search when meet bb 3 until bb 4, then last is 
> updated
> to bb 4, it will break until exit edge is found at bb 4 by
> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop 
> code will
> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
> 
> 
>       while (1)
>      {
>        SET_ALWAYS_EXECUTED_IN (last, loop);
>        if (last == loop->header)
>          break;
>        last = get_immediate_dominator (CDI_DOMINATORS, last);
>      }
> 
> After further discussion with Kewen, we found that the inn_loop variable is
> totally useless and could be removed.
> 
> 
>>
>>>         if (bitmap_bit_p (contains_call, bb->index))
>>>           break;
>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, 
>>> sbitmap contains_call)
>>>         if (bb->loop_father->header == bb)
>>>           {
>>> -          if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>> +          if (!dominated_by_p (CDI_DOMINATORS, 
>>> bb->loop_father->latch, bb))
>>>           break;
>>
>> That's now a always false condition - a loops latch is always dominated
>> by its header.  The condition as written tries to verify whether the
>> loop is always entered - mind we visit all blocks, not only those
>> always executed.
> 
> Thanks for the catch!  I am afraid the piece of code should be removed 
> since it stops
> search of potential ALWAYS EXECUTED bb after inner loop...
> 
>>
>> In fact for your testcase the x->j ref is _not_ always executed
>> since the inner loop is conditional on n > 0.
> 
> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in 
> store-motion.
> Attached the diff file without and with my patch to show the extra 
> optimization.
> 
> x->j is already moved out of loop 2 on master code.
> If change n and l to constant numbers like 100, master code could also 
> do 2 store
> motions as expected. The edge from bb 5 to bb 4 doesn't exist now, so bb 
> 4, bb 3
> and bb 5 are ALWAYS EXECUTED for loop 1.
> 
> 
> struct X { int i; int j; int k;};
> 
> void foo(struct X *x, int n, int l)
> {
>   for (int j = 0; j < l; j++) // loop 1
>     {
>       for (int i = 0; i < n; ++i)  // loop 2
>         {
>           int *p = &x->j;
>           int tem = *p;
>           x->j += tem * i;
>         }
>       int *r = &x->k;
>       int tem2 = *r;
>       x->k += tem2 * j;
>     }
> }
> 
>>
>> Richard.
>>
> 

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1
  2021-08-17  5:17   ` Xionghu Luo
  2021-08-17  5:24     ` Xionghu Luo
@ 2021-08-17  7:12     ` Richard Biener
  2021-08-17  9:10       ` [PATCH v2] Fix incomplete " Xionghu Luo
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-17  7:12 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Tue, 17 Aug 2021, Xionghu Luo wrote:

> Hi,
> 
> On 2021/8/16 19:46, Richard Biener wrote:
> > On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
> > 
> >> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> >> nested loops.  inn_loop is updated to inner loop, so it need be restored
> >> when exiting from innermost loop. With this patch, the store instruction
> >> in outer loop could also be moved out of outer loop by store motion.
> >> Any comments?  Thanks.
> > 
> >> gcc/ChangeLog:
> >>
> >>  * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
> >>  inn_loop when exiting from innermost loop.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >> ---
> >>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
> >>   gcc/tree-ssa-loop-im.c                     |  6 +++++-
> >>   2 files changed, 29 insertions(+), 1 deletion(-)
> >>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>
> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >> new file mode 100644
> >> index 00000000000..097a5ee4a4b
> >> --- /dev/null
> >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >> @@ -0,0 +1,24 @@
> >> +/* PR/101293 */
> >> +/* { dg-do compile } */
> >> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> >> +
> >> +struct X { int i; int j; int k;};
> >> +
> >> +void foo(struct X *x, int n, int l)
> >> +{
> >> +  for (int j = 0; j < l; j++)
> >> +    {
> >> +      for (int i = 0; i < n; ++i)
> >> +	{
> >> +	  int *p = &x->j;
> >> +	  int tem = *p;
> >> +	  x->j += tem * i;
> >> +	}
> >> +      int *r = &x->k;
> >> +      int tem2 = *r;
> >> +      x->k += tem2 * j;
> >> +    }
> >> +}
> >> +
> >> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } }
> >> */
> >> +
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index b24bc64f2a7..5ca4738b20e 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> >> @@ contains_call)
> >>      if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>        last = bb;
> >>   +	  if (inn_loop != loop
> >> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
> >> +	    inn_loop = bb->loop_father;
> >> +
> > 
> > The comment says
> > 
> >                /* In a loop that is always entered we may proceed anyway.
> >                   But record that we entered it and stop once we leave it.
> > */
> >                inn_loop = bb->loop_father;
> > 
> > and your change would defeat that early return, no?
> 
> The issue is the search method exits too early when iterating the outer
> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
> they won't be processed by store motion again.
> 
> 
>     5<----
>     |\   |
>     8 \  9
>     |  \ |
> --->3--->4
> |    | 
> 10---|
> 
> 
> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
> patch, it will continue search when meet bb 3 until bb 4, then last is updated
> to bb 4, it will break until exit edge is found at bb 4 by
> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code
> will
> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
> 
> 
>      while (1)
> 	{
> 	  SET_ALWAYS_EXECUTED_IN (last, loop);
> 	  if (last == loop->header)
> 	    break;
> 	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> 	}
> 
> After further discussion with Kewen, we found that the inn_loop variable is
> totally useless and could be removed.
> 
> 
> > 
> >>      if (bitmap_bit_p (contains_call, bb->index))
> >>        break;
> >>   
> >> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> >> @@ contains_call)
> >>   
> >>      if (bb->loop_father->header == bb)
> >>   	    {
> >> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch,
> >> bb))
> >>     break;
> > 
> > That's now a always false condition - a loops latch is always dominated
> > by its header.  The condition as written tries to verify whether the
> > loop is always entered - mind we visit all blocks, not only those
> > always executed.
> 
> Thanks for the catch!  I am afraid the piece of code should be removed since
> it stops
> search of potential ALWAYS EXECUTED bb after inner loop...

But the code says:

            /* In a loop that is always entered we may proceed anyway.
               But record that we entered it and stop once we leave it.
             */

and you do not remove this comment still it doesn't hold anymore
after your patch.  I don't say the current code is optimal - I just
say it does exactly what is documented.


> > 
> > In fact for your testcase the x->j ref is _not_ always executed
> > since the inner loop is conditional on n > 0.
> 
> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in
> store-motion.

Sorry, that wasn't clear.  Yes, I agree the code fails to see always
executed blocks after inner loops.  But since the code simply walks
all blocks in a loop instead of greedily following edges it has
to do that since the inner loop might exit the outer loop as well,
sth your change wouldn't honor.  Consider

 while (--n)
  {
    do
      {
        if (do_exit)
          goto out;
      }
    while (1);
    p->x += 1;
  }
out:;

you'll see a CFG where the inner loop exits the outer loop as well.

So I'm saying to improve the code you'll likely have to do more.
And add a testcase like the following

void __attribute__((noipa))
foo (int n, int m, int f, int *p, int *q)
{
 while (--n)
  {
    do
      {
        *q += 1;
        if (f)
          goto out;
      }
    while (--m);
    *p += 1;
  }
out:;
}

int main()
{
  int i = 0;
  foo (10, 10, 1, (void *)0, &i);
  if (i != 1)
    __builtin_abort ();
  return 0;
}

Richard.

> Attached the diff file without and with my patch to show the extra
> optimization.
> 
> x->j is already moved out of loop 2 on master code.
> If change n and l to constant numbers like 100, master code could also do 2
> store
> motions as expected. The edge from bb 5 to bb 4 doesn't exist now, so bb 4, bb
> 3
> and bb 5 are ALWAYS EXECUTED for loop 1.
> 
> 
> struct X { int i; int j; int k;};
> 
> void foo(struct X *x, int n, int l)
> {
>  for (int j = 0; j < l; j++) // loop 1
>    {
>      for (int i = 0; i < n; ++i)  // loop 2
>        {
>          int *p = &x->j;
>          int tem = *p;
>          x->j += tem * i;
>        }
>      int *r = &x->k;
>      int tem2 = *r;
>      x->k += tem2 * j;
>    }
> }
> 
> > 
> > Richard.
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
  2021-08-17  7:12     ` Richard Biener
@ 2021-08-17  9:10       ` Xionghu Luo
  2021-08-19  5:23         ` Xionghu Luo
  2021-08-19 12:11         ` Richard Biener
  0 siblings, 2 replies; 17+ messages in thread
From: Xionghu Luo @ 2021-08-17  9:10 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw



On 2021/8/17 15:12, Richard Biener wrote:
> On Tue, 17 Aug 2021, Xionghu Luo wrote:
> 
>> Hi,
>>
>> On 2021/8/16 19:46, Richard Biener wrote:
>>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
>>>
>>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
>>>> nested loops.  inn_loop is updated to inner loop, so it need be restored
>>>> when exiting from innermost loop. With this patch, the store instruction
>>>> in outer loop could also be moved out of outer loop by store motion.
>>>> Any comments?  Thanks.
>>>
>>>> gcc/ChangeLog:
>>>>
>>>>   * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
>>>>   inn_loop when exiting from innermost loop.
>>>>
>>>> gcc/testsuite/ChangeLog:
>>>>
>>>> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>> ---
>>>>    gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
>>>>    gcc/tree-ssa-loop-im.c                     |  6 +++++-
>>>>    2 files changed, 29 insertions(+), 1 deletion(-)
>>>>    create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>
>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>> new file mode 100644
>>>> index 00000000000..097a5ee4a4b
>>>> --- /dev/null
>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>> @@ -0,0 +1,24 @@
>>>> +/* PR/101293 */
>>>> +/* { dg-do compile } */
>>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
>>>> +
>>>> +struct X { int i; int j; int k;};
>>>> +
>>>> +void foo(struct X *x, int n, int l)
>>>> +{
>>>> +  for (int j = 0; j < l; j++)
>>>> +    {
>>>> +      for (int i = 0; i < n; ++i)
>>>> +	{
>>>> +	  int *p = &x->j;
>>>> +	  int tem = *p;
>>>> +	  x->j += tem * i;
>>>> +	}
>>>> +      int *r = &x->k;
>>>> +      int tem2 = *r;
>>>> +      x->k += tem2 * j;
>>>> +    }
>>>> +}
>>>> +
>>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } }
>>>> */
>>>> +
>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>> index b24bc64f2a7..5ca4738b20e 100644
>>>> --- a/gcc/tree-ssa-loop-im.c
>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
>>>> @@ contains_call)
>>>>       if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>         last = bb;
>>>>    +	  if (inn_loop != loop
>>>> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
>>>> +	    inn_loop = bb->loop_father;
>>>> +
>>>
>>> The comment says
>>>
>>>                 /* In a loop that is always entered we may proceed anyway.
>>>                    But record that we entered it and stop once we leave it.
>>> */
>>>                 inn_loop = bb->loop_father;
>>>
>>> and your change would defeat that early return, no?
>>
>> The issue is the search method exits too early when iterating the outer
>> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
>> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
>> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
>> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
>> they won't be processed by store motion again.
>>
>>
>>      5<----
>>      |\   |
>>      8 \  9
>>      |  \ |
>> --->3--->4
>> |    |
>> 10---|
>>
>>
>> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
>> patch, it will continue search when meet bb 3 until bb 4, then last is updated
>> to bb 4, it will break until exit edge is found at bb 4 by
>> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code
>> will
>> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
>>
>>
>>       while (1)
>> 	{
>> 	  SET_ALWAYS_EXECUTED_IN (last, loop);
>> 	  if (last == loop->header)
>> 	    break;
>> 	  last = get_immediate_dominator (CDI_DOMINATORS, last);
>> 	}
>>
>> After further discussion with Kewen, we found that the inn_loop variable is
>> totally useless and could be removed.
>>
>>
>>>
>>>>       if (bitmap_bit_p (contains_call, bb->index))
>>>>         break;
>>>>    
>>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
>>>> @@ contains_call)
>>>>    
>>>>       if (bb->loop_father->header == bb)
>>>>    	    {
>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch,
>>>> bb))
>>>>      break;
>>>
>>> That's now a always false condition - a loops latch is always dominated
>>> by its header.  The condition as written tries to verify whether the
>>> loop is always entered - mind we visit all blocks, not only those
>>> always executed.
>>
>> Thanks for the catch!  I am afraid the piece of code should be removed since
>> it stops
>> search of potential ALWAYS EXECUTED bb after inner loop...
> 
> But the code says:
> 
>              /* In a loop that is always entered we may proceed anyway.
>                 But record that we entered it and stop once we leave it.
>               */
> 
> and you do not remove this comment still it doesn't hold anymore
> after your patch.  I don't say the current code is optimal - I just
> say it does exactly what is documented.

Removed.

> 
> 
>>>
>>> In fact for your testcase the x->j ref is _not_ always executed
>>> since the inner loop is conditional on n > 0.
>>
>> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in
>> store-motion.
> 
> Sorry, that wasn't clear.  Yes, I agree the code fails to see always
> executed blocks after inner loops.  But since the code simply walks
> all blocks in a loop instead of greedily following edges it has
> to do that since the inner loop might exit the outer loop as well,
> sth your change wouldn't honor.  Consider

This is already handled now, if there is an edge exiting out of outer
loop directly, it will also early break, I've verified it with your case.


	  FOR_EACH_EDGE (e, ei, bb->succs)
	    {
	      /* If there is an exit from this BB.  */
	      if (!flow_bb_inside_loop_p (loop, e->dest))
		break;
> 
>   while (--n)
>    {
>      do
>        {
>          if (do_exit)
>            goto out;
>        }
>      while (1);
>      p->x += 1;
>    }
> out:;
> 
> you'll see a CFG where the inner loop exits the outer loop as well.
> 
> So I'm saying to improve the code you'll likely have to do more.
> And add a testcase like the following
> 
> void __attribute__((noipa))
> foo (int n, int m, int f, int *p, int *q)
> {
>   while (--n)
>    {
>      do
>        {
>          *q += 1;
>          if (f)
>            goto out;
>        }
>      while (--m);
>      *p += 1;
>    }
> out:;
> }
> 
> int main()
> {
>    int i = 0;
>    foo (10, 10, 1, (void *)0, &i);
>    if (i != 1)
>      __builtin_abort ();
>    return 0;
> }
> 
> Richard.
> 

Updated the patch:


[PATCH v2] Fix incomplete computation in fill_always_executed_in_1

It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
nested loops.  Current design will exit if an inner loop doesn't
dominate outer loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops is skipped.
This patch removes the redundant inner loop check, but keep the check of
exiting to out of outer loop from inner loop, then the store instruction
in outer loop could also be moved out of outer loop by store motion.

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
	inn_loop check.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++
 gcc/tree-ssa-loop-im.c                     | 14 ----------
 3 files changed, 54 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..097a5ee4a4b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,24 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+	{
+	  *q += 1;
+	  if (f)
+	    goto out;
+	}
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..c44f4390ce2 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
   basic_block bb = NULL, *bbs, last = NULL;
   unsigned i;
   edge e;
-  class loop *inn_loop = loop;
 
   if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
     {
@@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	     to disprove this if possible).  */
 	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
 	    break;
-
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
-	    break;
-
-	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
-
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it.  */
-	      inn_loop = bb->loop_father;
-	    }
 	}
 
       while (1)
-- 
2.27.0.90.geebb51ba8c


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
  2021-08-17  9:10       ` [PATCH v2] Fix incomplete " Xionghu Luo
@ 2021-08-19  5:23         ` Xionghu Luo
  2021-08-19 12:11         ` Richard Biener
  1 sibling, 0 replies; 17+ messages in thread
From: Xionghu Luo @ 2021-08-19  5:23 UTC (permalink / raw)
  To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc



On 2021/8/17 17:10, Xionghu Luo via Gcc-patches wrote:
> 
> 
> On 2021/8/17 15:12, Richard Biener wrote:
>> On Tue, 17 Aug 2021, Xionghu Luo wrote:
>>
>>> Hi,
>>>
>>> On 2021/8/16 19:46, Richard Biener wrote:
>>>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
>>>>
>>>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
>>>>> nested loops.  inn_loop is updated to inner loop, so it need be restored
>>>>> when exiting from innermost loop. With this patch, the store instruction
>>>>> in outer loop could also be moved out of outer loop by store motion.
>>>>> Any comments?  Thanks.
>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>>    * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
>>>>>    inn_loop when exiting from innermost loop.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>>>>> ---
>>>>>     gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
>>>>>     gcc/tree-ssa-loop-im.c                     |  6 +++++-
>>>>>     2 files changed, 29 insertions(+), 1 deletion(-)
>>>>>     create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>>
>>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>> new file mode 100644
>>>>> index 00000000000..097a5ee4a4b
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>>>> @@ -0,0 +1,24 @@
>>>>> +/* PR/101293 */
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
>>>>> +
>>>>> +struct X { int i; int j; int k;};
>>>>> +
>>>>> +void foo(struct X *x, int n, int l)
>>>>> +{
>>>>> +  for (int j = 0; j < l; j++)
>>>>> +    {
>>>>> +      for (int i = 0; i < n; ++i)
>>>>> +	{
>>>>> +	  int *p = &x->j;
>>>>> +	  int tem = *p;
>>>>> +	  x->j += tem * i;
>>>>> +	}
>>>>> +      int *r = &x->k;
>>>>> +      int tem2 = *r;
>>>>> +      x->k += tem2 * j;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } }
>>>>> */
>>>>> +
>>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>>>>> index b24bc64f2a7..5ca4738b20e 100644
>>>>> --- a/gcc/tree-ssa-loop-im.c
>>>>> +++ b/gcc/tree-ssa-loop-im.c
>>>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
>>>>> @@ contains_call)
>>>>>        if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>>          last = bb;
>>>>>     +	  if (inn_loop != loop
>>>>> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
>>>>> +	    inn_loop = bb->loop_father;
>>>>> +
>>>>
>>>> The comment says
>>>>
>>>>                  /* In a loop that is always entered we may proceed anyway.
>>>>                     But record that we entered it and stop once we leave it.
>>>> */
>>>>                  inn_loop = bb->loop_father;
>>>>
>>>> and your change would defeat that early return, no?
>>>
>>> The issue is the search method exits too early when iterating the outer
>>> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
>>> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
>>> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
>>> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
>>> they won't be processed by store motion again.
>>>
>>>
>>>       5<----
>>>       |\   |
>>>       8 \  9
>>>       |  \ |
>>> --->3--->4
>>> |    |
>>> 10---|
>>>
>>>
>>> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
>>> patch, it will continue search when meet bb 3 until bb 4, then last is updated
>>> to bb 4, it will break until exit edge is found at bb 4 by
>>> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code
>>> will
>>> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
>>>
>>>
>>>        while (1)
>>> 	{
>>> 	  SET_ALWAYS_EXECUTED_IN (last, loop);
>>> 	  if (last == loop->header)
>>> 	    break;
>>> 	  last = get_immediate_dominator (CDI_DOMINATORS, last);
>>> 	}
>>>
>>> After further discussion with Kewen, we found that the inn_loop variable is
>>> totally useless and could be removed.
>>>
>>>
>>>>
>>>>>        if (bitmap_bit_p (contains_call, bb->index))
>>>>>          break;
>>>>>     
>>>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
>>>>> @@ contains_call)
>>>>>     
>>>>>        if (bb->loop_father->header == bb)
>>>>>     	    {
>>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch,
>>>>> bb))
>>>>>       break;
>>>>
>>>> That's now a always false condition - a loops latch is always dominated
>>>> by its header.  The condition as written tries to verify whether the
>>>> loop is always entered - mind we visit all blocks, not only those
>>>> always executed.
>>>
>>> Thanks for the catch!  I am afraid the piece of code should be removed since
>>> it stops
>>> search of potential ALWAYS EXECUTED bb after inner loop...
>>
>> But the code says:
>>
>>               /* In a loop that is always entered we may proceed anyway.
>>                  But record that we entered it and stop once we leave it.
>>                */
>>
>> and you do not remove this comment still it doesn't hold anymore
>> after your patch.  I don't say the current code is optimal - I just
>> say it does exactly what is documented.
> 
> Removed.
> 
>>
>>
>>>>
>>>> In fact for your testcase the x->j ref is _not_ always executed
>>>> since the inner loop is conditional on n > 0.
>>>
>>> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in
>>> store-motion.
>>
>> Sorry, that wasn't clear.  Yes, I agree the code fails to see always
>> executed blocks after inner loops.  But since the code simply walks
>> all blocks in a loop instead of greedily following edges it has
>> to do that since the inner loop might exit the outer loop as well,
>> sth your change wouldn't honor.  Consider
> 
> This is already handled now, if there is an edge exiting out of outer
> loop directly, it will also early break, I've verified it with your case.
> 
> 
> 	  FOR_EACH_EDGE (e, ei, bb->succs)
> 	    {
> 	      /* If there is an exit from this BB.  */
> 	      if (!flow_bb_inside_loop_p (loop, e->dest))
> 		break;
>>
>>    while (--n)
>>     {
>>       do
>>         {
>>           if (do_exit)
>>             goto out;
>>         }
>>       while (1);
>>       p->x += 1;
>>     }
>> out:;
>>
>> you'll see a CFG where the inner loop exits the outer loop as well.
>>
>> So I'm saying to improve the code you'll likely have to do more.
>> And add a testcase like the following
>>
>> void __attribute__((noipa))
>> foo (int n, int m, int f, int *p, int *q)
>> {
>>    while (--n)
>>     {
>>       do
>>         {
>>           *q += 1;
>>           if (f)
>>             goto out;
>>         }
>>       while (--m);
>>       *p += 1;
>>     }
>> out:;
>> }
>>
>> int main()
>> {
>>     int i = 0;
>>     foo (10, 10, 1, (void *)0, &i);
>>     if (i != 1)
>>       __builtin_abort ();
>>     return 0;
>> }
>>
>> Richard.
>>
> 
> Updated the patch:
> 
> 
> [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
> 
> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> nested loops.  Current design will exit if an inner loop doesn't
> dominate outer loop's latch or exit after exiting from inner loop, which
> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> inner loops is skipped.
> This patch removes the redundant inner loop check, but keep the check of
> exiting to out of outer loop from inner loop, then the store instruction
> in outer loop could also be moved out of outer loop by store motion.

This patch is regression tested pass on Power, and shows NO SPEC2017
performance change.

> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
> 	inn_loop check.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> ---
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++
>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++
>   gcc/tree-ssa-loop-im.c                     | 14 ----------
>   3 files changed, 54 insertions(+), 14 deletions(-)
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..097a5ee4a4b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,24 @@
> +/* PR/101293 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +struct X { int i; int j; int k;};
> +
> +void foo(struct X *x, int n, int l)
> +{
> +  for (int j = 0; j < l; j++)
> +    {
> +      for (int i = 0; i < n; ++i)
> +	{
> +	  int *p = &x->j;
> +	  int tem = *p;
> +	  x->j += tem * i;
> +	}
> +      int *r = &x->k;
> +      int tem2 = *r;
> +      x->k += tem2 * j;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..6e180de528e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,30 @@
> +/* PR/101293 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
> +{
> +  while (--n)
> +    {
> +      do
> +	{
> +	  *q += 1;
> +	  if (f)
> +	    goto out;
> +	}
> +      while (--m);
> +      *p += 1;
> +    }
> +out:;
> +}
> +
> +int main()
> +{
> +    int i = 0;
> +    foo (10, 10, 1, (void *) 0, &i);
> +    if (i != 1)
> +      __builtin_abort ();
> +    return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index b24bc64f2a7..c44f4390ce2 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>     basic_block bb = NULL, *bbs, last = NULL;
>     unsigned i;
>     edge e;
> -  class loop *inn_loop = loop;
> 
>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>       {
> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>   	     to disprove this if possible).  */
>   	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>   	    break;
> -
> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> -	    break;
> -
> -	  if (bb->loop_father->header == bb)
> -	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -		break;
> -
> -	      /* In a loop that is always entered we may proceed anyway.
> -		 But record that we entered it and stop once we leave it.  */
> -	      inn_loop = bb->loop_father;
> -	    }
>   	}
> 
>         while (1)
> 

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
  2021-08-17  9:10       ` [PATCH v2] Fix incomplete " Xionghu Luo
  2021-08-19  5:23         ` Xionghu Luo
@ 2021-08-19 12:11         ` Richard Biener
  2021-08-24  7:44           ` Xionghu Luo
  1 sibling, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-19 12:11 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Tue, 17 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/17 15:12, Richard Biener wrote:
> > On Tue, 17 Aug 2021, Xionghu Luo wrote:
> > 
> >> Hi,
> >>
> >> On 2021/8/16 19:46, Richard Biener wrote:
> >>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote:
> >>>
> >>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> >>>> nested loops.  inn_loop is updated to inner loop, so it need be restored
> >>>> when exiting from innermost loop. With this patch, the store instruction
> >>>> in outer loop could also be moved out of outer loop by store motion.
> >>>> Any comments?  Thanks.
> >>>
> >>>> gcc/ChangeLog:
> >>>>
> >>>>   * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore
> >>>>   inn_loop when exiting from innermost loop.
> >>>>
> >>>> gcc/testsuite/ChangeLog:
> >>>>
> >>>> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>>> ---
> >>>>    gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++
> >>>>    gcc/tree-ssa-loop-im.c                     |  6 +++++-
> >>>>    2 files changed, 29 insertions(+), 1 deletion(-)
> >>>>    create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>>
> >>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>> new file mode 100644
> >>>> index 00000000000..097a5ee4a4b
> >>>> --- /dev/null
> >>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>>> @@ -0,0 +1,24 @@
> >>>> +/* PR/101293 */
> >>>> +/* { dg-do compile } */
> >>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> >>>> +
> >>>> +struct X { int i; int j; int k;};
> >>>> +
> >>>> +void foo(struct X *x, int n, int l)
> >>>> +{
> >>>> +  for (int j = 0; j < l; j++)
> >>>> +    {
> >>>> +      for (int i = 0; i < n; ++i)
> >>>> +	{
> >>>> +	  int *p = &x->j;
> >>>> +	  int tem = *p;
> >>>> +	  x->j += tem * i;
> >>>> +	}
> >>>> +      int *r = &x->k;
> >>>> +      int tem2 = *r;
> >>>> +      x->k += tem2 * j;
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } }
> >>>> */
> >>>> +
> >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >>>> index b24bc64f2a7..5ca4738b20e 100644
> >>>> --- a/gcc/tree-ssa-loop-im.c
> >>>> +++ b/gcc/tree-ssa-loop-im.c
> >>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> >>>> @@ contains_call)
> >>>>       if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>         last = bb;
> >>>>    +	  if (inn_loop != loop
> >>>> +	      && flow_loop_nested_p (bb->loop_father, inn_loop))
> >>>> +	    inn_loop = bb->loop_father;
> >>>> +
> >>>
> >>> The comment says
> >>>
> >>>                 /* In a loop that is always entered we may proceed anyway.
> >>>                    But record that we entered it and stop once we leave it.
> >>> */
> >>>                 inn_loop = bb->loop_father;
> >>>
> >>> and your change would defeat that early return, no?
> >>
> >> The issue is the search method exits too early when iterating the outer
> >> loop.  For example of a nested loop, loop 1 includes 5,8,3,10,4,9
> >> and loop2 includes 3,10.  Currently, it breaks when bb is 3 as bb 3
> >> doesn't dominate bb 9 of loop 1.  But actually, both bb 5 and bb 4 are
> >> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4
> >> they won't be processed by store motion again.
> >>
> >>
> >>      5<----
> >>      |\   |
> >>      8 \  9
> >>      |  \ |
> >> --->3--->4
> >> |    |
> >> 10---|
> >>
> >>
> >> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this
> >> patch, it will continue search when meet bb 3 until bb 4, then last is updated
> >> to bb 4, it will break until exit edge is found at bb 4 by
> >> "if (!flow_bb_inside_loop_p (loop, e->dest))".  Then the followed loop code
> >> will
> >> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5.
> >>
> >>
> >>       while (1)
> >> 	{
> >> 	  SET_ALWAYS_EXECUTED_IN (last, loop);
> >> 	  if (last == loop->header)
> >> 	    break;
> >> 	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> >> 	}
> >>
> >> After further discussion with Kewen, we found that the inn_loop variable is
> >> totally useless and could be removed.
> >>
> >>
> >>>
> >>>>       if (bitmap_bit_p (contains_call, bb->index))
> >>>>         break;
> >>>>    
> >>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap
> >>>> @@ contains_call)
> >>>>    
> >>>>       if (bb->loop_father->header == bb)
> >>>>    	    {
> >>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> +	      if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch,
> >>>> bb))
> >>>>      break;
> >>>
> >>> That's now a always false condition - a loops latch is always dominated
> >>> by its header.  The condition as written tries to verify whether the
> >>> loop is always entered - mind we visit all blocks, not only those
> >>> always executed.
> >>
> >> Thanks for the catch!  I am afraid the piece of code should be removed since
> >> it stops
> >> search of potential ALWAYS EXECUTED bb after inner loop...
> > 
> > But the code says:
> > 
> >              /* In a loop that is always entered we may proceed anyway.
> >                 But record that we entered it and stop once we leave it.
> >               */
> > 
> > and you do not remove this comment still it doesn't hold anymore
> > after your patch.  I don't say the current code is optimal - I just
> > say it does exactly what is documented.
> 
> Removed.
> 
> > 
> > 
> >>>
> >>> In fact for your testcase the x->j ref is _not_ always executed
> >>> since the inner loop is conditional on n > 0.
> >>
> >> Yes.  But I want to move x->k (not x->j) out of loop 1 when l > 0 in
> >> store-motion.
> > 
> > Sorry, that wasn't clear.  Yes, I agree the code fails to see always
> > executed blocks after inner loops.  But since the code simply walks
> > all blocks in a loop instead of greedily following edges it has
> > to do that since the inner loop might exit the outer loop as well,
> > sth your change wouldn't honor.  Consider
> 
> This is already handled now, if there is an edge exiting out of outer
> loop directly, it will also early break, I've verified it with your case.
> 
> 
> 	  FOR_EACH_EDGE (e, ei, bb->succs)
> 	    {
> 	      /* If there is an exit from this BB.  */
> 	      if (!flow_bb_inside_loop_p (loop, e->dest))
> 		break;
> > 
> >   while (--n)
> >    {
> >      do
> >        {
> >          if (do_exit)
> >            goto out;
> >        }
> >      while (1);
> >      p->x += 1;
> >    }
> > out:;
> > 
> > you'll see a CFG where the inner loop exits the outer loop as well.
> > 
> > So I'm saying to improve the code you'll likely have to do more.
> > And add a testcase like the following
> > 
> > void __attribute__((noipa))
> > foo (int n, int m, int f, int *p, int *q)
> > {
> >   while (--n)
> >    {
> >      do
> >        {
> >          *q += 1;
> >          if (f)
> >            goto out;
> >        }
> >      while (--m);
> >      *p += 1;
> >    }
> > out:;
> > }
> > 
> > int main()
> > {
> >    int i = 0;
> >    foo (10, 10, 1, (void *)0, &i);
> >    if (i != 1)
> >      __builtin_abort ();
> >    return 0;
> > }
> > 
> > Richard.
> > 
> 
> Updated the patch:
> 
> 
> [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
> 
> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for
> nested loops.  Current design will exit if an inner loop doesn't
> dominate outer loop's latch or exit after exiting from inner loop, which
> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> inner loops is skipped.
> This patch removes the redundant inner loop check, but keep the check of
> exiting to out of outer loop from inner loop, then the store instruction
> in outer loop could also be moved out of outer loop by store motion.
> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
> 	inn_loop check.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> ---
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++
>  gcc/tree-ssa-loop-im.c                     | 14 ----------
>  3 files changed, 54 insertions(+), 14 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> 
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..097a5ee4a4b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,24 @@
> +/* PR/101293 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +struct X { int i; int j; int k;};
> +
> +void foo(struct X *x, int n, int l)
> +{
> +  for (int j = 0; j < l; j++)
> +    {
> +      for (int i = 0; i < n; ++i)
> +	{
> +	  int *p = &x->j;
> +	  int tem = *p;
> +	  x->j += tem * i;
> +	}
> +      int *r = &x->k;
> +      int tem2 = *r;
> +      x->k += tem2 * j;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..6e180de528e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,30 @@
> +/* PR/101293 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
> +{
> +  while (--n)
> +    {
> +      do
> +	{
> +	  *q += 1;
> +	  if (f)
> +	    goto out;
> +	}
> +      while (--m);
> +      *p += 1;
> +    }
> +out:;
> +}
> +
> +int main()
> +{
> +    int i = 0;
> +    foo (10, 10, 1, (void *) 0, &i);
> +    if (i != 1)
> +      __builtin_abort ();
> +    return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index b24bc64f2a7..c44f4390ce2 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>    basic_block bb = NULL, *bbs, last = NULL;
>    unsigned i;
>    edge e;
> -  class loop *inn_loop = loop;
>  
>    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>      {
> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>  	     to disprove this if possible).  */
>  	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>  	    break;
> -
> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> -	    break;
> -
> -	  if (bb->loop_father->header == bb)
> -	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -		break;
> -
> -	      /* In a loop that is always entered we may proceed anyway.
> -		 But record that we entered it and stop once we leave it.  */
> -	      inn_loop = bb->loop_father;
> -	    }
>  	}
>  
>        while (1)

I'm not sure this will work correct (I'm not sure how the existing
code makes it so either...).  That said, I can't poke any hole
into the change.  What I see is that definitely

          if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
            last = bb;

          if (bitmap_bit_p (contains_call, bb->index))
            break;

doesn't work reliably since the DOM ordering will process blocks
A B and C in random order for

  for (;;)
   {
      if (cond)
        {
          A: foo ();
        }
      else B:;
      C:;
   }

and thus we can end up setting 'last' to C _before_ processing
'A' and thus arriving at the call foo () ...

get_loop_body_in_dom_order does some "special sauce" but not
to address the above problem - but it might be that a subtle
issue like the above is the reason for the inner loop handling.
The inner loop block order does _not_ adhere to this "special sauce",
that is - the "Additionally, if a basic block s dominates
the latch, then only blocks dominated by s are be after it."
guarantee holds for the outer loop latch, not for the inner.

Digging into the history of fill_always_executed_in_1 doesn't
reveal anything - the inner loop handling has been present
since introduction by Zdenek - but usually Zdenek has a reason
for doing things as he does ;)

Note it might be simply a measure against quadratic complexity,
esp. since with your patch we also dive into not always executed
subloops as you remove the

              if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
                break;

check.  I suggest to evaluate behavior of the patch on a testcase
like

void foo (int n, int **k)
{
  for (int i = 0; i < n; ++i)
    if (k[0][i])
      for (int j = 0; j < n; ++j)
        if (k[1][j])
          for (int l = 0; l < n; ++l)
            if (k[2][l])
              ...
}

I suspect you'll see quadratic behavior with your patch.  You
should be at least able to preserve a check like

          /* Do not process not always executed subloops to avoid
             quadratic behavior.  */
          if (bb->loop_father->header == bb
              && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
            break;

which is of course not optimistic for cases like

  for (..)
   {
     if (cond)
       for (..)
     x = 1; // this is always executed if the inner loop is finite
   }

but we need to have an eye on the complexity of this function.  I would
have suggested to do greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
is empty proceed to inner loops as the current code does.  For
bookkeeping simply keep a to-visit-incoming-edges counter per BB.
Pseudo-code:

  bitmap_set_bit (worklist, loop-header-bb);
  while (!bitmap_empty_p (worklist))
    {
      bb = pop (worklist);
      SET_ALWAYS_EXECUTED_IN (bb, loop);
      if (bitmap_bit_p (contains_call, bb->index))
        continue;
      FOR_EACH_EDGE (e, ei, bb->succs)
        {
          if (!flow_bb_inside_loop_p (loop, e->dest))
            continue;
          if (incoming_count[e->dest->index]-- == 0)
            push (worklist, e->dest);
        }
    }
  
  iterate over inner loops (incoming_count can be retained,
  we just force the inner loop header onto the worklist).

that would avoid the quadraticness with respect to
get_loop_body_in_dom_order and also fix the dominator issue
mentioned above.

Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
  2021-08-19 12:11         ` Richard Biener
@ 2021-08-24  7:44           ` Xionghu Luo
  2021-08-24  8:20             ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Xionghu Luo @ 2021-08-24  7:44 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw



On 2021/8/19 20:11, Richard Biener wrote:
>> -  class loop *inn_loop = loop;
>>   
>>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>       {
>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>   	     to disprove this if possible).  */
>>   	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>   	    break;
>> -
>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>> -	    break;
>> -
>> -	  if (bb->loop_father->header == bb)
>> -	    {
>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> -		break;
>> -
>> -	      /* In a loop that is always entered we may proceed anyway.
>> -		 But record that we entered it and stop once we leave it.  */
>> -	      inn_loop = bb->loop_father;
>> -	    }
>>   	}
>>   
>>         while (1)
> I'm not sure this will work correct (I'm not sure how the existing
> code makes it so either...).  That said, I can't poke any hole
> into the change.  What I see is that definitely
> 
>            if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>              last = bb;
> 
>            if (bitmap_bit_p (contains_call, bb->index))
>              break;
> 
> doesn't work reliably since the DOM ordering will process blocks
> A B and C in random order for
> 
>    for (;;)
>     {
>        if (cond)
>          {
>            A: foo ();
>          }
>        else B:;
>        C:;
>     }
> 
> and thus we can end up setting 'last' to C_before_  processing
> 'A' and thus arriving at the call foo () ...
> 
> get_loop_body_in_dom_order does some "special sauce" but not
> to address the above problem - but it might be that a subtle
> issue like the above is the reason for the inner loop handling.
> The inner loop block order does_not_  adhere to this "special sauce",
> that is - the "Additionally, if a basic block s dominates
> the latch, then only blocks dominated by s are be after it."
> guarantee holds for the outer loop latch, not for the inner.
> 
> Digging into the history of fill_always_executed_in_1 doesn't
> reveal anything - the inner loop handling has been present
> since introduction by Zdenek - but usually Zdenek has a reason
> for doing things as he does;)

Yes, this is really complicated usage, thanks for point it out. :)
I constructed two cases to verify this with inner loop includes "If A; else B; C". 
Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call;

Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1
since there is also dominate check, then if A or B contains call, it could break
successfully. 

C won't be set to ALWAYS EXECUTED for both circumstance.

> 
> Note it might be simply a measure against quadratic complexity,
> esp. since with your patch we also dive into not always executed
> subloops as you remove the
> 
>                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>                  break;
> 
> check.  I suggest to evaluate behavior of the patch on a testcase
> like
> 
> void foo (int n, int **k)
> {
>    for (int i = 0; i < n; ++i)
>      if (k[0][i])
>        for (int j = 0; j < n; ++j)
>          if (k[1][j])
>            for (int l = 0; l < n; ++l)
>              if (k[2][l])
>                ...
> }

Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
so fill_always_executed_in_1's execution time is supposed to be increase from
O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also drafted a
test case has 73-depth loop function with 25 no-ipa function copies each compiled
in lim2 and lim4 dependently.  Total execution time of fill_always_executed_in_1 is
increased from 32ms to 58ms, almost doubled but not quadratic?

It seems reasonable to see compiling time getting longer since most bbs are checked
more but a MUST to ensure early break correctly in every loop level... 
Though loop nodes could be huge, loop depth will never be so large in actual code?

>  
> I suspect you'll see quadratic behavior with your patch.  You
> should be at least able to preserve a check like
> 
>            /* Do not process not always executed subloops to avoid
>               quadratic behavior.  */
>            if (bb->loop_father->header == bb
>                && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>              break;
> 
> which is of course not optimistic for cases like
> 
>    for (..)
>     {
>       if (cond)
>         for (..)
>       x = 1; // this is always executed if the inner loop is finite
>     }
> 
> but we need to have an eye on the complexity of this function.  I would
> have suggested to do greedy visiting of the loop header successors,
> processing further blocks if all entries (ignoring backedges) are
> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
> is empty proceed to inner loops as the current code does.  For
> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> Pseudo-code:
> 
>    bitmap_set_bit (worklist, loop-header-bb);
>    while (!bitmap_empty_p (worklist))
>      {
>        bb = pop (worklist);

Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?

>        SET_ALWAYS_EXECUTED_IN (bb, loop);
>        if (bitmap_bit_p (contains_call, bb->index))
>          continue;
>        FOR_EACH_EDGE (e, ei, bb->succs)
>          {
>            if (!flow_bb_inside_loop_p (loop, e->dest))
>              continue;
>            if (incoming_count[e->dest->index]-- == 0)
>              push (worklist, e->dest);
>          }
>      }

Sorry I don't fully understand your algorithm.  worklist should be
auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
the bb's preds count, why need retain it since it it decreased to 0?

>    
>    iterate over inner loops (incoming_count can be retained,
>    we just force the inner loop header onto the worklist).

Is this same to ?

  for (loop = loop->inner; loop; loop = loop->next)
    fill_always_executed_in_1 (loop, contains_call)

> 
> that would avoid the quadraticness with respect to
> get_loop_body_in_dom_order and also fix the dominator issue
> mentioned above.
> 
> Richard.

-- 
Thanks,
Xionghu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1
  2021-08-24  7:44           ` Xionghu Luo
@ 2021-08-24  8:20             ` Richard Biener
  2021-08-26  5:50               ` [PATCH v3] " Xionghu Luo
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-24  8:20 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Tue, 24 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/19 20:11, Richard Biener wrote:
> >> -  class loop *inn_loop = loop;
> >>   
> >>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>       {
> >> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> >>   	     to disprove this if possible).  */
> >>   	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>   	    break;
> >> -
> >> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> >> -	    break;
> >> -
> >> -	  if (bb->loop_father->header == bb)
> >> -	    {
> >> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> -		break;
> >> -
> >> -	      /* In a loop that is always entered we may proceed anyway.
> >> -		 But record that we entered it and stop once we leave it.  */
> >> -	      inn_loop = bb->loop_father;
> >> -	    }
> >>   	}
> >>   
> >>         while (1)
> > I'm not sure this will work correct (I'm not sure how the existing
> > code makes it so either...).  That said, I can't poke any hole
> > into the change.  What I see is that definitely
> > 
> >            if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >              last = bb;
> > 
> >            if (bitmap_bit_p (contains_call, bb->index))
> >              break;
> > 
> > doesn't work reliably since the DOM ordering will process blocks
> > A B and C in random order for
> > 
> >    for (;;)
> >     {
> >        if (cond)
> >          {
> >            A: foo ();
> >          }
> >        else B:;
> >        C:;
> >     }
> > 
> > and thus we can end up setting 'last' to C_before_  processing
> > 'A' and thus arriving at the call foo () ...
> > 
> > get_loop_body_in_dom_order does some "special sauce" but not
> > to address the above problem - but it might be that a subtle
> > issue like the above is the reason for the inner loop handling.
> > The inner loop block order does_not_  adhere to this "special sauce",
> > that is - the "Additionally, if a basic block s dominates
> > the latch, then only blocks dominated by s are be after it."
> > guarantee holds for the outer loop latch, not for the inner.
> > 
> > Digging into the history of fill_always_executed_in_1 doesn't
> > reveal anything - the inner loop handling has been present
> > since introduction by Zdenek - but usually Zdenek has a reason
> > for doing things as he does;)
> 
> Yes, this is really complicated usage, thanks for point it out. :)
> I constructed two cases to verify this with inner loop includes "If A; else B; C". 
> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call;

But it depends on the order of visiting ABC and that's hard to put into
a testcase since it depends on the order of edges and the processing
of the dominance computation.  ABC are simply unordered with respect
to a dominator walk.

> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1
> since there is also dominate check, then if A or B contains call, it could break
> successfully. 
> 
> C won't be set to ALWAYS EXECUTED for both circumstance.
> 
> > 
> > Note it might be simply a measure against quadratic complexity,
> > esp. since with your patch we also dive into not always executed
> > subloops as you remove the
> > 
> >                if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >                  break;
> > 
> > check.  I suggest to evaluate behavior of the patch on a testcase
> > like
> > 
> > void foo (int n, int **k)
> > {
> >    for (int i = 0; i < n; ++i)
> >      if (k[0][i])
> >        for (int j = 0; j < n; ++j)
> >          if (k[1][j])
> >            for (int l = 0; l < n; ++l)
> >              if (k[2][l])
> >                ...
> > }
> 
> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
> so fill_always_executed_in_1's execution time is supposed to be increase from
> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also drafted a
> test case has 73-depth loop function with 25 no-ipa function copies each compiled
> in lim2 and lim4 dependently.  Total execution time of fill_always_executed_in_1 is
> increased from 32ms to 58ms, almost doubled but not quadratic?

It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
O(n^2).

> It seems reasonable to see compiling time getting longer since most bbs are checked
> more but a MUST to ensure early break correctly in every loop level... 
> Though loop nodes could be huge, loop depth will never be so large in actual code?

The "in practice" argument is almost always defeated by automatic
program generators ;)

> >  
> > I suspect you'll see quadratic behavior with your patch.  You
> > should be at least able to preserve a check like
> > 
> >            /* Do not process not always executed subloops to avoid
> >               quadratic behavior.  */
> >            if (bb->loop_father->header == bb
> >                && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >              break;
> > 
> > which is of course not optimistic for cases like
> > 
> >    for (..)
> >     {
> >       if (cond)
> >         for (..)
> >       x = 1; // this is always executed if the inner loop is finite
> >     }
> > 
> > but we need to have an eye on the complexity of this function.  I would
> > have suggested to do greedy visiting of the loop header successors,
> > processing further blocks if all entries (ignoring backedges) are
> > processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
> > is empty proceed to inner loops as the current code does.  For
> > bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> > Pseudo-code:
> > 
> >    bitmap_set_bit (worklist, loop-header-bb);
> >    while (!bitmap_empty_p (worklist))
> >      {
> >        bb = pop (worklist);
> 
> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?

Ah, sure.

> >        SET_ALWAYS_EXECUTED_IN (bb, loop);
> >        if (bitmap_bit_p (contains_call, bb->index))
> >          continue;
> >        FOR_EACH_EDGE (e, ei, bb->succs)
> >          {
> >            if (!flow_bb_inside_loop_p (loop, e->dest))
> >              continue;
> >            if (incoming_count[e->dest->index]-- == 0)
> >              push (worklist, e->dest);
> >          }
> >      }
> 
> Sorry I don't fully understand your algorithm.  worklist should be
> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
> the bb's preds count, why need retain it since it it decreased to 0?

the worklist is a auto_bitmap, pop () would be 
bitmap_first_set_bit/clear_bit.  One could use a vector with the
caveat of eventually adding duplicate members to the worklist.
But as said, it's pseudo-code ;)

> >    
> >    iterate over inner loops (incoming_count can be retained,
> >    we just force the inner loop header onto the worklist).
> 
> Is this same to ?
> 
>   for (loop = loop->inner; loop; loop = loop->next)
>     fill_always_executed_in_1 (loop, contains_call)

Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
should iterate over loops in PRE order.

Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
  2021-08-24  8:20             ` Richard Biener
@ 2021-08-26  5:50               ` Xionghu Luo
  2021-08-27  7:45                 ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Xionghu Luo @ 2021-08-26  5:50 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw



On 2021/8/24 16:20, Richard Biener wrote:
> On Tue, 24 Aug 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/8/19 20:11, Richard Biener wrote:
>>>> -  class loop *inn_loop = loop;
>>>>    
>>>>      if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>>>        {
>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>>>    	     to disprove this if possible).  */
>>>>    	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>>>    	    break;
>>>> -
>>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>> -	    break;
>>>> -
>>>> -	  if (bb->loop_father->header == bb)
>>>> -	    {
>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -		break;
>>>> -
>>>> -	      /* In a loop that is always entered we may proceed anyway.
>>>> -		 But record that we entered it and stop once we leave it.  */
>>>> -	      inn_loop = bb->loop_father;
>>>> -	    }
>>>>    	}
>>>>    
>>>>          while (1)
>>> I'm not sure this will work correct (I'm not sure how the existing
>>> code makes it so either...).  That said, I can't poke any hole
>>> into the change.  What I see is that definitely
>>>
>>>             if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>               last = bb;
>>>
>>>             if (bitmap_bit_p (contains_call, bb->index))
>>>               break;
>>>
>>> doesn't work reliably since the DOM ordering will process blocks
>>> A B and C in random order for
>>>
>>>     for (;;)
>>>      {
>>>         if (cond)
>>>           {
>>>             A: foo ();
>>>           }
>>>         else B:;
>>>         C:;
>>>      }
>>>
>>> and thus we can end up setting 'last' to C_before_  processing
>>> 'A' and thus arriving at the call foo () ...
>>>
>>> get_loop_body_in_dom_order does some "special sauce" but not
>>> to address the above problem - but it might be that a subtle
>>> issue like the above is the reason for the inner loop handling.
>>> The inner loop block order does_not_  adhere to this "special sauce",
>>> that is - the "Additionally, if a basic block s dominates
>>> the latch, then only blocks dominated by s are be after it."
>>> guarantee holds for the outer loop latch, not for the inner.
>>>
>>> Digging into the history of fill_always_executed_in_1 doesn't
>>> reveal anything - the inner loop handling has been present
>>> since introduction by Zdenek - but usually Zdenek has a reason
>>> for doing things as he does;)
>>
>> Yes, this is really complicated usage, thanks for point it out. :)
>> I constructed two cases to verify this with inner loop includes "If A; else B; C".
>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call;
> 
> But it depends on the order of visiting ABC and that's hard to put into
> a testcase since it depends on the order of edges and the processing
> of the dominance computation.  ABC are simply unordered with respect
> to a dominator walk.
> 
>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
>> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1
>> since there is also dominate check, then if A or B contains call, it could break
>> successfully.
>>
>> C won't be set to ALWAYS EXECUTED for both circumstance.
>>
>>>
>>> Note it might be simply a measure against quadratic complexity,
>>> esp. since with your patch we also dive into not always executed
>>> subloops as you remove the
>>>
>>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>                   break;
>>>
>>> check.  I suggest to evaluate behavior of the patch on a testcase
>>> like
>>>
>>> void foo (int n, int **k)
>>> {
>>>     for (int i = 0; i < n; ++i)
>>>       if (k[0][i])
>>>         for (int j = 0; j < n; ++j)
>>>           if (k[1][j])
>>>             for (int l = 0; l < n; ++l)
>>>               if (k[2][l])
>>>                 ...
>>> }
>>
>> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
>> so fill_always_executed_in_1's execution time is supposed to be increase from
>> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also drafted a
>> test case has 73-depth loop function with 25 no-ipa function copies each compiled
>> in lim2 and lim4 dependently.  Total execution time of fill_always_executed_in_1 is
>> increased from 32ms to 58ms, almost doubled but not quadratic?
> 
> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
> O(n^2).
> 
>> It seems reasonable to see compiling time getting longer since most bbs are checked
>> more but a MUST to ensure early break correctly in every loop level...
>> Though loop nodes could be huge, loop depth will never be so large in actual code?
> 
> The "in practice" argument is almost always defeated by automatic
> program generators ;)
> 
>>>   
>>> I suspect you'll see quadratic behavior with your patch.  You
>>> should be at least able to preserve a check like
>>>
>>>             /* Do not process not always executed subloops to avoid
>>>                quadratic behavior.  */
>>>             if (bb->loop_father->header == bb
>>>                 && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>               break;
>>>
>>> which is of course not optimistic for cases like
>>>
>>>     for (..)
>>>      {
>>>        if (cond)
>>>          for (..)
>>>        x = 1; // this is always executed if the inner loop is finite
>>>      }
>>>
>>> but we need to have an eye on the complexity of this function.  I would
>>> have suggested to do greedy visiting of the loop header successors,
>>> processing further blocks if all entries (ignoring backedges) are
>>> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
>>> is empty proceed to inner loops as the current code does.  For
>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
>>> Pseudo-code:
>>>
>>>     bitmap_set_bit (worklist, loop-header-bb);
>>>     while (!bitmap_empty_p (worklist))
>>>       {
>>>         bb = pop (worklist);
>>
>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
> 
> Ah, sure.
> 
>>>         SET_ALWAYS_EXECUTED_IN (bb, loop);
>>>         if (bitmap_bit_p (contains_call, bb->index))
>>>           continue;
>>>         FOR_EACH_EDGE (e, ei, bb->succs)
>>>           {
>>>             if (!flow_bb_inside_loop_p (loop, e->dest))
>>>               continue;
>>>             if (incoming_count[e->dest->index]-- == 0)
>>>               push (worklist, e->dest);
>>>           }
>>>       }
>>
>> Sorry I don't fully understand your algorithm.  worklist should be
>> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
>> the bb's preds count, why need retain it since it it decreased to 0?
> 
> the worklist is a auto_bitmap, pop () would be
> bitmap_first_set_bit/clear_bit.  One could use a vector with the
> caveat of eventually adding duplicate members to the worklist.
> But as said, it's pseudo-code ;)

Thanks a lot!

> 
>>>     
>>>     iterate over inner loops (incoming_count can be retained,
>>>     we just force the inner loop header onto the worklist).
>>
>> Is this same to ?
>>
>>    for (loop = loop->inner; loop; loop = loop->next)
>>      fill_always_executed_in_1 (loop, contains_call)
> 
> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
> should iterate over loops in PRE order.

Tried implementation with your pseudo-code, it works like the
previous v2 solution, bb after inner loops could be marked
as ALWAYS EXECUTED.  You are really so strong! ;)
Regression tested pass on P8LE.

Assume,

  A: GCC Base
  B: inner loop check removal
  C: incoming count

Execution time of fill_always_executed_in_1 for the 73-depth loop
(with five function copies) is
(A vs. B vs. C): 32ms vs 58ms vs 38ms.  Much better than O(n^2)?


Bumped the patch to v3 with TODOs:
 1. The time measurement code will be removed if the v3 is OK; 
 2. loops_list is not used as my code is not rebased to upstream yet.
 3. Function comments is not updated.



[PATCH v3] Fix incomplete computation in fill_always_executed_in_1


ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
When the worklist is empty proceed to inner loops as the current
code does.  For bookkeeping simply keep a to-visit-incoming-edges
counter per BB, and pass it through iteration over inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
	inn_loop check.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 95 +++++++++++++---------
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
 3 files changed, 111 insertions(+), 38 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..72be6b6bfac 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,39 +3192,52 @@ do_store_motion (void)
    blocks that contain a nonpure call.  */
 
 static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
   edge e;
-  class loop *inn_loop = loop;
 
   if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
     {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_clear (work_set);
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;
 
-      for (i = 0; i < loop->num_nodes; i++)
-	{
-	  edge_iterator ei;
-	  bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);
 
+      while (!bitmap_empty_p (work_set))
+	{
+	  bb_index = bitmap_first_set_bit (work_set);
+	  bitmap_clear_bit (work_set, bb_index);
+	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
 	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-	    last = bb;
-
+	    SET_ALWAYS_EXECUTED_IN (bb, loop);
 	  if (bitmap_bit_p (contains_call, bb->index))
 	    break;
-
+	  edge_iterator ei;
 	  FOR_EACH_EDGE (e, ei, bb->succs)
 	    {
-	      /* If there is an exit from this BB.  */
 	      if (!flow_bb_inside_loop_p (loop, e->dest))
 		break;
+
 	      /* Or we enter a possibly non-finite loop.  */
 	      if (flow_loop_nested_p (bb->loop_father,
 				      e->dest->loop_father)
 		  && ! finite_loop_p (e->dest->loop_father))
 		break;
+
+	      if (bbd[e->dest->index] == 1)
+	      {
+		bitmap_set_bit (work_set, e->dest->index);
+		bbd[e->dest->index]--;
+	      }
+	      else if (bbd[e->dest->index] > 1)
+		bbd[e->dest->index]--;
 	    }
+
 	  if (e)
 	    break;
 
@@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 	     to disprove this if possible).  */
 	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
 	    break;
-
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
-	    break;
-
-	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
-
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it.  */
-	      inn_loop = bb->loop_father;
-	    }
-	}
-
-      while (1)
-	{
-	  SET_ALWAYS_EXECUTED_IN (last, loop);
-	  if (last == loop->header)
-	    break;
-	  last = get_immediate_dominator (CDI_DOMINATORS, last);
 	}
-
-      free (bbs);
+      free (bbd);
     }
 
   for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+    fill_always_executed_in_1 (loop, contains_call, bbi);
 }
 
 /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
 	bitmap_set_bit (contains_call, bb->index);
     }
 
+  mark_dfs_back_edges ();
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XNEWVEC (int, array_size);
+  memset (bb_incoming_edges, 0, array_size * sizeof (int));
+  edge_iterator ei;
+  edge e;
+
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (e->flags & EDGE_DFS_BACK)
+	  len--;
+      bb_incoming_edges[bb->index] = len;
+    }
+
+  static unsigned long diff = 0;
+  struct timeval tv;
+  gettimeofday (&tv, NULL);
+  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+
+  //for (loop : loops_list (cfun, 0))
   for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
+
+  gettimeofday (&tv, NULL);
+  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+  diff += end - begin;
+  //printf ("%s,diff:%ld\n", current_function_name (), diff);
+  free (bb_incoming_edges);
 }
 
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..097a5ee4a4b
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,24 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+	{
+	  *q += 1;
+	  if (f)
+	    goto out;
+	}
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
  2021-08-26  5:50               ` [PATCH v3] " Xionghu Luo
@ 2021-08-27  7:45                 ` Richard Biener
  2021-08-30  8:49                   ` Xionghu Luo
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-27  7:45 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Thu, 26 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/24 16:20, Richard Biener wrote:
> > On Tue, 24 Aug 2021, Xionghu Luo wrote:
> > 
> >>
> >>
> >> On 2021/8/19 20:11, Richard Biener wrote:
> >>>> -  class loop *inn_loop = loop;
> >>>>    
> >>>>      if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>>>        {
> >>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> >>>>    	     to disprove this if possible).  */
> >>>>    	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>>>    	    break;
> >>>> -
> >>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>> -	    break;
> >>>> -
> >>>> -	  if (bb->loop_father->header == bb)
> >>>> -	    {
> >>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> -		break;
> >>>> -
> >>>> -	      /* In a loop that is always entered we may proceed anyway.
> >>>> -		 But record that we entered it and stop once we leave it.  */
> >>>> -	      inn_loop = bb->loop_father;
> >>>> -	    }
> >>>>    	}
> >>>>    
> >>>>          while (1)
> >>> I'm not sure this will work correct (I'm not sure how the existing
> >>> code makes it so either...).  That said, I can't poke any hole
> >>> into the change.  What I see is that definitely
> >>>
> >>>             if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>               last = bb;
> >>>
> >>>             if (bitmap_bit_p (contains_call, bb->index))
> >>>               break;
> >>>
> >>> doesn't work reliably since the DOM ordering will process blocks
> >>> A B and C in random order for
> >>>
> >>>     for (;;)
> >>>      {
> >>>         if (cond)
> >>>           {
> >>>             A: foo ();
> >>>           }
> >>>         else B:;
> >>>         C:;
> >>>      }
> >>>
> >>> and thus we can end up setting 'last' to C_before_  processing
> >>> 'A' and thus arriving at the call foo () ...
> >>>
> >>> get_loop_body_in_dom_order does some "special sauce" but not
> >>> to address the above problem - but it might be that a subtle
> >>> issue like the above is the reason for the inner loop handling.
> >>> The inner loop block order does_not_  adhere to this "special sauce",
> >>> that is - the "Additionally, if a basic block s dominates
> >>> the latch, then only blocks dominated by s are be after it."
> >>> guarantee holds for the outer loop latch, not for the inner.
> >>>
> >>> Digging into the history of fill_always_executed_in_1 doesn't
> >>> reveal anything - the inner loop handling has been present
> >>> since introduction by Zdenek - but usually Zdenek has a reason
> >>> for doing things as he does;)
> >>
> >> Yes, this is really complicated usage, thanks for point it out. :)
> >> I constructed two cases to verify this with inner loop includes "If A; else B; C".
> >> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
> >> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
> >> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call;
> > 
> > But it depends on the order of visiting ABC and that's hard to put into
> > a testcase since it depends on the order of edges and the processing
> > of the dominance computation.  ABC are simply unordered with respect
> > to a dominator walk.
> > 
> >> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
> >> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1
> >> since there is also dominate check, then if A or B contains call, it could break
> >> successfully.
> >>
> >> C won't be set to ALWAYS EXECUTED for both circumstance.
> >>
> >>>
> >>> Note it might be simply a measure against quadratic complexity,
> >>> esp. since with your patch we also dive into not always executed
> >>> subloops as you remove the
> >>>
> >>>                 if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>                   break;
> >>>
> >>> check.  I suggest to evaluate behavior of the patch on a testcase
> >>> like
> >>>
> >>> void foo (int n, int **k)
> >>> {
> >>>     for (int i = 0; i < n; ++i)
> >>>       if (k[0][i])
> >>>         for (int j = 0; j < n; ++j)
> >>>           if (k[1][j])
> >>>             for (int l = 0; l < n; ++l)
> >>>               if (k[2][l])
> >>>                 ...
> >>> }
> >>
> >> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
> >> so fill_always_executed_in_1's execution time is supposed to be increase from
> >> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also drafted a
> >> test case has 73-depth loop function with 25 no-ipa function copies each compiled
> >> in lim2 and lim4 dependently.  Total execution time of fill_always_executed_in_1 is
> >> increased from 32ms to 58ms, almost doubled but not quadratic?
> > 
> > It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
> > O(n^2).
> > 
> >> It seems reasonable to see compiling time getting longer since most bbs are checked
> >> more but a MUST to ensure early break correctly in every loop level...
> >> Though loop nodes could be huge, loop depth will never be so large in actual code?
> > 
> > The "in practice" argument is almost always defeated by automatic
> > program generators ;)
> > 
> >>>   
> >>> I suspect you'll see quadratic behavior with your patch.  You
> >>> should be at least able to preserve a check like
> >>>
> >>>             /* Do not process not always executed subloops to avoid
> >>>                quadratic behavior.  */
> >>>             if (bb->loop_father->header == bb
> >>>                 && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>               break;
> >>>
> >>> which is of course not optimistic for cases like
> >>>
> >>>     for (..)
> >>>      {
> >>>        if (cond)
> >>>          for (..)
> >>>        x = 1; // this is always executed if the inner loop is finite
> >>>      }
> >>>
> >>> but we need to have an eye on the complexity of this function.  I would
> >>> have suggested to do greedy visiting of the loop header successors,
> >>> processing further blocks if all entries (ignoring backedges) are
> >>> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
> >>> is empty proceed to inner loops as the current code does.  For
> >>> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> >>> Pseudo-code:
> >>>
> >>>     bitmap_set_bit (worklist, loop-header-bb);
> >>>     while (!bitmap_empty_p (worklist))
> >>>       {
> >>>         bb = pop (worklist);
> >>
> >> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
> > 
> > Ah, sure.
> > 
> >>>         SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>>         if (bitmap_bit_p (contains_call, bb->index))
> >>>           continue;
> >>>         FOR_EACH_EDGE (e, ei, bb->succs)
> >>>           {
> >>>             if (!flow_bb_inside_loop_p (loop, e->dest))
> >>>               continue;
> >>>             if (incoming_count[e->dest->index]-- == 0)
> >>>               push (worklist, e->dest);
> >>>           }
> >>>       }
> >>
> >> Sorry I don't fully understand your algorithm.  worklist should be
> >> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
> >> the bb's preds count, why need retain it since it it decreased to 0?
> > 
> > the worklist is a auto_bitmap, pop () would be
> > bitmap_first_set_bit/clear_bit.  One could use a vector with the
> > caveat of eventually adding duplicate members to the worklist.
> > But as said, it's pseudo-code ;)
> 
> Thanks a lot!
> 
> > 
> >>>     
> >>>     iterate over inner loops (incoming_count can be retained,
> >>>     we just force the inner loop header onto the worklist).
> >>
> >> Is this same to ?
> >>
> >>    for (loop = loop->inner; loop; loop = loop->next)
> >>      fill_always_executed_in_1 (loop, contains_call)
> > 
> > Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
> > should iterate over loops in PRE order.
> 
> Tried implementation with your pseudo-code, it works like the
> previous v2 solution, bb after inner loops could be marked
> as ALWAYS EXECUTED.  You are really so strong! ;)
> Regression tested pass on P8LE.
> 
> Assume,
> 
>   A: GCC Base
>   B: inner loop check removal
>   C: incoming count
> 
> Execution time of fill_always_executed_in_1 for the 73-depth loop
> (with five function copies) is
> (A vs. B vs. C): 32ms vs 58ms vs 38ms.  Much better than O(n^2)?
> 
> 
> Bumped the patch to v3 with TODOs:
>  1. The time measurement code will be removed if the v3 is OK; 
>  2. loops_list is not used as my code is not rebased to upstream yet.
>  3. Function comments is not updated.
> 
> 
> 
> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
> 
> 
> ALWAYS_EXECUTED_IN is not computed completely for nested loops.
> Current design will exit if an inner loop doesn't dominate outer
> loop's latch or exit after exiting from inner loop, which
> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> inner loops are skipped.  Another problem is it has to call
> get_loop_body_in_dom_order in each recursive call which is also
> inefficient.
> 
> This patch does greedy visiting of the loop header successors,
> processing further blocks if all entries (ignoring backedges) are
> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
> When the worklist is empty proceed to inner loops as the current
> code does.  For bookkeeping simply keep a to-visit-incoming-edges
> counter per BB, and pass it through iteration over inner loops.
> 
> Details discusions:
> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html
> 
> gcc/ChangeLog:
> 
> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
> 	inn_loop check.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> ---
>  gcc/tree-ssa-loop-im.c                     | 95 +++++++++++++---------
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
>  gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
>  3 files changed, 111 insertions(+), 38 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>  create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> 
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index b24bc64f2a7..72be6b6bfac 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3192,39 +3192,52 @@ do_store_motion (void)
>     blocks that contain a nonpure call.  */
>  
>  static void
> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
>  {
> -  basic_block bb = NULL, *bbs, last = NULL;
> -  unsigned i;
> +  basic_block bb = NULL;
>    edge e;
> -  class loop *inn_loop = loop;
>  
>    if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>      {
> -      bbs = get_loop_body_in_dom_order (loop);
> +      auto_bitmap work_set;
> +      bitmap_clear (work_set);

bitmap_clear isn't necessary

> +      bitmap_set_bit (work_set, loop->header->index);
> +      unsigned bb_index;
>  
> -      for (i = 0; i < loop->num_nodes; i++)
> -	{
> -	  edge_iterator ei;
> -	  bb = bbs[i];
> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> +      int *bbd = XNEWVEC (int, array_size);
> +      bbd = XDUPVEC (int, bbi, array_size);

I don't think you need to copy 'bbi' but you can re-use the
state from the outer loop processing.  Did you run into any
issues with that?
 
> +      while (!bitmap_empty_p (work_set))
> +	{
> +	  bb_index = bitmap_first_set_bit (work_set);
> +	  bitmap_clear_bit (work_set, bb_index);
> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
>  	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -	    last = bb;
> -
> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
>  	  if (bitmap_bit_p (contains_call, bb->index))
>  	    break;

I think you want to continue; here (process remaining worklist
but not continue greedy walking this block)

> -
> +	  edge_iterator ei;
>  	  FOR_EACH_EDGE (e, ei, bb->succs)
>  	    {
> -	      /* If there is an exit from this BB.  */
>  	      if (!flow_bb_inside_loop_p (loop, e->dest))
>  		break;

in particular this should keep the outer 'bbi' valid to re-use.

But again, you want 'continue;' the greedy walk to other edges.
If that's not valid (I'd need to think about this) then with
your patch whether we process an edge depends on the order
of the edge visit so you'd have to walk successors twice,
once to determine whether we can greedily walk any of it
and once to actually do the greedy walk.

So thinking about it an exit edge is like a not returning call
and thus we indeed should not process any outgoing edges of
this block.

> +
>  	      /* Or we enter a possibly non-finite loop.  */
>  	      if (flow_loop_nested_p (bb->loop_father,
>  				      e->dest->loop_father)
>  		  && ! finite_loop_p (e->dest->loop_father))
>  		break;

I think this is no longer necessary?  In any case it would
again be 'continue;', see also above.

I'm missing skipping of backedges in the walk.

> +
> +	      if (bbd[e->dest->index] == 1)
> +	      {
> +		bitmap_set_bit (work_set, e->dest->index);
> +		bbd[e->dest->index]--;
> +	      }
> +	      else if (bbd[e->dest->index] > 1)
> +		bbd[e->dest->index]--;

maybe just

            bbd[e->dest->index]--;
            if (bbd[e->dest->index] == 0)
              bitmap_set_bit (work_set, e->dest->index);

>  	    }
> +
>  	  if (e)
>  	    break;

continue; processing the worklist

>  
> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>  	     to disprove this if possible).  */
>  	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>  	    break;

continue; processing the worklist.  I think this has to
come early, before processing successors, next to
the contains_call processing.  We could think of ignoring
entering of irreducible regions by tracking exits from them,
but I'm not sure we're identifying the regions in a good
enough way to allow this.

Likewise possibly infinite inner loops can be processed
when we track exits from those which would be sth like

    /* When this is an exit from an inner loop that we cannot
       prove as finite do not follow this edge.  */
    if (bb->loop_father != loop
        && loop_exit_edge_p (bb->loop_father, e)
        && ! finite_loop_p (bb->loop_father))
      continue;

> -
> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> -	    break;
> -
> -	  if (bb->loop_father->header == bb)
> -	    {
> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> -		break;
> -
> -	      /* In a loop that is always entered we may proceed anyway.
> -		 But record that we entered it and stop once we leave it.  */
> -	      inn_loop = bb->loop_father;
> -	    }
> -	}
> -
> -      while (1)
> -	{
> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
> -	  if (last == loop->header)
> -	    break;
> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
>  	}
> -
> -      free (bbs);
> +      free (bbd);
>      }
>  
>    for (loop = loop->inner; loop; loop = loop->next)
> -    fill_always_executed_in_1 (loop, contains_call);
> +    fill_always_executed_in_1 (loop, contains_call, bbi);
>  }
>  
>  /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
>  	bitmap_set_bit (contains_call, bb->index);
>      }
>  
> +  mark_dfs_back_edges ();

Please put this to loop_invariant_motion_in_fun right before
the call to fill_always_executed_in.

> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> +  int *bb_incoming_edges = XNEWVEC (int, array_size);

There's XCNEWVEC that also zeros the array.

> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
> +  edge_iterator ei;
> +  edge e;
> +
> +  FOR_EACH_BB_FN (bb, cfun)
> +    {
> +      int len = bb->preds->length ();
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	if (e->flags & EDGE_DFS_BACK)
> +	  len--;
> +      bb_incoming_edges[bb->index] = len;
> +    }

it would be nice to combine this with the preceeding loop over
all blocks.

> +  static unsigned long diff = 0;
> +  struct timeval tv;
> +  gettimeofday (&tv, NULL);
> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> +
> +  //for (loop : loops_list (cfun, 0))
>    for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> -    fill_always_executed_in_1 (loop, contains_call);
> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
> +
> +  gettimeofday (&tv, NULL);
> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> +  diff += end - begin;
> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
> +  free (bb_incoming_edges);

Eh, looks very much like work in progress.  From the O() complexity
point duplicating bb_incoming_edges for each loop makes it quadratic
again but as said I think we don't need this duplication.

I'd like to see the recursion removed, did the for (loop : ...)
not work out?

Thanks,
Richard.

>  }
>  
>  
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> new file mode 100644
> index 00000000000..097a5ee4a4b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> @@ -0,0 +1,24 @@
> +/* PR/101293 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +struct X { int i; int j; int k;};
> +
> +void foo(struct X *x, int n, int l)
> +{
> +  for (int j = 0; j < l; j++)
> +    {
> +      for (int i = 0; i < n; ++i)
> +	{
> +	  int *p = &x->j;
> +	  int tem = *p;
> +	  x->j += tem * i;
> +	}
> +      int *r = &x->k;
> +      int tem2 = *r;
> +      x->k += tem2 * j;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
> +
> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> new file mode 100644
> index 00000000000..6e180de528e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> @@ -0,0 +1,30 @@
> +/* PR/101293 */
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */
> +
> +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
> +{
> +  while (--n)
> +    {
> +      do
> +	{
> +	  *q += 1;
> +	  if (f)
> +	    goto out;
> +	}
> +      while (--m);
> +      *p += 1;
> +    }
> +out:;
> +}
> +
> +int main()
> +{
> +    int i = 0;
> +    foo (10, 10, 1, (void *) 0, &i);
> +    if (i != 1)
> +      __builtin_abort ();
> +    return 0;
> +}
> +
> +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
  2021-08-27  7:45                 ` Richard Biener
@ 2021-08-30  8:49                   ` Xionghu Luo
  2021-08-30  9:19                     ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Xionghu Luo @ 2021-08-30  8:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

[-- Attachment #1: Type: text/plain, Size: 19611 bytes --]



On 2021/8/27 15:45, Richard Biener wrote:
> On Thu, 26 Aug 2021, Xionghu Luo wrote:
> 
>>
>>
>> On 2021/8/24 16:20, Richard Biener wrote:
>>> On Tue, 24 Aug 2021, Xionghu Luo wrote:
>>>
>>>>
>>>>
>>>> On 2021/8/19 20:11, Richard Biener wrote:
>>>>>> -  class loop *inn_loop = loop;
>>>>>>     
>>>>>>       if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>>>>>         {
>>>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>>>>>     	     to disprove this if possible).  */
>>>>>>     	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>>>>>     	    break;
>>>>>> -
>>>>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>>>> -	    break;
>>>>>> -
>>>>>> -	  if (bb->loop_father->header == bb)
>>>>>> -	    {
>>>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>>> -		break;
>>>>>> -
>>>>>> -	      /* In a loop that is always entered we may proceed anyway.
>>>>>> -		 But record that we entered it and stop once we leave it.  */
>>>>>> -	      inn_loop = bb->loop_father;
>>>>>> -	    }
>>>>>>     	}
>>>>>>     
>>>>>>           while (1)
>>>>> I'm not sure this will work correct (I'm not sure how the existing
>>>>> code makes it so either...).  That said, I can't poke any hole
>>>>> into the change.  What I see is that definitely
>>>>>
>>>>>              if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>>                last = bb;
>>>>>
>>>>>              if (bitmap_bit_p (contains_call, bb->index))
>>>>>                break;
>>>>>
>>>>> doesn't work reliably since the DOM ordering will process blocks
>>>>> A B and C in random order for
>>>>>
>>>>>      for (;;)
>>>>>       {
>>>>>          if (cond)
>>>>>            {
>>>>>              A: foo ();
>>>>>            }
>>>>>          else B:;
>>>>>          C:;
>>>>>       }
>>>>>
>>>>> and thus we can end up setting 'last' to C_before_  processing
>>>>> 'A' and thus arriving at the call foo () ...
>>>>>
>>>>> get_loop_body_in_dom_order does some "special sauce" but not
>>>>> to address the above problem - but it might be that a subtle
>>>>> issue like the above is the reason for the inner loop handling.
>>>>> The inner loop block order does_not_  adhere to this "special sauce",
>>>>> that is - the "Additionally, if a basic block s dominates
>>>>> the latch, then only blocks dominated by s are be after it."
>>>>> guarantee holds for the outer loop latch, not for the inner.
>>>>>
>>>>> Digging into the history of fill_always_executed_in_1 doesn't
>>>>> reveal anything - the inner loop handling has been present
>>>>> since introduction by Zdenek - but usually Zdenek has a reason
>>>>> for doing things as he does;)
>>>>
>>>> Yes, this is really complicated usage, thanks for point it out. :)
>>>> I constructed two cases to verify this with inner loop includes "If A; else B; C".
>>>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks
>>>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch,
>>>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call;
>>>
>>> But it depends on the order of visiting ABC and that's hard to put into
>>> a testcase since it depends on the order of edges and the processing
>>> of the dominance computation.  ABC are simply unordered with respect
>>> to a dominator walk.
>>>
>>>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
>>>> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1
>>>> since there is also dominate check, then if A or B contains call, it could break
>>>> successfully.
>>>>
>>>> C won't be set to ALWAYS EXECUTED for both circumstance.
>>>>
>>>>>
>>>>> Note it might be simply a measure against quadratic complexity,
>>>>> esp. since with your patch we also dive into not always executed
>>>>> subloops as you remove the
>>>>>
>>>>>                  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>>                    break;
>>>>>
>>>>> check.  I suggest to evaluate behavior of the patch on a testcase
>>>>> like
>>>>>
>>>>> void foo (int n, int **k)
>>>>> {
>>>>>      for (int i = 0; i < n; ++i)
>>>>>        if (k[0][i])
>>>>>          for (int j = 0; j < n; ++j)
>>>>>            if (k[1][j])
>>>>>              for (int l = 0; l < n; ++l)
>>>>>                if (k[2][l])
>>>>>                  ...
>>>>> }
>>>>
>>>> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
>>>> so fill_always_executed_in_1's execution time is supposed to be increase from
>>>> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I also drafted a
>>>> test case has 73-depth loop function with 25 no-ipa function copies each compiled
>>>> in lim2 and lim4 dependently.  Total execution time of fill_always_executed_in_1 is
>>>> increased from 32ms to 58ms, almost doubled but not quadratic?
>>>
>>> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
>>> O(n^2).
>>>
>>>> It seems reasonable to see compiling time getting longer since most bbs are checked
>>>> more but a MUST to ensure early break correctly in every loop level...
>>>> Though loop nodes could be huge, loop depth will never be so large in actual code?
>>>
>>> The "in practice" argument is almost always defeated by automatic
>>> program generators ;)
>>>
>>>>>    
>>>>> I suspect you'll see quadratic behavior with your patch.  You
>>>>> should be at least able to preserve a check like
>>>>>
>>>>>              /* Do not process not always executed subloops to avoid
>>>>>                 quadratic behavior.  */
>>>>>              if (bb->loop_father->header == bb
>>>>>                  && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>>>                break;
>>>>>
>>>>> which is of course not optimistic for cases like
>>>>>
>>>>>      for (..)
>>>>>       {
>>>>>         if (cond)
>>>>>           for (..)
>>>>>         x = 1; // this is always executed if the inner loop is finite
>>>>>       }
>>>>>
>>>>> but we need to have an eye on the complexity of this function.  I would
>>>>> have suggested to do greedy visiting of the loop header successors,
>>>>> processing further blocks if all entries (ignoring backedges) are
>>>>> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
>>>>> is empty proceed to inner loops as the current code does.  For
>>>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
>>>>> Pseudo-code:
>>>>>
>>>>>      bitmap_set_bit (worklist, loop-header-bb);
>>>>>      while (!bitmap_empty_p (worklist))
>>>>>        {
>>>>>          bb = pop (worklist);
>>>>
>>>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
>>>
>>> Ah, sure.
>>>
>>>>>          SET_ALWAYS_EXECUTED_IN (bb, loop);
>>>>>          if (bitmap_bit_p (contains_call, bb->index))
>>>>>            continue;
>>>>>          FOR_EACH_EDGE (e, ei, bb->succs)
>>>>>            {
>>>>>              if (!flow_bb_inside_loop_p (loop, e->dest))
>>>>>                continue;
>>>>>              if (incoming_count[e->dest->index]-- == 0)
>>>>>                push (worklist, e->dest);
>>>>>            }
>>>>>        }
>>>>
>>>> Sorry I don't fully understand your algorithm.  worklist should be
>>>> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
>>>> the bb's preds count, why need retain it since it it decreased to 0?
>>>
>>> the worklist is a auto_bitmap, pop () would be
>>> bitmap_first_set_bit/clear_bit.  One could use a vector with the
>>> caveat of eventually adding duplicate members to the worklist.
>>> But as said, it's pseudo-code ;)
>>
>> Thanks a lot!
>>
>>>
>>>>>      
>>>>>      iterate over inner loops (incoming_count can be retained,
>>>>>      we just force the inner loop header onto the worklist).
>>>>
>>>> Is this same to ?
>>>>
>>>>     for (loop = loop->inner; loop; loop = loop->next)
>>>>       fill_always_executed_in_1 (loop, contains_call)
>>>
>>> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
>>> should iterate over loops in PRE order.
>>
>> Tried implementation with your pseudo-code, it works like the
>> previous v2 solution, bb after inner loops could be marked
>> as ALWAYS EXECUTED.  You are really so strong! ;)
>> Regression tested pass on P8LE.
>>
>> Assume,
>>
>>    A: GCC Base
>>    B: inner loop check removal
>>    C: incoming count
>>
>> Execution time of fill_always_executed_in_1 for the 73-depth loop
>> (with five function copies) is
>> (A vs. B vs. C): 32ms vs 58ms vs 38ms.  Much better than O(n^2)?

Sorry didn't get C's execution time accurate enough due to forgot to remove
dump file code in it.
C's execution time is 25 ms after code refine.  It's even better than A.
Attached the test case ssa-lim-22.c used for time measurement.

>>
>>
>> Bumped the patch to v3 with TODOs:
>>   1. The time measurement code will be removed if the v3 is OK;
>>   2. loops_list is not used as my code is not rebased to upstream yet.
>>   3. Function comments is not updated.
>>
>>
>>
>> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
>>
>>
>> ALWAYS_EXECUTED_IN is not computed completely for nested loops.
>> Current design will exit if an inner loop doesn't dominate outer
>> loop's latch or exit after exiting from inner loop, which
>> caused early return from outer loop, then ALWAYS EXECUTED blocks after
>> inner loops are skipped.  Another problem is it has to call
>> get_loop_body_in_dom_order in each recursive call which is also
>> inefficient.
>>
>> This patch does greedy visiting of the loop header successors,
>> processing further blocks if all entries (ignoring backedges) are
>> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
>> When the worklist is empty proceed to inner loops as the current
>> code does.  For bookkeeping simply keep a to-visit-incoming-edges
>> counter per BB, and pass it through iteration over inner loops.
>>
>> Details discusions:
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html
>>
>> gcc/ChangeLog:
>>
>> 	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
>> 	inn_loop check.
>>
>> gcc/testsuite/ChangeLog:
>>
>> 	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
>> 	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
>> ---
>>   gcc/tree-ssa-loop-im.c                     | 95 +++++++++++++---------
>>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
>>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
>>   3 files changed, 111 insertions(+), 38 deletions(-)
>>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
>>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
>>
>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
>> index b24bc64f2a7..72be6b6bfac 100644
>> --- a/gcc/tree-ssa-loop-im.c
>> +++ b/gcc/tree-ssa-loop-im.c
>> @@ -3192,39 +3192,52 @@ do_store_motion (void)
>>      blocks that contain a nonpure call.  */
>>   
>>   static void
>> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
>>   {
>> -  basic_block bb = NULL, *bbs, last = NULL;
>> -  unsigned i;
>> +  basic_block bb = NULL;
>>     edge e;
>> -  class loop *inn_loop = loop;
>>   
>>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
>>       {
>> -      bbs = get_loop_body_in_dom_order (loop);
>> +      auto_bitmap work_set;
>> +      bitmap_clear (work_set);
> 
> bitmap_clear isn't necessary

Removed.

> 
>> +      bitmap_set_bit (work_set, loop->header->index);
>> +      unsigned bb_index;
>>   
>> -      for (i = 0; i < loop->num_nodes; i++)
>> -	{
>> -	  edge_iterator ei;
>> -	  bb = bbs[i];
>> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
>> +      int *bbd = XNEWVEC (int, array_size);
>> +      bbd = XDUPVEC (int, bbi, array_size);
> 
> I don't think you need to copy 'bbi' but you can re-use the
> state from the outer loop processing.  Did you run into any
> issues with that?

Yes.  For example, adding a small if-else block to ssa-lim-19.c,
Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call
fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
then if fill_always_executed_in_1 is called again for loop 2, it's value is not
reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.


struct X { int i; int j; int k;};

void foo(struct X *x, int n, int l, int m)
{
  for (int j = 0; j < l; j++)  // loop 1
    {
      for (int i = 0; i < n; ++i)  // loop 2
        {
          if (m)
            x->j++;
          else
            x->j = m+n+l;

          int *p = &x->j;   // bb 6
          int tem = *p;
          x->j += tem * i;
        }
      int *r = &x->k;
      int tem2 = *r;
      x->k += tem2 * j;
    }
}



>   
>> +      while (!bitmap_empty_p (work_set))
>> +	{
>> +	  bb_index = bitmap_first_set_bit (work_set);
>> +	  bitmap_clear_bit (work_set, bb_index);
>> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
>>   	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> -	    last = bb;
>> -
>> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
>>   	  if (bitmap_bit_p (contains_call, bb->index))
>>   	    break;
> 
> I think you want to continue; here (process remaining worklist
> but not continue greedy walking this block)

Same as above, if use 'continue' instead of 'break', the algorithm
seems also not work again.  If inner loop contains a jump to outmost
loop, the blocks after the jump block will be set to ALWAYS EXECUTE
incorrectly.

> 
>> -
>> +	  edge_iterator ei;
>>   	  FOR_EACH_EDGE (e, ei, bb->succs)
>>   	    {
>> -	      /* If there is an exit from this BB.  */
>>   	      if (!flow_bb_inside_loop_p (loop, e->dest))
>>   		break;
> 
> in particular this should keep the outer 'bbi' valid to re-use.
> 
> But again, you want 'continue;' the greedy walk to other edges.
> If that's not valid (I'd need to think about this) then with
> your patch whether we process an edge depends on the order
> of the edge visit so you'd have to walk successors twice,
> once to determine whether we can greedily walk any of it
> and once to actually do the greedy walk.
> 
> So thinking about it an exit edge is like a not returning call
> and thus we indeed should not process any outgoing edges of
> this block.
> 
>> +
>>   	      /* Or we enter a possibly non-finite loop.  */
>>   	      if (flow_loop_nested_p (bb->loop_father,
>>   				      e->dest->loop_father)
>>   		  && ! finite_loop_p (e->dest->loop_father))
>>   		break;
> 
> I think this is no longer necessary?  In any case it would
> again be 'continue;', see also above.
> 
> I'm missing skipping of backedges in the walk.
> 
>> +
>> +	      if (bbd[e->dest->index] == 1)
>> +	      {
>> +		bitmap_set_bit (work_set, e->dest->index);
>> +		bbd[e->dest->index]--;
>> +	      }
>> +	      else if (bbd[e->dest->index] > 1)
>> +		bbd[e->dest->index]--;
> 
> maybe just
> 
>              bbd[e->dest->index]--;
>              if (bbd[e->dest->index] == 0)
>                bitmap_set_bit (work_set, e->dest->index);

They are not equivalent.  Former won't decrease if bbd[e->dest->index]
is 0, but the latter does.

> 
>>   	    }
>> +
>>   	  if (e)
>>   	    break;
> 
> continue; processing the worklist
> 
>>   
>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>>   	     to disprove this if possible).  */
>>   	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>   	    break;
> 
> continue; processing the worklist.  I think this has to
> come early, before processing successors, next to
> the contains_call processing.  We could think of ignoring
> entering of irreducible regions by tracking exits from them,
> but I'm not sure we're identifying the regions in a good
> enough way to allow this.
> 
> Likewise possibly infinite inner loops can be processed
> when we track exits from those which would be sth like
> 
>      /* When this is an exit from an inner loop that we cannot
>         prove as finite do not follow this edge.  */
>      if (bb->loop_father != loop
>          && loop_exit_edge_p (bb->loop_father, e)
>          && ! finite_loop_p (bb->loop_father))
>        continue;
> 
>> -
>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>> -	    break;
>> -
>> -	  if (bb->loop_father->header == bb)
>> -	    {
>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>> -		break;
>> -
>> -	      /* In a loop that is always entered we may proceed anyway.
>> -		 But record that we entered it and stop once we leave it.  */
>> -	      inn_loop = bb->loop_father;
>> -	    }
>> -	}
>> -
>> -      while (1)
>> -	{
>> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
>> -	  if (last == loop->header)
>> -	    break;
>> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
>>   	}
>> -
>> -      free (bbs);
>> +      free (bbd);
>>       }
>>   
>>     for (loop = loop->inner; loop; loop = loop->next)
>> -    fill_always_executed_in_1 (loop, contains_call);
>> +    fill_always_executed_in_1 (loop, contains_call, bbi);
>>   }
>>   
>>   /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
>>   	bitmap_set_bit (contains_call, bb->index);
>>       }
>>   
>> +  mark_dfs_back_edges ();
> 
> Please put this to loop_invariant_motion_in_fun right before
> the call to fill_always_executed_in.

Done.

> 
>> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
>> +  int *bb_incoming_edges = XNEWVEC (int, array_size);
> 
> There's XCNEWVEC that also zeros the array.

OK, thanks.

> 
>> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
>> +  edge_iterator ei;
>> +  edge e;
>> +
>> +  FOR_EACH_BB_FN (bb, cfun)
>> +    {
>> +      int len = bb->preds->length ();
>> +      FOR_EACH_EDGE (e, ei, bb->preds)
>> +	if (e->flags & EDGE_DFS_BACK)
>> +	  len--;
>> +      bb_incoming_edges[bb->index] = len;
>> +    }
> 
> it would be nice to combine this with the preceeding loop over
> all blocks.

Done.

> 
>> +  static unsigned long diff = 0;
>> +  struct timeval tv;
>> +  gettimeofday (&tv, NULL);
>> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
>> +
>> +  //for (loop : loops_list (cfun, 0))
>>     for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
>> -    fill_always_executed_in_1 (loop, contains_call);
>> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
>> +
>> +  gettimeofday (&tv, NULL);
>> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
>> +  diff += end - begin;
>> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
>> +  free (bb_incoming_edges);
> 
> Eh, looks very much like work in progress.  From the O() complexity
> point duplicating bb_incoming_edges for each loop makes it quadratic
> again but as said I think we don't need this duplication.

Sorry I didn't quite follow your above comments about reusing bb_incoming_edges,
we are searching the bbs that dominate loop->latch loop by loop independently,
if outer loop changes the inner loops bb_incoming_edges, it will make the inner
loop's ALWAYS EXECUTE computation incorrect?

Seems duplicating bb_incoming_edges not increases the complexity but the
recursive call does?

> 
> I'd like to see the recursion removed, did the for (loop : ...)
> not work out?

The recursion in fill_always_executed_in_1 could be removed since loops_list
could iterate all inner loops but
"for (loop = current_loops->tree_root->inner; loop; loop = loop->next)"
only iterate sibling level loops?

-- 
Thanks,
Xionghu

[-- Attachment #2: v4-0001-Fix-incomplete-computation-in-fill_always_execute.patch --]
[-- Type: text/plain, Size: 8499 bytes --]

From 6047a7b8bbef48794bc3d1608c64ae82e5716430 Mon Sep 17 00:00:00 2001
From: Xiong Hu Luo <luoxhu@linux.ibm.com>
Date: Tue, 17 Aug 2021 20:33:26 -0500
Subject: [PATCH v4] Fix incomplete computation in fill_always_executed_in_1

ALWAYS_EXECUTED_IN is not computed completely for nested loops.
Current design will exit if an inner loop doesn't dominate outer
loop's latch or exit after exiting from inner loop, which
caused early return from outer loop, then ALWAYS EXECUTED blocks after
inner loops are skipped.  Another problem is it has to call
get_loop_body_in_dom_order in each recursive call which is also
inefficient.

This patch does greedy visiting of the loop header successors,
processing further blocks if all entries (ignoring backedges) are
processed, setting SET_ALWAYS_EXECUTED_IN if dominates loop's latch.
When the worklist is empty proceed to inner loops.  For bookkeeping
simply keep a to-visit-incoming-edges counter per BB, and pass it
through to inner loops.

Details discusions:
https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html

gcc/ChangeLog:

	* tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
	inn_loop check.  Greedy visit loop header successors and update
	edge counts.
	(fill_always_executed_in): Compute bb_incoming_edges and pass
	down.
	(loop_invariant_motion_in_fun): Compute dfs back edge.

gcc/testsuite/ChangeLog:

	* gcc.dg/tree-ssa/ssa-lim-19.c: New test.
	* gcc.dg/tree-ssa/ssa-lim-20.c: New test.
---
 gcc/tree-ssa-loop-im.c                     | 109 ++++++++++++---------
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c |  31 ++++++
 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c |  30 ++++++
 3 files changed, 125 insertions(+), 45 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c

diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index b24bc64f2a7..5d69d47eaa6 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3192,30 +3192,48 @@ do_store_motion (void)
    blocks that contain a nonpure call.  */
 
 static void
-fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
+fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi)
 {
-  basic_block bb = NULL, *bbs, last = NULL;
-  unsigned i;
+  basic_block bb = NULL;
   edge e;
-  class loop *inn_loop = loop;
 
   if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
     {
-      bbs = get_loop_body_in_dom_order (loop);
+      auto_bitmap work_set;
+      bitmap_set_bit (work_set, loop->header->index);
+      unsigned bb_index;
 
-      for (i = 0; i < loop->num_nodes; i++)
-	{
-	  edge_iterator ei;
-	  bb = bbs[i];
+      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+      int *bbd = XCNEWVEC (int, array_size);
+      bbd = XDUPVEC (int, bbi, array_size);
 
+      while (!bitmap_empty_p (work_set))
+	{
+	  bb_index = bitmap_first_set_bit (work_set);
+	  bitmap_clear_bit (work_set, bb_index);
+	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
 	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-	    last = bb;
-
+	    {
+	      SET_ALWAYS_EXECUTED_IN (bb, loop);
+	      printf ("always executed: bb->index:%d, loop->num: %d\n", bb->index, loop->num);
+	      if (dump_file && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file,
+			 "always executed: bb->index:%d, loop->num: %d\n",
+			 bb->index, loop->num);
+	    }
 	  if (bitmap_bit_p (contains_call, bb->index))
 	    break;
 
+	  /* A loop might be infinite (TODO use simple loop analysis
+	     to disprove this if possible).  */
+	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
+	    break;
+
+	  edge_iterator ei;
 	  FOR_EACH_EDGE (e, ei, bb->succs)
 	    {
+	      if (e->dest == loop->latch)
+		break;
 	      /* If there is an exit from this BB.  */
 	      if (!flow_bb_inside_loop_p (loop, e->dest))
 		break;
@@ -3224,42 +3242,20 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
 				      e->dest->loop_father)
 		  && ! finite_loop_p (e->dest->loop_father))
 		break;
-	    }
-	  if (e)
-	    break;
-
-	  /* A loop might be infinite (TODO use simple loop analysis
-	     to disprove this if possible).  */
-	  if (bb->flags & BB_IRREDUCIBLE_LOOP)
-	    break;
-
-	  if (!flow_bb_inside_loop_p (inn_loop, bb))
-	    break;
-
-	  if (bb->loop_father->header == bb)
-	    {
-	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
-		break;
 
-	      /* In a loop that is always entered we may proceed anyway.
-		 But record that we entered it and stop once we leave it.  */
-	      inn_loop = bb->loop_father;
+	      if (bbd[e->dest->index] == 1)
+	      {
+		bitmap_set_bit (work_set, e->dest->index);
+		bbd[e->dest->index]--;
+	      }
+	      else if (bbd[e->dest->index] > 1)
+		bbd[e->dest->index]--;
 	    }
-	}
-
-      while (1)
-	{
-	  SET_ALWAYS_EXECUTED_IN (last, loop);
-	  if (last == loop->header)
+	  if (e)
 	    break;
-	  last = get_immediate_dominator (CDI_DOMINATORS, last);
 	}
-
-      free (bbs);
+      free (bbd);
     }
-
-  for (loop = loop->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
 }
 
 /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
@@ -3270,12 +3266,22 @@ static void
 fill_always_executed_in (void)
 {
   basic_block bb;
-  class loop *loop;
+  edge_iterator ei;
+  edge e;
+
+  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
+  int *bb_incoming_edges = XCNEWVEC (int, array_size);
 
   auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
   bitmap_clear (contains_call);
   FOR_EACH_BB_FN (bb, cfun)
     {
+      int len = bb->preds->length ();
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	if (e->flags & EDGE_DFS_BACK)
+	  len--;
+      bb_incoming_edges[bb->index] = len;
+
       gimple_stmt_iterator gsi;
       for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
 	{
@@ -3287,8 +3293,19 @@ fill_always_executed_in (void)
 	bitmap_set_bit (contains_call, bb->index);
     }
 
-  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
-    fill_always_executed_in_1 (loop, contains_call);
+  static unsigned long diff = 0;
+  struct timeval tv;
+  gettimeofday (&tv, NULL);
+  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+
+  for (auto loop : loops_list (cfun, 0))
+    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
+
+  gettimeofday (&tv, NULL);
+  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
+  diff += end - begin;
+  printf ("%s,diff:%ld\n", current_function_name (), diff);
+  free (bb_incoming_edges);
 }
 
 
@@ -3391,6 +3408,8 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion)
   /* Gathers information about memory accesses in the loops.  */
   analyze_memory_references (store_motion);
 
+  mark_dfs_back_edges ();
+
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
   fill_always_executed_in ();
 
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
new file mode 100644
index 00000000000..3d5fe6f0314
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
@@ -0,0 +1,31 @@
+/* PR/101293 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+struct X { int i; int j; int k;};
+
+void foo(struct X *x, int n, int l, int m)
+{
+  for (int j = 0; j < l; j++)
+    {
+      for (int i = 0; i < n; ++i)
+	{
+#if 1
+	  if (m)
+	    x->j++;
+	  else
+	    x->j = m+n+l;
+#endif
+
+	  int *p = &x->j;
+	  int tem = *p;
+	  x->j += tem * i;
+	}
+      int *r = &x->k;
+      int tem2 = *r;
+      x->k += tem2 * j;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */
+
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
new file mode 100644
index 00000000000..6e180de528e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
@@ -0,0 +1,30 @@
+/* PR/101293 */
+/* { dg-do run } */
+/* { dg-options "-O2 -fdump-tree-lim2-details" } */
+
+void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q)
+{
+  while (--n)
+    {
+      do
+	{
+	  *q += 1;
+	  if (f)
+	    goto out;
+	}
+      while (--m);
+      *p += 1;
+    }
+out:;
+}
+
+int main()
+{
+    int i = 0;
+    foo (10, 10, 1, (void *) 0, &i);
+    if (i != 1)
+      __builtin_abort ();
+    return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */
-- 
2.27.0.90.geebb51ba8c


[-- Attachment #3: ssa-lim-22.c --]
[-- Type: text/plain, Size: 4837 bytes --]

#include <stdlib.h>
#include <stdio.h>

int n; int **k; int m;

#define FOOAB(A, B) void __attribute__((noipa)) foo##A##B () { \
  for (int i = 0; i < n; ++i) \
  { \
    if (k[0][i]) \
    { \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
      for (int j = 0; j < n; ++j) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
	if (k[1][j]) \
	  for (int l = 0; l < n; ++l) \
	    if (k[2][l]) \
	      for (int a = 0; a < n; ++a) \
		if (k[3][a]) \
		  for (int b = 0; b < n; ++b) \
		    if (k[4][b]) \
		      for (int c = 0; c < n; ++c) \
			if (k[5][c]) \
			  for (int a = 0; a < n; ++a) \
			    if (k[3][a]) \
			      for (int b = 0; b < n; ++b) \
				if (k[4][b]) \
				  for (int c = 0; c < n; ++c) \
				  { \
				    if (k[5][c]) \
				      { \
					int *r = &k[2][m]; \
					int tem2 = *r; \
					k[2][m] += tem2 * i; \
				      } \
				    else \
				      { \
					int *r = &k[5][m]; \
					int tem2 = *r; \
					k[5][m] += tem2 * i; \
				      } \
				    if (k[7][c]) \
				      { \
					int *r = &k[3][m]; \
					int tem2 = *r; \
					k[3][m] += tem2 * i; \
				      } \
				    else \
				      { \
					int *r = &k[3][m]; \
					int tem2 = *r; \
					k[3][m] += tem2 * i; \
				      } \
				    int *t = &k[8][m]; \
				    int tem2 = *t; \
				    k[8][m] += tem2 * i; \
				  } \
    } \
    int *s = &k[4][m]; \
    int tem2 = *s; \
    k[4][m] += tem2 * i; \
  } \
} \

#define FOOB(B)  FOOAB (1,B)\
FOOAB (2,B) \
FOOAB (3,B) \
FOOAB (4,B) \
FOOAB (5,B)

FOOB (0)
#if 1
FOOB (1)
FOOB (2)
FOOB (3)
FOOB (4)
#if 0
FOOB (5)
FOOB (6)
FOOB (7)
FOOB (8)
FOOB (9)
#endif
#endif

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
  2021-08-30  8:49                   ` Xionghu Luo
@ 2021-08-30  9:19                     ` Richard Biener
  2021-08-31  7:43                       ` Xionghu Luo
  0 siblings, 1 reply; 17+ messages in thread
From: Richard Biener @ 2021-08-30  9:19 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Mon, 30 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/27 15:45, Richard Biener wrote:
> > On Thu, 26 Aug 2021, Xionghu Luo wrote:
> > 
> >>
> >>
> >> On 2021/8/24 16:20, Richard Biener wrote:
> >>> On Tue, 24 Aug 2021, Xionghu Luo wrote:
> >>>
> >>>>
> >>>>
> >>>> On 2021/8/19 20:11, Richard Biener wrote:
> >>>>>> -  class loop *inn_loop = loop;
> >>>>>>     
> >>>>>>       if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>>>>>         {
> >>>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop,
> >>>>>> @@ sbitmap contains_call)
> >>>>>>           to disprove this if possible).  */
> >>>>>>        if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>>>>>     	    break;
> >>>>>> -
> >>>>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>>>> -	    break;
> >>>>>> -
> >>>>>> -	  if (bb->loop_father->header == bb)
> >>>>>> -	    {
> >>>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>>> -		break;
> >>>>>> -
> >>>>>> -	      /* In a loop that is always entered we may proceed
> >>>>>> anyway.
> >>>>>> -		 But record that we entered it and stop once we leave
> >>>>>> it.  */
> >>>>>> -	      inn_loop = bb->loop_father;
> >>>>>> -	    }
> >>>>>>      }
> >>>>>>     
> >>>>>>           while (1)
> >>>>> I'm not sure this will work correct (I'm not sure how the existing
> >>>>> code makes it so either...).  That said, I can't poke any hole
> >>>>> into the change.  What I see is that definitely
> >>>>>
> >>>>>              if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>>                last = bb;
> >>>>>
> >>>>>              if (bitmap_bit_p (contains_call, bb->index))
> >>>>>                break;
> >>>>>
> >>>>> doesn't work reliably since the DOM ordering will process blocks
> >>>>> A B and C in random order for
> >>>>>
> >>>>>      for (;;)
> >>>>>       {
> >>>>>          if (cond)
> >>>>>            {
> >>>>>              A: foo ();
> >>>>>            }
> >>>>>          else B:;
> >>>>>          C:;
> >>>>>       }
> >>>>>
> >>>>> and thus we can end up setting 'last' to C_before_  processing
> >>>>> 'A' and thus arriving at the call foo () ...
> >>>>>
> >>>>> get_loop_body_in_dom_order does some "special sauce" but not
> >>>>> to address the above problem - but it might be that a subtle
> >>>>> issue like the above is the reason for the inner loop handling.
> >>>>> The inner loop block order does_not_  adhere to this "special sauce",
> >>>>> that is - the "Additionally, if a basic block s dominates
> >>>>> the latch, then only blocks dominated by s are be after it."
> >>>>> guarantee holds for the outer loop latch, not for the inner.
> >>>>>
> >>>>> Digging into the history of fill_always_executed_in_1 doesn't
> >>>>> reveal anything - the inner loop handling has been present
> >>>>> since introduction by Zdenek - but usually Zdenek has a reason
> >>>>> for doing things as he does;)
> >>>>
> >>>> Yes, this is really complicated usage, thanks for point it out. :)
> >>>> I constructed two cases to verify this with inner loop includes "If A;
> >>>> else B; C".
> >>>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also
> >>>> checks
> >>>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s
> >>>> latch,
> >>>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or
> >>>> B contains call;
> >>>
> >>> But it depends on the order of visiting ABC and that's hard to put into
> >>> a testcase since it depends on the order of edges and the processing
> >>> of the dominance computation.  ABC are simply unordered with respect
> >>> to a dominator walk.
> >>>
> >>>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop,
> >>>> the access order is CAB, but 'last' also won’t be updated to C in
> >>>> fill_always_executed_in_1
> >>>> since there is also dominate check, then if A or B contains call, it
> >>>> could break
> >>>> successfully.
> >>>>
> >>>> C won't be set to ALWAYS EXECUTED for both circumstance.
> >>>>
> >>>>>
> >>>>> Note it might be simply a measure against quadratic complexity,
> >>>>> esp. since with your patch we also dive into not always executed
> >>>>> subloops as you remove the
> >>>>>
> >>>>>                  if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>>                    break;
> >>>>>
> >>>>> check.  I suggest to evaluate behavior of the patch on a testcase
> >>>>> like
> >>>>>
> >>>>> void foo (int n, int **k)
> >>>>> {
> >>>>>      for (int i = 0; i < n; ++i)
> >>>>>        if (k[0][i])
> >>>>>          for (int j = 0; j < n; ++j)
> >>>>>            if (k[1][j])
> >>>>>              for (int l = 0; l < n; ++l)
> >>>>>                if (k[2][l])
> >>>>>                  ...
> >>>>> }
> >>>>
> >>>> Theoretically the complexity is changing from L1(bbs) to
> >>>> L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs),
> >>>> so fill_always_executed_in_1's execution time is supposed to be increase
> >>>> from
> >>>> O(n) to O(n2)?  The time should depend on loop depth and bb counts.   I
> >>>> also drafted a
> >>>> test case has 73-depth loop function with 25 no-ipa function copies each
> >>>> compiled
> >>>> in lim2 and lim4 dependently.  Total execution time of
> >>>> fill_always_executed_in_1 is
> >>>> increased from 32ms to 58ms, almost doubled but not quadratic?
> >>>
> >>> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still
> >>> O(n^2).
> >>>
> >>>> It seems reasonable to see compiling time getting longer since most bbs
> >>>> are checked
> >>>> more but a MUST to ensure early break correctly in every loop level...
> >>>> Though loop nodes could be huge, loop depth will never be so large in
> >>>> actual code?
> >>>
> >>> The "in practice" argument is almost always defeated by automatic
> >>> program generators ;)
> >>>
> >>>>>    I suspect you'll see quadratic behavior with your patch.  You
> >>>>> should be at least able to preserve a check like
> >>>>>
> >>>>>              /* Do not process not always executed subloops to avoid
> >>>>>                 quadratic behavior.  */
> >>>>>              if (bb->loop_father->header == bb
> >>>>>                  && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>>>                break;
> >>>>>
> >>>>> which is of course not optimistic for cases like
> >>>>>
> >>>>>      for (..)
> >>>>>       {
> >>>>>         if (cond)
> >>>>>           for (..)
> >>>>>         x = 1; // this is always executed if the inner loop is finite
> >>>>>       }
> >>>>>
> >>>>> but we need to have an eye on the complexity of this function.  I would
> >>>>> have suggested to do greedy visiting of the loop header successors,
> >>>>> processing further blocks if all entries (ignoring backedges) are
> >>>>> processed, setting SET_ALWAYS_EXECUTED_IN.  When the worklist
> >>>>> is empty proceed to inner loops as the current code does.  For
> >>>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB.
> >>>>> Pseudo-code:
> >>>>>
> >>>>>      bitmap_set_bit (worklist, loop-header-bb);
> >>>>>      while (!bitmap_empty_p (worklist))
> >>>>>        {
> >>>>>          bb = pop (worklist);
> >>>>
> >>>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN?
> >>>
> >>> Ah, sure.
> >>>
> >>>>>          SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>>>>          if (bitmap_bit_p (contains_call, bb->index))
> >>>>>            continue;
> >>>>>          FOR_EACH_EDGE (e, ei, bb->succs)
> >>>>>            {
> >>>>>              if (!flow_bb_inside_loop_p (loop, e->dest))
> >>>>>                continue;
> >>>>>              if (incoming_count[e->dest->index]-- == 0)
> >>>>>                push (worklist, e->dest);
> >>>>>            }
> >>>>>        }
> >>>>
> >>>> Sorry I don't fully understand your algorithm.  worklist should be
> >>>> auto_vec<basic_block> don't support bitmap operations?  Is incoming_count
> >>>> the bb's preds count, why need retain it since it it decreased to 0?
> >>>
> >>> the worklist is a auto_bitmap, pop () would be
> >>> bitmap_first_set_bit/clear_bit.  One could use a vector with the
> >>> caveat of eventually adding duplicate members to the worklist.
> >>> But as said, it's pseudo-code ;)
> >>
> >> Thanks a lot!
> >>
> >>>
> >>>>>      
> >>>>>      iterate over inner loops (incoming_count can be retained,
> >>>>>      we just force the inner loop header onto the worklist).
> >>>>
> >>>> Is this same to ?
> >>>>
> >>>>     for (loop = loop->inner; loop; loop = loop->next)
> >>>>       fill_always_executed_in_1 (loop, contains_call)
> >>>
> >>> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which
> >>> should iterate over loops in PRE order.
> >>
> >> Tried implementation with your pseudo-code, it works like the
> >> previous v2 solution, bb after inner loops could be marked
> >> as ALWAYS EXECUTED.  You are really so strong! ;)
> >> Regression tested pass on P8LE.
> >>
> >> Assume,
> >>
> >>    A: GCC Base
> >>    B: inner loop check removal
> >>    C: incoming count
> >>
> >> Execution time of fill_always_executed_in_1 for the 73-depth loop
> >> (with five function copies) is
> >> (A vs. B vs. C): 32ms vs 58ms vs 38ms.  Much better than O(n^2)?
> 
> Sorry didn't get C's execution time accurate enough due to forgot to remove
> dump file code in it.
> C's execution time is 25 ms after code refine.  It's even better than A.
> Attached the test case ssa-lim-22.c used for time measurement.
> 
> >>
> >>
> >> Bumped the patch to v3 with TODOs:
> >>   1. The time measurement code will be removed if the v3 is OK;
> >>   2. loops_list is not used as my code is not rebased to upstream yet.
> >>   3. Function comments is not updated.
> >>
> >>
> >>
> >> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
> >>
> >>
> >> ALWAYS_EXECUTED_IN is not computed completely for nested loops.
> >> Current design will exit if an inner loop doesn't dominate outer
> >> loop's latch or exit after exiting from inner loop, which
> >> caused early return from outer loop, then ALWAYS EXECUTED blocks after
> >> inner loops are skipped.  Another problem is it has to call
> >> get_loop_body_in_dom_order in each recursive call which is also
> >> inefficient.
> >>
> >> This patch does greedy visiting of the loop header successors,
> >> processing further blocks if all entries (ignoring backedges) are
> >> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch.
> >> When the worklist is empty proceed to inner loops as the current
> >> code does.  For bookkeeping simply keep a to-visit-incoming-edges
> >> counter per BB, and pass it through iteration over inner loops.
> >>
> >> Details discusions:
> >> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html
> >>
> >> gcc/ChangeLog:
> >>
> >>  * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove
> >>  inn_loop check.
> >>
> >> gcc/testsuite/ChangeLog:
> >>
> >>  * gcc.dg/tree-ssa/ssa-lim-19.c: New test.
> >>  * gcc.dg/tree-ssa/ssa-lim-20.c: New test.
> >> ---
> >>   gcc/tree-ssa-loop-im.c                     | 95 +++++++++++++---------
> >>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++
> >>   gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++
> >>   3 files changed, 111 insertions(+), 38 deletions(-)
> >>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c
> >>   create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c
> >>
> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> >> index b24bc64f2a7..72be6b6bfac 100644
> >> --- a/gcc/tree-ssa-loop-im.c
> >> +++ b/gcc/tree-ssa-loop-im.c
> >> @@ -3192,39 +3192,52 @@ do_store_motion (void)
> >>      blocks that contain a nonpure call.  */
> >>   
> >>   static void
> >> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
> >> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int
> >> *bbi)
> >>   {
> >> -  basic_block bb = NULL, *bbs, last = NULL;
> >> -  unsigned i;
> >> +  basic_block bb = NULL;
> >>     edge e;
> >> -  class loop *inn_loop = loop;
> >>   
> >>     if (ALWAYS_EXECUTED_IN (loop->header) == NULL)
> >>       {
> >> -      bbs = get_loop_body_in_dom_order (loop);
> >> +      auto_bitmap work_set;
> >> +      bitmap_clear (work_set);
> > 
> > bitmap_clear isn't necessary
> 
> Removed.
> 
> > 
> >> +      bitmap_set_bit (work_set, loop->header->index);
> >> +      unsigned bb_index;
> >>   -      for (i = 0; i < loop->num_nodes; i++)
> >> -	{
> >> -	  edge_iterator ei;
> >> -	  bb = bbs[i];
> >> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >> +      int *bbd = XNEWVEC (int, array_size);
> >> +      bbd = XDUPVEC (int, bbi, array_size);
> > 
> > I don't think you need to copy 'bbi' but you can re-use the
> > state from the outer loop processing.  Did you run into any
> > issues with that?
> 
> Yes.  For example, adding a small if-else block to ssa-lim-19.c,
> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call
> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
> then if fill_always_executed_in_1 is called again for loop 2, it's value is
> not
> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
> 
> 
> struct X { int i; int j; int k;};
> 
> void foo(struct X *x, int n, int l, int m)
> {
>  for (int j = 0; j < l; j++)  // loop 1
>    {
>      for (int i = 0; i < n; ++i)  // loop 2
>        {
>          if (m)
>            x->j++;
>          else
>            x->j = m+n+l;
> 
>          int *p = &x->j;   // bb 6
>          int tem = *p;
>          x->j += tem * i;
>        }
>      int *r = &x->k;
>      int tem2 = *r;
>      x->k += tem2 * j;
>    }
> }

Hmm, but if the outer loop processing reaches bb 6 then
it should have set it ALWAYS_EXECUTED in loop 1 already?

> 
> 
> >   
> >> +      while (!bitmap_empty_p (work_set))
> >> +	{
> >> +	  bb_index = bitmap_first_set_bit (work_set);
> >> +	  bitmap_clear_bit (work_set, bb_index);
> >> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
> >>   	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> -	    last = bb;
> >> -
> >> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>      if (bitmap_bit_p (contains_call, bb->index))
> >>        break;
> > 
> > I think you want to continue; here (process remaining worklist
> > but not continue greedy walking this block)
> 
> Same as above, if use 'continue' instead of 'break', the algorithm
> seems also not work again.  If inner loop contains a jump to outmost
> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
> incorrectly.
> 
> > 
> >> -
> >> +	  edge_iterator ei;
> >>      FOR_EACH_EDGE (e, ei, bb->succs)
> >>   	    {
> >> -	      /* If there is an exit from this BB.  */
> >>          if (!flow_bb_inside_loop_p (loop, e->dest))
> >>     break;
> > 
> > in particular this should keep the outer 'bbi' valid to re-use.
> > 
> > But again, you want 'continue;' the greedy walk to other edges.
> > If that's not valid (I'd need to think about this) then with
> > your patch whether we process an edge depends on the order
> > of the edge visit so you'd have to walk successors twice,
> > once to determine whether we can greedily walk any of it
> > and once to actually do the greedy walk.
> > 
> > So thinking about it an exit edge is like a not returning call
> > and thus we indeed should not process any outgoing edges of
> > this block.
> > 
> >> +
> >>          /* Or we enter a possibly non-finite loop.  */
> >>          if (flow_loop_nested_p (bb->loop_father,
> >>             e->dest->loop_father)
> >>       && ! finite_loop_p (e->dest->loop_father))
> >>     break;
> > 
> > I think this is no longer necessary?  In any case it would
> > again be 'continue;', see also above.
> > 
> > I'm missing skipping of backedges in the walk.
> > 
> >> +
> >> +	      if (bbd[e->dest->index] == 1)
> >> +	      {
> >> +		bitmap_set_bit (work_set, e->dest->index);
> >> +		bbd[e->dest->index]--;
> >> +	      }
> >> +	      else if (bbd[e->dest->index] > 1)
> >> +		bbd[e->dest->index]--;
> > 
> > maybe just
> > 
> >              bbd[e->dest->index]--;
> >              if (bbd[e->dest->index] == 0)
> >                bitmap_set_bit (work_set, e->dest->index);
> 
> They are not equivalent.  Former won't decrease if bbd[e->dest->index]
> is 0, but the latter does.

we should be never visiting in this case, so sth indeed doesn't
work as I expected.

> > 
> >>   	    }
> >> +
> >>      if (e)
> >>        break;
> > 
> > continue; processing the worklist
> > 
> >>   
> >> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
> >> @@ sbitmap contains_call)
> >>         to disprove this if possible).  */
> >>      if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>        break;
> > 
> > continue; processing the worklist.  I think this has to
> > come early, before processing successors, next to
> > the contains_call processing.  We could think of ignoring
> > entering of irreducible regions by tracking exits from them,
> > but I'm not sure we're identifying the regions in a good
> > enough way to allow this.
> > 
> > Likewise possibly infinite inner loops can be processed
> > when we track exits from those which would be sth like
> > 
> >      /* When this is an exit from an inner loop that we cannot
> >         prove as finite do not follow this edge.  */
> >      if (bb->loop_father != loop
> >          && loop_exit_edge_p (bb->loop_father, e)
> >          && ! finite_loop_p (bb->loop_father))
> >        continue;
> > 
> >> -
> >> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> >> -	    break;
> >> -
> >> -	  if (bb->loop_father->header == bb)
> >> -	    {
> >> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >> -		break;
> >> -
> >> -	      /* In a loop that is always entered we may proceed anyway.
> >> -		 But record that we entered it and stop once we leave it.  */
> >> -	      inn_loop = bb->loop_father;
> >> -	    }
> >> -	}
> >> -
> >> -      while (1)
> >> -	{
> >> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
> >> -	  if (last == loop->header)
> >> -	    break;
> >> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> >>   	}
> >> -
> >> -      free (bbs);
> >> +      free (bbd);
> >>       }
> >>   
> >>     for (loop = loop->inner; loop; loop = loop->next)
> >> -    fill_always_executed_in_1 (loop, contains_call);
> >> +    fill_always_executed_in_1 (loop, contains_call, bbi);
> >>   }
> >>   
> >>   /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> >> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
> >>    bitmap_set_bit (contains_call, bb->index);
> >>       }
> >>   +  mark_dfs_back_edges ();
> > 
> > Please put this to loop_invariant_motion_in_fun right before
> > the call to fill_always_executed_in.
> 
> Done.
> 
> > 
> >> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >> +  int *bb_incoming_edges = XNEWVEC (int, array_size);
> > 
> > There's XCNEWVEC that also zeros the array.
> 
> OK, thanks.
> 
> > 
> >> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
> >> +  edge_iterator ei;
> >> +  edge e;
> >> +
> >> +  FOR_EACH_BB_FN (bb, cfun)
> >> +    {
> >> +      int len = bb->preds->length ();
> >> +      FOR_EACH_EDGE (e, ei, bb->preds)
> >> +	if (e->flags & EDGE_DFS_BACK)
> >> +	  len--;
> >> +      bb_incoming_edges[bb->index] = len;
> >> +    }
> > 
> > it would be nice to combine this with the preceeding loop over
> > all blocks.
> 
> Done.
> 
> > 
> >> +  static unsigned long diff = 0;
> >> +  struct timeval tv;
> >> +  gettimeofday (&tv, NULL);
> >> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >> +
> >> +  //for (loop : loops_list (cfun, 0))
> >>     for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> >> -    fill_always_executed_in_1 (loop, contains_call);
> >> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
> >> +
> >> +  gettimeofday (&tv, NULL);
> >> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >> +  diff += end - begin;
> >> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
> >> +  free (bb_incoming_edges);
> > 
> > Eh, looks very much like work in progress.  From the O() complexity
> > point duplicating bb_incoming_edges for each loop makes it quadratic
> > again but as said I think we don't need this duplication.
> 
> Sorry I didn't quite follow your above comments about reusing
> bb_incoming_edges,
> we are searching the bbs that dominate loop->latch loop by loop independently,
> if outer loop changes the inner loops bb_incoming_edges, it will make the
> inner
> loop's ALWAYS EXECUTE computation incorrect?

We're only supposed to greedily walk into always excuted blocks.  I see
this probably breaks down for conditionally executed inner loops like

  for ()
   {
    if ()
      for ();
    A;
   }

where we want to process the inner loop separately but we still want
to make sure A is marked as always executed in the outer loop.

> Seems duplicating bb_incoming_edges not increases the complexity but the
> recursive call does?

Since the bb_incoming_edges array has a size of N and we copy it
number_of_loop times (which is O(N) bound) the copying effectively
has O(N^2) cost.  So yes, it's the iteration over all loops,
whether by recursion or iteration.

I wonder if we can more cleverly handle this by picking up
conditionally executed inner loops on the fly for example
by recording a 'outermost unconditionally executed loop'
for each loop header we visit and using that like
SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))

So make this a single greedy CFG walk starting at the outermost
loops.  loop exists will need to be processed specially then.

Richard.

> > 
> > I'd like to see the recursion removed, did the for (loop : ...)
> > not work out?
> 
> The recursion in fill_always_executed_in_1 could be removed since loops_list
> could iterate all inner loops but
> "for (loop = current_loops->tree_root->inner; loop; loop = loop->next)"
> only iterate sibling level loops?
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
  2021-08-30  9:19                     ` Richard Biener
@ 2021-08-31  7:43                       ` Xionghu Luo
  2021-08-31 11:29                         ` Richard Biener
  0 siblings, 1 reply; 17+ messages in thread
From: Xionghu Luo @ 2021-08-31  7:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

[-- Attachment #1: Type: text/plain, Size: 15548 bytes --]



On 2021/8/30 17:19, Richard Biener wrote:
>>>>      bitmap_set_bit (work_set, loop->header->index);
>>>> +      unsigned bb_index;
>>>>    -      for (i = 0; i < loop->num_nodes; i++)
>>>> -	{
>>>> -	  edge_iterator ei;
>>>> -	  bb = bbs[i];
>>>> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
>>>> +      int *bbd = XNEWVEC (int, array_size);
>>>> +      bbd = XDUPVEC (int, bbi, array_size);
>>> I don't think you need to copy 'bbi' but you can re-use the
>>> state from the outer loop processing.  Did you run into any
>>> issues with that?
>> Yes.  For example, adding a small if-else block to ssa-lim-19.c,
>> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call
>> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
>> then if fill_always_executed_in_1 is called again for loop 2, it's value is
>> not
>> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
>>
>>
>> struct X { int i; int j; int k;};
>>
>> void foo(struct X *x, int n, int l, int m)
>> {
>>   for (int j = 0; j < l; j++)  // loop 1
>>     {
>>       for (int i = 0; i < n; ++i)  // loop 2
>>         {
>>           if (m)
>>             x->j++;
>>           else
>>             x->j = m+n+l;
>>
>>           int *p = &x->j;   // bb 6
>>           int tem = *p;
>>           x->j += tem * i;
>>         }
>>       int *r = &x->k;
>>       int tem2 = *r;
>>       x->k += tem2 * j;
>>     }
>> }
> Hmm, but if the outer loop processing reaches bb 6 then
> it should have set it ALWAYS_EXECUTED in loop 1 already?

But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for
loop 2 as it requires n>0.  Please refer to the attached file ssa-lim-19.c.138t.lim2.

;;
;; Loop 1
;;  header 8, latch 12
;;  depth 1, outer 0
;;  nodes: 8 12 7 6 4 5 3 13 11
;;
;; Loop 2
;;  header 3, latch 13
;;  depth 2, outer 1
;;  nodes: 3 13 6 4 5
;; 2 succs { 10 9 }
;; 10 succs { 8 }
;; 11 succs { 3 }
;; 3 succs { 4 5 }
;; 4 succs { 6 }
;; 5 succs { 6 }
;; 6 succs { 13 7 }
;; 13 succs { 3 }
;; 7 succs { 12 9 }
;; 12 succs { 8 }
;; 8 succs { 11 7 }
;; 9 succs { 1 }

always executed: bb->index:8, loop->num: 1
always executed: bb->index:7, loop->num: 1
always executed: bb->index:3, loop->num: 2
always executed: bb->index:6, loop->num: 2

     8<---------------
   /  \              |
  11   \             |
  /     \            |
  3<---  \           | 
 /\    |  \          |
 4 5   |   \         |
 \/    |    \        |
  6    |     \       |
  |-->13      \      |
  |----------> 7     |
               /\    |
              9 12---

(gdb) x /15x bbd
0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000001
0x1354c9c0: 0x00000001  0x00000001  0x00000002  0x00000002
0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000001
0x1354c9e0: 0x00000001  0x00000001  0x00000000

our algorithm will walk through 8->11->3->4->5->6->7,
for loop 1, exit at edge 7->9.

(gdb) x /15x bbd
0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000000
0x1354c9c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000000
0x1354c9e0: 0x00000001  0x00000000  0x00000000

If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5]
and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set
ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6.

> 
>>
>>>    
>>>> +      while (!bitmap_empty_p (work_set))
>>>> +	{
>>>> +	  bb_index = bitmap_first_set_bit (work_set);
>>>> +	  bitmap_clear_bit (work_set, bb_index);
>>>> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
>>>>    	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -	    last = bb;
>>>> -
>>>> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
>>>>       if (bitmap_bit_p (contains_call, bb->index))
>>>>         break;
>>> I think you want to continue; here (process remaining worklist
>>> but not continue greedy walking this block)
>> Same as above, if use 'continue' instead of 'break', the algorithm
>> seems also not work again.  If inner loop contains a jump to outmost
>> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
>> incorrectly.
>>
>>>> -
>>>> +	  edge_iterator ei;
>>>>       FOR_EACH_EDGE (e, ei, bb->succs)
>>>>    	    {
>>>> -	      /* If there is an exit from this BB.  */
>>>>           if (!flow_bb_inside_loop_p (loop, e->dest))
>>>>      break;
>>> in particular this should keep the outer 'bbi' valid to re-use.
>>>
>>> But again, you want 'continue;' the greedy walk to other edges.
>>> If that's not valid (I'd need to think about this) then with
>>> your patch whether we process an edge depends on the order
>>> of the edge visit so you'd have to walk successors twice,
>>> once to determine whether we can greedily walk any of it
>>> and once to actually do the greedy walk.
>>>
>>> So thinking about it an exit edge is like a not returning call
>>> and thus we indeed should not process any outgoing edges of
>>> this block.
>>>
>>>> +
>>>>           /* Or we enter a possibly non-finite loop.  */
>>>>           if (flow_loop_nested_p (bb->loop_father,
>>>>              e->dest->loop_father)
>>>>        && ! finite_loop_p (e->dest->loop_father))
>>>>      break;
>>> I think this is no longer necessary?  In any case it would
>>> again be 'continue;', see also above.
>>>
>>> I'm missing skipping of backedges in the walk.
>>>
>>>> +
>>>> +	      if (bbd[e->dest->index] == 1)
>>>> +	      {
>>>> +		bitmap_set_bit (work_set, e->dest->index);
>>>> +		bbd[e->dest->index]--;
>>>> +	      }
>>>> +	      else if (bbd[e->dest->index] > 1)
>>>> +		bbd[e->dest->index]--;
>>> maybe just
>>>
>>>               bbd[e->dest->index]--;
>>>               if (bbd[e->dest->index] == 0)
>>>                 bitmap_set_bit (work_set, e->dest->index);
>> They are not equivalent.  Former won't decrease if bbd[e->dest->index]
>> is 0, but the latter does.
> we should be never visiting in this case, so sth indeed doesn't
> work as I expected.

I've verified this won't happen if duplicate the bb_incoming_edges
in fill_always_executed_in_1, so could replace it.

> 
>>>>    	    }
>>>> +
>>>>       if (e)
>>>>         break;
>>> continue; processing the worklist
>>>
>>>>    
>>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
>>>> @@ sbitmap contains_call)
>>>>          to disprove this if possible).  */
>>>>       if (bb->flags & BB_IRREDUCIBLE_LOOP)
>>>>         break;
>>> continue; processing the worklist.  I think this has to
>>> come early, before processing successors, next to
>>> the contains_call processing.  We could think of ignoring
>>> entering of irreducible regions by tracking exits from them,
>>> but I'm not sure we're identifying the regions in a good
>>> enough way to allow this.
>>>
>>> Likewise possibly infinite inner loops can be processed
>>> when we track exits from those which would be sth like
>>>
>>>       /* When this is an exit from an inner loop that we cannot
>>>          prove as finite do not follow this edge.  */
>>>       if (bb->loop_father != loop
>>>           && loop_exit_edge_p (bb->loop_father, e)
>>>           && ! finite_loop_p (bb->loop_father))
>>>         continue;
>>>
>>>> -
>>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
>>>> -	    break;
>>>> -
>>>> -	  if (bb->loop_father->header == bb)
>>>> -	    {
>>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
>>>> -		break;
>>>> -
>>>> -	      /* In a loop that is always entered we may proceed anyway.
>>>> -		 But record that we entered it and stop once we leave it.  */
>>>> -	      inn_loop = bb->loop_father;
>>>> -	    }
>>>> -	}
>>>> -
>>>> -      while (1)
>>>> -	{
>>>> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
>>>> -	  if (last == loop->header)
>>>> -	    break;
>>>> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
>>>>    	}
>>>> -
>>>> -      free (bbs);
>>>> +      free (bbd);
>>>>        }
>>>>    
>>>>      for (loop = loop->inner; loop; loop = loop->next)
>>>> -    fill_always_executed_in_1 (loop, contains_call);
>>>> +    fill_always_executed_in_1 (loop, contains_call, bbi);
>>>>    }
>>>>    
>>>>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
>>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
>>>>     bitmap_set_bit (contains_call, bb->index);
>>>>        }
>>>>    +  mark_dfs_back_edges ();
>>> Please put this to loop_invariant_motion_in_fun right before
>>> the call to fill_always_executed_in.
>> Done.
>>
>>>> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
>>>> +  int *bb_incoming_edges = XNEWVEC (int, array_size);
>>> There's XCNEWVEC that also zeros the array.
>> OK, thanks.
>>
>>>> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
>>>> +  edge_iterator ei;
>>>> +  edge e;
>>>> +
>>>> +  FOR_EACH_BB_FN (bb, cfun)
>>>> +    {
>>>> +      int len = bb->preds->length ();
>>>> +      FOR_EACH_EDGE (e, ei, bb->preds)
>>>> +	if (e->flags & EDGE_DFS_BACK)
>>>> +	  len--;
>>>> +      bb_incoming_edges[bb->index] = len;
>>>> +    }
>>> it would be nice to combine this with the preceeding loop over
>>> all blocks.
>> Done.
>>
>>>> +  static unsigned long diff = 0;
>>>> +  struct timeval tv;
>>>> +  gettimeofday (&tv, NULL);
>>>> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
>>>> +
>>>> +  //for (loop : loops_list (cfun, 0))
>>>>      for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
>>>> -    fill_always_executed_in_1 (loop, contains_call);
>>>> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
>>>> +
>>>> +  gettimeofday (&tv, NULL);
>>>> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
>>>> +  diff += end - begin;
>>>> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
>>>> +  free (bb_incoming_edges);
>>> Eh, looks very much like work in progress.  From the O() complexity
>>> point duplicating bb_incoming_edges for each loop makes it quadratic
>>> again but as said I think we don't need this duplication.
>> Sorry I didn't quite follow your above comments about reusing
>> bb_incoming_edges,
>> we are searching the bbs that dominate loop->latch loop by loop independently,
>> if outer loop changes the inner loops bb_incoming_edges, it will make the
>> inner
>> loop's ALWAYS EXECUTE computation incorrect?
> We're only supposed to greedily walk into always excuted blocks.  I see
> this probably breaks down for conditionally executed inner loops like
> 
>    for ()
>     {
>      if ()
>        for ();
>      A;
>     }
> 
> where we want to process the inner loop separately but we still want
> to make sure A is marked as always executed in the outer loop.

Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG.
so we have to check every bb and maintain bb_incoming_edges in each loop. 
If there is any block goes to outer loop, we have to 'break' instead of
'continue' as followed bbs are not ALWAYS_EXECUTE again.

> 
>> Seems duplicating bb_incoming_edges not increases the complexity but the
>> recursive call does?
> Since the bb_incoming_edges array has a size of N and we copy it
> number_of_loop times (which is O(N) bound) the copying effectively
> has O(N^2) cost.  So yes, it's the iteration over all loops,
> whether by recursion or iteration.


Actually, the copying is outside of the greedy CFG walk loop. I guess there
will be some optimizations in memcpy bb_incoming_edges by libraries?  But
maybe only works when most of the value are same then copy_by_pieces?

And I modified the time-consuming case with many bbs to simulate the extreme
usage, didn't observe quadratic behavior from it, it shows almost linear
build time increase for both base and V3 patch, but V3 patch generates more
accurate ALWAYS_EXECUTED result.  The function ratio in perf tool also looks
not bad.  At least, it looks like an improvement compared with the base code.

A function with 10K bbs would cost more than 8 minutes to build, so it seems
duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will
not be a performance bottle neck? (While most values are different, the copying
will exactly slow down when bb_incoming_edges array is quite large like 1000*1000*1000,
if so, that's really crazy automatic program generators...)

Base:		
number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant motion (usr/sys/wall/Mem)
7850+15615	123.99	13.52 ( 11%)   2.16 ( 52%)  15.68 ( 11%)    60M ( 21%)
15338+30591	208.98	24.55 ( 12%)   4.04 ( 52%)  28.57 ( 12%)   121M ( 21%)
30314+60543	511.54	48.96 ( 10%)   7.70 ( 52%)  56.90 ( 10%)   241M ( 21%)
		
V3 patch:		
number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant motion (usr/sys/wall/Mem)
7850+15615	121.18	12.26 ( 10%)   1.99 ( 49%)  14.42 ( 11%)    60M ( 21%)
15338+30591	204.99	24.72 ( 12%)   3.98 ( 52%)  28.85 ( 13%)   121M ( 21%)
30314+60543	500.76	48.26 ( 10%)   8.27 ( 51%)  57.12 ( 11%)   241M ( 21%)


Samples: 455K of event 'cycles', Event count (approx.): 426772700334
Overhead  Command  Shared Object     Symbol
   7.19%  cc1      cc1               [.] bitmap_set_bit
   5.86%  cc1      cc1               [.] get_ref_base_and_extent
   4.08%  cc1      cc1               [.] get_continuation_for_phi
   3.48%  cc1      cc1               [.] hash_table<vn_ssa_aux_hasher, false, xcallocator>::find
   3.41%  cc1      cc1               [.] dominated_by_p
   3.28%  cc1      cc1               [.] compute_transp
   2.98%  cc1      cc1               [.] bitmap_set_range
   2.60%  cc1      cc1               [.] bitmap_list_insert_element_after
   2.44%  cc1      cc1               [.] stmt_may_clobber_ref_p_1
   2.27%  cc1      cc1               [.] refs_may_alias_p_2
   2.26%  cc1      cc1               [.] hash_table<vn_reference_hasher, false, xcallocator>::fi
   2.26%  cc1      cc1               [.] df_rd_transfer_function
   2.07%  cc1      cc1               [.] bitmap_ior_into
   1.84%  cc1      cc1               [.] find_base_term
   1.82%  cc1      cc1               [.] get_alias_set
   1.79%  cc1      cc1               [.] tree_strip_nop_conversions
   1.73%  cc1      cc1               [.] dfs_enumerate_from
   1.73%  cc1      cc1               [.] reference_alias_ptr_type_1
   1.53%  cc1      cc1               [.] alias_sets_conflict_p
   1.30%  cc1      cc1               [.] c_get_alias_set
   1.16%  cc1      cc1               [.] bitmap_and_into
   1.12%  cc1      cc1               [.] indirect_ref_may_alias_decl_p
   1.04%  cc1      cc1               [.] et_splay
   0.99%  cc1      libc-2.27.so      [.] __memset_power8
   0.95%  cc1      cc1               [.] bitmap_ior_and_compl
   0.86%  cc1      cc1               [.] bitmap_copy
   0.78%  cc1      cc1               [.] ao_ref_from_mem

> 
> I wonder if we can more cleverly handle this by picking up
> conditionally executed inner loops on the fly for example
> by recording a 'outermost unconditionally executed loop'
> for each loop header we visit and using that like
> SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))
> 
> So make this a single greedy CFG walk starting at the outermost
> loops.  loop exists will need to be processed specially then.

Not sure do you mean we only check loop headers and exits?  As said above,
the ALWAYS_EXECUTED block chains are not successive, there may be blocks
are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they
could only have two chains?  One is from loop->header, and another is from
loop->exit->src for loops has one one exit?

-- 
Thanks,
Xionghu

[-- Attachment #2: ssa-lim-19.c.138t.lim2 --]
[-- Type: text/plain, Size: 6863 bytes --]


;; Function foo (foo, funcdef_no=0, decl_uid=3306, cgraph_uid=1, symbol_order=0)

Created preheader block for loop 1
Created preheader block for loop 2
;; 3 loops found
;;
;; Loop 0
;;  header 0, latch 1
;;  depth 0, outer -1
;;  nodes: 0 1 2 10 11 3 4 5 6 13 7 12 8 9
;;
;; Loop 1
;;  header 8, latch 12
;;  depth 1, outer 0
;;  nodes: 8 12 7 6 4 5 3 13 11
;;
;; Loop 2
;;  header 3, latch 13
;;  depth 2, outer 1
;;  nodes: 3 13 6 4 5
;; 2 succs { 10 9 }
;; 10 succs { 8 }
;; 11 succs { 3 }
;; 3 succs { 4 5 }
;; 4 succs { 6 }
;; 5 succs { 6 }
;; 6 succs { 13 7 }
;; 13 succs { 3 }
;; 7 succs { 12 9 }
;; 12 succs { 8 }
;; 8 succs { 11 7 }
;; 9 succs { 1 }
Memory reference 1: x_17(D)->j
Memory reference 2: MEM[(int *)x_17(D) + 8B]
always executed: bb->index:8, loop->num: 1
Found loop 2 to be finite: upper bound found.
always executed: bb->index:7, loop->num: 1
always executed: bb->index:3, loop->num: 2
always executed: bb->index:6, loop->num: 2
Basic block 8 (loop 1 -- depth 1):

if (n_16(D) > 0)
  invariant up to level 1, cost 20.

Basic block 11 (loop 1 -- depth 1):

Basic block 3 (loop 2 -- depth 2):

if (m_21(D) != 0)
  invariant up to level 1, cost 20.

Basic block 5 (loop 2 -- depth 2):

_4 = l_15(D) + n_16(D);
  invariant up to level 1, cost 1.

Basic block 4 (loop 2 -- depth 2):

Basic block 6 (loop 2 -- depth 2):

Basic block 7 (loop 1 -- depth 1):

Basic block 12 (loop 1 -- depth 1):

Basic block 13 (loop 2 -- depth 2):

Querying dependency of refs 2 and 1: independent.
Querying RAW dependencies of ref 2 in loop 2: independent
Querying RAW dependencies of ref 2 in loop 1: independent
Querying dependency of refs 2 and 1: independent.
Querying SM WAR dependencies of ref 2 in loop 2: independent
Querying SM WAR dependencies of ref 2 in loop 1: independent
Executing store motion of MEM[(int *)x_17(D) + 8B] from loop 1
Querying RAW dependencies of ref 1 in loop 2: independent
Querying SM WAR dependencies of ref 1 in loop 2: independent
Executing store motion of MEM[(int *)x_17(D) + 4B] from loop 2
Moving statement
x__lsm.5 = MEM[(int *)x_17(D) + 4B];
(cost 0) out of loop 2.

Moving statement
x__lsm.4 = MEM[(int *)x_17(D) + 8B];
(cost 0) out of loop 1.


Updating SSA:
creating PHI node in block #8 for x__lsm.4
creating PHI node in block #3 for x__lsm.5
Registering new PHI nodes in block #0
Registering new PHI nodes in block #2
Registering new PHI nodes in block #10
Updating SSA information for statement x__lsm.4 = MEM[(int *)x_17(D) + 8B];
Registering new PHI nodes in block #8
Registering new PHI nodes in block #11
Updating SSA information for statement x__lsm.5 = MEM[(int *)x_17(D) + 4B];
Registering new PHI nodes in block #3
Registering new PHI nodes in block #5
Registering new PHI nodes in block #4
Updating SSA information for statement _1 = x__lsm.5;
Registering new PHI nodes in block #6
Updating SSA information for statement x__lsm.5 = cstore_28;
Updating SSA information for statement tem_24 = x__lsm.5;
Updating SSA information for statement x__lsm.5 = _6;
Registering new PHI nodes in block #15
Updating SSA information for statement MEM[(int *)x_17(D) + 4B] = x__lsm.5;
Registering new PHI nodes in block #13
Registering new PHI nodes in block #7
Updating SSA information for statement tem2_18 = x__lsm.4;
Updating SSA information for statement x__lsm.4 = _8;
Registering new PHI nodes in block #14
Updating SSA information for statement MEM[(int *)x_17(D) + 8B] = x__lsm.4;
Registering new PHI nodes in block #12
Registering new PHI nodes in block #9
Updating SSA information for statement return;

Symbols to be put in SSA form
{ D.3326 D.3331 D.3332 }
Incremental SSA update started at block: 0
Number of blocks in CFG: 16
Number of blocks to update: 15 ( 94%)
Affected blocks: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15


;; Created LCSSA PHI: x__lsm.5_36 = PHI <x__lsm.5_10(6)>
;; Created LCSSA PHI: x__lsm.4_37 = PHI <x__lsm.4_33(7)>

Updating SSA:
Registering new PHI nodes in block #8
Registering new PHI nodes in block #11
Registering new PHI nodes in block #3
Registering new PHI nodes in block #5
Registering new PHI nodes in block #4
Registering new PHI nodes in block #6
Updating SSA information for statement x__lsm.5_10 = _6;
Registering new PHI nodes in block #15
Updating SSA information for statement MEM[(int *)x_17(D) + 4B] = x__lsm.5_10;
Registering new PHI nodes in block #13
Registering new PHI nodes in block #7
Updating SSA information for statement x__lsm.4_33 = _8;
Registering new PHI nodes in block #14
Updating SSA information for statement MEM[(int *)x_17(D) + 8B] = x__lsm.4_33;
Registering new PHI nodes in block #12

SSA replacement table
N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j

x__lsm.5_36 -> { x__lsm.5_10 }
x__lsm.4_37 -> { x__lsm.4_33 }
Incremental SSA update started at block: 8
Number of blocks in CFG: 16
Number of blocks to update: 8 ( 50%)
Affected blocks: 3 6 7 8 12 13 14 15


void foo (struct X * x, int n, int l, int m)
{
  int x__lsm.5;
  int x__lsm.4;
  int tem;
  int i;
  int tem2;
  int j;
  int _1;
  int _2;
  int _4;
  int _5;
  int _6;
  int _7;
  int _8;
  int cstore_28;

  <bb 2> [local count: 14598063]:
  if (l_15(D) > 0)
    goto <bb 10>; [89.00%]
  else
    goto <bb 9>; [11.00%]

  <bb 10> [local count: 12992276]:
  x__lsm.4_13 = MEM[(int *)x_17(D) + 8B];
  goto <bb 8>; [100.00%]

  <bb 11> [local count: 105119324]:
  x__lsm.5_9 = MEM[(int *)x_17(D) + 4B];

  <bb 3> [local count: 955630225]:
  # i_30 = PHI <i_26(13), 0(11)>
  # x__lsm.5_32 = PHI <x__lsm.5_10(13), x__lsm.5_9(11)>
  if (m_21(D) != 0)
    goto <bb 4>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 4> [local count: 477815112]:
  _1 = x__lsm.5_32;
  _2 = _1 + 1;
  goto <bb 6>; [100.00%]

  <bb 5> [local count: 477815112]:
  _4 = l_15(D) + n_16(D);

  <bb 6> [local count: 955630225]:
  # cstore_28 = PHI <_2(4), _4(5)>
  x__lsm.5_12 = cstore_28;
  tem_24 = x__lsm.5_12;
  _5 = tem_24 * i_30;
  _6 = _5 + tem_24;
  x__lsm.5_10 = _6;
  i_26 = i_30 + 1;
  if (n_16(D) > i_26)
    goto <bb 13>; [89.00%]
  else
    goto <bb 15>; [11.00%]

  <bb 13> [local count: 850510901]:
  goto <bb 3>; [100.00%]

  <bb 15> [local count: 105119324]:
  # x__lsm.5_36 = PHI <x__lsm.5_10(6)>
  MEM[(int *)x_17(D) + 4B] = x__lsm.5_36;

  <bb 7> [local count: 118111600]:
  tem2_18 = x__lsm.4_11;
  _7 = tem2_18 * j_31;
  _8 = _7 + tem2_18;
  x__lsm.4_33 = _8;
  j_20 = j_31 + 1;
  if (l_15(D) > j_20)
    goto <bb 12>; [89.00%]
  else
    goto <bb 14>; [11.00%]

  <bb 12> [local count: 105119324]:

  <bb 8> [local count: 118111600]:
  # j_31 = PHI <j_20(12), 0(10)>
  # x__lsm.4_11 = PHI <x__lsm.4_33(12), x__lsm.4_13(10)>
  if (n_16(D) > 0)
    goto <bb 11>; [89.00%]
  else
    goto <bb 7>; [11.00%]

  <bb 14> [local count: 12992276]:
  # x__lsm.4_37 = PHI <x__lsm.4_33(7)>
  MEM[(int *)x_17(D) + 8B] = x__lsm.4_37;

  <bb 9> [local count: 14598063]:
  return;

}



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1
  2021-08-31  7:43                       ` Xionghu Luo
@ 2021-08-31 11:29                         ` Richard Biener
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Biener @ 2021-08-31 11:29 UTC (permalink / raw)
  To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw

On Tue, 31 Aug 2021, Xionghu Luo wrote:

> 
> 
> On 2021/8/30 17:19, Richard Biener wrote:
> >>>>      bitmap_set_bit (work_set, loop->header->index);
> >>>> +      unsigned bb_index;
> >>>>    -      for (i = 0; i < loop->num_nodes; i++)
> >>>> -	{
> >>>> -	  edge_iterator ei;
> >>>> -	  bb = bbs[i];
> >>>> +      unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >>>> +      int *bbd = XNEWVEC (int, array_size);
> >>>> +      bbd = XDUPVEC (int, bbi, array_size);
> >>> I don't think you need to copy 'bbi' but you can re-use the
> >>> state from the outer loop processing.  Did you run into any
> >>> issues with that?
> >> Yes.  For example, adding a small if-else block to ssa-lim-19.c,
> >> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when
> >> call
> >> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0,
> >> then if fill_always_executed_in_1 is called again for loop 2, it's value is
> >> not
> >> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong.
> >>
> >>
> >> struct X { int i; int j; int k;};
> >>
> >> void foo(struct X *x, int n, int l, int m)
> >> {
> >>   for (int j = 0; j < l; j++)  // loop 1
> >>     {
> >>       for (int i = 0; i < n; ++i)  // loop 2
> >>         {
> >>           if (m)
> >>             x->j++;
> >>           else
> >>             x->j = m+n+l;
> >>
> >>           int *p = &x->j;   // bb 6
> >>           int tem = *p;
> >>           x->j += tem * i;
> >>         }
> >>       int *r = &x->k;
> >>       int tem2 = *r;
> >>       x->k += tem2 * j;
> >>     }
> >> }
> > Hmm, but if the outer loop processing reaches bb 6 then
> > it should have set it ALWAYS_EXECUTED in loop 1 already?
> 
> But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for
> loop 2 as it requires n>0.  Please refer to the attached file
> ssa-lim-19.c.138t.lim2.
> 
> ;;
> ;; Loop 1
> ;;  header 8, latch 12
> ;;  depth 1, outer 0
> ;;  nodes: 8 12 7 6 4 5 3 13 11
> ;;
> ;; Loop 2
> ;;  header 3, latch 13
> ;;  depth 2, outer 1
> ;;  nodes: 3 13 6 4 5
> ;; 2 succs { 10 9 }
> ;; 10 succs { 8 }
> ;; 11 succs { 3 }
> ;; 3 succs { 4 5 }
> ;; 4 succs { 6 }
> ;; 5 succs { 6 }
> ;; 6 succs { 13 7 }
> ;; 13 succs { 3 }
> ;; 7 succs { 12 9 }
> ;; 12 succs { 8 }
> ;; 8 succs { 11 7 }
> ;; 9 succs { 1 }
> 
> always executed: bb->index:8, loop->num: 1
> always executed: bb->index:7, loop->num: 1
> always executed: bb->index:3, loop->num: 2
> always executed: bb->index:6, loop->num: 2
> 
>     8<---------------
>  /  \              |
>  11   \             |
>  /     \            |
>  3<---  \           | 
> /\    |  \          |
> 4 5   |   \         |
> \/    |    \        |
>  6    |     \       |
>  |-->13      \      |
>  |----------> 7     |
>              /\    |
>              9 12---
> 
> (gdb) x /15x bbd
> 0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000001
> 0x1354c9c0: 0x00000001  0x00000001  0x00000002  0x00000002
> 0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000001
> 0x1354c9e0: 0x00000001  0x00000001  0x00000000
> 
> our algorithm will walk through 8->11->3->4->5->6->7,
> for loop 1, exit at edge 7->9.
> 
> (gdb) x /15x bbd
> 0x1354c9b0: 0x00000000  0x00000000  0x00000001  0x00000000
> 0x1354c9c0: 0x00000000  0x00000000  0x00000000  0x00000000
> 0x1354c9d0: 0x00000001  0x00000002  0x00000001  0x00000000
> 0x1354c9e0: 0x00000001  0x00000000  0x00000000
> 
> If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5]
> and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set
> ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6.
> 
> > 
> >>
> >>>    
> >>>> +      while (!bitmap_empty_p (work_set))
> >>>> +	{
> >>>> +	  bb_index = bitmap_first_set_bit (work_set);
> >>>> +	  bitmap_clear_bit (work_set, bb_index);
> >>>> +	  bb = BASIC_BLOCK_FOR_FN (cfun, bb_index);
> >>>>    	  if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> -	    last = bb;
> >>>> -
> >>>> +	    SET_ALWAYS_EXECUTED_IN (bb, loop);
> >>>>       if (bitmap_bit_p (contains_call, bb->index))
> >>>>         break;
> >>> I think you want to continue; here (process remaining worklist
> >>> but not continue greedy walking this block)
> >> Same as above, if use 'continue' instead of 'break', the algorithm
> >> seems also not work again.  If inner loop contains a jump to outmost
> >> loop, the blocks after the jump block will be set to ALWAYS EXECUTE
> >> incorrectly.
> >>
> >>>> -
> >>>> +	  edge_iterator ei;
> >>>>       FOR_EACH_EDGE (e, ei, bb->succs)
> >>>>    	    {
> >>>> -	      /* If there is an exit from this BB.  */
> >>>>           if (!flow_bb_inside_loop_p (loop, e->dest))
> >>>>      break;
> >>> in particular this should keep the outer 'bbi' valid to re-use.
> >>>
> >>> But again, you want 'continue;' the greedy walk to other edges.
> >>> If that's not valid (I'd need to think about this) then with
> >>> your patch whether we process an edge depends on the order
> >>> of the edge visit so you'd have to walk successors twice,
> >>> once to determine whether we can greedily walk any of it
> >>> and once to actually do the greedy walk.
> >>>
> >>> So thinking about it an exit edge is like a not returning call
> >>> and thus we indeed should not process any outgoing edges of
> >>> this block.
> >>>
> >>>> +
> >>>>           /* Or we enter a possibly non-finite loop.  */
> >>>>           if (flow_loop_nested_p (bb->loop_father,
> >>>>              e->dest->loop_father)
> >>>>        && ! finite_loop_p (e->dest->loop_father))
> >>>>      break;
> >>> I think this is no longer necessary?  In any case it would
> >>> again be 'continue;', see also above.
> >>>
> >>> I'm missing skipping of backedges in the walk.
> >>>
> >>>> +
> >>>> +	      if (bbd[e->dest->index] == 1)
> >>>> +	      {
> >>>> +		bitmap_set_bit (work_set, e->dest->index);
> >>>> +		bbd[e->dest->index]--;
> >>>> +	      }
> >>>> +	      else if (bbd[e->dest->index] > 1)
> >>>> +		bbd[e->dest->index]--;
> >>> maybe just
> >>>
> >>>               bbd[e->dest->index]--;
> >>>               if (bbd[e->dest->index] == 0)
> >>>                 bitmap_set_bit (work_set, e->dest->index);
> >> They are not equivalent.  Former won't decrease if bbd[e->dest->index]
> >> is 0, but the latter does.
> > we should be never visiting in this case, so sth indeed doesn't
> > work as I expected.
> 
> I've verified this won't happen if duplicate the bb_incoming_edges
> in fill_always_executed_in_1, so could replace it.
> 
> > 
> >>>>    	    }
> >>>> +
> >>>>       if (e)
> >>>>         break;
> >>> continue; processing the worklist
> >>>
> >>>>    
> >>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop,
> >>>> @@ sbitmap contains_call)
> >>>>          to disprove this if possible).  */
> >>>>       if (bb->flags & BB_IRREDUCIBLE_LOOP)
> >>>>         break;
> >>> continue; processing the worklist.  I think this has to
> >>> come early, before processing successors, next to
> >>> the contains_call processing.  We could think of ignoring
> >>> entering of irreducible regions by tracking exits from them,
> >>> but I'm not sure we're identifying the regions in a good
> >>> enough way to allow this.
> >>>
> >>> Likewise possibly infinite inner loops can be processed
> >>> when we track exits from those which would be sth like
> >>>
> >>>       /* When this is an exit from an inner loop that we cannot
> >>>          prove as finite do not follow this edge.  */
> >>>       if (bb->loop_father != loop
> >>>           && loop_exit_edge_p (bb->loop_father, e)
> >>>           && ! finite_loop_p (bb->loop_father))
> >>>         continue;
> >>>
> >>>> -
> >>>> -	  if (!flow_bb_inside_loop_p (inn_loop, bb))
> >>>> -	    break;
> >>>> -
> >>>> -	  if (bb->loop_father->header == bb)
> >>>> -	    {
> >>>> -	      if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb))
> >>>> -		break;
> >>>> -
> >>>> -	      /* In a loop that is always entered we may proceed anyway.
> >>>> -		 But record that we entered it and stop once we leave it.  */
> >>>> -	      inn_loop = bb->loop_father;
> >>>> -	    }
> >>>> -	}
> >>>> -
> >>>> -      while (1)
> >>>> -	{
> >>>> -	  SET_ALWAYS_EXECUTED_IN (last, loop);
> >>>> -	  if (last == loop->header)
> >>>> -	    break;
> >>>> -	  last = get_immediate_dominator (CDI_DOMINATORS, last);
> >>>>    	}
> >>>> -
> >>>> -      free (bbs);
> >>>> +      free (bbd);
> >>>>        }
> >>>>    
> >>>>      for (loop = loop->inner; loop; loop = loop->next)
> >>>> -    fill_always_executed_in_1 (loop, contains_call);
> >>>> +    fill_always_executed_in_1 (loop, contains_call, bbi);
> >>>>    }
> >>>>    
> >>>>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e.
> >>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void)
> >>>>     bitmap_set_bit (contains_call, bb->index);
> >>>>        }
> >>>>    +  mark_dfs_back_edges ();
> >>> Please put this to loop_invariant_motion_in_fun right before
> >>> the call to fill_always_executed_in.
> >> Done.
> >>
> >>>> +  unsigned array_size = last_basic_block_for_fn (cfun) + 1;
> >>>> +  int *bb_incoming_edges = XNEWVEC (int, array_size);
> >>> There's XCNEWVEC that also zeros the array.
> >> OK, thanks.
> >>
> >>>> +  memset (bb_incoming_edges, 0, array_size * sizeof (int));
> >>>> +  edge_iterator ei;
> >>>> +  edge e;
> >>>> +
> >>>> +  FOR_EACH_BB_FN (bb, cfun)
> >>>> +    {
> >>>> +      int len = bb->preds->length ();
> >>>> +      FOR_EACH_EDGE (e, ei, bb->preds)
> >>>> +	if (e->flags & EDGE_DFS_BACK)
> >>>> +	  len--;
> >>>> +      bb_incoming_edges[bb->index] = len;
> >>>> +    }
> >>> it would be nice to combine this with the preceeding loop over
> >>> all blocks.
> >> Done.
> >>
> >>>> +  static unsigned long diff = 0;
> >>>> +  struct timeval tv;
> >>>> +  gettimeofday (&tv, NULL);
> >>>> +  unsigned long begin = (unsigned long) tv.tv_sec * 1000000 +
> >>>> tv.tv_usec;
> >>>> +
> >>>> +  //for (loop : loops_list (cfun, 0))
> >>>>      for (loop = current_loops->tree_root->inner; loop; loop =
> >>>> loop->next)
> >>>> -    fill_always_executed_in_1 (loop, contains_call);
> >>>> +    fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges);
> >>>> +
> >>>> +  gettimeofday (&tv, NULL);
> >>>> +  unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec;
> >>>> +  diff += end - begin;
> >>>> +  //printf ("%s,diff:%ld\n", current_function_name (), diff);
> >>>> +  free (bb_incoming_edges);
> >>> Eh, looks very much like work in progress.  From the O() complexity
> >>> point duplicating bb_incoming_edges for each loop makes it quadratic
> >>> again but as said I think we don't need this duplication.
> >> Sorry I didn't quite follow your above comments about reusing
> >> bb_incoming_edges,
> >> we are searching the bbs that dominate loop->latch loop by loop
> >> independently,
> >> if outer loop changes the inner loops bb_incoming_edges, it will make the
> >> inner
> >> loop's ALWAYS EXECUTE computation incorrect?
> > We're only supposed to greedily walk into always excuted blocks.  I see
> > this probably breaks down for conditionally executed inner loops like
> > 
> >    for ()
> >     {
> >      if ()
> >        for ();
> >     A;
> >     }
> > 
> > where we want to process the inner loop separately but we still want
> > to make sure A is marked as always executed in the outer loop.
> 
> Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG.
> so we have to check every bb and maintain bb_incoming_edges in each loop. If
> there is any block goes to outer loop, we have to 'break' instead of
> 'continue' as followed bbs are not ALWAYS_EXECUTE again.

But we're using a worklist (since the CFG branches) - if we're not
continue to pick items from the worklist then the outcome depends
on the order of the worklist proceesing.  That can't be correct either
(similar when walking successor edges).

> >> Seems duplicating bb_incoming_edges not increases the complexity but the
> >> recursive call does?
> > Since the bb_incoming_edges array has a size of N and we copy it
> > number_of_loop times (which is O(N) bound) the copying effectively
> > has O(N^2) cost.  So yes, it's the iteration over all loops,
> > whether by recursion or iteration.
> 
> 
> Actually, the copying is outside of the greedy CFG walk loop. I guess there
> will be some optimizations in memcpy bb_incoming_edges by libraries?  But
> maybe only works when most of the value are same then copy_by_pieces?
> 
> And I modified the time-consuming case with many bbs to simulate the extreme
> usage, didn't observe quadratic behavior from it, it shows almost linear
> build time increase for both base and V3 patch, but V3 patch generates more
> accurate ALWAYS_EXECUTED result.  The function ratio in perf tool also looks
> not bad.  At least, it looks like an improvement compared with the base code.
> 
> A function with 10K bbs would cost more than 8 minutes to build, so it seems
> duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will
> not be a performance bottle neck? (While most values are different, the
> copying
> will exactly slow down when bb_incoming_edges array is quite large like
> 1000*1000*1000,
> if so, that's really crazy automatic program generators...)
> 
> Base:		
> number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant
> motion (usr/sys/wall/Mem)
> 7850+15615	123.99	13.52 ( 11%)   2.16 ( 52%)  15.68 ( 11%)    60M ( 21%)
> 15338+30591	208.98	24.55 ( 12%)   4.04 ( 52%)  28.57 ( 12%)   121M ( 21%)
> 30314+60543	511.54	48.96 ( 10%)   7.70 ( 52%)  56.90 ( 10%)   241M ( 21%)
> 		
> V3 patch:		
> number of bbs(lim2+lim4)	total build time (sec)	tree loop invariant
> motion (usr/sys/wall/Mem)
> 7850+15615	121.18	12.26 ( 10%)   1.99 ( 49%)  14.42 ( 11%)    60M ( 21%)
> 15338+30591	204.99	24.72 ( 12%)   3.98 ( 52%)  28.85 ( 13%)   121M ( 21%)
> 30314+60543	500.76	48.26 ( 10%)   8.27 ( 51%)  57.12 ( 11%)   241M ( 21%)
> 
> 
> Samples: 455K of event 'cycles', Event count (approx.): 426772700334
> Overhead  Command  Shared Object     Symbol
>   7.19%  cc1      cc1               [.] bitmap_set_bit
>   5.86%  cc1      cc1               [.] get_ref_base_and_extent
>   4.08%  cc1      cc1               [.] get_continuation_for_phi
>   3.48%  cc1      cc1               [.] hash_table<vn_ssa_aux_hasher, false,
>   xcallocator>::find
>   3.41%  cc1      cc1               [.] dominated_by_p
>   3.28%  cc1      cc1               [.] compute_transp
>   2.98%  cc1      cc1               [.] bitmap_set_range
>   2.60%  cc1      cc1               [.] bitmap_list_insert_element_after
>   2.44%  cc1      cc1               [.] stmt_may_clobber_ref_p_1
>   2.27%  cc1      cc1               [.] refs_may_alias_p_2
>   2.26%  cc1      cc1               [.] hash_table<vn_reference_hasher, false,
>   xcallocator>::fi
>   2.26%  cc1      cc1               [.] df_rd_transfer_function
>   2.07%  cc1      cc1               [.] bitmap_ior_into
>   1.84%  cc1      cc1               [.] find_base_term
>   1.82%  cc1      cc1               [.] get_alias_set
>   1.79%  cc1      cc1               [.] tree_strip_nop_conversions
>   1.73%  cc1      cc1               [.] dfs_enumerate_from
>   1.73%  cc1      cc1               [.] reference_alias_ptr_type_1
>   1.53%  cc1      cc1               [.] alias_sets_conflict_p
>   1.30%  cc1      cc1               [.] c_get_alias_set
>   1.16%  cc1      cc1               [.] bitmap_and_into
>   1.12%  cc1      cc1               [.] indirect_ref_may_alias_decl_p
>   1.04%  cc1      cc1               [.] et_splay
>   0.99%  cc1      libc-2.27.so      [.] __memset_power8
>   0.95%  cc1      cc1               [.] bitmap_ior_and_compl
>   0.86%  cc1      cc1               [.] bitmap_copy
>   0.78%  cc1      cc1               [.] ao_ref_from_mem
> 
> > 
> > I wonder if we can more cleverly handle this by picking up
> > conditionally executed inner loops on the fly for example
> > by recording a 'outermost unconditionally executed loop'
> > for each loop header we visit and using that like
> > SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father))
> > 
> > So make this a single greedy CFG walk starting at the outermost
> > loops.  loop exists will need to be processed specially then.
> 
> Not sure do you mean we only check loop headers and exits?  As said above,
> the ALWAYS_EXECUTED block chains are not successive, there may be blocks
> are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they
> could only have two chains?  One is from loop->header, and another is from
> loop->exit->src for loops has one one exit?

For setting ALWAYS_EXECUTED we rely on dominance.  The greedy walk
should determine whether there are paths in the CFG that skip
execution of the block.  I made the assumption
in that if there exist "uninterrupted" paths covering all incoming
edges of a block that this ensures the block is always executed.
And since we only trace such "always executed" blocks this holds
transitively.

I'll see if I can find some time this week to spend some time on this
problem and maybe also create a testcase for the existing issue
with contains_call and the dominator order of BB processing.

Richard.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1
  2021-08-16  8:46 [PATCH] Fix incorrect computation in fill_always_executed_in_1 Xiong Hu Luo
  2021-08-16 11:46 ` Richard Biener
@ 2021-08-17 20:59 ` Segher Boessenkool
  1 sibling, 0 replies; 17+ messages in thread
From: Segher Boessenkool @ 2021-08-17 20:59 UTC (permalink / raw)
  To: Xiong Hu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, guojiufu, linkw, rguenther

Hi!

As an aside...

On Mon, Aug 16, 2021 at 03:46:12AM -0500, Xiong Hu Luo wrote:
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c

> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c

You can make a saner order for your diffs by putting the testsuite
changes after the rest.  A very mimimal example would use

[diff]
	orderfile = .gitorder

with that file containing something like

gcc/*.*

and nothing else even.  Yeah this is minimal ;-)


Segher

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-08-31 11:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16  8:46 [PATCH] Fix incorrect computation in fill_always_executed_in_1 Xiong Hu Luo
2021-08-16 11:46 ` Richard Biener
2021-08-17  5:17   ` Xionghu Luo
2021-08-17  5:24     ` Xionghu Luo
2021-08-17  7:12     ` Richard Biener
2021-08-17  9:10       ` [PATCH v2] Fix incomplete " Xionghu Luo
2021-08-19  5:23         ` Xionghu Luo
2021-08-19 12:11         ` Richard Biener
2021-08-24  7:44           ` Xionghu Luo
2021-08-24  8:20             ` Richard Biener
2021-08-26  5:50               ` [PATCH v3] " Xionghu Luo
2021-08-27  7:45                 ` Richard Biener
2021-08-30  8:49                   ` Xionghu Luo
2021-08-30  9:19                     ` Richard Biener
2021-08-31  7:43                       ` Xionghu Luo
2021-08-31 11:29                         ` Richard Biener
2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).