* [PATCH] Fix incorrect computation in fill_always_executed_in_1 @ 2021-08-16 8:46 Xiong Hu Luo 2021-08-16 11:46 ` Richard Biener 2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool 0 siblings, 2 replies; 17+ messages in thread From: Xiong Hu Luo @ 2021-08-16 8:46 UTC (permalink / raw) To: gcc-patches Cc: segher, dje.gcc, wschmidt, guojiufu, linkw, rguenther, Xiong Hu Luo It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for nested loops. inn_loop is updated to inner loop, so it need be restored when exiting from innermost loop. With this patch, the store instruction in outer loop could also be moved out of outer loop by store motion. Any comments? Thanks. gcc/ChangeLog: * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore inn_loop when exiting from innermost loop. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-lim-19.c: New test. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ gcc/tree-ssa-loop-im.c | 6 +++++- 2 files changed, 29 insertions(+), 1 deletion(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..097a5ee4a4b --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,24 @@ +/* PR/101293 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +struct X { int i; int j; int k;}; + +void foo(struct X *x, int n, int l) +{ + for (int j = 0; j < l; j++) + { + for (int i = 0; i < n; ++i) + { + int *p = &x->j; + int tem = *p; + x->j += tem * i; + } + int *r = &x->k; + int tem2 = *r; + x->k += tem2 * j; + } +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ + diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index b24bc64f2a7..5ca4738b20e 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) last = bb; + if (inn_loop != loop + && flow_loop_nested_p (bb->loop_father, inn_loop)) + inn_loop = bb->loop_father; + if (bitmap_bit_p (contains_call, bb->index)) break; @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) if (bb->loop_father->header == bb) { - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb)) break; /* In a loop that is always entered we may proceed anyway. -- 2.27.0.90.geebb51ba8c ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1 2021-08-16 8:46 [PATCH] Fix incorrect computation in fill_always_executed_in_1 Xiong Hu Luo @ 2021-08-16 11:46 ` Richard Biener 2021-08-17 5:17 ` Xionghu Luo 2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool 1 sibling, 1 reply; 17+ messages in thread From: Richard Biener @ 2021-08-16 11:46 UTC (permalink / raw) To: Xiong Hu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Mon, 16 Aug 2021, Xiong Hu Luo wrote: > It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > nested loops. inn_loop is updated to inner loop, so it need be restored > when exiting from innermost loop. With this patch, the store instruction > in outer loop could also be moved out of outer loop by store motion. > Any comments? Thanks. > gcc/ChangeLog: > > * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore > inn_loop when exiting from innermost loop. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > --- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ > gcc/tree-ssa-loop-im.c | 6 +++++- > 2 files changed, 29 insertions(+), 1 deletion(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..097a5ee4a4b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,24 @@ > +/* PR/101293 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +struct X { int i; int j; int k;}; > + > +void foo(struct X *x, int n, int l) > +{ > + for (int j = 0; j < l; j++) > + { > + for (int i = 0; i < n; ++i) > + { > + int *p = &x->j; > + int tem = *p; > + x->j += tem * i; > + } > + int *r = &x->k; > + int tem2 = *r; > + x->k += tem2 * j; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ > + > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index b24bc64f2a7..5ca4738b20e 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > last = bb; > > + if (inn_loop != loop > + && flow_loop_nested_p (bb->loop_father, inn_loop)) > + inn_loop = bb->loop_father; > + The comment says /* In a loop that is always entered we may proceed anyway. But record that we entered it and stop once we leave it. */ inn_loop = bb->loop_father; and your change would defeat that early return, no? > if (bitmap_bit_p (contains_call, bb->index)) > break; > > @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > > if (bb->loop_father->header == bb) > { > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb)) > break; That's now a always false condition - a loops latch is always dominated by its header. The condition as written tries to verify whether the loop is always entered - mind we visit all blocks, not only those always executed. In fact for your testcase the x->j ref is _not_ always executed since the inner loop is conditional on n > 0. Richard. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1 2021-08-16 11:46 ` Richard Biener @ 2021-08-17 5:17 ` Xionghu Luo 2021-08-17 5:24 ` Xionghu Luo 2021-08-17 7:12 ` Richard Biener 0 siblings, 2 replies; 17+ messages in thread From: Xionghu Luo @ 2021-08-17 5:17 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw [-- Attachment #1: Type: text/plain, Size: 5254 bytes --] Hi, On 2021/8/16 19:46, Richard Biener wrote: > On Mon, 16 Aug 2021, Xiong Hu Luo wrote: > >> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for >> nested loops. inn_loop is updated to inner loop, so it need be restored >> when exiting from innermost loop. With this patch, the store instruction >> in outer loop could also be moved out of outer loop by store motion. >> Any comments? Thanks. > >> gcc/ChangeLog: >> >> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore >> inn_loop when exiting from innermost loop. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >> --- >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ >> gcc/tree-ssa-loop-im.c | 6 +++++- >> 2 files changed, 29 insertions(+), 1 deletion(-) >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> new file mode 100644 >> index 00000000000..097a5ee4a4b >> --- /dev/null >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> @@ -0,0 +1,24 @@ >> +/* PR/101293 */ >> +/* { dg-do compile } */ >> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ >> + >> +struct X { int i; int j; int k;}; >> + >> +void foo(struct X *x, int n, int l) >> +{ >> + for (int j = 0; j < l; j++) >> + { >> + for (int i = 0; i < n; ++i) >> + { >> + int *p = &x->j; >> + int tem = *p; >> + x->j += tem * i; >> + } >> + int *r = &x->k; >> + int tem2 = *r; >> + x->k += tem2 * j; >> + } >> +} >> + >> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ >> + >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >> index b24bc64f2a7..5ca4738b20e 100644 >> --- a/gcc/tree-ssa-loop-im.c >> +++ b/gcc/tree-ssa-loop-im.c >> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >> last = bb; >> >> + if (inn_loop != loop >> + && flow_loop_nested_p (bb->loop_father, inn_loop)) >> + inn_loop = bb->loop_father; >> + > > The comment says > > /* In a loop that is always entered we may proceed anyway. > But record that we entered it and stop once we leave it. > */ > inn_loop = bb->loop_father; > > and your change would defeat that early return, no? The issue is the search method exits too early when iterating the outer loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 they won't be processed by store motion again. 5<---- |\ | 8 \ 9 | \ | --->3--->4 | | 10---| SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this patch, it will continue search when meet bb 3 until bb 4, then last is updated to bb 4, it will break until exit edge is found at bb 4 by "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop code will set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. while (1) { SET_ALWAYS_EXECUTED_IN (last, loop); if (last == loop->header) break; last = get_immediate_dominator (CDI_DOMINATORS, last); } After further discussion with Kewen, we found that the inn_loop variable is totally useless and could be removed. > >> if (bitmap_bit_p (contains_call, bb->index)) >> break; >> >> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >> >> if (bb->loop_father->header == bb) >> { >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >> + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, bb)) >> break; > > That's now a always false condition - a loops latch is always dominated > by its header. The condition as written tries to verify whether the > loop is always entered - mind we visit all blocks, not only those > always executed. Thanks for the catch! I am afraid the piece of code should be removed since it stops search of potential ALWAYS EXECUTED bb after inner loop... > > In fact for your testcase the x->j ref is _not_ always executed > since the inner loop is conditional on n > 0. Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in store-motion. Attached the diff file without and with my patch to show the extra optimization. x->j is already moved out of loop 2 on master code. If change n and l to constant numbers like 100, master code could also do 2 store motions as expected. The edge from bb 5 to bb 4 doesn't exist now, so bb 4, bb 3 and bb 5 are ALWAYS EXECUTED for loop 1. struct X { int i; int j; int k;}; void foo(struct X *x, int n, int l) { for (int j = 0; j < l; j++) // loop 1 { for (int i = 0; i < n; ++i) // loop 2 { int *p = &x->j; int tem = *p; x->j += tem * i; } int *r = &x->k; int tem2 = *r; x->k += tem2 * j; } } > > Richard. > -- Thanks, Xionghu [-- Attachment #2: ssa-lim-19.c.137t.lim2.diff --] [-- Type: text/plain, Size: 6665 bytes --] --- ssa-lim-19.c.137t.lim2 2021-08-16 03:27:04.173709326 -0500 +++ ../debug/ssa-lim-19.c.137t.lim2 2021-08-16 03:26:47.866261053 -0500 @@ -34,141 +34,172 @@ Basic block 5 (loop 1 -- depth 1): if (n_11(D) > 0) invariant up to level 1, cost 20. Basic block 8 (loop 1 -- depth 1): Basic block 3 (loop 2 -- depth 2): Basic block 4 (loop 1 -- depth 1): Basic block 9 (loop 1 -- depth 1): Basic block 10 (loop 2 -- depth 2): +Querying dependency of refs 2 and 1: independent. +Querying RAW dependencies of ref 2 in loop 2: independent +Querying RAW dependencies of ref 2 in loop 1: independent +Querying dependency of refs 2 and 1: independent. +Querying SM WAR dependencies of ref 2 in loop 2: independent +Querying SM WAR dependencies of ref 2 in loop 1: independent +Executing store motion of MEM[(int *)x_12(D) + 8B] from loop 1 Querying RAW dependencies of ref 1 in loop 2: independent Querying SM WAR dependencies of ref 1 in loop 2: independent Executing store motion of MEM[(int *)x_12(D) + 4B] from loop 2 Moving statement -x__lsm.4 = MEM[(int *)x_12(D) + 4B]; +x__lsm.5 = MEM[(int *)x_12(D) + 4B]; (cost 0) out of loop 2. +Moving statement +x__lsm.4 = MEM[(int *)x_12(D) + 8B]; +(cost 0) out of loop 1. + Updating SSA: -creating PHI node in block #3 for x__lsm.4 +creating PHI node in block #5 for x__lsm.4 +creating PHI node in block #3 for x__lsm.5 Registering new PHI nodes in block #0 Registering new PHI nodes in block #2 Registering new PHI nodes in block #7 +Updating SSA information for statement x__lsm.4 = MEM[(int *)x_12(D) + 8B]; Registering new PHI nodes in block #5 Registering new PHI nodes in block #8 -Updating SSA information for statement x__lsm.4 = MEM[(int *)x_12(D) + 4B]; +Updating SSA information for statement x__lsm.5 = MEM[(int *)x_12(D) + 4B]; Registering new PHI nodes in block #3 -Updating SSA information for statement tem_16 = x__lsm.4; -Updating SSA information for statement x__lsm.4 = _2; -Registering new PHI nodes in block #11 -Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.4; +Updating SSA information for statement tem_16 = x__lsm.5; +Updating SSA information for statement x__lsm.5 = _2; +Registering new PHI nodes in block #12 +Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.5; Registering new PHI nodes in block #10 Registering new PHI nodes in block #4 -Updating SSA information for statement tem2_13 = MEM[(int *)x_12(D) + 8B]; -Updating SSA information for statement x_12(D)->k = _4; +Updating SSA information for statement tem2_13 = x__lsm.4; +Updating SSA information for statement x__lsm.4 = _4; +Registering new PHI nodes in block #11 +Updating SSA information for statement MEM[(int *)x_12(D) + 8B] = x__lsm.4; Registering new PHI nodes in block #9 Registering new PHI nodes in block #6 Updating SSA information for statement return; Symbols to be put in SSA form -{ D.3324 D.3329 } +{ D.3324 D.3329 D.3330 } Incremental SSA update started at block: 0 -Number of blocks in CFG: 12 -Number of blocks to update: 11 ( 92%) -Affected blocks: 0 2 3 4 5 6 7 8 9 10 11 +Number of blocks in CFG: 13 +Number of blocks to update: 12 ( 92%) +Affected blocks: 0 2 3 4 5 6 7 8 9 10 11 12 -;; Created LCSSA PHI: x__lsm.4_6 = PHI <x__lsm.4_5(3)> +;; Created LCSSA PHI: x__lsm.5_29 = PHI <x__lsm.5_6(3)> +;; Created LCSSA PHI: x__lsm.4_30 = PHI <x__lsm.4_21(4)> Updating SSA: +Registering new PHI nodes in block #5 +Registering new PHI nodes in block #8 Registering new PHI nodes in block #3 -Updating SSA information for statement x__lsm.4_5 = _2; -Registering new PHI nodes in block #11 -Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.4_5; +Updating SSA information for statement x__lsm.5_6 = _2; +Registering new PHI nodes in block #12 +Updating SSA information for statement MEM[(int *)x_12(D) + 4B] = x__lsm.5_6; Registering new PHI nodes in block #10 +Registering new PHI nodes in block #4 +Updating SSA information for statement x__lsm.4_21 = _4; +Registering new PHI nodes in block #11 +Updating SSA information for statement MEM[(int *)x_12(D) + 8B] = x__lsm.4_21; +Registering new PHI nodes in block #9 SSA replacement table N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j -x__lsm.4_6 -> { x__lsm.4_5 } -Incremental SSA update started at block: 3 -Number of blocks in CFG: 12 -Number of blocks to update: 3 ( 25%) -Affected blocks: 3 10 11 +x__lsm.5_29 -> { x__lsm.5_6 } +x__lsm.4_30 -> { x__lsm.4_21 } +Incremental SSA update started at block: 5 +Number of blocks in CFG: 13 +Number of blocks to update: 7 ( 54%) +Affected blocks: 3 4 5 9 10 11 12 void foo (struct X * x, int n, int l) { + int x__lsm.5; int x__lsm.4; int tem; int i; int tem2; int j; int _1; int _2; int _3; int _4; <bb 2> [local count: 14598063]: if (l_10(D) > 0) goto <bb 7>; [89.00%] else goto <bb 6>; [11.00%] <bb 7> [local count: 12992276]: + x__lsm.4_5 = MEM[(int *)x_12(D) + 8B]; goto <bb 5>; [100.00%] <bb 8> [local count: 105119324]: - x__lsm.4_8 = MEM[(int *)x_12(D) + 4B]; + x__lsm.5_7 = MEM[(int *)x_12(D) + 4B]; <bb 3> [local count: 955630225]: # i_24 = PHI <i_18(10), 0(8)> - # x__lsm.4_20 = PHI <x__lsm.4_5(10), x__lsm.4_8(8)> - tem_16 = x__lsm.4_20; + # x__lsm.5_8 = PHI <x__lsm.5_6(10), x__lsm.5_7(8)> + tem_16 = x__lsm.5_8; _1 = tem_16 * i_24; _2 = _1 + tem_16; - x__lsm.4_5 = _2; + x__lsm.5_6 = _2; i_18 = i_24 + 1; if (n_11(D) > i_18) goto <bb 10>; [89.00%] else - goto <bb 11>; [11.00%] + goto <bb 12>; [11.00%] <bb 10> [local count: 850510901]: goto <bb 3>; [100.00%] - <bb 11> [local count: 105119324]: - # x__lsm.4_6 = PHI <x__lsm.4_5(3)> - MEM[(int *)x_12(D) + 4B] = x__lsm.4_6; + <bb 12> [local count: 105119324]: + # x__lsm.5_29 = PHI <x__lsm.5_6(3)> + MEM[(int *)x_12(D) + 4B] = x__lsm.5_29; <bb 4> [local count: 118111600]: - tem2_13 = MEM[(int *)x_12(D) + 8B]; + tem2_13 = x__lsm.4_20; _3 = tem2_13 * j_23; _4 = _3 + tem2_13; - x_12(D)->k = _4; + x__lsm.4_21 = _4; j_15 = j_23 + 1; if (l_10(D) > j_15) goto <bb 9>; [89.00%] else - goto <bb 6>; [11.00%] + goto <bb 11>; [11.00%] <bb 9> [local count: 105119324]: <bb 5> [local count: 118111600]: # j_23 = PHI <j_15(9), 0(7)> + # x__lsm.4_20 = PHI <x__lsm.4_21(9), x__lsm.4_5(7)> if (n_11(D) > 0) goto <bb 8>; [89.00%] else goto <bb 4>; [11.00%] + <bb 11> [local count: 12992276]: + # x__lsm.4_30 = PHI <x__lsm.4_21(4)> + MEM[(int *)x_12(D) + 8B] = x__lsm.4_30; + <bb 6> [local count: 14598063]: return; } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1 2021-08-17 5:17 ` Xionghu Luo @ 2021-08-17 5:24 ` Xionghu Luo 2021-08-17 7:12 ` Richard Biener 1 sibling, 0 replies; 17+ messages in thread From: Xionghu Luo @ 2021-08-17 5:24 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On 2021/8/17 13:17, Xionghu Luo via Gcc-patches wrote: > Hi, > > On 2021/8/16 19:46, Richard Biener wrote: >> On Mon, 16 Aug 2021, Xiong Hu Luo wrote: >> >>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for >>> nested loops. inn_loop is updated to inner loop, so it need be restored >>> when exiting from innermost loop. With this patch, the store instruction >>> in outer loop could also be moved out of outer loop by store motion. >>> Any comments? Thanks. >> >>> gcc/ChangeLog: >>> >>> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore >>> inn_loop when exiting from innermost loop. >>> >>> gcc/testsuite/ChangeLog: >>> >>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>> --- >>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ >>> gcc/tree-ssa-loop-im.c | 6 +++++- >>> 2 files changed, 29 insertions(+), 1 deletion(-) >>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>> >>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>> new file mode 100644 >>> index 00000000000..097a5ee4a4b >>> --- /dev/null >>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>> @@ -0,0 +1,24 @@ >>> +/* PR/101293 */ >>> +/* { dg-do compile } */ >>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ >>> + >>> +struct X { int i; int j; int k;}; >>> + >>> +void foo(struct X *x, int n, int l) >>> +{ >>> + for (int j = 0; j < l; j++) >>> + { >>> + for (int i = 0; i < n; ++i) >>> + { >>> + int *p = &x->j; >>> + int tem = *p; >>> + x->j += tem * i; >>> + } >>> + int *r = &x->k; >>> + int tem2 = *r; >>> + x->k += tem2 * j; >>> + } >>> +} >>> + >>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 >>> "lim2" } } */ >>> + >>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>> index b24bc64f2a7..5ca4738b20e 100644 >>> --- a/gcc/tree-ssa-loop-im.c >>> +++ b/gcc/tree-ssa-loop-im.c >>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, >>> sbitmap contains_call) >>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>> last = bb; >>> + if (inn_loop != loop >>> + && flow_loop_nested_p (bb->loop_father, inn_loop)) >>> + inn_loop = bb->loop_father; >>> + >> >> The comment says >> >> /* In a loop that is always entered we may proceed anyway. >> But record that we entered it and stop once we leave >> it. >> */ >> inn_loop = bb->loop_father; >> >> and your change would defeat that early return, no? > > The issue is the search method exits too early when iterating the outer > loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 > and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 > doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are > ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 > they won't be processed by store motion again. > > > 5<---- > |\ | > 8 \ 9 > | \ | > --->3--->4 > | | 10---| Correct the graph display: 5<---- |\ | 8 \ 9 | \ | --->3--->4 | | ---10 > > > SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this > patch, it will continue search when meet bb 3 until bb 4, then last is > updated > to bb 4, it will break until exit edge is found at bb 4 by > "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop > code will > set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. > > > while (1) > { > SET_ALWAYS_EXECUTED_IN (last, loop); > if (last == loop->header) > break; > last = get_immediate_dominator (CDI_DOMINATORS, last); > } > > After further discussion with Kewen, we found that the inn_loop variable is > totally useless and could be removed. > > >> >>> if (bitmap_bit_p (contains_call, bb->index)) >>> break; >>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, >>> sbitmap contains_call) >>> if (bb->loop_father->header == bb) >>> { >>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>> + if (!dominated_by_p (CDI_DOMINATORS, >>> bb->loop_father->latch, bb)) >>> break; >> >> That's now a always false condition - a loops latch is always dominated >> by its header. The condition as written tries to verify whether the >> loop is always entered - mind we visit all blocks, not only those >> always executed. > > Thanks for the catch! I am afraid the piece of code should be removed > since it stops > search of potential ALWAYS EXECUTED bb after inner loop... > >> >> In fact for your testcase the x->j ref is _not_ always executed >> since the inner loop is conditional on n > 0. > > Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in > store-motion. > Attached the diff file without and with my patch to show the extra > optimization. > > x->j is already moved out of loop 2 on master code. > If change n and l to constant numbers like 100, master code could also > do 2 store > motions as expected. The edge from bb 5 to bb 4 doesn't exist now, so bb > 4, bb 3 > and bb 5 are ALWAYS EXECUTED for loop 1. > > > struct X { int i; int j; int k;}; > > void foo(struct X *x, int n, int l) > { > for (int j = 0; j < l; j++) // loop 1 > { > for (int i = 0; i < n; ++i) // loop 2 > { > int *p = &x->j; > int tem = *p; > x->j += tem * i; > } > int *r = &x->k; > int tem2 = *r; > x->k += tem2 * j; > } > } > >> >> Richard. >> > -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1 2021-08-17 5:17 ` Xionghu Luo 2021-08-17 5:24 ` Xionghu Luo @ 2021-08-17 7:12 ` Richard Biener 2021-08-17 9:10 ` [PATCH v2] Fix incomplete " Xionghu Luo 1 sibling, 1 reply; 17+ messages in thread From: Richard Biener @ 2021-08-17 7:12 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Tue, 17 Aug 2021, Xionghu Luo wrote: > Hi, > > On 2021/8/16 19:46, Richard Biener wrote: > > On Mon, 16 Aug 2021, Xiong Hu Luo wrote: > > > >> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > >> nested loops. inn_loop is updated to inner loop, so it need be restored > >> when exiting from innermost loop. With this patch, the store instruction > >> in outer loop could also be moved out of outer loop by store motion. > >> Any comments? Thanks. > > > >> gcc/ChangeLog: > >> > >> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore > >> inn_loop when exiting from innermost loop. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >> --- > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ > >> gcc/tree-ssa-loop-im.c | 6 +++++- > >> 2 files changed, 29 insertions(+), 1 deletion(-) > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> > >> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> new file mode 100644 > >> index 00000000000..097a5ee4a4b > >> --- /dev/null > >> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> @@ -0,0 +1,24 @@ > >> +/* PR/101293 */ > >> +/* { dg-do compile } */ > >> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > >> + > >> +struct X { int i; int j; int k;}; > >> + > >> +void foo(struct X *x, int n, int l) > >> +{ > >> + for (int j = 0; j < l; j++) > >> + { > >> + for (int i = 0; i < n; ++i) > >> + { > >> + int *p = &x->j; > >> + int tem = *p; > >> + x->j += tem * i; > >> + } > >> + int *r = &x->k; > >> + int tem2 = *r; > >> + x->k += tem2 * j; > >> + } > >> +} > >> + > >> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } > >> */ > >> + > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index b24bc64f2a7..5ca4738b20e 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > >> @@ contains_call) > >> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> last = bb; > >> + if (inn_loop != loop > >> + && flow_loop_nested_p (bb->loop_father, inn_loop)) > >> + inn_loop = bb->loop_father; > >> + > > > > The comment says > > > > /* In a loop that is always entered we may proceed anyway. > > But record that we entered it and stop once we leave it. > > */ > > inn_loop = bb->loop_father; > > > > and your change would defeat that early return, no? > > The issue is the search method exits too early when iterating the outer > loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 > and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 > doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are > ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 > they won't be processed by store motion again. > > > 5<---- > |\ | > 8 \ 9 > | \ | > --->3--->4 > | | > 10---| > > > SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this > patch, it will continue search when meet bb 3 until bb 4, then last is updated > to bb 4, it will break until exit edge is found at bb 4 by > "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop code > will > set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. > > > while (1) > { > SET_ALWAYS_EXECUTED_IN (last, loop); > if (last == loop->header) > break; > last = get_immediate_dominator (CDI_DOMINATORS, last); > } > > After further discussion with Kewen, we found that the inn_loop variable is > totally useless and could be removed. > > > > > >> if (bitmap_bit_p (contains_call, bb->index)) > >> break; > >> > >> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > >> @@ contains_call) > >> > >> if (bb->loop_father->header == bb) > >> { > >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, > >> bb)) > >> break; > > > > That's now a always false condition - a loops latch is always dominated > > by its header. The condition as written tries to verify whether the > > loop is always entered - mind we visit all blocks, not only those > > always executed. > > Thanks for the catch! I am afraid the piece of code should be removed since > it stops > search of potential ALWAYS EXECUTED bb after inner loop... But the code says: /* In a loop that is always entered we may proceed anyway. But record that we entered it and stop once we leave it. */ and you do not remove this comment still it doesn't hold anymore after your patch. I don't say the current code is optimal - I just say it does exactly what is documented. > > > > In fact for your testcase the x->j ref is _not_ always executed > > since the inner loop is conditional on n > 0. > > Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in > store-motion. Sorry, that wasn't clear. Yes, I agree the code fails to see always executed blocks after inner loops. But since the code simply walks all blocks in a loop instead of greedily following edges it has to do that since the inner loop might exit the outer loop as well, sth your change wouldn't honor. Consider while (--n) { do { if (do_exit) goto out; } while (1); p->x += 1; } out:; you'll see a CFG where the inner loop exits the outer loop as well. So I'm saying to improve the code you'll likely have to do more. And add a testcase like the following void __attribute__((noipa)) foo (int n, int m, int f, int *p, int *q) { while (--n) { do { *q += 1; if (f) goto out; } while (--m); *p += 1; } out:; } int main() { int i = 0; foo (10, 10, 1, (void *)0, &i); if (i != 1) __builtin_abort (); return 0; } Richard. > Attached the diff file without and with my patch to show the extra > optimization. > > x->j is already moved out of loop 2 on master code. > If change n and l to constant numbers like 100, master code could also do 2 > store > motions as expected. The edge from bb 5 to bb 4 doesn't exist now, so bb 4, bb > 3 > and bb 5 are ALWAYS EXECUTED for loop 1. > > > struct X { int i; int j; int k;}; > > void foo(struct X *x, int n, int l) > { > for (int j = 0; j < l; j++) // loop 1 > { > for (int i = 0; i < n; ++i) // loop 2 > { > int *p = &x->j; > int tem = *p; > x->j += tem * i; > } > int *r = &x->k; > int tem2 = *r; > x->k += tem2 * j; > } > } > > > > > Richard. > > > > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 2021-08-17 7:12 ` Richard Biener @ 2021-08-17 9:10 ` Xionghu Luo 2021-08-19 5:23 ` Xionghu Luo 2021-08-19 12:11 ` Richard Biener 0 siblings, 2 replies; 17+ messages in thread From: Xionghu Luo @ 2021-08-17 9:10 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On 2021/8/17 15:12, Richard Biener wrote: > On Tue, 17 Aug 2021, Xionghu Luo wrote: > >> Hi, >> >> On 2021/8/16 19:46, Richard Biener wrote: >>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote: >>> >>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for >>>> nested loops. inn_loop is updated to inner loop, so it need be restored >>>> when exiting from innermost loop. With this patch, the store instruction >>>> in outer loop could also be moved out of outer loop by store motion. >>>> Any comments? Thanks. >>> >>>> gcc/ChangeLog: >>>> >>>> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore >>>> inn_loop when exiting from innermost loop. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>> --- >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ >>>> gcc/tree-ssa-loop-im.c | 6 +++++- >>>> 2 files changed, 29 insertions(+), 1 deletion(-) >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> >>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> new file mode 100644 >>>> index 00000000000..097a5ee4a4b >>>> --- /dev/null >>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>> @@ -0,0 +1,24 @@ >>>> +/* PR/101293 */ >>>> +/* { dg-do compile } */ >>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ >>>> + >>>> +struct X { int i; int j; int k;}; >>>> + >>>> +void foo(struct X *x, int n, int l) >>>> +{ >>>> + for (int j = 0; j < l; j++) >>>> + { >>>> + for (int i = 0; i < n; ++i) >>>> + { >>>> + int *p = &x->j; >>>> + int tem = *p; >>>> + x->j += tem * i; >>>> + } >>>> + int *r = &x->k; >>>> + int tem2 = *r; >>>> + x->k += tem2 * j; >>>> + } >>>> +} >>>> + >>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } >>>> */ >>>> + >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>> index b24bc64f2a7..5ca4738b20e 100644 >>>> --- a/gcc/tree-ssa-loop-im.c >>>> +++ b/gcc/tree-ssa-loop-im.c >>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap >>>> @@ contains_call) >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>> last = bb; >>>> + if (inn_loop != loop >>>> + && flow_loop_nested_p (bb->loop_father, inn_loop)) >>>> + inn_loop = bb->loop_father; >>>> + >>> >>> The comment says >>> >>> /* In a loop that is always entered we may proceed anyway. >>> But record that we entered it and stop once we leave it. >>> */ >>> inn_loop = bb->loop_father; >>> >>> and your change would defeat that early return, no? >> >> The issue is the search method exits too early when iterating the outer >> loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 >> and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 >> doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are >> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 >> they won't be processed by store motion again. >> >> >> 5<---- >> |\ | >> 8 \ 9 >> | \ | >> --->3--->4 >> | | >> 10---| >> >> >> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this >> patch, it will continue search when meet bb 3 until bb 4, then last is updated >> to bb 4, it will break until exit edge is found at bb 4 by >> "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop code >> will >> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. >> >> >> while (1) >> { >> SET_ALWAYS_EXECUTED_IN (last, loop); >> if (last == loop->header) >> break; >> last = get_immediate_dominator (CDI_DOMINATORS, last); >> } >> >> After further discussion with Kewen, we found that the inn_loop variable is >> totally useless and could be removed. >> >> >>> >>>> if (bitmap_bit_p (contains_call, bb->index)) >>>> break; >>>> >>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap >>>> @@ contains_call) >>>> >>>> if (bb->loop_father->header == bb) >>>> { >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>> + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, >>>> bb)) >>>> break; >>> >>> That's now a always false condition - a loops latch is always dominated >>> by its header. The condition as written tries to verify whether the >>> loop is always entered - mind we visit all blocks, not only those >>> always executed. >> >> Thanks for the catch! I am afraid the piece of code should be removed since >> it stops >> search of potential ALWAYS EXECUTED bb after inner loop... > > But the code says: > > /* In a loop that is always entered we may proceed anyway. > But record that we entered it and stop once we leave it. > */ > > and you do not remove this comment still it doesn't hold anymore > after your patch. I don't say the current code is optimal - I just > say it does exactly what is documented. Removed. > > >>> >>> In fact for your testcase the x->j ref is _not_ always executed >>> since the inner loop is conditional on n > 0. >> >> Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in >> store-motion. > > Sorry, that wasn't clear. Yes, I agree the code fails to see always > executed blocks after inner loops. But since the code simply walks > all blocks in a loop instead of greedily following edges it has > to do that since the inner loop might exit the outer loop as well, > sth your change wouldn't honor. Consider This is already handled now, if there is an edge exiting out of outer loop directly, it will also early break, I've verified it with your case. FOR_EACH_EDGE (e, ei, bb->succs) { /* If there is an exit from this BB. */ if (!flow_bb_inside_loop_p (loop, e->dest)) break; > > while (--n) > { > do > { > if (do_exit) > goto out; > } > while (1); > p->x += 1; > } > out:; > > you'll see a CFG where the inner loop exits the outer loop as well. > > So I'm saying to improve the code you'll likely have to do more. > And add a testcase like the following > > void __attribute__((noipa)) > foo (int n, int m, int f, int *p, int *q) > { > while (--n) > { > do > { > *q += 1; > if (f) > goto out; > } > while (--m); > *p += 1; > } > out:; > } > > int main() > { > int i = 0; > foo (10, 10, 1, (void *)0, &i); > if (i != 1) > __builtin_abort (); > return 0; > } > > Richard. > Updated the patch: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for nested loops. Current design will exit if an inner loop doesn't dominate outer loop's latch or exit after exiting from inner loop, which caused early return from outer loop, then ALWAYS EXECUTED blocks after inner loops is skipped. This patch removes the redundant inner loop check, but keep the check of exiting to out of outer loop from inner loop, then the store instruction in outer loop could also be moved out of outer loop by store motion. gcc/ChangeLog: * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove inn_loop check. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. --- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++ gcc/tree-ssa-loop-im.c | 14 ---------- 3 files changed, 54 insertions(+), 14 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..097a5ee4a4b --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,24 @@ +/* PR/101293 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +struct X { int i; int j; int k;}; + +void foo(struct X *x, int n, int l) +{ + for (int j = 0; j < l; j++) + { + for (int i = 0; i < n; ++i) + { + int *p = &x->j; + int tem = *p; + x->j += tem * i; + } + int *r = &x->k; + int tem2 = *r; + x->k += tem2 * j; + } +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..6e180de528e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,30 @@ +/* PR/101293 */ +/* { dg-do run } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) +{ + while (--n) + { + do + { + *q += 1; + if (f) + goto out; + } + while (--m); + *p += 1; + } +out:; +} + +int main() +{ + int i = 0; + foo (10, 10, 1, (void *) 0, &i); + if (i != 1) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index b24bc64f2a7..c44f4390ce2 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) basic_block bb = NULL, *bbs, last = NULL; unsigned i; edge e; - class loop *inn_loop = loop; if (ALWAYS_EXECUTED_IN (loop->header) == NULL) { @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) to disprove this if possible). */ if (bb->flags & BB_IRREDUCIBLE_LOOP) break; - - if (!flow_bb_inside_loop_p (inn_loop, bb)) - break; - - if (bb->loop_father->header == bb) - { - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) - break; - - /* In a loop that is always entered we may proceed anyway. - But record that we entered it and stop once we leave it. */ - inn_loop = bb->loop_father; - } } while (1) -- 2.27.0.90.geebb51ba8c ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 2021-08-17 9:10 ` [PATCH v2] Fix incomplete " Xionghu Luo @ 2021-08-19 5:23 ` Xionghu Luo 2021-08-19 12:11 ` Richard Biener 1 sibling, 0 replies; 17+ messages in thread From: Xionghu Luo @ 2021-08-19 5:23 UTC (permalink / raw) To: Richard Biener; +Cc: segher, wschmidt, linkw, gcc-patches, dje.gcc On 2021/8/17 17:10, Xionghu Luo via Gcc-patches wrote: > > > On 2021/8/17 15:12, Richard Biener wrote: >> On Tue, 17 Aug 2021, Xionghu Luo wrote: >> >>> Hi, >>> >>> On 2021/8/16 19:46, Richard Biener wrote: >>>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote: >>>> >>>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for >>>>> nested loops. inn_loop is updated to inner loop, so it need be restored >>>>> when exiting from innermost loop. With this patch, the store instruction >>>>> in outer loop could also be moved out of outer loop by store motion. >>>>> Any comments? Thanks. >>>> >>>>> gcc/ChangeLog: >>>>> >>>>> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore >>>>> inn_loop when exiting from innermost loop. >>>>> >>>>> gcc/testsuite/ChangeLog: >>>>> >>>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >>>>> --- >>>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ >>>>> gcc/tree-ssa-loop-im.c | 6 +++++- >>>>> 2 files changed, 29 insertions(+), 1 deletion(-) >>>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>> >>>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>> new file mode 100644 >>>>> index 00000000000..097a5ee4a4b >>>>> --- /dev/null >>>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >>>>> @@ -0,0 +1,24 @@ >>>>> +/* PR/101293 */ >>>>> +/* { dg-do compile } */ >>>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ >>>>> + >>>>> +struct X { int i; int j; int k;}; >>>>> + >>>>> +void foo(struct X *x, int n, int l) >>>>> +{ >>>>> + for (int j = 0; j < l; j++) >>>>> + { >>>>> + for (int i = 0; i < n; ++i) >>>>> + { >>>>> + int *p = &x->j; >>>>> + int tem = *p; >>>>> + x->j += tem * i; >>>>> + } >>>>> + int *r = &x->k; >>>>> + int tem2 = *r; >>>>> + x->k += tem2 * j; >>>>> + } >>>>> +} >>>>> + >>>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } >>>>> */ >>>>> + >>>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >>>>> index b24bc64f2a7..5ca4738b20e 100644 >>>>> --- a/gcc/tree-ssa-loop-im.c >>>>> +++ b/gcc/tree-ssa-loop-im.c >>>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap >>>>> @@ contains_call) >>>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>>> last = bb; >>>>> + if (inn_loop != loop >>>>> + && flow_loop_nested_p (bb->loop_father, inn_loop)) >>>>> + inn_loop = bb->loop_father; >>>>> + >>>> >>>> The comment says >>>> >>>> /* In a loop that is always entered we may proceed anyway. >>>> But record that we entered it and stop once we leave it. >>>> */ >>>> inn_loop = bb->loop_father; >>>> >>>> and your change would defeat that early return, no? >>> >>> The issue is the search method exits too early when iterating the outer >>> loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 >>> and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 >>> doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are >>> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 >>> they won't be processed by store motion again. >>> >>> >>> 5<---- >>> |\ | >>> 8 \ 9 >>> | \ | >>> --->3--->4 >>> | | >>> 10---| >>> >>> >>> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this >>> patch, it will continue search when meet bb 3 until bb 4, then last is updated >>> to bb 4, it will break until exit edge is found at bb 4 by >>> "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop code >>> will >>> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. >>> >>> >>> while (1) >>> { >>> SET_ALWAYS_EXECUTED_IN (last, loop); >>> if (last == loop->header) >>> break; >>> last = get_immediate_dominator (CDI_DOMINATORS, last); >>> } >>> >>> After further discussion with Kewen, we found that the inn_loop variable is >>> totally useless and could be removed. >>> >>> >>>> >>>>> if (bitmap_bit_p (contains_call, bb->index)) >>>>> break; >>>>> >>>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap >>>>> @@ contains_call) >>>>> >>>>> if (bb->loop_father->header == bb) >>>>> { >>>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>>> + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, >>>>> bb)) >>>>> break; >>>> >>>> That's now a always false condition - a loops latch is always dominated >>>> by its header. The condition as written tries to verify whether the >>>> loop is always entered - mind we visit all blocks, not only those >>>> always executed. >>> >>> Thanks for the catch! I am afraid the piece of code should be removed since >>> it stops >>> search of potential ALWAYS EXECUTED bb after inner loop... >> >> But the code says: >> >> /* In a loop that is always entered we may proceed anyway. >> But record that we entered it and stop once we leave it. >> */ >> >> and you do not remove this comment still it doesn't hold anymore >> after your patch. I don't say the current code is optimal - I just >> say it does exactly what is documented. > > Removed. > >> >> >>>> >>>> In fact for your testcase the x->j ref is _not_ always executed >>>> since the inner loop is conditional on n > 0. >>> >>> Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in >>> store-motion. >> >> Sorry, that wasn't clear. Yes, I agree the code fails to see always >> executed blocks after inner loops. But since the code simply walks >> all blocks in a loop instead of greedily following edges it has >> to do that since the inner loop might exit the outer loop as well, >> sth your change wouldn't honor. Consider > > This is already handled now, if there is an edge exiting out of outer > loop directly, it will also early break, I've verified it with your case. > > > FOR_EACH_EDGE (e, ei, bb->succs) > { > /* If there is an exit from this BB. */ > if (!flow_bb_inside_loop_p (loop, e->dest)) > break; >> >> while (--n) >> { >> do >> { >> if (do_exit) >> goto out; >> } >> while (1); >> p->x += 1; >> } >> out:; >> >> you'll see a CFG where the inner loop exits the outer loop as well. >> >> So I'm saying to improve the code you'll likely have to do more. >> And add a testcase like the following >> >> void __attribute__((noipa)) >> foo (int n, int m, int f, int *p, int *q) >> { >> while (--n) >> { >> do >> { >> *q += 1; >> if (f) >> goto out; >> } >> while (--m); >> *p += 1; >> } >> out:; >> } >> >> int main() >> { >> int i = 0; >> foo (10, 10, 1, (void *)0, &i); >> if (i != 1) >> __builtin_abort (); >> return 0; >> } >> >> Richard. >> > > Updated the patch: > > > [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 > > It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > nested loops. Current design will exit if an inner loop doesn't > dominate outer loop's latch or exit after exiting from inner loop, which > caused early return from outer loop, then ALWAYS EXECUTED blocks after > inner loops is skipped. > This patch removes the redundant inner loop check, but keep the check of > exiting to out of outer loop from inner loop, then the store instruction > in outer loop could also be moved out of outer loop by store motion. This patch is regression tested pass on Power, and shows NO SPEC2017 performance change. > > gcc/ChangeLog: > > * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove > inn_loop check. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > --- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++ > gcc/tree-ssa-loop-im.c | 14 ---------- > 3 files changed, 54 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..097a5ee4a4b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,24 @@ > +/* PR/101293 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +struct X { int i; int j; int k;}; > + > +void foo(struct X *x, int n, int l) > +{ > + for (int j = 0; j < l; j++) > + { > + for (int i = 0; i < n; ++i) > + { > + int *p = &x->j; > + int tem = *p; > + x->j += tem * i; > + } > + int *r = &x->k; > + int tem2 = *r; > + x->k += tem2 * j; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > new file mode 100644 > index 00000000000..6e180de528e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > @@ -0,0 +1,30 @@ > +/* PR/101293 */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) > +{ > + while (--n) > + { > + do > + { > + *q += 1; > + if (f) > + goto out; > + } > + while (--m); > + *p += 1; > + } > +out:; > +} > + > +int main() > +{ > + int i = 0; > + foo (10, 10, 1, (void *) 0, &i); > + if (i != 1) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index b24bc64f2a7..c44f4390ce2 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > basic_block bb = NULL, *bbs, last = NULL; > unsigned i; > edge e; > - class loop *inn_loop = loop; > > if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > { > @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > to disprove this if possible). */ > if (bb->flags & BB_IRREDUCIBLE_LOOP) > break; > - > - if (!flow_bb_inside_loop_p (inn_loop, bb)) > - break; > - > - if (bb->loop_father->header == bb) > - { > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > - break; > - > - /* In a loop that is always entered we may proceed anyway. > - But record that we entered it and stop once we leave it. */ > - inn_loop = bb->loop_father; > - } > } > > while (1) > -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 2021-08-17 9:10 ` [PATCH v2] Fix incomplete " Xionghu Luo 2021-08-19 5:23 ` Xionghu Luo @ 2021-08-19 12:11 ` Richard Biener 2021-08-24 7:44 ` Xionghu Luo 1 sibling, 1 reply; 17+ messages in thread From: Richard Biener @ 2021-08-19 12:11 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Tue, 17 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/17 15:12, Richard Biener wrote: > > On Tue, 17 Aug 2021, Xionghu Luo wrote: > > > >> Hi, > >> > >> On 2021/8/16 19:46, Richard Biener wrote: > >>> On Mon, 16 Aug 2021, Xiong Hu Luo wrote: > >>> > >>>> It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > >>>> nested loops. inn_loop is updated to inner loop, so it need be restored > >>>> when exiting from innermost loop. With this patch, the store instruction > >>>> in outer loop could also be moved out of outer loop by store motion. > >>>> Any comments? Thanks. > >>> > >>>> gcc/ChangeLog: > >>>> > >>>> * tree-ssa-loop-im.c (fill_always_executed_in_1): Restore > >>>> inn_loop when exiting from innermost loop. > >>>> > >>>> gcc/testsuite/ChangeLog: > >>>> > >>>> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >>>> --- > >>>> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++++++++++++++++++ > >>>> gcc/tree-ssa-loop-im.c | 6 +++++- > >>>> 2 files changed, 29 insertions(+), 1 deletion(-) > >>>> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> > >>>> diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> new file mode 100644 > >>>> index 00000000000..097a5ee4a4b > >>>> --- /dev/null > >>>> +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >>>> @@ -0,0 +1,24 @@ > >>>> +/* PR/101293 */ > >>>> +/* { dg-do compile } */ > >>>> +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > >>>> + > >>>> +struct X { int i; int j; int k;}; > >>>> + > >>>> +void foo(struct X *x, int n, int l) > >>>> +{ > >>>> + for (int j = 0; j < l; j++) > >>>> + { > >>>> + for (int i = 0; i < n; ++i) > >>>> + { > >>>> + int *p = &x->j; > >>>> + int tem = *p; > >>>> + x->j += tem * i; > >>>> + } > >>>> + int *r = &x->k; > >>>> + int tem2 = *r; > >>>> + x->k += tem2 * j; > >>>> + } > >>>> +} > >>>> + > >>>> +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } > >>>> */ > >>>> + > >>>> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >>>> index b24bc64f2a7..5ca4738b20e 100644 > >>>> --- a/gcc/tree-ssa-loop-im.c > >>>> +++ b/gcc/tree-ssa-loop-im.c > >>>> @@ -3211,6 +3211,10 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > >>>> @@ contains_call) > >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> last = bb; > >>>> + if (inn_loop != loop > >>>> + && flow_loop_nested_p (bb->loop_father, inn_loop)) > >>>> + inn_loop = bb->loop_father; > >>>> + > >>> > >>> The comment says > >>> > >>> /* In a loop that is always entered we may proceed anyway. > >>> But record that we entered it and stop once we leave it. > >>> */ > >>> inn_loop = bb->loop_father; > >>> > >>> and your change would defeat that early return, no? > >> > >> The issue is the search method exits too early when iterating the outer > >> loop. For example of a nested loop, loop 1 includes 5,8,3,10,4,9 > >> and loop2 includes 3,10. Currently, it breaks when bb is 3 as bb 3 > >> doesn't dominate bb 9 of loop 1. But actually, both bb 5 and bb 4 are > >> ALWAYS_EXECUTED for loop 1, so if there are store instructions in bb 4 > >> they won't be processed by store motion again. > >> > >> > >> 5<---- > >> |\ | > >> 8 \ 9 > >> | \ | > >> --->3--->4 > >> | | > >> 10---| > >> > >> > >> SET_ALWAYS_EXECUTED_IN is only set to bb 5 on master code now, with this > >> patch, it will continue search when meet bb 3 until bb 4, then last is updated > >> to bb 4, it will break until exit edge is found at bb 4 by > >> "if (!flow_bb_inside_loop_p (loop, e->dest))". Then the followed loop code > >> will > >> set bb 4 as ALWAYS_EXEUCTED and all it's idoms bb 5. > >> > >> > >> while (1) > >> { > >> SET_ALWAYS_EXECUTED_IN (last, loop); > >> if (last == loop->header) > >> break; > >> last = get_immediate_dominator (CDI_DOMINATORS, last); > >> } > >> > >> After further discussion with Kewen, we found that the inn_loop variable is > >> totally useless and could be removed. > >> > >> > >>> > >>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>> break; > >>>> > >>>> @@ -3238,7 +3242,7 @@ fill_always_executed_in_1 (class loop *loop, sbitmap > >>>> @@ contains_call) > >>>> > >>>> if (bb->loop_father->header == bb) > >>>> { > >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> + if (!dominated_by_p (CDI_DOMINATORS, bb->loop_father->latch, > >>>> bb)) > >>>> break; > >>> > >>> That's now a always false condition - a loops latch is always dominated > >>> by its header. The condition as written tries to verify whether the > >>> loop is always entered - mind we visit all blocks, not only those > >>> always executed. > >> > >> Thanks for the catch! I am afraid the piece of code should be removed since > >> it stops > >> search of potential ALWAYS EXECUTED bb after inner loop... > > > > But the code says: > > > > /* In a loop that is always entered we may proceed anyway. > > But record that we entered it and stop once we leave it. > > */ > > > > and you do not remove this comment still it doesn't hold anymore > > after your patch. I don't say the current code is optimal - I just > > say it does exactly what is documented. > > Removed. > > > > > > >>> > >>> In fact for your testcase the x->j ref is _not_ always executed > >>> since the inner loop is conditional on n > 0. > >> > >> Yes. But I want to move x->k (not x->j) out of loop 1 when l > 0 in > >> store-motion. > > > > Sorry, that wasn't clear. Yes, I agree the code fails to see always > > executed blocks after inner loops. But since the code simply walks > > all blocks in a loop instead of greedily following edges it has > > to do that since the inner loop might exit the outer loop as well, > > sth your change wouldn't honor. Consider > > This is already handled now, if there is an edge exiting out of outer > loop directly, it will also early break, I've verified it with your case. > > > FOR_EACH_EDGE (e, ei, bb->succs) > { > /* If there is an exit from this BB. */ > if (!flow_bb_inside_loop_p (loop, e->dest)) > break; > > > > while (--n) > > { > > do > > { > > if (do_exit) > > goto out; > > } > > while (1); > > p->x += 1; > > } > > out:; > > > > you'll see a CFG where the inner loop exits the outer loop as well. > > > > So I'm saying to improve the code you'll likely have to do more. > > And add a testcase like the following > > > > void __attribute__((noipa)) > > foo (int n, int m, int f, int *p, int *q) > > { > > while (--n) > > { > > do > > { > > *q += 1; > > if (f) > > goto out; > > } > > while (--m); > > *p += 1; > > } > > out:; > > } > > > > int main() > > { > > int i = 0; > > foo (10, 10, 1, (void *)0, &i); > > if (i != 1) > > __builtin_abort (); > > return 0; > > } > > > > Richard. > > > > Updated the patch: > > > [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 > > It seems to me that ALWAYS_EXECUTED_IN is not computed correctly for > nested loops. Current design will exit if an inner loop doesn't > dominate outer loop's latch or exit after exiting from inner loop, which > caused early return from outer loop, then ALWAYS EXECUTED blocks after > inner loops is skipped. > This patch removes the redundant inner loop check, but keep the check of > exiting to out of outer loop from inner loop, then the store instruction > in outer loop could also be moved out of outer loop by store motion. > > gcc/ChangeLog: > > * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove > inn_loop check. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > --- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 +++++++++++++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++++++++++++++++++ > gcc/tree-ssa-loop-im.c | 14 ---------- > 3 files changed, 54 insertions(+), 14 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..097a5ee4a4b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,24 @@ > +/* PR/101293 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +struct X { int i; int j; int k;}; > + > +void foo(struct X *x, int n, int l) > +{ > + for (int j = 0; j < l; j++) > + { > + for (int i = 0; i < n; ++i) > + { > + int *p = &x->j; > + int tem = *p; > + x->j += tem * i; > + } > + int *r = &x->k; > + int tem2 = *r; > + x->k += tem2 * j; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > new file mode 100644 > index 00000000000..6e180de528e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > @@ -0,0 +1,30 @@ > +/* PR/101293 */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) > +{ > + while (--n) > + { > + do > + { > + *q += 1; > + if (f) > + goto out; > + } > + while (--m); > + *p += 1; > + } > +out:; > +} > + > +int main() > +{ > + int i = 0; > + foo (10, 10, 1, (void *) 0, &i); > + if (i != 1) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index b24bc64f2a7..c44f4390ce2 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -3197,7 +3197,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > basic_block bb = NULL, *bbs, last = NULL; > unsigned i; > edge e; > - class loop *inn_loop = loop; > > if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > { > @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > to disprove this if possible). */ > if (bb->flags & BB_IRREDUCIBLE_LOOP) > break; > - > - if (!flow_bb_inside_loop_p (inn_loop, bb)) > - break; > - > - if (bb->loop_father->header == bb) > - { > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > - break; > - > - /* In a loop that is always entered we may proceed anyway. > - But record that we entered it and stop once we leave it. */ > - inn_loop = bb->loop_father; > - } > } > > while (1) I'm not sure this will work correct (I'm not sure how the existing code makes it so either...). That said, I can't poke any hole into the change. What I see is that definitely if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) last = bb; if (bitmap_bit_p (contains_call, bb->index)) break; doesn't work reliably since the DOM ordering will process blocks A B and C in random order for for (;;) { if (cond) { A: foo (); } else B:; C:; } and thus we can end up setting 'last' to C _before_ processing 'A' and thus arriving at the call foo () ... get_loop_body_in_dom_order does some "special sauce" but not to address the above problem - but it might be that a subtle issue like the above is the reason for the inner loop handling. The inner loop block order does _not_ adhere to this "special sauce", that is - the "Additionally, if a basic block s dominates the latch, then only blocks dominated by s are be after it." guarantee holds for the outer loop latch, not for the inner. Digging into the history of fill_always_executed_in_1 doesn't reveal anything - the inner loop handling has been present since introduction by Zdenek - but usually Zdenek has a reason for doing things as he does ;) Note it might be simply a measure against quadratic complexity, esp. since with your patch we also dive into not always executed subloops as you remove the if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) break; check. I suggest to evaluate behavior of the patch on a testcase like void foo (int n, int **k) { for (int i = 0; i < n; ++i) if (k[0][i]) for (int j = 0; j < n; ++j) if (k[1][j]) for (int l = 0; l < n; ++l) if (k[2][l]) ... } I suspect you'll see quadratic behavior with your patch. You should be at least able to preserve a check like /* Do not process not always executed subloops to avoid quadratic behavior. */ if (bb->loop_father->header == bb && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) break; which is of course not optimistic for cases like for (..) { if (cond) for (..) x = 1; // this is always executed if the inner loop is finite } but we need to have an eye on the complexity of this function. I would have suggested to do greedy visiting of the loop header successors, processing further blocks if all entries (ignoring backedges) are processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist is empty proceed to inner loops as the current code does. For bookkeeping simply keep a to-visit-incoming-edges counter per BB. Pseudo-code: bitmap_set_bit (worklist, loop-header-bb); while (!bitmap_empty_p (worklist)) { bb = pop (worklist); SET_ALWAYS_EXECUTED_IN (bb, loop); if (bitmap_bit_p (contains_call, bb->index)) continue; FOR_EACH_EDGE (e, ei, bb->succs) { if (!flow_bb_inside_loop_p (loop, e->dest)) continue; if (incoming_count[e->dest->index]-- == 0) push (worklist, e->dest); } } iterate over inner loops (incoming_count can be retained, we just force the inner loop header onto the worklist). that would avoid the quadraticness with respect to get_loop_body_in_dom_order and also fix the dominator issue mentioned above. Richard. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 2021-08-19 12:11 ` Richard Biener @ 2021-08-24 7:44 ` Xionghu Luo 2021-08-24 8:20 ` Richard Biener 0 siblings, 1 reply; 17+ messages in thread From: Xionghu Luo @ 2021-08-24 7:44 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On 2021/8/19 20:11, Richard Biener wrote: >> - class loop *inn_loop = loop; >> >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) >> { >> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >> to disprove this if possible). */ >> if (bb->flags & BB_IRREDUCIBLE_LOOP) >> break; >> - >> - if (!flow_bb_inside_loop_p (inn_loop, bb)) >> - break; >> - >> - if (bb->loop_father->header == bb) >> - { >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >> - break; >> - >> - /* In a loop that is always entered we may proceed anyway. >> - But record that we entered it and stop once we leave it. */ >> - inn_loop = bb->loop_father; >> - } >> } >> >> while (1) > I'm not sure this will work correct (I'm not sure how the existing > code makes it so either...). That said, I can't poke any hole > into the change. What I see is that definitely > > if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > last = bb; > > if (bitmap_bit_p (contains_call, bb->index)) > break; > > doesn't work reliably since the DOM ordering will process blocks > A B and C in random order for > > for (;;) > { > if (cond) > { > A: foo (); > } > else B:; > C:; > } > > and thus we can end up setting 'last' to C_before_ processing > 'A' and thus arriving at the call foo () ... > > get_loop_body_in_dom_order does some "special sauce" but not > to address the above problem - but it might be that a subtle > issue like the above is the reason for the inner loop handling. > The inner loop block order does_not_ adhere to this "special sauce", > that is - the "Additionally, if a basic block s dominates > the latch, then only blocks dominated by s are be after it." > guarantee holds for the outer loop latch, not for the inner. > > Digging into the history of fill_always_executed_in_1 doesn't > reveal anything - the inner loop handling has been present > since introduction by Zdenek - but usually Zdenek has a reason > for doing things as he does;) Yes, this is really complicated usage, thanks for point it out. :) I constructed two cases to verify this with inner loop includes "If A; else B; C". Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch, C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call; Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1 since there is also dominate check, then if A or B contains call, it could break successfully. C won't be set to ALWAYS EXECUTED for both circumstance. > > Note it might be simply a measure against quadratic complexity, > esp. since with your patch we also dive into not always executed > subloops as you remove the > > if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > break; > > check. I suggest to evaluate behavior of the patch on a testcase > like > > void foo (int n, int **k) > { > for (int i = 0; i < n; ++i) > if (k[0][i]) > for (int j = 0; j < n; ++j) > if (k[1][j]) > for (int l = 0; l < n; ++l) > if (k[2][l]) > ... > } Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), so fill_always_executed_in_1's execution time is supposed to be increase from O(n) to O(n2)? The time should depend on loop depth and bb counts. I also drafted a test case has 73-depth loop function with 25 no-ipa function copies each compiled in lim2 and lim4 dependently. Total execution time of fill_always_executed_in_1 is increased from 32ms to 58ms, almost doubled but not quadratic? It seems reasonable to see compiling time getting longer since most bbs are checked more but a MUST to ensure early break correctly in every loop level... Though loop nodes could be huge, loop depth will never be so large in actual code? > > I suspect you'll see quadratic behavior with your patch. You > should be at least able to preserve a check like > > /* Do not process not always executed subloops to avoid > quadratic behavior. */ > if (bb->loop_father->header == bb > && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > break; > > which is of course not optimistic for cases like > > for (..) > { > if (cond) > for (..) > x = 1; // this is always executed if the inner loop is finite > } > > but we need to have an eye on the complexity of this function. I would > have suggested to do greedy visiting of the loop header successors, > processing further blocks if all entries (ignoring backedges) are > processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist > is empty proceed to inner loops as the current code does. For > bookkeeping simply keep a to-visit-incoming-edges counter per BB. > Pseudo-code: > > bitmap_set_bit (worklist, loop-header-bb); > while (!bitmap_empty_p (worklist)) > { > bb = pop (worklist); Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? > SET_ALWAYS_EXECUTED_IN (bb, loop); > if (bitmap_bit_p (contains_call, bb->index)) > continue; > FOR_EACH_EDGE (e, ei, bb->succs) > { > if (!flow_bb_inside_loop_p (loop, e->dest)) > continue; > if (incoming_count[e->dest->index]-- == 0) > push (worklist, e->dest); > } > } Sorry I don't fully understand your algorithm. worklist should be auto_vec<basic_block> don't support bitmap operations? Is incoming_count the bb's preds count, why need retain it since it it decreased to 0? > > iterate over inner loops (incoming_count can be retained, > we just force the inner loop header onto the worklist). Is this same to ? for (loop = loop->inner; loop; loop = loop->next) fill_always_executed_in_1 (loop, contains_call) > > that would avoid the quadraticness with respect to > get_loop_body_in_dom_order and also fix the dominator issue > mentioned above. > > Richard. -- Thanks, Xionghu ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2] Fix incomplete computation in fill_always_executed_in_1 2021-08-24 7:44 ` Xionghu Luo @ 2021-08-24 8:20 ` Richard Biener 2021-08-26 5:50 ` [PATCH v3] " Xionghu Luo 0 siblings, 1 reply; 17+ messages in thread From: Richard Biener @ 2021-08-24 8:20 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Tue, 24 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/19 20:11, Richard Biener wrote: > >> - class loop *inn_loop = loop; > >> > >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >> { > >> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > >> to disprove this if possible). */ > >> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >> break; > >> - > >> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >> - break; > >> - > >> - if (bb->loop_father->header == bb) > >> - { > >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> - break; > >> - > >> - /* In a loop that is always entered we may proceed anyway. > >> - But record that we entered it and stop once we leave it. */ > >> - inn_loop = bb->loop_father; > >> - } > >> } > >> > >> while (1) > > I'm not sure this will work correct (I'm not sure how the existing > > code makes it so either...). That said, I can't poke any hole > > into the change. What I see is that definitely > > > > if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > last = bb; > > > > if (bitmap_bit_p (contains_call, bb->index)) > > break; > > > > doesn't work reliably since the DOM ordering will process blocks > > A B and C in random order for > > > > for (;;) > > { > > if (cond) > > { > > A: foo (); > > } > > else B:; > > C:; > > } > > > > and thus we can end up setting 'last' to C_before_ processing > > 'A' and thus arriving at the call foo () ... > > > > get_loop_body_in_dom_order does some "special sauce" but not > > to address the above problem - but it might be that a subtle > > issue like the above is the reason for the inner loop handling. > > The inner loop block order does_not_ adhere to this "special sauce", > > that is - the "Additionally, if a basic block s dominates > > the latch, then only blocks dominated by s are be after it." > > guarantee holds for the outer loop latch, not for the inner. > > > > Digging into the history of fill_always_executed_in_1 doesn't > > reveal anything - the inner loop handling has been present > > since introduction by Zdenek - but usually Zdenek has a reason > > for doing things as he does;) > > Yes, this is really complicated usage, thanks for point it out. :) > I constructed two cases to verify this with inner loop includes "If A; else B; C". > Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks > whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch, > C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call; But it depends on the order of visiting ABC and that's hard to put into a testcase since it depends on the order of edges and the processing of the dominance computation. ABC are simply unordered with respect to a dominator walk. > Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, > the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1 > since there is also dominate check, then if A or B contains call, it could break > successfully. > > C won't be set to ALWAYS EXECUTED for both circumstance. > > > > > Note it might be simply a measure against quadratic complexity, > > esp. since with your patch we also dive into not always executed > > subloops as you remove the > > > > if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > break; > > > > check. I suggest to evaluate behavior of the patch on a testcase > > like > > > > void foo (int n, int **k) > > { > > for (int i = 0; i < n; ++i) > > if (k[0][i]) > > for (int j = 0; j < n; ++j) > > if (k[1][j]) > > for (int l = 0; l < n; ++l) > > if (k[2][l]) > > ... > > } > > Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), > so fill_always_executed_in_1's execution time is supposed to be increase from > O(n) to O(n2)? The time should depend on loop depth and bb counts. I also drafted a > test case has 73-depth loop function with 25 no-ipa function copies each compiled > in lim2 and lim4 dependently. Total execution time of fill_always_executed_in_1 is > increased from 32ms to 58ms, almost doubled but not quadratic? It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still O(n^2). > It seems reasonable to see compiling time getting longer since most bbs are checked > more but a MUST to ensure early break correctly in every loop level... > Though loop nodes could be huge, loop depth will never be so large in actual code? The "in practice" argument is almost always defeated by automatic program generators ;) > > > > I suspect you'll see quadratic behavior with your patch. You > > should be at least able to preserve a check like > > > > /* Do not process not always executed subloops to avoid > > quadratic behavior. */ > > if (bb->loop_father->header == bb > > && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > > break; > > > > which is of course not optimistic for cases like > > > > for (..) > > { > > if (cond) > > for (..) > > x = 1; // this is always executed if the inner loop is finite > > } > > > > but we need to have an eye on the complexity of this function. I would > > have suggested to do greedy visiting of the loop header successors, > > processing further blocks if all entries (ignoring backedges) are > > processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist > > is empty proceed to inner loops as the current code does. For > > bookkeeping simply keep a to-visit-incoming-edges counter per BB. > > Pseudo-code: > > > > bitmap_set_bit (worklist, loop-header-bb); > > while (!bitmap_empty_p (worklist)) > > { > > bb = pop (worklist); > > Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? Ah, sure. > > SET_ALWAYS_EXECUTED_IN (bb, loop); > > if (bitmap_bit_p (contains_call, bb->index)) > > continue; > > FOR_EACH_EDGE (e, ei, bb->succs) > > { > > if (!flow_bb_inside_loop_p (loop, e->dest)) > > continue; > > if (incoming_count[e->dest->index]-- == 0) > > push (worklist, e->dest); > > } > > } > > Sorry I don't fully understand your algorithm. worklist should be > auto_vec<basic_block> don't support bitmap operations? Is incoming_count > the bb's preds count, why need retain it since it it decreased to 0? the worklist is a auto_bitmap, pop () would be bitmap_first_set_bit/clear_bit. One could use a vector with the caveat of eventually adding duplicate members to the worklist. But as said, it's pseudo-code ;) > > > > iterate over inner loops (incoming_count can be retained, > > we just force the inner loop header onto the worklist). > > Is this same to ? > > for (loop = loop->inner; loop; loop = loop->next) > fill_always_executed_in_1 (loop, contains_call) Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which should iterate over loops in PRE order. Richard. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 2021-08-24 8:20 ` Richard Biener @ 2021-08-26 5:50 ` Xionghu Luo 2021-08-27 7:45 ` Richard Biener 0 siblings, 1 reply; 17+ messages in thread From: Xionghu Luo @ 2021-08-26 5:50 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On 2021/8/24 16:20, Richard Biener wrote: > On Tue, 24 Aug 2021, Xionghu Luo wrote: > >> >> >> On 2021/8/19 20:11, Richard Biener wrote: >>>> - class loop *inn_loop = loop; >>>> >>>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) >>>> { >>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >>>> to disprove this if possible). */ >>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) >>>> break; >>>> - >>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) >>>> - break; >>>> - >>>> - if (bb->loop_father->header == bb) >>>> - { >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>> - break; >>>> - >>>> - /* In a loop that is always entered we may proceed anyway. >>>> - But record that we entered it and stop once we leave it. */ >>>> - inn_loop = bb->loop_father; >>>> - } >>>> } >>>> >>>> while (1) >>> I'm not sure this will work correct (I'm not sure how the existing >>> code makes it so either...). That said, I can't poke any hole >>> into the change. What I see is that definitely >>> >>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>> last = bb; >>> >>> if (bitmap_bit_p (contains_call, bb->index)) >>> break; >>> >>> doesn't work reliably since the DOM ordering will process blocks >>> A B and C in random order for >>> >>> for (;;) >>> { >>> if (cond) >>> { >>> A: foo (); >>> } >>> else B:; >>> C:; >>> } >>> >>> and thus we can end up setting 'last' to C_before_ processing >>> 'A' and thus arriving at the call foo () ... >>> >>> get_loop_body_in_dom_order does some "special sauce" but not >>> to address the above problem - but it might be that a subtle >>> issue like the above is the reason for the inner loop handling. >>> The inner loop block order does_not_ adhere to this "special sauce", >>> that is - the "Additionally, if a basic block s dominates >>> the latch, then only blocks dominated by s are be after it." >>> guarantee holds for the outer loop latch, not for the inner. >>> >>> Digging into the history of fill_always_executed_in_1 doesn't >>> reveal anything - the inner loop handling has been present >>> since introduction by Zdenek - but usually Zdenek has a reason >>> for doing things as he does;) >> >> Yes, this is really complicated usage, thanks for point it out. :) >> I constructed two cases to verify this with inner loop includes "If A; else B; C". >> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks >> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch, >> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call; > > But it depends on the order of visiting ABC and that's hard to put into > a testcase since it depends on the order of edges and the processing > of the dominance computation. ABC are simply unordered with respect > to a dominator walk. > >> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, >> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1 >> since there is also dominate check, then if A or B contains call, it could break >> successfully. >> >> C won't be set to ALWAYS EXECUTED for both circumstance. >> >>> >>> Note it might be simply a measure against quadratic complexity, >>> esp. since with your patch we also dive into not always executed >>> subloops as you remove the >>> >>> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>> break; >>> >>> check. I suggest to evaluate behavior of the patch on a testcase >>> like >>> >>> void foo (int n, int **k) >>> { >>> for (int i = 0; i < n; ++i) >>> if (k[0][i]) >>> for (int j = 0; j < n; ++j) >>> if (k[1][j]) >>> for (int l = 0; l < n; ++l) >>> if (k[2][l]) >>> ... >>> } >> >> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), >> so fill_always_executed_in_1's execution time is supposed to be increase from >> O(n) to O(n2)? The time should depend on loop depth and bb counts. I also drafted a >> test case has 73-depth loop function with 25 no-ipa function copies each compiled >> in lim2 and lim4 dependently. Total execution time of fill_always_executed_in_1 is >> increased from 32ms to 58ms, almost doubled but not quadratic? > > It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still > O(n^2). > >> It seems reasonable to see compiling time getting longer since most bbs are checked >> more but a MUST to ensure early break correctly in every loop level... >> Though loop nodes could be huge, loop depth will never be so large in actual code? > > The "in practice" argument is almost always defeated by automatic > program generators ;) > >>> >>> I suspect you'll see quadratic behavior with your patch. You >>> should be at least able to preserve a check like >>> >>> /* Do not process not always executed subloops to avoid >>> quadratic behavior. */ >>> if (bb->loop_father->header == bb >>> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>> break; >>> >>> which is of course not optimistic for cases like >>> >>> for (..) >>> { >>> if (cond) >>> for (..) >>> x = 1; // this is always executed if the inner loop is finite >>> } >>> >>> but we need to have an eye on the complexity of this function. I would >>> have suggested to do greedy visiting of the loop header successors, >>> processing further blocks if all entries (ignoring backedges) are >>> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist >>> is empty proceed to inner loops as the current code does. For >>> bookkeeping simply keep a to-visit-incoming-edges counter per BB. >>> Pseudo-code: >>> >>> bitmap_set_bit (worklist, loop-header-bb); >>> while (!bitmap_empty_p (worklist)) >>> { >>> bb = pop (worklist); >> >> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? > > Ah, sure. > >>> SET_ALWAYS_EXECUTED_IN (bb, loop); >>> if (bitmap_bit_p (contains_call, bb->index)) >>> continue; >>> FOR_EACH_EDGE (e, ei, bb->succs) >>> { >>> if (!flow_bb_inside_loop_p (loop, e->dest)) >>> continue; >>> if (incoming_count[e->dest->index]-- == 0) >>> push (worklist, e->dest); >>> } >>> } >> >> Sorry I don't fully understand your algorithm. worklist should be >> auto_vec<basic_block> don't support bitmap operations? Is incoming_count >> the bb's preds count, why need retain it since it it decreased to 0? > > the worklist is a auto_bitmap, pop () would be > bitmap_first_set_bit/clear_bit. One could use a vector with the > caveat of eventually adding duplicate members to the worklist. > But as said, it's pseudo-code ;) Thanks a lot! > >>> >>> iterate over inner loops (incoming_count can be retained, >>> we just force the inner loop header onto the worklist). >> >> Is this same to ? >> >> for (loop = loop->inner; loop; loop = loop->next) >> fill_always_executed_in_1 (loop, contains_call) > > Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which > should iterate over loops in PRE order. Tried implementation with your pseudo-code, it works like the previous v2 solution, bb after inner loops could be marked as ALWAYS EXECUTED. You are really so strong! ;) Regression tested pass on P8LE. Assume, A: GCC Base B: inner loop check removal C: incoming count Execution time of fill_always_executed_in_1 for the 73-depth loop (with five function copies) is (A vs. B vs. C): 32ms vs 58ms vs 38ms. Much better than O(n^2)? Bumped the patch to v3 with TODOs: 1. The time measurement code will be removed if the v3 is OK; 2. loops_list is not used as my code is not rebased to upstream yet. 3. Function comments is not updated. [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 ALWAYS_EXECUTED_IN is not computed completely for nested loops. Current design will exit if an inner loop doesn't dominate outer loop's latch or exit after exiting from inner loop, which caused early return from outer loop, then ALWAYS EXECUTED blocks after inner loops are skipped. Another problem is it has to call get_loop_body_in_dom_order in each recursive call which is also inefficient. This patch does greedy visiting of the loop header successors, processing further blocks if all entries (ignoring backedges) are processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch. When the worklist is empty proceed to inner loops as the current code does. For bookkeeping simply keep a to-visit-incoming-edges counter per BB, and pass it through iteration over inner loops. Details discusions: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html gcc/ChangeLog: * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove inn_loop check. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. --- gcc/tree-ssa-loop-im.c | 95 +++++++++++++--------- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++ 3 files changed, 111 insertions(+), 38 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index b24bc64f2a7..72be6b6bfac 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -3192,39 +3192,52 @@ do_store_motion (void) blocks that contain a nonpure call. */ static void -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi) { - basic_block bb = NULL, *bbs, last = NULL; - unsigned i; + basic_block bb = NULL; edge e; - class loop *inn_loop = loop; if (ALWAYS_EXECUTED_IN (loop->header) == NULL) { - bbs = get_loop_body_in_dom_order (loop); + auto_bitmap work_set; + bitmap_clear (work_set); + bitmap_set_bit (work_set, loop->header->index); + unsigned bb_index; - for (i = 0; i < loop->num_nodes; i++) - { - edge_iterator ei; - bb = bbs[i]; + unsigned array_size = last_basic_block_for_fn (cfun) + 1; + int *bbd = XNEWVEC (int, array_size); + bbd = XDUPVEC (int, bbi, array_size); + while (!bitmap_empty_p (work_set)) + { + bb_index = bitmap_first_set_bit (work_set); + bitmap_clear_bit (work_set, bb_index); + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) - last = bb; - + SET_ALWAYS_EXECUTED_IN (bb, loop); if (bitmap_bit_p (contains_call, bb->index)) break; - + edge_iterator ei; FOR_EACH_EDGE (e, ei, bb->succs) { - /* If there is an exit from this BB. */ if (!flow_bb_inside_loop_p (loop, e->dest)) break; + /* Or we enter a possibly non-finite loop. */ if (flow_loop_nested_p (bb->loop_father, e->dest->loop_father) && ! finite_loop_p (e->dest->loop_father)) break; + + if (bbd[e->dest->index] == 1) + { + bitmap_set_bit (work_set, e->dest->index); + bbd[e->dest->index]--; + } + else if (bbd[e->dest->index] > 1) + bbd[e->dest->index]--; } + if (e) break; @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) to disprove this if possible). */ if (bb->flags & BB_IRREDUCIBLE_LOOP) break; - - if (!flow_bb_inside_loop_p (inn_loop, bb)) - break; - - if (bb->loop_father->header == bb) - { - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) - break; - - /* In a loop that is always entered we may proceed anyway. - But record that we entered it and stop once we leave it. */ - inn_loop = bb->loop_father; - } - } - - while (1) - { - SET_ALWAYS_EXECUTED_IN (last, loop); - if (last == loop->header) - break; - last = get_immediate_dominator (CDI_DOMINATORS, last); } - - free (bbs); + free (bbd); } for (loop = loop->inner; loop; loop = loop->next) - fill_always_executed_in_1 (loop, contains_call); + fill_always_executed_in_1 (loop, contains_call, bbi); } /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) bitmap_set_bit (contains_call, bb->index); } + mark_dfs_back_edges (); + unsigned array_size = last_basic_block_for_fn (cfun) + 1; + int *bb_incoming_edges = XNEWVEC (int, array_size); + memset (bb_incoming_edges, 0, array_size * sizeof (int)); + edge_iterator ei; + edge e; + + FOR_EACH_BB_FN (bb, cfun) + { + int len = bb->preds->length (); + FOR_EACH_EDGE (e, ei, bb->preds) + if (e->flags & EDGE_DFS_BACK) + len--; + bb_incoming_edges[bb->index] = len; + } + + static unsigned long diff = 0; + struct timeval tv; + gettimeofday (&tv, NULL); + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; + + //for (loop : loops_list (cfun, 0)) for (loop = current_loops->tree_root->inner; loop; loop = loop->next) - fill_always_executed_in_1 (loop, contains_call); + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); + + gettimeofday (&tv, NULL); + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; + diff += end - begin; + //printf ("%s,diff:%ld\n", current_function_name (), diff); + free (bb_incoming_edges); } diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..097a5ee4a4b --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,24 @@ +/* PR/101293 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +struct X { int i; int j; int k;}; + +void foo(struct X *x, int n, int l) +{ + for (int j = 0; j < l; j++) + { + for (int i = 0; i < n; ++i) + { + int *p = &x->j; + int tem = *p; + x->j += tem * i; + } + int *r = &x->k; + int tem2 = *r; + x->k += tem2 * j; + } +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..6e180de528e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,30 @@ +/* PR/101293 */ +/* { dg-do run } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) +{ + while (--n) + { + do + { + *q += 1; + if (f) + goto out; + } + while (--m); + *p += 1; + } +out:; +} + +int main() +{ + int i = 0; + foo (10, 10, 1, (void *) 0, &i); + if (i != 1) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ -- 2.27.0.90.geebb51ba8c ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 2021-08-26 5:50 ` [PATCH v3] " Xionghu Luo @ 2021-08-27 7:45 ` Richard Biener 2021-08-30 8:49 ` Xionghu Luo 0 siblings, 1 reply; 17+ messages in thread From: Richard Biener @ 2021-08-27 7:45 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Thu, 26 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/24 16:20, Richard Biener wrote: > > On Tue, 24 Aug 2021, Xionghu Luo wrote: > > > >> > >> > >> On 2021/8/19 20:11, Richard Biener wrote: > >>>> - class loop *inn_loop = loop; > >>>> > >>>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >>>> { > >>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > >>>> to disprove this if possible). */ > >>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >>>> break; > >>>> - > >>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >>>> - break; > >>>> - > >>>> - if (bb->loop_father->header == bb) > >>>> - { > >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> - break; > >>>> - > >>>> - /* In a loop that is always entered we may proceed anyway. > >>>> - But record that we entered it and stop once we leave it. */ > >>>> - inn_loop = bb->loop_father; > >>>> - } > >>>> } > >>>> > >>>> while (1) > >>> I'm not sure this will work correct (I'm not sure how the existing > >>> code makes it so either...). That said, I can't poke any hole > >>> into the change. What I see is that definitely > >>> > >>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>> last = bb; > >>> > >>> if (bitmap_bit_p (contains_call, bb->index)) > >>> break; > >>> > >>> doesn't work reliably since the DOM ordering will process blocks > >>> A B and C in random order for > >>> > >>> for (;;) > >>> { > >>> if (cond) > >>> { > >>> A: foo (); > >>> } > >>> else B:; > >>> C:; > >>> } > >>> > >>> and thus we can end up setting 'last' to C_before_ processing > >>> 'A' and thus arriving at the call foo () ... > >>> > >>> get_loop_body_in_dom_order does some "special sauce" but not > >>> to address the above problem - but it might be that a subtle > >>> issue like the above is the reason for the inner loop handling. > >>> The inner loop block order does_not_ adhere to this "special sauce", > >>> that is - the "Additionally, if a basic block s dominates > >>> the latch, then only blocks dominated by s are be after it." > >>> guarantee holds for the outer loop latch, not for the inner. > >>> > >>> Digging into the history of fill_always_executed_in_1 doesn't > >>> reveal anything - the inner loop handling has been present > >>> since introduction by Zdenek - but usually Zdenek has a reason > >>> for doing things as he does;) > >> > >> Yes, this is really complicated usage, thanks for point it out. :) > >> I constructed two cases to verify this with inner loop includes "If A; else B; C". > >> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks > >> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch, > >> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call; > > > > But it depends on the order of visiting ABC and that's hard to put into > > a testcase since it depends on the order of edges and the processing > > of the dominance computation. ABC are simply unordered with respect > > to a dominator walk. > > > >> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, > >> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1 > >> since there is also dominate check, then if A or B contains call, it could break > >> successfully. > >> > >> C won't be set to ALWAYS EXECUTED for both circumstance. > >> > >>> > >>> Note it might be simply a measure against quadratic complexity, > >>> esp. since with your patch we also dive into not always executed > >>> subloops as you remove the > >>> > >>> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>> break; > >>> > >>> check. I suggest to evaluate behavior of the patch on a testcase > >>> like > >>> > >>> void foo (int n, int **k) > >>> { > >>> for (int i = 0; i < n; ++i) > >>> if (k[0][i]) > >>> for (int j = 0; j < n; ++j) > >>> if (k[1][j]) > >>> for (int l = 0; l < n; ++l) > >>> if (k[2][l]) > >>> ... > >>> } > >> > >> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), > >> so fill_always_executed_in_1's execution time is supposed to be increase from > >> O(n) to O(n2)? The time should depend on loop depth and bb counts. I also drafted a > >> test case has 73-depth loop function with 25 no-ipa function copies each compiled > >> in lim2 and lim4 dependently. Total execution time of fill_always_executed_in_1 is > >> increased from 32ms to 58ms, almost doubled but not quadratic? > > > > It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still > > O(n^2). > > > >> It seems reasonable to see compiling time getting longer since most bbs are checked > >> more but a MUST to ensure early break correctly in every loop level... > >> Though loop nodes could be huge, loop depth will never be so large in actual code? > > > > The "in practice" argument is almost always defeated by automatic > > program generators ;) > > > >>> > >>> I suspect you'll see quadratic behavior with your patch. You > >>> should be at least able to preserve a check like > >>> > >>> /* Do not process not always executed subloops to avoid > >>> quadratic behavior. */ > >>> if (bb->loop_father->header == bb > >>> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>> break; > >>> > >>> which is of course not optimistic for cases like > >>> > >>> for (..) > >>> { > >>> if (cond) > >>> for (..) > >>> x = 1; // this is always executed if the inner loop is finite > >>> } > >>> > >>> but we need to have an eye on the complexity of this function. I would > >>> have suggested to do greedy visiting of the loop header successors, > >>> processing further blocks if all entries (ignoring backedges) are > >>> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist > >>> is empty proceed to inner loops as the current code does. For > >>> bookkeeping simply keep a to-visit-incoming-edges counter per BB. > >>> Pseudo-code: > >>> > >>> bitmap_set_bit (worklist, loop-header-bb); > >>> while (!bitmap_empty_p (worklist)) > >>> { > >>> bb = pop (worklist); > >> > >> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? > > > > Ah, sure. > > > >>> SET_ALWAYS_EXECUTED_IN (bb, loop); > >>> if (bitmap_bit_p (contains_call, bb->index)) > >>> continue; > >>> FOR_EACH_EDGE (e, ei, bb->succs) > >>> { > >>> if (!flow_bb_inside_loop_p (loop, e->dest)) > >>> continue; > >>> if (incoming_count[e->dest->index]-- == 0) > >>> push (worklist, e->dest); > >>> } > >>> } > >> > >> Sorry I don't fully understand your algorithm. worklist should be > >> auto_vec<basic_block> don't support bitmap operations? Is incoming_count > >> the bb's preds count, why need retain it since it it decreased to 0? > > > > the worklist is a auto_bitmap, pop () would be > > bitmap_first_set_bit/clear_bit. One could use a vector with the > > caveat of eventually adding duplicate members to the worklist. > > But as said, it's pseudo-code ;) > > Thanks a lot! > > > > >>> > >>> iterate over inner loops (incoming_count can be retained, > >>> we just force the inner loop header onto the worklist). > >> > >> Is this same to ? > >> > >> for (loop = loop->inner; loop; loop = loop->next) > >> fill_always_executed_in_1 (loop, contains_call) > > > > Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which > > should iterate over loops in PRE order. > > Tried implementation with your pseudo-code, it works like the > previous v2 solution, bb after inner loops could be marked > as ALWAYS EXECUTED. You are really so strong! ;) > Regression tested pass on P8LE. > > Assume, > > A: GCC Base > B: inner loop check removal > C: incoming count > > Execution time of fill_always_executed_in_1 for the 73-depth loop > (with five function copies) is > (A vs. B vs. C): 32ms vs 58ms vs 38ms. Much better than O(n^2)? > > > Bumped the patch to v3 with TODOs: > 1. The time measurement code will be removed if the v3 is OK; > 2. loops_list is not used as my code is not rebased to upstream yet. > 3. Function comments is not updated. > > > > [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 > > > ALWAYS_EXECUTED_IN is not computed completely for nested loops. > Current design will exit if an inner loop doesn't dominate outer > loop's latch or exit after exiting from inner loop, which > caused early return from outer loop, then ALWAYS EXECUTED blocks after > inner loops are skipped. Another problem is it has to call > get_loop_body_in_dom_order in each recursive call which is also > inefficient. > > This patch does greedy visiting of the loop header successors, > processing further blocks if all entries (ignoring backedges) are > processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch. > When the worklist is empty proceed to inner loops as the current > code does. For bookkeeping simply keep a to-visit-incoming-edges > counter per BB, and pass it through iteration over inner loops. > > Details discusions: > https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html > > gcc/ChangeLog: > > * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove > inn_loop check. > > gcc/testsuite/ChangeLog: > > * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > --- > gcc/tree-ssa-loop-im.c | 95 +++++++++++++--------- > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++ > gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++ > 3 files changed, 111 insertions(+), 38 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > > diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > index b24bc64f2a7..72be6b6bfac 100644 > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c > @@ -3192,39 +3192,52 @@ do_store_motion (void) > blocks that contain a nonpure call. */ > > static void > -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi) > { > - basic_block bb = NULL, *bbs, last = NULL; > - unsigned i; > + basic_block bb = NULL; > edge e; > - class loop *inn_loop = loop; > > if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > { > - bbs = get_loop_body_in_dom_order (loop); > + auto_bitmap work_set; > + bitmap_clear (work_set); bitmap_clear isn't necessary > + bitmap_set_bit (work_set, loop->header->index); > + unsigned bb_index; > > - for (i = 0; i < loop->num_nodes; i++) > - { > - edge_iterator ei; > - bb = bbs[i]; > + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > + int *bbd = XNEWVEC (int, array_size); > + bbd = XDUPVEC (int, bbi, array_size); I don't think you need to copy 'bbi' but you can re-use the state from the outer loop processing. Did you run into any issues with that? > + while (!bitmap_empty_p (work_set)) > + { > + bb_index = bitmap_first_set_bit (work_set); > + bitmap_clear_bit (work_set, bb_index); > + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); > if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > - last = bb; > - > + SET_ALWAYS_EXECUTED_IN (bb, loop); > if (bitmap_bit_p (contains_call, bb->index)) > break; I think you want to continue; here (process remaining worklist but not continue greedy walking this block) > - > + edge_iterator ei; > FOR_EACH_EDGE (e, ei, bb->succs) > { > - /* If there is an exit from this BB. */ > if (!flow_bb_inside_loop_p (loop, e->dest)) > break; in particular this should keep the outer 'bbi' valid to re-use. But again, you want 'continue;' the greedy walk to other edges. If that's not valid (I'd need to think about this) then with your patch whether we process an edge depends on the order of the edge visit so you'd have to walk successors twice, once to determine whether we can greedily walk any of it and once to actually do the greedy walk. So thinking about it an exit edge is like a not returning call and thus we indeed should not process any outgoing edges of this block. > + > /* Or we enter a possibly non-finite loop. */ > if (flow_loop_nested_p (bb->loop_father, > e->dest->loop_father) > && ! finite_loop_p (e->dest->loop_father)) > break; I think this is no longer necessary? In any case it would again be 'continue;', see also above. I'm missing skipping of backedges in the walk. > + > + if (bbd[e->dest->index] == 1) > + { > + bitmap_set_bit (work_set, e->dest->index); > + bbd[e->dest->index]--; > + } > + else if (bbd[e->dest->index] > 1) > + bbd[e->dest->index]--; maybe just bbd[e->dest->index]--; if (bbd[e->dest->index] == 0) bitmap_set_bit (work_set, e->dest->index); > } > + > if (e) > break; continue; processing the worklist > > @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > to disprove this if possible). */ > if (bb->flags & BB_IRREDUCIBLE_LOOP) > break; continue; processing the worklist. I think this has to come early, before processing successors, next to the contains_call processing. We could think of ignoring entering of irreducible regions by tracking exits from them, but I'm not sure we're identifying the regions in a good enough way to allow this. Likewise possibly infinite inner loops can be processed when we track exits from those which would be sth like /* When this is an exit from an inner loop that we cannot prove as finite do not follow this edge. */ if (bb->loop_father != loop && loop_exit_edge_p (bb->loop_father, e) && ! finite_loop_p (bb->loop_father)) continue; > - > - if (!flow_bb_inside_loop_p (inn_loop, bb)) > - break; > - > - if (bb->loop_father->header == bb) > - { > - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > - break; > - > - /* In a loop that is always entered we may proceed anyway. > - But record that we entered it and stop once we leave it. */ > - inn_loop = bb->loop_father; > - } > - } > - > - while (1) > - { > - SET_ALWAYS_EXECUTED_IN (last, loop); > - if (last == loop->header) > - break; > - last = get_immediate_dominator (CDI_DOMINATORS, last); > } > - > - free (bbs); > + free (bbd); > } > > for (loop = loop->inner; loop; loop = loop->next) > - fill_always_executed_in_1 (loop, contains_call); > + fill_always_executed_in_1 (loop, contains_call, bbi); > } > > /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. > @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) > bitmap_set_bit (contains_call, bb->index); > } > > + mark_dfs_back_edges (); Please put this to loop_invariant_motion_in_fun right before the call to fill_always_executed_in. > + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > + int *bb_incoming_edges = XNEWVEC (int, array_size); There's XCNEWVEC that also zeros the array. > + memset (bb_incoming_edges, 0, array_size * sizeof (int)); > + edge_iterator ei; > + edge e; > + > + FOR_EACH_BB_FN (bb, cfun) > + { > + int len = bb->preds->length (); > + FOR_EACH_EDGE (e, ei, bb->preds) > + if (e->flags & EDGE_DFS_BACK) > + len--; > + bb_incoming_edges[bb->index] = len; > + } it would be nice to combine this with the preceeding loop over all blocks. > + static unsigned long diff = 0; > + struct timeval tv; > + gettimeofday (&tv, NULL); > + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > + > + //for (loop : loops_list (cfun, 0)) > for (loop = current_loops->tree_root->inner; loop; loop = loop->next) > - fill_always_executed_in_1 (loop, contains_call); > + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); > + > + gettimeofday (&tv, NULL); > + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > + diff += end - begin; > + //printf ("%s,diff:%ld\n", current_function_name (), diff); > + free (bb_incoming_edges); Eh, looks very much like work in progress. From the O() complexity point duplicating bb_incoming_edges for each loop makes it quadratic again but as said I think we don't need this duplication. I'd like to see the recursion removed, did the for (loop : ...) not work out? Thanks, Richard. > } > > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > new file mode 100644 > index 00000000000..097a5ee4a4b > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > @@ -0,0 +1,24 @@ > +/* PR/101293 */ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +struct X { int i; int j; int k;}; > + > +void foo(struct X *x, int n, int l) > +{ > + for (int j = 0; j < l; j++) > + { > + for (int i = 0; i < n; ++i) > + { > + int *p = &x->j; > + int tem = *p; > + x->j += tem * i; > + } > + int *r = &x->k; > + int tem2 = *r; > + x->k += tem2 * j; > + } > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ > + > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > new file mode 100644 > index 00000000000..6e180de528e > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > @@ -0,0 +1,30 @@ > +/* PR/101293 */ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ > + > +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) > +{ > + while (--n) > + { > + do > + { > + *q += 1; > + if (f) > + goto out; > + } > + while (--m); > + *p += 1; > + } > +out:; > +} > + > +int main() > +{ > + int i = 0; > + foo (10, 10, 1, (void *) 0, &i); > + if (i != 1) > + __builtin_abort (); > + return 0; > +} > + > +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 2021-08-27 7:45 ` Richard Biener @ 2021-08-30 8:49 ` Xionghu Luo 2021-08-30 9:19 ` Richard Biener 0 siblings, 1 reply; 17+ messages in thread From: Xionghu Luo @ 2021-08-30 8:49 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw [-- Attachment #1: Type: text/plain, Size: 19611 bytes --] On 2021/8/27 15:45, Richard Biener wrote: > On Thu, 26 Aug 2021, Xionghu Luo wrote: > >> >> >> On 2021/8/24 16:20, Richard Biener wrote: >>> On Tue, 24 Aug 2021, Xionghu Luo wrote: >>> >>>> >>>> >>>> On 2021/8/19 20:11, Richard Biener wrote: >>>>>> - class loop *inn_loop = loop; >>>>>> >>>>>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) >>>>>> { >>>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >>>>>> to disprove this if possible). */ >>>>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) >>>>>> break; >>>>>> - >>>>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) >>>>>> - break; >>>>>> - >>>>>> - if (bb->loop_father->header == bb) >>>>>> - { >>>>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>>>> - break; >>>>>> - >>>>>> - /* In a loop that is always entered we may proceed anyway. >>>>>> - But record that we entered it and stop once we leave it. */ >>>>>> - inn_loop = bb->loop_father; >>>>>> - } >>>>>> } >>>>>> >>>>>> while (1) >>>>> I'm not sure this will work correct (I'm not sure how the existing >>>>> code makes it so either...). That said, I can't poke any hole >>>>> into the change. What I see is that definitely >>>>> >>>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>>> last = bb; >>>>> >>>>> if (bitmap_bit_p (contains_call, bb->index)) >>>>> break; >>>>> >>>>> doesn't work reliably since the DOM ordering will process blocks >>>>> A B and C in random order for >>>>> >>>>> for (;;) >>>>> { >>>>> if (cond) >>>>> { >>>>> A: foo (); >>>>> } >>>>> else B:; >>>>> C:; >>>>> } >>>>> >>>>> and thus we can end up setting 'last' to C_before_ processing >>>>> 'A' and thus arriving at the call foo () ... >>>>> >>>>> get_loop_body_in_dom_order does some "special sauce" but not >>>>> to address the above problem - but it might be that a subtle >>>>> issue like the above is the reason for the inner loop handling. >>>>> The inner loop block order does_not_ adhere to this "special sauce", >>>>> that is - the "Additionally, if a basic block s dominates >>>>> the latch, then only blocks dominated by s are be after it." >>>>> guarantee holds for the outer loop latch, not for the inner. >>>>> >>>>> Digging into the history of fill_always_executed_in_1 doesn't >>>>> reveal anything - the inner loop handling has been present >>>>> since introduction by Zdenek - but usually Zdenek has a reason >>>>> for doing things as he does;) >>>> >>>> Yes, this is really complicated usage, thanks for point it out. :) >>>> I constructed two cases to verify this with inner loop includes "If A; else B; C". >>>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also checks >>>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s latch, >>>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or B contains call; >>> >>> But it depends on the order of visiting ABC and that's hard to put into >>> a testcase since it depends on the order of edges and the processing >>> of the dominance computation. ABC are simply unordered with respect >>> to a dominator walk. >>> >>>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, >>>> the access order is CAB, but 'last' also won’t be updated to C in fill_always_executed_in_1 >>>> since there is also dominate check, then if A or B contains call, it could break >>>> successfully. >>>> >>>> C won't be set to ALWAYS EXECUTED for both circumstance. >>>> >>>>> >>>>> Note it might be simply a measure against quadratic complexity, >>>>> esp. since with your patch we also dive into not always executed >>>>> subloops as you remove the >>>>> >>>>> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>>> break; >>>>> >>>>> check. I suggest to evaluate behavior of the patch on a testcase >>>>> like >>>>> >>>>> void foo (int n, int **k) >>>>> { >>>>> for (int i = 0; i < n; ++i) >>>>> if (k[0][i]) >>>>> for (int j = 0; j < n; ++j) >>>>> if (k[1][j]) >>>>> for (int l = 0; l < n; ++l) >>>>> if (k[2][l]) >>>>> ... >>>>> } >>>> >>>> Theoretically the complexity is changing from L1(bbs) to L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), >>>> so fill_always_executed_in_1's execution time is supposed to be increase from >>>> O(n) to O(n2)? The time should depend on loop depth and bb counts. I also drafted a >>>> test case has 73-depth loop function with 25 no-ipa function copies each compiled >>>> in lim2 and lim4 dependently. Total execution time of fill_always_executed_in_1 is >>>> increased from 32ms to 58ms, almost doubled but not quadratic? >>> >>> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still >>> O(n^2). >>> >>>> It seems reasonable to see compiling time getting longer since most bbs are checked >>>> more but a MUST to ensure early break correctly in every loop level... >>>> Though loop nodes could be huge, loop depth will never be so large in actual code? >>> >>> The "in practice" argument is almost always defeated by automatic >>> program generators ;) >>> >>>>> >>>>> I suspect you'll see quadratic behavior with your patch. You >>>>> should be at least able to preserve a check like >>>>> >>>>> /* Do not process not always executed subloops to avoid >>>>> quadratic behavior. */ >>>>> if (bb->loop_father->header == bb >>>>> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>>> break; >>>>> >>>>> which is of course not optimistic for cases like >>>>> >>>>> for (..) >>>>> { >>>>> if (cond) >>>>> for (..) >>>>> x = 1; // this is always executed if the inner loop is finite >>>>> } >>>>> >>>>> but we need to have an eye on the complexity of this function. I would >>>>> have suggested to do greedy visiting of the loop header successors, >>>>> processing further blocks if all entries (ignoring backedges) are >>>>> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist >>>>> is empty proceed to inner loops as the current code does. For >>>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB. >>>>> Pseudo-code: >>>>> >>>>> bitmap_set_bit (worklist, loop-header-bb); >>>>> while (!bitmap_empty_p (worklist)) >>>>> { >>>>> bb = pop (worklist); >>>> >>>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? >>> >>> Ah, sure. >>> >>>>> SET_ALWAYS_EXECUTED_IN (bb, loop); >>>>> if (bitmap_bit_p (contains_call, bb->index)) >>>>> continue; >>>>> FOR_EACH_EDGE (e, ei, bb->succs) >>>>> { >>>>> if (!flow_bb_inside_loop_p (loop, e->dest)) >>>>> continue; >>>>> if (incoming_count[e->dest->index]-- == 0) >>>>> push (worklist, e->dest); >>>>> } >>>>> } >>>> >>>> Sorry I don't fully understand your algorithm. worklist should be >>>> auto_vec<basic_block> don't support bitmap operations? Is incoming_count >>>> the bb's preds count, why need retain it since it it decreased to 0? >>> >>> the worklist is a auto_bitmap, pop () would be >>> bitmap_first_set_bit/clear_bit. One could use a vector with the >>> caveat of eventually adding duplicate members to the worklist. >>> But as said, it's pseudo-code ;) >> >> Thanks a lot! >> >>> >>>>> >>>>> iterate over inner loops (incoming_count can be retained, >>>>> we just force the inner loop header onto the worklist). >>>> >>>> Is this same to ? >>>> >>>> for (loop = loop->inner; loop; loop = loop->next) >>>> fill_always_executed_in_1 (loop, contains_call) >>> >>> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which >>> should iterate over loops in PRE order. >> >> Tried implementation with your pseudo-code, it works like the >> previous v2 solution, bb after inner loops could be marked >> as ALWAYS EXECUTED. You are really so strong! ;) >> Regression tested pass on P8LE. >> >> Assume, >> >> A: GCC Base >> B: inner loop check removal >> C: incoming count >> >> Execution time of fill_always_executed_in_1 for the 73-depth loop >> (with five function copies) is >> (A vs. B vs. C): 32ms vs 58ms vs 38ms. Much better than O(n^2)? Sorry didn't get C's execution time accurate enough due to forgot to remove dump file code in it. C's execution time is 25 ms after code refine. It's even better than A. Attached the test case ssa-lim-22.c used for time measurement. >> >> >> Bumped the patch to v3 with TODOs: >> 1. The time measurement code will be removed if the v3 is OK; >> 2. loops_list is not used as my code is not rebased to upstream yet. >> 3. Function comments is not updated. >> >> >> >> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 >> >> >> ALWAYS_EXECUTED_IN is not computed completely for nested loops. >> Current design will exit if an inner loop doesn't dominate outer >> loop's latch or exit after exiting from inner loop, which >> caused early return from outer loop, then ALWAYS EXECUTED blocks after >> inner loops are skipped. Another problem is it has to call >> get_loop_body_in_dom_order in each recursive call which is also >> inefficient. >> >> This patch does greedy visiting of the loop header successors, >> processing further blocks if all entries (ignoring backedges) are >> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch. >> When the worklist is empty proceed to inner loops as the current >> code does. For bookkeeping simply keep a to-visit-incoming-edges >> counter per BB, and pass it through iteration over inner loops. >> >> Details discusions: >> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html >> >> gcc/ChangeLog: >> >> * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove >> inn_loop check. >> >> gcc/testsuite/ChangeLog: >> >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. >> --- >> gcc/tree-ssa-loop-im.c | 95 +++++++++++++--------- >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++ >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++ >> 3 files changed, 111 insertions(+), 38 deletions(-) >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c >> >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c >> index b24bc64f2a7..72be6b6bfac 100644 >> --- a/gcc/tree-ssa-loop-im.c >> +++ b/gcc/tree-ssa-loop-im.c >> @@ -3192,39 +3192,52 @@ do_store_motion (void) >> blocks that contain a nonpure call. */ >> >> static void >> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi) >> { >> - basic_block bb = NULL, *bbs, last = NULL; >> - unsigned i; >> + basic_block bb = NULL; >> edge e; >> - class loop *inn_loop = loop; >> >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) >> { >> - bbs = get_loop_body_in_dom_order (loop); >> + auto_bitmap work_set; >> + bitmap_clear (work_set); > > bitmap_clear isn't necessary Removed. > >> + bitmap_set_bit (work_set, loop->header->index); >> + unsigned bb_index; >> >> - for (i = 0; i < loop->num_nodes; i++) >> - { >> - edge_iterator ei; >> - bb = bbs[i]; >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; >> + int *bbd = XNEWVEC (int, array_size); >> + bbd = XDUPVEC (int, bbi, array_size); > > I don't think you need to copy 'bbi' but you can re-use the > state from the outer loop processing. Did you run into any > issues with that? Yes. For example, adding a small if-else block to ssa-lim-19.c, Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0, then if fill_always_executed_in_1 is called again for loop 2, it's value is not reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong. struct X { int i; int j; int k;}; void foo(struct X *x, int n, int l, int m) { for (int j = 0; j < l; j++) // loop 1 { for (int i = 0; i < n; ++i) // loop 2 { if (m) x->j++; else x->j = m+n+l; int *p = &x->j; // bb 6 int tem = *p; x->j += tem * i; } int *r = &x->k; int tem2 = *r; x->k += tem2 * j; } } > >> + while (!bitmap_empty_p (work_set)) >> + { >> + bb_index = bitmap_first_set_bit (work_set); >> + bitmap_clear_bit (work_set, bb_index); >> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); >> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >> - last = bb; >> - >> + SET_ALWAYS_EXECUTED_IN (bb, loop); >> if (bitmap_bit_p (contains_call, bb->index)) >> break; > > I think you want to continue; here (process remaining worklist > but not continue greedy walking this block) Same as above, if use 'continue' instead of 'break', the algorithm seems also not work again. If inner loop contains a jump to outmost loop, the blocks after the jump block will be set to ALWAYS EXECUTE incorrectly. > >> - >> + edge_iterator ei; >> FOR_EACH_EDGE (e, ei, bb->succs) >> { >> - /* If there is an exit from this BB. */ >> if (!flow_bb_inside_loop_p (loop, e->dest)) >> break; > > in particular this should keep the outer 'bbi' valid to re-use. > > But again, you want 'continue;' the greedy walk to other edges. > If that's not valid (I'd need to think about this) then with > your patch whether we process an edge depends on the order > of the edge visit so you'd have to walk successors twice, > once to determine whether we can greedily walk any of it > and once to actually do the greedy walk. > > So thinking about it an exit edge is like a not returning call > and thus we indeed should not process any outgoing edges of > this block. > >> + >> /* Or we enter a possibly non-finite loop. */ >> if (flow_loop_nested_p (bb->loop_father, >> e->dest->loop_father) >> && ! finite_loop_p (e->dest->loop_father)) >> break; > > I think this is no longer necessary? In any case it would > again be 'continue;', see also above. > > I'm missing skipping of backedges in the walk. > >> + >> + if (bbd[e->dest->index] == 1) >> + { >> + bitmap_set_bit (work_set, e->dest->index); >> + bbd[e->dest->index]--; >> + } >> + else if (bbd[e->dest->index] > 1) >> + bbd[e->dest->index]--; > > maybe just > > bbd[e->dest->index]--; > if (bbd[e->dest->index] == 0) > bitmap_set_bit (work_set, e->dest->index); They are not equivalent. Former won't decrease if bbd[e->dest->index] is 0, but the latter does. > >> } >> + >> if (e) >> break; > > continue; processing the worklist > >> >> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) >> to disprove this if possible). */ >> if (bb->flags & BB_IRREDUCIBLE_LOOP) >> break; > > continue; processing the worklist. I think this has to > come early, before processing successors, next to > the contains_call processing. We could think of ignoring > entering of irreducible regions by tracking exits from them, > but I'm not sure we're identifying the regions in a good > enough way to allow this. > > Likewise possibly infinite inner loops can be processed > when we track exits from those which would be sth like > > /* When this is an exit from an inner loop that we cannot > prove as finite do not follow this edge. */ > if (bb->loop_father != loop > && loop_exit_edge_p (bb->loop_father, e) > && ! finite_loop_p (bb->loop_father)) > continue; > >> - >> - if (!flow_bb_inside_loop_p (inn_loop, bb)) >> - break; >> - >> - if (bb->loop_father->header == bb) >> - { >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >> - break; >> - >> - /* In a loop that is always entered we may proceed anyway. >> - But record that we entered it and stop once we leave it. */ >> - inn_loop = bb->loop_father; >> - } >> - } >> - >> - while (1) >> - { >> - SET_ALWAYS_EXECUTED_IN (last, loop); >> - if (last == loop->header) >> - break; >> - last = get_immediate_dominator (CDI_DOMINATORS, last); >> } >> - >> - free (bbs); >> + free (bbd); >> } >> >> for (loop = loop->inner; loop; loop = loop->next) >> - fill_always_executed_in_1 (loop, contains_call); >> + fill_always_executed_in_1 (loop, contains_call, bbi); >> } >> >> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. >> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) >> bitmap_set_bit (contains_call, bb->index); >> } >> >> + mark_dfs_back_edges (); > > Please put this to loop_invariant_motion_in_fun right before > the call to fill_always_executed_in. Done. > >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; >> + int *bb_incoming_edges = XNEWVEC (int, array_size); > > There's XCNEWVEC that also zeros the array. OK, thanks. > >> + memset (bb_incoming_edges, 0, array_size * sizeof (int)); >> + edge_iterator ei; >> + edge e; >> + >> + FOR_EACH_BB_FN (bb, cfun) >> + { >> + int len = bb->preds->length (); >> + FOR_EACH_EDGE (e, ei, bb->preds) >> + if (e->flags & EDGE_DFS_BACK) >> + len--; >> + bb_incoming_edges[bb->index] = len; >> + } > > it would be nice to combine this with the preceeding loop over > all blocks. Done. > >> + static unsigned long diff = 0; >> + struct timeval tv; >> + gettimeofday (&tv, NULL); >> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; >> + >> + //for (loop : loops_list (cfun, 0)) >> for (loop = current_loops->tree_root->inner; loop; loop = loop->next) >> - fill_always_executed_in_1 (loop, contains_call); >> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); >> + >> + gettimeofday (&tv, NULL); >> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; >> + diff += end - begin; >> + //printf ("%s,diff:%ld\n", current_function_name (), diff); >> + free (bb_incoming_edges); > > Eh, looks very much like work in progress. From the O() complexity > point duplicating bb_incoming_edges for each loop makes it quadratic > again but as said I think we don't need this duplication. Sorry I didn't quite follow your above comments about reusing bb_incoming_edges, we are searching the bbs that dominate loop->latch loop by loop independently, if outer loop changes the inner loops bb_incoming_edges, it will make the inner loop's ALWAYS EXECUTE computation incorrect? Seems duplicating bb_incoming_edges not increases the complexity but the recursive call does? > > I'd like to see the recursion removed, did the for (loop : ...) > not work out? The recursion in fill_always_executed_in_1 could be removed since loops_list could iterate all inner loops but "for (loop = current_loops->tree_root->inner; loop; loop = loop->next)" only iterate sibling level loops? -- Thanks, Xionghu [-- Attachment #2: v4-0001-Fix-incomplete-computation-in-fill_always_execute.patch --] [-- Type: text/plain, Size: 8499 bytes --] From 6047a7b8bbef48794bc3d1608c64ae82e5716430 Mon Sep 17 00:00:00 2001 From: Xiong Hu Luo <luoxhu@linux.ibm.com> Date: Tue, 17 Aug 2021 20:33:26 -0500 Subject: [PATCH v4] Fix incomplete computation in fill_always_executed_in_1 ALWAYS_EXECUTED_IN is not computed completely for nested loops. Current design will exit if an inner loop doesn't dominate outer loop's latch or exit after exiting from inner loop, which caused early return from outer loop, then ALWAYS EXECUTED blocks after inner loops are skipped. Another problem is it has to call get_loop_body_in_dom_order in each recursive call which is also inefficient. This patch does greedy visiting of the loop header successors, processing further blocks if all entries (ignoring backedges) are processed, setting SET_ALWAYS_EXECUTED_IN if dominates loop's latch. When the worklist is empty proceed to inner loops. For bookkeeping simply keep a to-visit-incoming-edges counter per BB, and pass it through to inner loops. Details discusions: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html gcc/ChangeLog: * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove inn_loop check. Greedy visit loop header successors and update edge counts. (fill_always_executed_in): Compute bb_incoming_edges and pass down. (loop_invariant_motion_in_fun): Compute dfs back edge. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-lim-19.c: New test. * gcc.dg/tree-ssa/ssa-lim-20.c: New test. --- gcc/tree-ssa-loop-im.c | 109 ++++++++++++--------- gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 31 ++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 ++++++ 3 files changed, 125 insertions(+), 45 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index b24bc64f2a7..5d69d47eaa6 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -3192,30 +3192,48 @@ do_store_motion (void) blocks that contain a nonpure call. */ static void -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int *bbi) { - basic_block bb = NULL, *bbs, last = NULL; - unsigned i; + basic_block bb = NULL; edge e; - class loop *inn_loop = loop; if (ALWAYS_EXECUTED_IN (loop->header) == NULL) { - bbs = get_loop_body_in_dom_order (loop); + auto_bitmap work_set; + bitmap_set_bit (work_set, loop->header->index); + unsigned bb_index; - for (i = 0; i < loop->num_nodes; i++) - { - edge_iterator ei; - bb = bbs[i]; + unsigned array_size = last_basic_block_for_fn (cfun) + 1; + int *bbd = XCNEWVEC (int, array_size); + bbd = XDUPVEC (int, bbi, array_size); + while (!bitmap_empty_p (work_set)) + { + bb_index = bitmap_first_set_bit (work_set); + bitmap_clear_bit (work_set, bb_index); + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) - last = bb; - + { + SET_ALWAYS_EXECUTED_IN (bb, loop); + printf ("always executed: bb->index:%d, loop->num: %d\n", bb->index, loop->num); + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + "always executed: bb->index:%d, loop->num: %d\n", + bb->index, loop->num); + } if (bitmap_bit_p (contains_call, bb->index)) break; + /* A loop might be infinite (TODO use simple loop analysis + to disprove this if possible). */ + if (bb->flags & BB_IRREDUCIBLE_LOOP) + break; + + edge_iterator ei; FOR_EACH_EDGE (e, ei, bb->succs) { + if (e->dest == loop->latch) + break; /* If there is an exit from this BB. */ if (!flow_bb_inside_loop_p (loop, e->dest)) break; @@ -3224,42 +3242,20 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) e->dest->loop_father) && ! finite_loop_p (e->dest->loop_father)) break; - } - if (e) - break; - - /* A loop might be infinite (TODO use simple loop analysis - to disprove this if possible). */ - if (bb->flags & BB_IRREDUCIBLE_LOOP) - break; - - if (!flow_bb_inside_loop_p (inn_loop, bb)) - break; - - if (bb->loop_father->header == bb) - { - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) - break; - /* In a loop that is always entered we may proceed anyway. - But record that we entered it and stop once we leave it. */ - inn_loop = bb->loop_father; + if (bbd[e->dest->index] == 1) + { + bitmap_set_bit (work_set, e->dest->index); + bbd[e->dest->index]--; + } + else if (bbd[e->dest->index] > 1) + bbd[e->dest->index]--; } - } - - while (1) - { - SET_ALWAYS_EXECUTED_IN (last, loop); - if (last == loop->header) + if (e) break; - last = get_immediate_dominator (CDI_DOMINATORS, last); } - - free (bbs); + free (bbd); } - - for (loop = loop->inner; loop; loop = loop->next) - fill_always_executed_in_1 (loop, contains_call); } /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. @@ -3270,12 +3266,22 @@ static void fill_always_executed_in (void) { basic_block bb; - class loop *loop; + edge_iterator ei; + edge e; + + unsigned array_size = last_basic_block_for_fn (cfun) + 1; + int *bb_incoming_edges = XCNEWVEC (int, array_size); auto_sbitmap contains_call (last_basic_block_for_fn (cfun)); bitmap_clear (contains_call); FOR_EACH_BB_FN (bb, cfun) { + int len = bb->preds->length (); + FOR_EACH_EDGE (e, ei, bb->preds) + if (e->flags & EDGE_DFS_BACK) + len--; + bb_incoming_edges[bb->index] = len; + gimple_stmt_iterator gsi; for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) { @@ -3287,8 +3293,19 @@ fill_always_executed_in (void) bitmap_set_bit (contains_call, bb->index); } - for (loop = current_loops->tree_root->inner; loop; loop = loop->next) - fill_always_executed_in_1 (loop, contains_call); + static unsigned long diff = 0; + struct timeval tv; + gettimeofday (&tv, NULL); + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; + + for (auto loop : loops_list (cfun, 0)) + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); + + gettimeofday (&tv, NULL); + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; + diff += end - begin; + printf ("%s,diff:%ld\n", current_function_name (), diff); + free (bb_incoming_edges); } @@ -3391,6 +3408,8 @@ loop_invariant_motion_in_fun (function *fun, bool store_motion) /* Gathers information about memory accesses in the loops. */ analyze_memory_references (store_motion); + mark_dfs_back_edges (); + /* Fills ALWAYS_EXECUTED_IN information for basic blocks. */ fill_always_executed_in (); diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c new file mode 100644 index 00000000000..3d5fe6f0314 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c @@ -0,0 +1,31 @@ +/* PR/101293 */ +/* { dg-do compile } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +struct X { int i; int j; int k;}; + +void foo(struct X *x, int n, int l, int m) +{ + for (int j = 0; j < l; j++) + { + for (int i = 0; i < n; ++i) + { +#if 1 + if (m) + x->j++; + else + x->j = m+n+l; +#endif + + int *p = &x->j; + int tem = *p; + x->j += tem * i; + } + int *r = &x->k; + int tem2 = *r; + x->k += tem2 * j; + } +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 2 "lim2" } } */ + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c new file mode 100644 index 00000000000..6e180de528e --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c @@ -0,0 +1,30 @@ +/* PR/101293 */ +/* { dg-do run } */ +/* { dg-options "-O2 -fdump-tree-lim2-details" } */ + +void __attribute__ ((noipa)) foo (int n, int m, int f, int *p, int *q) +{ + while (--n) + { + do + { + *q += 1; + if (f) + goto out; + } + while (--m); + *p += 1; + } +out:; +} + +int main() +{ + int i = 0; + foo (10, 10, 1, (void *) 0, &i); + if (i != 1) + __builtin_abort (); + return 0; +} + +/* { dg-final { scan-tree-dump-times "Executing store motion" 1 "lim2" } } */ -- 2.27.0.90.geebb51ba8c [-- Attachment #3: ssa-lim-22.c --] [-- Type: text/plain, Size: 4837 bytes --] #include <stdlib.h> #include <stdio.h> int n; int **k; int m; #define FOOAB(A, B) void __attribute__((noipa)) foo##A##B () { \ for (int i = 0; i < n; ++i) \ { \ if (k[0][i]) \ { \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ for (int j = 0; j < n; ++j) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[1][j]) \ for (int l = 0; l < n; ++l) \ if (k[2][l]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ if (k[5][c]) \ for (int a = 0; a < n; ++a) \ if (k[3][a]) \ for (int b = 0; b < n; ++b) \ if (k[4][b]) \ for (int c = 0; c < n; ++c) \ { \ if (k[5][c]) \ { \ int *r = &k[2][m]; \ int tem2 = *r; \ k[2][m] += tem2 * i; \ } \ else \ { \ int *r = &k[5][m]; \ int tem2 = *r; \ k[5][m] += tem2 * i; \ } \ if (k[7][c]) \ { \ int *r = &k[3][m]; \ int tem2 = *r; \ k[3][m] += tem2 * i; \ } \ else \ { \ int *r = &k[3][m]; \ int tem2 = *r; \ k[3][m] += tem2 * i; \ } \ int *t = &k[8][m]; \ int tem2 = *t; \ k[8][m] += tem2 * i; \ } \ } \ int *s = &k[4][m]; \ int tem2 = *s; \ k[4][m] += tem2 * i; \ } \ } \ #define FOOB(B) FOOAB (1,B)\ FOOAB (2,B) \ FOOAB (3,B) \ FOOAB (4,B) \ FOOAB (5,B) FOOB (0) #if 1 FOOB (1) FOOB (2) FOOB (3) FOOB (4) #if 0 FOOB (5) FOOB (6) FOOB (7) FOOB (8) FOOB (9) #endif #endif ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 2021-08-30 8:49 ` Xionghu Luo @ 2021-08-30 9:19 ` Richard Biener 2021-08-31 7:43 ` Xionghu Luo 0 siblings, 1 reply; 17+ messages in thread From: Richard Biener @ 2021-08-30 9:19 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Mon, 30 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/27 15:45, Richard Biener wrote: > > On Thu, 26 Aug 2021, Xionghu Luo wrote: > > > >> > >> > >> On 2021/8/24 16:20, Richard Biener wrote: > >>> On Tue, 24 Aug 2021, Xionghu Luo wrote: > >>> > >>>> > >>>> > >>>> On 2021/8/19 20:11, Richard Biener wrote: > >>>>>> - class loop *inn_loop = loop; > >>>>>> > >>>>>> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >>>>>> { > >>>>>> @@ -3232,19 +3231,6 @@ fill_always_executed_in_1 (class loop *loop, > >>>>>> @@ sbitmap contains_call) > >>>>>> to disprove this if possible). */ > >>>>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >>>>>> break; > >>>>>> - > >>>>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >>>>>> - break; > >>>>>> - > >>>>>> - if (bb->loop_father->header == bb) > >>>>>> - { > >>>>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>>> - break; > >>>>>> - > >>>>>> - /* In a loop that is always entered we may proceed > >>>>>> anyway. > >>>>>> - But record that we entered it and stop once we leave > >>>>>> it. */ > >>>>>> - inn_loop = bb->loop_father; > >>>>>> - } > >>>>>> } > >>>>>> > >>>>>> while (1) > >>>>> I'm not sure this will work correct (I'm not sure how the existing > >>>>> code makes it so either...). That said, I can't poke any hole > >>>>> into the change. What I see is that definitely > >>>>> > >>>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>> last = bb; > >>>>> > >>>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>>> break; > >>>>> > >>>>> doesn't work reliably since the DOM ordering will process blocks > >>>>> A B and C in random order for > >>>>> > >>>>> for (;;) > >>>>> { > >>>>> if (cond) > >>>>> { > >>>>> A: foo (); > >>>>> } > >>>>> else B:; > >>>>> C:; > >>>>> } > >>>>> > >>>>> and thus we can end up setting 'last' to C_before_ processing > >>>>> 'A' and thus arriving at the call foo () ... > >>>>> > >>>>> get_loop_body_in_dom_order does some "special sauce" but not > >>>>> to address the above problem - but it might be that a subtle > >>>>> issue like the above is the reason for the inner loop handling. > >>>>> The inner loop block order does_not_ adhere to this "special sauce", > >>>>> that is - the "Additionally, if a basic block s dominates > >>>>> the latch, then only blocks dominated by s are be after it." > >>>>> guarantee holds for the outer loop latch, not for the inner. > >>>>> > >>>>> Digging into the history of fill_always_executed_in_1 doesn't > >>>>> reveal anything - the inner loop handling has been present > >>>>> since introduction by Zdenek - but usually Zdenek has a reason > >>>>> for doing things as he does;) > >>>> > >>>> Yes, this is really complicated usage, thanks for point it out. :) > >>>> I constructed two cases to verify this with inner loop includes "If A; > >>>> else B; C". > >>>> Finding that fill_sons_in_loop in get_loop_body_in_dom_order will also > >>>> checks > >>>> whether the bb domintes outer loop’s latch, if C dominate outer loop’s > >>>> latch, > >>>> C is postponed, the access order is ABC, 'last' won’t be set to C if A or > >>>> B contains call; > >>> > >>> But it depends on the order of visiting ABC and that's hard to put into > >>> a testcase since it depends on the order of edges and the processing > >>> of the dominance computation. ABC are simply unordered with respect > >>> to a dominator walk. > >>> > >>>> Otherwise if C doesn’t dominate outer loop’s latch in fill_sons_in_loop, > >>>> the access order is CAB, but 'last' also won’t be updated to C in > >>>> fill_always_executed_in_1 > >>>> since there is also dominate check, then if A or B contains call, it > >>>> could break > >>>> successfully. > >>>> > >>>> C won't be set to ALWAYS EXECUTED for both circumstance. > >>>> > >>>>> > >>>>> Note it might be simply a measure against quadratic complexity, > >>>>> esp. since with your patch we also dive into not always executed > >>>>> subloops as you remove the > >>>>> > >>>>> if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>> break; > >>>>> > >>>>> check. I suggest to evaluate behavior of the patch on a testcase > >>>>> like > >>>>> > >>>>> void foo (int n, int **k) > >>>>> { > >>>>> for (int i = 0; i < n; ++i) > >>>>> if (k[0][i]) > >>>>> for (int j = 0; j < n; ++j) > >>>>> if (k[1][j]) > >>>>> for (int l = 0; l < n; ++l) > >>>>> if (k[2][l]) > >>>>> ... > >>>>> } > >>>> > >>>> Theoretically the complexity is changing from L1(bbs) to > >>>> L1(bbs)+L2(bbs)+L3(bbs)+…+Ln(bbs), > >>>> so fill_always_executed_in_1's execution time is supposed to be increase > >>>> from > >>>> O(n) to O(n2)? The time should depend on loop depth and bb counts. I > >>>> also drafted a > >>>> test case has 73-depth loop function with 25 no-ipa function copies each > >>>> compiled > >>>> in lim2 and lim4 dependently. Total execution time of > >>>> fill_always_executed_in_1 is > >>>> increased from 32ms to 58ms, almost doubled but not quadratic? > >>> > >>> It's more like n + (n-1) + (n-2) + ... + 1 which is n^2/2 but that's still > >>> O(n^2). > >>> > >>>> It seems reasonable to see compiling time getting longer since most bbs > >>>> are checked > >>>> more but a MUST to ensure early break correctly in every loop level... > >>>> Though loop nodes could be huge, loop depth will never be so large in > >>>> actual code? > >>> > >>> The "in practice" argument is almost always defeated by automatic > >>> program generators ;) > >>> > >>>>> I suspect you'll see quadratic behavior with your patch. You > >>>>> should be at least able to preserve a check like > >>>>> > >>>>> /* Do not process not always executed subloops to avoid > >>>>> quadratic behavior. */ > >>>>> if (bb->loop_father->header == bb > >>>>> && !dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>>> break; > >>>>> > >>>>> which is of course not optimistic for cases like > >>>>> > >>>>> for (..) > >>>>> { > >>>>> if (cond) > >>>>> for (..) > >>>>> x = 1; // this is always executed if the inner loop is finite > >>>>> } > >>>>> > >>>>> but we need to have an eye on the complexity of this function. I would > >>>>> have suggested to do greedy visiting of the loop header successors, > >>>>> processing further blocks if all entries (ignoring backedges) are > >>>>> processed, setting SET_ALWAYS_EXECUTED_IN. When the worklist > >>>>> is empty proceed to inner loops as the current code does. For > >>>>> bookkeeping simply keep a to-visit-incoming-edges counter per BB. > >>>>> Pseudo-code: > >>>>> > >>>>> bitmap_set_bit (worklist, loop-header-bb); > >>>>> while (!bitmap_empty_p (worklist)) > >>>>> { > >>>>> bb = pop (worklist); > >>>> > >>>> Need check whether bb dominates latch before SET_ALWAYS_EXECUTED_IN? > >>> > >>> Ah, sure. > >>> > >>>>> SET_ALWAYS_EXECUTED_IN (bb, loop); > >>>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>>> continue; > >>>>> FOR_EACH_EDGE (e, ei, bb->succs) > >>>>> { > >>>>> if (!flow_bb_inside_loop_p (loop, e->dest)) > >>>>> continue; > >>>>> if (incoming_count[e->dest->index]-- == 0) > >>>>> push (worklist, e->dest); > >>>>> } > >>>>> } > >>>> > >>>> Sorry I don't fully understand your algorithm. worklist should be > >>>> auto_vec<basic_block> don't support bitmap operations? Is incoming_count > >>>> the bb's preds count, why need retain it since it it decreased to 0? > >>> > >>> the worklist is a auto_bitmap, pop () would be > >>> bitmap_first_set_bit/clear_bit. One could use a vector with the > >>> caveat of eventually adding duplicate members to the worklist. > >>> But as said, it's pseudo-code ;) > >> > >> Thanks a lot! > >> > >>> > >>>>> > >>>>> iterate over inner loops (incoming_count can be retained, > >>>>> we just force the inner loop header onto the worklist). > >>>> > >>>> Is this same to ? > >>>> > >>>> for (loop = loop->inner; loop; loop = loop->next) > >>>> fill_always_executed_in_1 (loop, contains_call) > >>> > >>> Yes, but I'd simply use for (loop : loops_list (cfun, 0)) which > >>> should iterate over loops in PRE order. > >> > >> Tried implementation with your pseudo-code, it works like the > >> previous v2 solution, bb after inner loops could be marked > >> as ALWAYS EXECUTED. You are really so strong! ;) > >> Regression tested pass on P8LE. > >> > >> Assume, > >> > >> A: GCC Base > >> B: inner loop check removal > >> C: incoming count > >> > >> Execution time of fill_always_executed_in_1 for the 73-depth loop > >> (with five function copies) is > >> (A vs. B vs. C): 32ms vs 58ms vs 38ms. Much better than O(n^2)? > > Sorry didn't get C's execution time accurate enough due to forgot to remove > dump file code in it. > C's execution time is 25 ms after code refine. It's even better than A. > Attached the test case ssa-lim-22.c used for time measurement. > > >> > >> > >> Bumped the patch to v3 with TODOs: > >> 1. The time measurement code will be removed if the v3 is OK; > >> 2. loops_list is not used as my code is not rebased to upstream yet. > >> 3. Function comments is not updated. > >> > >> > >> > >> [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 > >> > >> > >> ALWAYS_EXECUTED_IN is not computed completely for nested loops. > >> Current design will exit if an inner loop doesn't dominate outer > >> loop's latch or exit after exiting from inner loop, which > >> caused early return from outer loop, then ALWAYS EXECUTED blocks after > >> inner loops are skipped. Another problem is it has to call > >> get_loop_body_in_dom_order in each recursive call which is also > >> inefficient. > >> > >> This patch does greedy visiting of the loop header successors, > >> processing further blocks if all entries (ignoring backedges) are > >> processed, setting SET_ALWAYS_EXECUTED_IN of dominates loop's latch. > >> When the worklist is empty proceed to inner loops as the current > >> code does. For bookkeeping simply keep a to-visit-incoming-edges > >> counter per BB, and pass it through iteration over inner loops. > >> > >> Details discusions: > >> https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577743.html > >> > >> gcc/ChangeLog: > >> > >> * tree-ssa-loop-im.c (fill_always_executed_in_1): Remove > >> inn_loop check. > >> > >> gcc/testsuite/ChangeLog: > >> > >> * gcc.dg/tree-ssa/ssa-lim-19.c: New test. > >> * gcc.dg/tree-ssa/ssa-lim-20.c: New test. > >> --- > >> gcc/tree-ssa-loop-im.c | 95 +++++++++++++--------- > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c | 24 ++++++ > >> gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c | 30 +++++++ > >> 3 files changed, 111 insertions(+), 38 deletions(-) > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > >> create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-20.c > >> > >> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c > >> index b24bc64f2a7..72be6b6bfac 100644 > >> --- a/gcc/tree-ssa-loop-im.c > >> +++ b/gcc/tree-ssa-loop-im.c > >> @@ -3192,39 +3192,52 @@ do_store_motion (void) > >> blocks that contain a nonpure call. */ > >> > >> static void > >> -fill_always_executed_in_1 (class loop *loop, sbitmap contains_call) > >> +fill_always_executed_in_1 (class loop *loop, sbitmap contains_call, int > >> *bbi) > >> { > >> - basic_block bb = NULL, *bbs, last = NULL; > >> - unsigned i; > >> + basic_block bb = NULL; > >> edge e; > >> - class loop *inn_loop = loop; > >> > >> if (ALWAYS_EXECUTED_IN (loop->header) == NULL) > >> { > >> - bbs = get_loop_body_in_dom_order (loop); > >> + auto_bitmap work_set; > >> + bitmap_clear (work_set); > > > > bitmap_clear isn't necessary > > Removed. > > > > >> + bitmap_set_bit (work_set, loop->header->index); > >> + unsigned bb_index; > >> - for (i = 0; i < loop->num_nodes; i++) > >> - { > >> - edge_iterator ei; > >> - bb = bbs[i]; > >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >> + int *bbd = XNEWVEC (int, array_size); > >> + bbd = XDUPVEC (int, bbi, array_size); > > > > I don't think you need to copy 'bbi' but you can re-use the > > state from the outer loop processing. Did you run into any > > issues with that? > > Yes. For example, adding a small if-else block to ssa-lim-19.c, > Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call > fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0, > then if fill_always_executed_in_1 is called again for loop 2, it's value is > not > reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong. > > > struct X { int i; int j; int k;}; > > void foo(struct X *x, int n, int l, int m) > { > for (int j = 0; j < l; j++) // loop 1 > { > for (int i = 0; i < n; ++i) // loop 2 > { > if (m) > x->j++; > else > x->j = m+n+l; > > int *p = &x->j; // bb 6 > int tem = *p; > x->j += tem * i; > } > int *r = &x->k; > int tem2 = *r; > x->k += tem2 * j; > } > } Hmm, but if the outer loop processing reaches bb 6 then it should have set it ALWAYS_EXECUTED in loop 1 already? > > > > > >> + while (!bitmap_empty_p (work_set)) > >> + { > >> + bb_index = bitmap_first_set_bit (work_set); > >> + bitmap_clear_bit (work_set, bb_index); > >> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); > >> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> - last = bb; > >> - > >> + SET_ALWAYS_EXECUTED_IN (bb, loop); > >> if (bitmap_bit_p (contains_call, bb->index)) > >> break; > > > > I think you want to continue; here (process remaining worklist > > but not continue greedy walking this block) > > Same as above, if use 'continue' instead of 'break', the algorithm > seems also not work again. If inner loop contains a jump to outmost > loop, the blocks after the jump block will be set to ALWAYS EXECUTE > incorrectly. > > > > >> - > >> + edge_iterator ei; > >> FOR_EACH_EDGE (e, ei, bb->succs) > >> { > >> - /* If there is an exit from this BB. */ > >> if (!flow_bb_inside_loop_p (loop, e->dest)) > >> break; > > > > in particular this should keep the outer 'bbi' valid to re-use. > > > > But again, you want 'continue;' the greedy walk to other edges. > > If that's not valid (I'd need to think about this) then with > > your patch whether we process an edge depends on the order > > of the edge visit so you'd have to walk successors twice, > > once to determine whether we can greedily walk any of it > > and once to actually do the greedy walk. > > > > So thinking about it an exit edge is like a not returning call > > and thus we indeed should not process any outgoing edges of > > this block. > > > >> + > >> /* Or we enter a possibly non-finite loop. */ > >> if (flow_loop_nested_p (bb->loop_father, > >> e->dest->loop_father) > >> && ! finite_loop_p (e->dest->loop_father)) > >> break; > > > > I think this is no longer necessary? In any case it would > > again be 'continue;', see also above. > > > > I'm missing skipping of backedges in the walk. > > > >> + > >> + if (bbd[e->dest->index] == 1) > >> + { > >> + bitmap_set_bit (work_set, e->dest->index); > >> + bbd[e->dest->index]--; > >> + } > >> + else if (bbd[e->dest->index] > 1) > >> + bbd[e->dest->index]--; > > > > maybe just > > > > bbd[e->dest->index]--; > > if (bbd[e->dest->index] == 0) > > bitmap_set_bit (work_set, e->dest->index); > > They are not equivalent. Former won't decrease if bbd[e->dest->index] > is 0, but the latter does. we should be never visiting in this case, so sth indeed doesn't work as I expected. > > > >> } > >> + > >> if (e) > >> break; > > > > continue; processing the worklist > > > >> > >> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, > >> @@ sbitmap contains_call) > >> to disprove this if possible). */ > >> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >> break; > > > > continue; processing the worklist. I think this has to > > come early, before processing successors, next to > > the contains_call processing. We could think of ignoring > > entering of irreducible regions by tracking exits from them, > > but I'm not sure we're identifying the regions in a good > > enough way to allow this. > > > > Likewise possibly infinite inner loops can be processed > > when we track exits from those which would be sth like > > > > /* When this is an exit from an inner loop that we cannot > > prove as finite do not follow this edge. */ > > if (bb->loop_father != loop > > && loop_exit_edge_p (bb->loop_father, e) > > && ! finite_loop_p (bb->loop_father)) > > continue; > > > >> - > >> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >> - break; > >> - > >> - if (bb->loop_father->header == bb) > >> - { > >> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >> - break; > >> - > >> - /* In a loop that is always entered we may proceed anyway. > >> - But record that we entered it and stop once we leave it. */ > >> - inn_loop = bb->loop_father; > >> - } > >> - } > >> - > >> - while (1) > >> - { > >> - SET_ALWAYS_EXECUTED_IN (last, loop); > >> - if (last == loop->header) > >> - break; > >> - last = get_immediate_dominator (CDI_DOMINATORS, last); > >> } > >> - > >> - free (bbs); > >> + free (bbd); > >> } > >> > >> for (loop = loop->inner; loop; loop = loop->next) > >> - fill_always_executed_in_1 (loop, contains_call); > >> + fill_always_executed_in_1 (loop, contains_call, bbi); > >> } > >> > >> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. > >> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) > >> bitmap_set_bit (contains_call, bb->index); > >> } > >> + mark_dfs_back_edges (); > > > > Please put this to loop_invariant_motion_in_fun right before > > the call to fill_always_executed_in. > > Done. > > > > >> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >> + int *bb_incoming_edges = XNEWVEC (int, array_size); > > > > There's XCNEWVEC that also zeros the array. > > OK, thanks. > > > > >> + memset (bb_incoming_edges, 0, array_size * sizeof (int)); > >> + edge_iterator ei; > >> + edge e; > >> + > >> + FOR_EACH_BB_FN (bb, cfun) > >> + { > >> + int len = bb->preds->length (); > >> + FOR_EACH_EDGE (e, ei, bb->preds) > >> + if (e->flags & EDGE_DFS_BACK) > >> + len--; > >> + bb_incoming_edges[bb->index] = len; > >> + } > > > > it would be nice to combine this with the preceeding loop over > > all blocks. > > Done. > > > > >> + static unsigned long diff = 0; > >> + struct timeval tv; > >> + gettimeofday (&tv, NULL); > >> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > >> + > >> + //for (loop : loops_list (cfun, 0)) > >> for (loop = current_loops->tree_root->inner; loop; loop = loop->next) > >> - fill_always_executed_in_1 (loop, contains_call); > >> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); > >> + > >> + gettimeofday (&tv, NULL); > >> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > >> + diff += end - begin; > >> + //printf ("%s,diff:%ld\n", current_function_name (), diff); > >> + free (bb_incoming_edges); > > > > Eh, looks very much like work in progress. From the O() complexity > > point duplicating bb_incoming_edges for each loop makes it quadratic > > again but as said I think we don't need this duplication. > > Sorry I didn't quite follow your above comments about reusing > bb_incoming_edges, > we are searching the bbs that dominate loop->latch loop by loop independently, > if outer loop changes the inner loops bb_incoming_edges, it will make the > inner > loop's ALWAYS EXECUTE computation incorrect? We're only supposed to greedily walk into always excuted blocks. I see this probably breaks down for conditionally executed inner loops like for () { if () for (); A; } where we want to process the inner loop separately but we still want to make sure A is marked as always executed in the outer loop. > Seems duplicating bb_incoming_edges not increases the complexity but the > recursive call does? Since the bb_incoming_edges array has a size of N and we copy it number_of_loop times (which is O(N) bound) the copying effectively has O(N^2) cost. So yes, it's the iteration over all loops, whether by recursion or iteration. I wonder if we can more cleverly handle this by picking up conditionally executed inner loops on the fly for example by recording a 'outermost unconditionally executed loop' for each loop header we visit and using that like SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father)) So make this a single greedy CFG walk starting at the outermost loops. loop exists will need to be processed specially then. Richard. > > > > I'd like to see the recursion removed, did the for (loop : ...) > > not work out? > > The recursion in fill_always_executed_in_1 could be removed since loops_list > could iterate all inner loops but > "for (loop = current_loops->tree_root->inner; loop; loop = loop->next)" > only iterate sibling level loops? > > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 2021-08-30 9:19 ` Richard Biener @ 2021-08-31 7:43 ` Xionghu Luo 2021-08-31 11:29 ` Richard Biener 0 siblings, 1 reply; 17+ messages in thread From: Xionghu Luo @ 2021-08-31 7:43 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw [-- Attachment #1: Type: text/plain, Size: 15548 bytes --] On 2021/8/30 17:19, Richard Biener wrote: >>>> bitmap_set_bit (work_set, loop->header->index); >>>> + unsigned bb_index; >>>> - for (i = 0; i < loop->num_nodes; i++) >>>> - { >>>> - edge_iterator ei; >>>> - bb = bbs[i]; >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; >>>> + int *bbd = XNEWVEC (int, array_size); >>>> + bbd = XDUPVEC (int, bbi, array_size); >>> I don't think you need to copy 'bbi' but you can re-use the >>> state from the outer loop processing. Did you run into any >>> issues with that? >> Yes. For example, adding a small if-else block to ssa-lim-19.c, >> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when call >> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0, >> then if fill_always_executed_in_1 is called again for loop 2, it's value is >> not >> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong. >> >> >> struct X { int i; int j; int k;}; >> >> void foo(struct X *x, int n, int l, int m) >> { >> for (int j = 0; j < l; j++) // loop 1 >> { >> for (int i = 0; i < n; ++i) // loop 2 >> { >> if (m) >> x->j++; >> else >> x->j = m+n+l; >> >> int *p = &x->j; // bb 6 >> int tem = *p; >> x->j += tem * i; >> } >> int *r = &x->k; >> int tem2 = *r; >> x->k += tem2 * j; >> } >> } > Hmm, but if the outer loop processing reaches bb 6 then > it should have set it ALWAYS_EXECUTED in loop 1 already? But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for loop 2 as it requires n>0. Please refer to the attached file ssa-lim-19.c.138t.lim2. ;; ;; Loop 1 ;; header 8, latch 12 ;; depth 1, outer 0 ;; nodes: 8 12 7 6 4 5 3 13 11 ;; ;; Loop 2 ;; header 3, latch 13 ;; depth 2, outer 1 ;; nodes: 3 13 6 4 5 ;; 2 succs { 10 9 } ;; 10 succs { 8 } ;; 11 succs { 3 } ;; 3 succs { 4 5 } ;; 4 succs { 6 } ;; 5 succs { 6 } ;; 6 succs { 13 7 } ;; 13 succs { 3 } ;; 7 succs { 12 9 } ;; 12 succs { 8 } ;; 8 succs { 11 7 } ;; 9 succs { 1 } always executed: bb->index:8, loop->num: 1 always executed: bb->index:7, loop->num: 1 always executed: bb->index:3, loop->num: 2 always executed: bb->index:6, loop->num: 2 8<--------------- / \ | 11 \ | / \ | 3<--- \ | /\ | \ | 4 5 | \ | \/ | \ | 6 | \ | |-->13 \ | |----------> 7 | /\ | 9 12--- (gdb) x /15x bbd 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000001 0x1354c9c0: 0x00000001 0x00000001 0x00000002 0x00000002 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000001 0x1354c9e0: 0x00000001 0x00000001 0x00000000 our algorithm will walk through 8->11->3->4->5->6->7, for loop 1, exit at edge 7->9. (gdb) x /15x bbd 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000000 0x1354c9c0: 0x00000000 0x00000000 0x00000000 0x00000000 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000000 0x1354c9e0: 0x00000001 0x00000000 0x00000000 If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5] and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6. > >> >>> >>>> + while (!bitmap_empty_p (work_set)) >>>> + { >>>> + bb_index = bitmap_first_set_bit (work_set); >>>> + bitmap_clear_bit (work_set, bb_index); >>>> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>> - last = bb; >>>> - >>>> + SET_ALWAYS_EXECUTED_IN (bb, loop); >>>> if (bitmap_bit_p (contains_call, bb->index)) >>>> break; >>> I think you want to continue; here (process remaining worklist >>> but not continue greedy walking this block) >> Same as above, if use 'continue' instead of 'break', the algorithm >> seems also not work again. If inner loop contains a jump to outmost >> loop, the blocks after the jump block will be set to ALWAYS EXECUTE >> incorrectly. >> >>>> - >>>> + edge_iterator ei; >>>> FOR_EACH_EDGE (e, ei, bb->succs) >>>> { >>>> - /* If there is an exit from this BB. */ >>>> if (!flow_bb_inside_loop_p (loop, e->dest)) >>>> break; >>> in particular this should keep the outer 'bbi' valid to re-use. >>> >>> But again, you want 'continue;' the greedy walk to other edges. >>> If that's not valid (I'd need to think about this) then with >>> your patch whether we process an edge depends on the order >>> of the edge visit so you'd have to walk successors twice, >>> once to determine whether we can greedily walk any of it >>> and once to actually do the greedy walk. >>> >>> So thinking about it an exit edge is like a not returning call >>> and thus we indeed should not process any outgoing edges of >>> this block. >>> >>>> + >>>> /* Or we enter a possibly non-finite loop. */ >>>> if (flow_loop_nested_p (bb->loop_father, >>>> e->dest->loop_father) >>>> && ! finite_loop_p (e->dest->loop_father)) >>>> break; >>> I think this is no longer necessary? In any case it would >>> again be 'continue;', see also above. >>> >>> I'm missing skipping of backedges in the walk. >>> >>>> + >>>> + if (bbd[e->dest->index] == 1) >>>> + { >>>> + bitmap_set_bit (work_set, e->dest->index); >>>> + bbd[e->dest->index]--; >>>> + } >>>> + else if (bbd[e->dest->index] > 1) >>>> + bbd[e->dest->index]--; >>> maybe just >>> >>> bbd[e->dest->index]--; >>> if (bbd[e->dest->index] == 0) >>> bitmap_set_bit (work_set, e->dest->index); >> They are not equivalent. Former won't decrease if bbd[e->dest->index] >> is 0, but the latter does. > we should be never visiting in this case, so sth indeed doesn't > work as I expected. I've verified this won't happen if duplicate the bb_incoming_edges in fill_always_executed_in_1, so could replace it. > >>>> } >>>> + >>>> if (e) >>>> break; >>> continue; processing the worklist >>> >>>> >>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, >>>> @@ sbitmap contains_call) >>>> to disprove this if possible). */ >>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) >>>> break; >>> continue; processing the worklist. I think this has to >>> come early, before processing successors, next to >>> the contains_call processing. We could think of ignoring >>> entering of irreducible regions by tracking exits from them, >>> but I'm not sure we're identifying the regions in a good >>> enough way to allow this. >>> >>> Likewise possibly infinite inner loops can be processed >>> when we track exits from those which would be sth like >>> >>> /* When this is an exit from an inner loop that we cannot >>> prove as finite do not follow this edge. */ >>> if (bb->loop_father != loop >>> && loop_exit_edge_p (bb->loop_father, e) >>> && ! finite_loop_p (bb->loop_father)) >>> continue; >>> >>>> - >>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) >>>> - break; >>>> - >>>> - if (bb->loop_father->header == bb) >>>> - { >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) >>>> - break; >>>> - >>>> - /* In a loop that is always entered we may proceed anyway. >>>> - But record that we entered it and stop once we leave it. */ >>>> - inn_loop = bb->loop_father; >>>> - } >>>> - } >>>> - >>>> - while (1) >>>> - { >>>> - SET_ALWAYS_EXECUTED_IN (last, loop); >>>> - if (last == loop->header) >>>> - break; >>>> - last = get_immediate_dominator (CDI_DOMINATORS, last); >>>> } >>>> - >>>> - free (bbs); >>>> + free (bbd); >>>> } >>>> >>>> for (loop = loop->inner; loop; loop = loop->next) >>>> - fill_always_executed_in_1 (loop, contains_call); >>>> + fill_always_executed_in_1 (loop, contains_call, bbi); >>>> } >>>> >>>> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. >>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) >>>> bitmap_set_bit (contains_call, bb->index); >>>> } >>>> + mark_dfs_back_edges (); >>> Please put this to loop_invariant_motion_in_fun right before >>> the call to fill_always_executed_in. >> Done. >> >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; >>>> + int *bb_incoming_edges = XNEWVEC (int, array_size); >>> There's XCNEWVEC that also zeros the array. >> OK, thanks. >> >>>> + memset (bb_incoming_edges, 0, array_size * sizeof (int)); >>>> + edge_iterator ei; >>>> + edge e; >>>> + >>>> + FOR_EACH_BB_FN (bb, cfun) >>>> + { >>>> + int len = bb->preds->length (); >>>> + FOR_EACH_EDGE (e, ei, bb->preds) >>>> + if (e->flags & EDGE_DFS_BACK) >>>> + len--; >>>> + bb_incoming_edges[bb->index] = len; >>>> + } >>> it would be nice to combine this with the preceeding loop over >>> all blocks. >> Done. >> >>>> + static unsigned long diff = 0; >>>> + struct timeval tv; >>>> + gettimeofday (&tv, NULL); >>>> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; >>>> + >>>> + //for (loop : loops_list (cfun, 0)) >>>> for (loop = current_loops->tree_root->inner; loop; loop = loop->next) >>>> - fill_always_executed_in_1 (loop, contains_call); >>>> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); >>>> + >>>> + gettimeofday (&tv, NULL); >>>> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; >>>> + diff += end - begin; >>>> + //printf ("%s,diff:%ld\n", current_function_name (), diff); >>>> + free (bb_incoming_edges); >>> Eh, looks very much like work in progress. From the O() complexity >>> point duplicating bb_incoming_edges for each loop makes it quadratic >>> again but as said I think we don't need this duplication. >> Sorry I didn't quite follow your above comments about reusing >> bb_incoming_edges, >> we are searching the bbs that dominate loop->latch loop by loop independently, >> if outer loop changes the inner loops bb_incoming_edges, it will make the >> inner >> loop's ALWAYS EXECUTE computation incorrect? > We're only supposed to greedily walk into always excuted blocks. I see > this probably breaks down for conditionally executed inner loops like > > for () > { > if () > for (); > A; > } > > where we want to process the inner loop separately but we still want > to make sure A is marked as always executed in the outer loop. Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG. so we have to check every bb and maintain bb_incoming_edges in each loop. If there is any block goes to outer loop, we have to 'break' instead of 'continue' as followed bbs are not ALWAYS_EXECUTE again. > >> Seems duplicating bb_incoming_edges not increases the complexity but the >> recursive call does? > Since the bb_incoming_edges array has a size of N and we copy it > number_of_loop times (which is O(N) bound) the copying effectively > has O(N^2) cost. So yes, it's the iteration over all loops, > whether by recursion or iteration. Actually, the copying is outside of the greedy CFG walk loop. I guess there will be some optimizations in memcpy bb_incoming_edges by libraries? But maybe only works when most of the value are same then copy_by_pieces? And I modified the time-consuming case with many bbs to simulate the extreme usage, didn't observe quadratic behavior from it, it shows almost linear build time increase for both base and V3 patch, but V3 patch generates more accurate ALWAYS_EXECUTED result. The function ratio in perf tool also looks not bad. At least, it looks like an improvement compared with the base code. A function with 10K bbs would cost more than 8 minutes to build, so it seems duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will not be a performance bottle neck? (While most values are different, the copying will exactly slow down when bb_incoming_edges array is quite large like 1000*1000*1000, if so, that's really crazy automatic program generators...) Base: number of bbs(lim2+lim4) total build time (sec) tree loop invariant motion (usr/sys/wall/Mem) 7850+15615 123.99 13.52 ( 11%) 2.16 ( 52%) 15.68 ( 11%) 60M ( 21%) 15338+30591 208.98 24.55 ( 12%) 4.04 ( 52%) 28.57 ( 12%) 121M ( 21%) 30314+60543 511.54 48.96 ( 10%) 7.70 ( 52%) 56.90 ( 10%) 241M ( 21%) V3 patch: number of bbs(lim2+lim4) total build time (sec) tree loop invariant motion (usr/sys/wall/Mem) 7850+15615 121.18 12.26 ( 10%) 1.99 ( 49%) 14.42 ( 11%) 60M ( 21%) 15338+30591 204.99 24.72 ( 12%) 3.98 ( 52%) 28.85 ( 13%) 121M ( 21%) 30314+60543 500.76 48.26 ( 10%) 8.27 ( 51%) 57.12 ( 11%) 241M ( 21%) Samples: 455K of event 'cycles', Event count (approx.): 426772700334 Overhead Command Shared Object Symbol 7.19% cc1 cc1 [.] bitmap_set_bit 5.86% cc1 cc1 [.] get_ref_base_and_extent 4.08% cc1 cc1 [.] get_continuation_for_phi 3.48% cc1 cc1 [.] hash_table<vn_ssa_aux_hasher, false, xcallocator>::find 3.41% cc1 cc1 [.] dominated_by_p 3.28% cc1 cc1 [.] compute_transp 2.98% cc1 cc1 [.] bitmap_set_range 2.60% cc1 cc1 [.] bitmap_list_insert_element_after 2.44% cc1 cc1 [.] stmt_may_clobber_ref_p_1 2.27% cc1 cc1 [.] refs_may_alias_p_2 2.26% cc1 cc1 [.] hash_table<vn_reference_hasher, false, xcallocator>::fi 2.26% cc1 cc1 [.] df_rd_transfer_function 2.07% cc1 cc1 [.] bitmap_ior_into 1.84% cc1 cc1 [.] find_base_term 1.82% cc1 cc1 [.] get_alias_set 1.79% cc1 cc1 [.] tree_strip_nop_conversions 1.73% cc1 cc1 [.] dfs_enumerate_from 1.73% cc1 cc1 [.] reference_alias_ptr_type_1 1.53% cc1 cc1 [.] alias_sets_conflict_p 1.30% cc1 cc1 [.] c_get_alias_set 1.16% cc1 cc1 [.] bitmap_and_into 1.12% cc1 cc1 [.] indirect_ref_may_alias_decl_p 1.04% cc1 cc1 [.] et_splay 0.99% cc1 libc-2.27.so [.] __memset_power8 0.95% cc1 cc1 [.] bitmap_ior_and_compl 0.86% cc1 cc1 [.] bitmap_copy 0.78% cc1 cc1 [.] ao_ref_from_mem > > I wonder if we can more cleverly handle this by picking up > conditionally executed inner loops on the fly for example > by recording a 'outermost unconditionally executed loop' > for each loop header we visit and using that like > SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father)) > > So make this a single greedy CFG walk starting at the outermost > loops. loop exists will need to be processed specially then. Not sure do you mean we only check loop headers and exits? As said above, the ALWAYS_EXECUTED block chains are not successive, there may be blocks are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they could only have two chains? One is from loop->header, and another is from loop->exit->src for loops has one one exit? -- Thanks, Xionghu [-- Attachment #2: ssa-lim-19.c.138t.lim2 --] [-- Type: text/plain, Size: 6863 bytes --] ;; Function foo (foo, funcdef_no=0, decl_uid=3306, cgraph_uid=1, symbol_order=0) Created preheader block for loop 1 Created preheader block for loop 2 ;; 3 loops found ;; ;; Loop 0 ;; header 0, latch 1 ;; depth 0, outer -1 ;; nodes: 0 1 2 10 11 3 4 5 6 13 7 12 8 9 ;; ;; Loop 1 ;; header 8, latch 12 ;; depth 1, outer 0 ;; nodes: 8 12 7 6 4 5 3 13 11 ;; ;; Loop 2 ;; header 3, latch 13 ;; depth 2, outer 1 ;; nodes: 3 13 6 4 5 ;; 2 succs { 10 9 } ;; 10 succs { 8 } ;; 11 succs { 3 } ;; 3 succs { 4 5 } ;; 4 succs { 6 } ;; 5 succs { 6 } ;; 6 succs { 13 7 } ;; 13 succs { 3 } ;; 7 succs { 12 9 } ;; 12 succs { 8 } ;; 8 succs { 11 7 } ;; 9 succs { 1 } Memory reference 1: x_17(D)->j Memory reference 2: MEM[(int *)x_17(D) + 8B] always executed: bb->index:8, loop->num: 1 Found loop 2 to be finite: upper bound found. always executed: bb->index:7, loop->num: 1 always executed: bb->index:3, loop->num: 2 always executed: bb->index:6, loop->num: 2 Basic block 8 (loop 1 -- depth 1): if (n_16(D) > 0) invariant up to level 1, cost 20. Basic block 11 (loop 1 -- depth 1): Basic block 3 (loop 2 -- depth 2): if (m_21(D) != 0) invariant up to level 1, cost 20. Basic block 5 (loop 2 -- depth 2): _4 = l_15(D) + n_16(D); invariant up to level 1, cost 1. Basic block 4 (loop 2 -- depth 2): Basic block 6 (loop 2 -- depth 2): Basic block 7 (loop 1 -- depth 1): Basic block 12 (loop 1 -- depth 1): Basic block 13 (loop 2 -- depth 2): Querying dependency of refs 2 and 1: independent. Querying RAW dependencies of ref 2 in loop 2: independent Querying RAW dependencies of ref 2 in loop 1: independent Querying dependency of refs 2 and 1: independent. Querying SM WAR dependencies of ref 2 in loop 2: independent Querying SM WAR dependencies of ref 2 in loop 1: independent Executing store motion of MEM[(int *)x_17(D) + 8B] from loop 1 Querying RAW dependencies of ref 1 in loop 2: independent Querying SM WAR dependencies of ref 1 in loop 2: independent Executing store motion of MEM[(int *)x_17(D) + 4B] from loop 2 Moving statement x__lsm.5 = MEM[(int *)x_17(D) + 4B]; (cost 0) out of loop 2. Moving statement x__lsm.4 = MEM[(int *)x_17(D) + 8B]; (cost 0) out of loop 1. Updating SSA: creating PHI node in block #8 for x__lsm.4 creating PHI node in block #3 for x__lsm.5 Registering new PHI nodes in block #0 Registering new PHI nodes in block #2 Registering new PHI nodes in block #10 Updating SSA information for statement x__lsm.4 = MEM[(int *)x_17(D) + 8B]; Registering new PHI nodes in block #8 Registering new PHI nodes in block #11 Updating SSA information for statement x__lsm.5 = MEM[(int *)x_17(D) + 4B]; Registering new PHI nodes in block #3 Registering new PHI nodes in block #5 Registering new PHI nodes in block #4 Updating SSA information for statement _1 = x__lsm.5; Registering new PHI nodes in block #6 Updating SSA information for statement x__lsm.5 = cstore_28; Updating SSA information for statement tem_24 = x__lsm.5; Updating SSA information for statement x__lsm.5 = _6; Registering new PHI nodes in block #15 Updating SSA information for statement MEM[(int *)x_17(D) + 4B] = x__lsm.5; Registering new PHI nodes in block #13 Registering new PHI nodes in block #7 Updating SSA information for statement tem2_18 = x__lsm.4; Updating SSA information for statement x__lsm.4 = _8; Registering new PHI nodes in block #14 Updating SSA information for statement MEM[(int *)x_17(D) + 8B] = x__lsm.4; Registering new PHI nodes in block #12 Registering new PHI nodes in block #9 Updating SSA information for statement return; Symbols to be put in SSA form { D.3326 D.3331 D.3332 } Incremental SSA update started at block: 0 Number of blocks in CFG: 16 Number of blocks to update: 15 ( 94%) Affected blocks: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ;; Created LCSSA PHI: x__lsm.5_36 = PHI <x__lsm.5_10(6)> ;; Created LCSSA PHI: x__lsm.4_37 = PHI <x__lsm.4_33(7)> Updating SSA: Registering new PHI nodes in block #8 Registering new PHI nodes in block #11 Registering new PHI nodes in block #3 Registering new PHI nodes in block #5 Registering new PHI nodes in block #4 Registering new PHI nodes in block #6 Updating SSA information for statement x__lsm.5_10 = _6; Registering new PHI nodes in block #15 Updating SSA information for statement MEM[(int *)x_17(D) + 4B] = x__lsm.5_10; Registering new PHI nodes in block #13 Registering new PHI nodes in block #7 Updating SSA information for statement x__lsm.4_33 = _8; Registering new PHI nodes in block #14 Updating SSA information for statement MEM[(int *)x_17(D) + 8B] = x__lsm.4_33; Registering new PHI nodes in block #12 SSA replacement table N_i -> { O_1 ... O_j } means that N_i replaces O_1, ..., O_j x__lsm.5_36 -> { x__lsm.5_10 } x__lsm.4_37 -> { x__lsm.4_33 } Incremental SSA update started at block: 8 Number of blocks in CFG: 16 Number of blocks to update: 8 ( 50%) Affected blocks: 3 6 7 8 12 13 14 15 void foo (struct X * x, int n, int l, int m) { int x__lsm.5; int x__lsm.4; int tem; int i; int tem2; int j; int _1; int _2; int _4; int _5; int _6; int _7; int _8; int cstore_28; <bb 2> [local count: 14598063]: if (l_15(D) > 0) goto <bb 10>; [89.00%] else goto <bb 9>; [11.00%] <bb 10> [local count: 12992276]: x__lsm.4_13 = MEM[(int *)x_17(D) + 8B]; goto <bb 8>; [100.00%] <bb 11> [local count: 105119324]: x__lsm.5_9 = MEM[(int *)x_17(D) + 4B]; <bb 3> [local count: 955630225]: # i_30 = PHI <i_26(13), 0(11)> # x__lsm.5_32 = PHI <x__lsm.5_10(13), x__lsm.5_9(11)> if (m_21(D) != 0) goto <bb 4>; [50.00%] else goto <bb 5>; [50.00%] <bb 4> [local count: 477815112]: _1 = x__lsm.5_32; _2 = _1 + 1; goto <bb 6>; [100.00%] <bb 5> [local count: 477815112]: _4 = l_15(D) + n_16(D); <bb 6> [local count: 955630225]: # cstore_28 = PHI <_2(4), _4(5)> x__lsm.5_12 = cstore_28; tem_24 = x__lsm.5_12; _5 = tem_24 * i_30; _6 = _5 + tem_24; x__lsm.5_10 = _6; i_26 = i_30 + 1; if (n_16(D) > i_26) goto <bb 13>; [89.00%] else goto <bb 15>; [11.00%] <bb 13> [local count: 850510901]: goto <bb 3>; [100.00%] <bb 15> [local count: 105119324]: # x__lsm.5_36 = PHI <x__lsm.5_10(6)> MEM[(int *)x_17(D) + 4B] = x__lsm.5_36; <bb 7> [local count: 118111600]: tem2_18 = x__lsm.4_11; _7 = tem2_18 * j_31; _8 = _7 + tem2_18; x__lsm.4_33 = _8; j_20 = j_31 + 1; if (l_15(D) > j_20) goto <bb 12>; [89.00%] else goto <bb 14>; [11.00%] <bb 12> [local count: 105119324]: <bb 8> [local count: 118111600]: # j_31 = PHI <j_20(12), 0(10)> # x__lsm.4_11 = PHI <x__lsm.4_33(12), x__lsm.4_13(10)> if (n_16(D) > 0) goto <bb 11>; [89.00%] else goto <bb 7>; [11.00%] <bb 14> [local count: 12992276]: # x__lsm.4_37 = PHI <x__lsm.4_33(7)> MEM[(int *)x_17(D) + 8B] = x__lsm.4_37; <bb 9> [local count: 14598063]: return; } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] Fix incomplete computation in fill_always_executed_in_1 2021-08-31 7:43 ` Xionghu Luo @ 2021-08-31 11:29 ` Richard Biener 0 siblings, 0 replies; 17+ messages in thread From: Richard Biener @ 2021-08-31 11:29 UTC (permalink / raw) To: Xionghu Luo; +Cc: gcc-patches, segher, dje.gcc, wschmidt, guojiufu, linkw On Tue, 31 Aug 2021, Xionghu Luo wrote: > > > On 2021/8/30 17:19, Richard Biener wrote: > >>>> bitmap_set_bit (work_set, loop->header->index); > >>>> + unsigned bb_index; > >>>> - for (i = 0; i < loop->num_nodes; i++) > >>>> - { > >>>> - edge_iterator ei; > >>>> - bb = bbs[i]; > >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >>>> + int *bbd = XNEWVEC (int, array_size); > >>>> + bbd = XDUPVEC (int, bbi, array_size); > >>> I don't think you need to copy 'bbi' but you can re-use the > >>> state from the outer loop processing. Did you run into any > >>> issues with that? > >> Yes. For example, adding a small if-else block to ssa-lim-19.c, > >> Then block "x->j += tem * i;" of bb 6 is always executed for loop 2, when > >> call > >> fill_always_executed_in_1 for loop 1, bbi[6] is decreased from 2 to 1 to 0, > >> then if fill_always_executed_in_1 is called again for loop 2, it's value is > >> not > >> reset so bbi[6] won't be set ALWAYS EXECUTE, this is wrong. > >> > >> > >> struct X { int i; int j; int k;}; > >> > >> void foo(struct X *x, int n, int l, int m) > >> { > >> for (int j = 0; j < l; j++) // loop 1 > >> { > >> for (int i = 0; i < n; ++i) // loop 2 > >> { > >> if (m) > >> x->j++; > >> else > >> x->j = m+n+l; > >> > >> int *p = &x->j; // bb 6 > >> int tem = *p; > >> x->j += tem * i; > >> } > >> int *r = &x->k; > >> int tem2 = *r; > >> x->k += tem2 * j; > >> } > >> } > > Hmm, but if the outer loop processing reaches bb 6 then > > it should have set it ALWAYS_EXECUTED in loop 1 already? > > But bb 6 is NOT ALWAYS_EXECUTED for loop 1, it is only ALWAYS_EXECUTED for > loop 2 as it requires n>0. Please refer to the attached file > ssa-lim-19.c.138t.lim2. > > ;; > ;; Loop 1 > ;; header 8, latch 12 > ;; depth 1, outer 0 > ;; nodes: 8 12 7 6 4 5 3 13 11 > ;; > ;; Loop 2 > ;; header 3, latch 13 > ;; depth 2, outer 1 > ;; nodes: 3 13 6 4 5 > ;; 2 succs { 10 9 } > ;; 10 succs { 8 } > ;; 11 succs { 3 } > ;; 3 succs { 4 5 } > ;; 4 succs { 6 } > ;; 5 succs { 6 } > ;; 6 succs { 13 7 } > ;; 13 succs { 3 } > ;; 7 succs { 12 9 } > ;; 12 succs { 8 } > ;; 8 succs { 11 7 } > ;; 9 succs { 1 } > > always executed: bb->index:8, loop->num: 1 > always executed: bb->index:7, loop->num: 1 > always executed: bb->index:3, loop->num: 2 > always executed: bb->index:6, loop->num: 2 > > 8<--------------- > / \ | > 11 \ | > / \ | > 3<--- \ | > /\ | \ | > 4 5 | \ | > \/ | \ | > 6 | \ | > |-->13 \ | > |----------> 7 | > /\ | > 9 12--- > > (gdb) x /15x bbd > 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000001 > 0x1354c9c0: 0x00000001 0x00000001 0x00000002 0x00000002 > 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000001 > 0x1354c9e0: 0x00000001 0x00000001 0x00000000 > > our algorithm will walk through 8->11->3->4->5->6->7, > for loop 1, exit at edge 7->9. > > (gdb) x /15x bbd > 0x1354c9b0: 0x00000000 0x00000000 0x00000001 0x00000000 > 0x1354c9c0: 0x00000000 0x00000000 0x00000000 0x00000000 > 0x1354c9d0: 0x00000001 0x00000002 0x00000001 0x00000000 > 0x1354c9e0: 0x00000001 0x00000000 0x00000000 > > If we don't reset bbd to incoming_edge by memcpy, bbd[3],bbd[4],bbd[5] > and bbd[6] is 0 now for loop 2, fill_always_executed_in_1 couldn't set > ALWAYS_EXECUTED correctly for loop 2 at bb 3 and bb 6. > > > > >> > >>> > >>>> + while (!bitmap_empty_p (work_set)) > >>>> + { > >>>> + bb_index = bitmap_first_set_bit (work_set); > >>>> + bitmap_clear_bit (work_set, bb_index); > >>>> + bb = BASIC_BLOCK_FOR_FN (cfun, bb_index); > >>>> if (dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> - last = bb; > >>>> - > >>>> + SET_ALWAYS_EXECUTED_IN (bb, loop); > >>>> if (bitmap_bit_p (contains_call, bb->index)) > >>>> break; > >>> I think you want to continue; here (process remaining worklist > >>> but not continue greedy walking this block) > >> Same as above, if use 'continue' instead of 'break', the algorithm > >> seems also not work again. If inner loop contains a jump to outmost > >> loop, the blocks after the jump block will be set to ALWAYS EXECUTE > >> incorrectly. > >> > >>>> - > >>>> + edge_iterator ei; > >>>> FOR_EACH_EDGE (e, ei, bb->succs) > >>>> { > >>>> - /* If there is an exit from this BB. */ > >>>> if (!flow_bb_inside_loop_p (loop, e->dest)) > >>>> break; > >>> in particular this should keep the outer 'bbi' valid to re-use. > >>> > >>> But again, you want 'continue;' the greedy walk to other edges. > >>> If that's not valid (I'd need to think about this) then with > >>> your patch whether we process an edge depends on the order > >>> of the edge visit so you'd have to walk successors twice, > >>> once to determine whether we can greedily walk any of it > >>> and once to actually do the greedy walk. > >>> > >>> So thinking about it an exit edge is like a not returning call > >>> and thus we indeed should not process any outgoing edges of > >>> this block. > >>> > >>>> + > >>>> /* Or we enter a possibly non-finite loop. */ > >>>> if (flow_loop_nested_p (bb->loop_father, > >>>> e->dest->loop_father) > >>>> && ! finite_loop_p (e->dest->loop_father)) > >>>> break; > >>> I think this is no longer necessary? In any case it would > >>> again be 'continue;', see also above. > >>> > >>> I'm missing skipping of backedges in the walk. > >>> > >>>> + > >>>> + if (bbd[e->dest->index] == 1) > >>>> + { > >>>> + bitmap_set_bit (work_set, e->dest->index); > >>>> + bbd[e->dest->index]--; > >>>> + } > >>>> + else if (bbd[e->dest->index] > 1) > >>>> + bbd[e->dest->index]--; > >>> maybe just > >>> > >>> bbd[e->dest->index]--; > >>> if (bbd[e->dest->index] == 0) > >>> bitmap_set_bit (work_set, e->dest->index); > >> They are not equivalent. Former won't decrease if bbd[e->dest->index] > >> is 0, but the latter does. > > we should be never visiting in this case, so sth indeed doesn't > > work as I expected. > > I've verified this won't happen if duplicate the bb_incoming_edges > in fill_always_executed_in_1, so could replace it. > > > > >>>> } > >>>> + > >>>> if (e) > >>>> break; > >>> continue; processing the worklist > >>> > >>>> > >>>> @@ -3232,34 +3245,12 @@ fill_always_executed_in_1 (class loop *loop, > >>>> @@ sbitmap contains_call) > >>>> to disprove this if possible). */ > >>>> if (bb->flags & BB_IRREDUCIBLE_LOOP) > >>>> break; > >>> continue; processing the worklist. I think this has to > >>> come early, before processing successors, next to > >>> the contains_call processing. We could think of ignoring > >>> entering of irreducible regions by tracking exits from them, > >>> but I'm not sure we're identifying the regions in a good > >>> enough way to allow this. > >>> > >>> Likewise possibly infinite inner loops can be processed > >>> when we track exits from those which would be sth like > >>> > >>> /* When this is an exit from an inner loop that we cannot > >>> prove as finite do not follow this edge. */ > >>> if (bb->loop_father != loop > >>> && loop_exit_edge_p (bb->loop_father, e) > >>> && ! finite_loop_p (bb->loop_father)) > >>> continue; > >>> > >>>> - > >>>> - if (!flow_bb_inside_loop_p (inn_loop, bb)) > >>>> - break; > >>>> - > >>>> - if (bb->loop_father->header == bb) > >>>> - { > >>>> - if (!dominated_by_p (CDI_DOMINATORS, loop->latch, bb)) > >>>> - break; > >>>> - > >>>> - /* In a loop that is always entered we may proceed anyway. > >>>> - But record that we entered it and stop once we leave it. */ > >>>> - inn_loop = bb->loop_father; > >>>> - } > >>>> - } > >>>> - > >>>> - while (1) > >>>> - { > >>>> - SET_ALWAYS_EXECUTED_IN (last, loop); > >>>> - if (last == loop->header) > >>>> - break; > >>>> - last = get_immediate_dominator (CDI_DOMINATORS, last); > >>>> } > >>>> - > >>>> - free (bbs); > >>>> + free (bbd); > >>>> } > >>>> > >>>> for (loop = loop->inner; loop; loop = loop->next) > >>>> - fill_always_executed_in_1 (loop, contains_call); > >>>> + fill_always_executed_in_1 (loop, contains_call, bbi); > >>>> } > >>>> > >>>> /* Fills ALWAYS_EXECUTED_IN information for basic blocks, i.e. > >>>> @@ -3287,8 +3278,36 @@ fill_always_executed_in (void) > >>>> bitmap_set_bit (contains_call, bb->index); > >>>> } > >>>> + mark_dfs_back_edges (); > >>> Please put this to loop_invariant_motion_in_fun right before > >>> the call to fill_always_executed_in. > >> Done. > >> > >>>> + unsigned array_size = last_basic_block_for_fn (cfun) + 1; > >>>> + int *bb_incoming_edges = XNEWVEC (int, array_size); > >>> There's XCNEWVEC that also zeros the array. > >> OK, thanks. > >> > >>>> + memset (bb_incoming_edges, 0, array_size * sizeof (int)); > >>>> + edge_iterator ei; > >>>> + edge e; > >>>> + > >>>> + FOR_EACH_BB_FN (bb, cfun) > >>>> + { > >>>> + int len = bb->preds->length (); > >>>> + FOR_EACH_EDGE (e, ei, bb->preds) > >>>> + if (e->flags & EDGE_DFS_BACK) > >>>> + len--; > >>>> + bb_incoming_edges[bb->index] = len; > >>>> + } > >>> it would be nice to combine this with the preceeding loop over > >>> all blocks. > >> Done. > >> > >>>> + static unsigned long diff = 0; > >>>> + struct timeval tv; > >>>> + gettimeofday (&tv, NULL); > >>>> + unsigned long begin = (unsigned long) tv.tv_sec * 1000000 + > >>>> tv.tv_usec; > >>>> + > >>>> + //for (loop : loops_list (cfun, 0)) > >>>> for (loop = current_loops->tree_root->inner; loop; loop = > >>>> loop->next) > >>>> - fill_always_executed_in_1 (loop, contains_call); > >>>> + fill_always_executed_in_1 (loop, contains_call, bb_incoming_edges); > >>>> + > >>>> + gettimeofday (&tv, NULL); > >>>> + unsigned long end = (unsigned long) tv.tv_sec * 1000000 + tv.tv_usec; > >>>> + diff += end - begin; > >>>> + //printf ("%s,diff:%ld\n", current_function_name (), diff); > >>>> + free (bb_incoming_edges); > >>> Eh, looks very much like work in progress. From the O() complexity > >>> point duplicating bb_incoming_edges for each loop makes it quadratic > >>> again but as said I think we don't need this duplication. > >> Sorry I didn't quite follow your above comments about reusing > >> bb_incoming_edges, > >> we are searching the bbs that dominate loop->latch loop by loop > >> independently, > >> if outer loop changes the inner loops bb_incoming_edges, it will make the > >> inner > >> loop's ALWAYS EXECUTE computation incorrect? > > We're only supposed to greedily walk into always excuted blocks. I see > > this probably breaks down for conditionally executed inner loops like > > > > for () > > { > > if () > > for (); > > A; > > } > > > > where we want to process the inner loop separately but we still want > > to make sure A is marked as always executed in the outer loop. > > Yes, the problem is ALWAYS_EXECUTE blocks are not SUCCESSIVE in CFG. > so we have to check every bb and maintain bb_incoming_edges in each loop. If > there is any block goes to outer loop, we have to 'break' instead of > 'continue' as followed bbs are not ALWAYS_EXECUTE again. But we're using a worklist (since the CFG branches) - if we're not continue to pick items from the worklist then the outcome depends on the order of the worklist proceesing. That can't be correct either (similar when walking successor edges). > >> Seems duplicating bb_incoming_edges not increases the complexity but the > >> recursive call does? > > Since the bb_incoming_edges array has a size of N and we copy it > > number_of_loop times (which is O(N) bound) the copying effectively > > has O(N^2) cost. So yes, it's the iteration over all loops, > > whether by recursion or iteration. > > > Actually, the copying is outside of the greedy CFG walk loop. I guess there > will be some optimizations in memcpy bb_incoming_edges by libraries? But > maybe only works when most of the value are same then copy_by_pieces? > > And I modified the time-consuming case with many bbs to simulate the extreme > usage, didn't observe quadratic behavior from it, it shows almost linear > build time increase for both base and V3 patch, but V3 patch generates more > accurate ALWAYS_EXECUTED result. The function ratio in perf tool also looks > not bad. At least, it looks like an improvement compared with the base code. > > A function with 10K bbs would cost more than 8 minutes to build, so it seems > duplicating bb_incoming_edges much less then 1000*1000*1000 number of bbs will > not be a performance bottle neck? (While most values are different, the > copying > will exactly slow down when bb_incoming_edges array is quite large like > 1000*1000*1000, > if so, that's really crazy automatic program generators...) > > Base: > number of bbs(lim2+lim4) total build time (sec) tree loop invariant > motion (usr/sys/wall/Mem) > 7850+15615 123.99 13.52 ( 11%) 2.16 ( 52%) 15.68 ( 11%) 60M ( 21%) > 15338+30591 208.98 24.55 ( 12%) 4.04 ( 52%) 28.57 ( 12%) 121M ( 21%) > 30314+60543 511.54 48.96 ( 10%) 7.70 ( 52%) 56.90 ( 10%) 241M ( 21%) > > V3 patch: > number of bbs(lim2+lim4) total build time (sec) tree loop invariant > motion (usr/sys/wall/Mem) > 7850+15615 121.18 12.26 ( 10%) 1.99 ( 49%) 14.42 ( 11%) 60M ( 21%) > 15338+30591 204.99 24.72 ( 12%) 3.98 ( 52%) 28.85 ( 13%) 121M ( 21%) > 30314+60543 500.76 48.26 ( 10%) 8.27 ( 51%) 57.12 ( 11%) 241M ( 21%) > > > Samples: 455K of event 'cycles', Event count (approx.): 426772700334 > Overhead Command Shared Object Symbol > 7.19% cc1 cc1 [.] bitmap_set_bit > 5.86% cc1 cc1 [.] get_ref_base_and_extent > 4.08% cc1 cc1 [.] get_continuation_for_phi > 3.48% cc1 cc1 [.] hash_table<vn_ssa_aux_hasher, false, > xcallocator>::find > 3.41% cc1 cc1 [.] dominated_by_p > 3.28% cc1 cc1 [.] compute_transp > 2.98% cc1 cc1 [.] bitmap_set_range > 2.60% cc1 cc1 [.] bitmap_list_insert_element_after > 2.44% cc1 cc1 [.] stmt_may_clobber_ref_p_1 > 2.27% cc1 cc1 [.] refs_may_alias_p_2 > 2.26% cc1 cc1 [.] hash_table<vn_reference_hasher, false, > xcallocator>::fi > 2.26% cc1 cc1 [.] df_rd_transfer_function > 2.07% cc1 cc1 [.] bitmap_ior_into > 1.84% cc1 cc1 [.] find_base_term > 1.82% cc1 cc1 [.] get_alias_set > 1.79% cc1 cc1 [.] tree_strip_nop_conversions > 1.73% cc1 cc1 [.] dfs_enumerate_from > 1.73% cc1 cc1 [.] reference_alias_ptr_type_1 > 1.53% cc1 cc1 [.] alias_sets_conflict_p > 1.30% cc1 cc1 [.] c_get_alias_set > 1.16% cc1 cc1 [.] bitmap_and_into > 1.12% cc1 cc1 [.] indirect_ref_may_alias_decl_p > 1.04% cc1 cc1 [.] et_splay > 0.99% cc1 libc-2.27.so [.] __memset_power8 > 0.95% cc1 cc1 [.] bitmap_ior_and_compl > 0.86% cc1 cc1 [.] bitmap_copy > 0.78% cc1 cc1 [.] ao_ref_from_mem > > > > > I wonder if we can more cleverly handle this by picking up > > conditionally executed inner loops on the fly for example > > by recording a 'outermost unconditionally executed loop' > > for each loop header we visit and using that like > > SET_ALWAYS_EXECUTED_IN (bb, outermost_always_exec_in (bb->loop_father)) > > > > So make this a single greedy CFG walk starting at the outermost > > loops. loop exists will need to be processed specially then. > > Not sure do you mean we only check loop headers and exits? As said above, > the ALWAYS_EXECUTED block chains are not successive, there may be blocks > are ALWAYS_EXECUTED or goto-out-loops in the middle of the loop, but they > could only have two chains? One is from loop->header, and another is from > loop->exit->src for loops has one one exit? For setting ALWAYS_EXECUTED we rely on dominance. The greedy walk should determine whether there are paths in the CFG that skip execution of the block. I made the assumption in that if there exist "uninterrupted" paths covering all incoming edges of a block that this ensures the block is always executed. And since we only trace such "always executed" blocks this holds transitively. I'll see if I can find some time this week to spend some time on this problem and maybe also create a testcase for the existing issue with contains_call and the dominator order of BB processing. Richard. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Fix incorrect computation in fill_always_executed_in_1 2021-08-16 8:46 [PATCH] Fix incorrect computation in fill_always_executed_in_1 Xiong Hu Luo 2021-08-16 11:46 ` Richard Biener @ 2021-08-17 20:59 ` Segher Boessenkool 1 sibling, 0 replies; 17+ messages in thread From: Segher Boessenkool @ 2021-08-17 20:59 UTC (permalink / raw) To: Xiong Hu Luo; +Cc: gcc-patches, dje.gcc, wschmidt, guojiufu, linkw, rguenther Hi! As an aside... On Mon, Aug 16, 2021 at 03:46:12AM -0500, Xiong Hu Luo wrote: > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-19.c > --- a/gcc/tree-ssa-loop-im.c > +++ b/gcc/tree-ssa-loop-im.c You can make a saner order for your diffs by putting the testsuite changes after the rest. A very mimimal example would use [diff] orderfile = .gitorder with that file containing something like gcc/*.* and nothing else even. Yeah this is minimal ;-) Segher ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-08-31 11:29 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-16 8:46 [PATCH] Fix incorrect computation in fill_always_executed_in_1 Xiong Hu Luo 2021-08-16 11:46 ` Richard Biener 2021-08-17 5:17 ` Xionghu Luo 2021-08-17 5:24 ` Xionghu Luo 2021-08-17 7:12 ` Richard Biener 2021-08-17 9:10 ` [PATCH v2] Fix incomplete " Xionghu Luo 2021-08-19 5:23 ` Xionghu Luo 2021-08-19 12:11 ` Richard Biener 2021-08-24 7:44 ` Xionghu Luo 2021-08-24 8:20 ` Richard Biener 2021-08-26 5:50 ` [PATCH v3] " Xionghu Luo 2021-08-27 7:45 ` Richard Biener 2021-08-30 8:49 ` Xionghu Luo 2021-08-30 9:19 ` Richard Biener 2021-08-31 7:43 ` Xionghu Luo 2021-08-31 11:29 ` Richard Biener 2021-08-17 20:59 ` [PATCH] Fix incorrect " Segher Boessenkool
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).