* More aggressive threading causing loop-interchange-9.c regression @ 2021-09-07 11:49 Aldy Hernandez 2021-09-07 14:45 ` Michael Matz 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-07 11:49 UTC (permalink / raw) To: Jeff Law, Richard Biener; +Cc: Andrew MacLeod, GCC Mailing List [-- Attachment #1: Type: text/plain, Size: 2301 bytes --] Hi folks. I have a pending patch for the path solver that pulls in global ranges when available (the stuff in SSA_NAME_RANGE_INFO). In doing so, I have run into a regression I was hoping the loop experts could pontificate on. The regression comes from the simple_reduc_1() function in tree-ssa/loop-interchange-9.c, and it involves the following path: =========== BB 5 ============ Imports: n_10(D) j_19 Exports: n_10(D) j_13 j_19 j_13 : j_19(I) n_10(D) int [1, 257] j_19 int [0, 256] Relational : (j_13 > j_19) <bb 5> [local count: 118111600]: # sum_21 = PHI <sum_14(4), sum_11(3)> c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto <bb 9>; [89.00%] else goto <bb 6>; [11.00%] j_13 : int [1, 257] 5->9 (T) n_10(D) : int [2, 257] 5->9 (T) j_13 : int [1, 256] 5->9 (T) j_19 : int [0, 255] 5->6 (F) n_10(D) : int [1, 257] 5->6 (F) j_13 : int [1, 257] 5->6 (F) j_19 : int [0, 256] =========== BB 9 ============ n_10(D) int [2, 257] j_13 int [1, 256] Relational : (n_10(D) > j_19) Relational : (n_10(D) > j_13) <bb 9> [local count: 105119324]: goto <bb 3>; [100.00%] =========== BB 3 ============ Imports: n_10(D) Exports: n_10(D) n_10(D) int [1, +INF] <bb 3> [local count: 118111600]: # j_19 = PHI <j_13(9), 0(7)> sum_11 = c[j_19]; if (n_10(D) > 0) goto <bb 8>; [89.00%] else goto <bb 5>; [11.00%] With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that the loop cannot legally iterate outside the size of c[256]. So, j_13 lies within [1, 257] and n_10 is [2, +INF] at the end of the path. This allows us to thread BB3 to BB8. Doing so, causes the test to fail here: /* { dg-final { scan-tree-dump-times "Loop_pair<outer:., inner:.> is interchanged" 1 "linterchange" } } */ All the blocks lie within the same loop, and the path passes all the tests in path_profitable_p(). Is there something about the shape of this path that should make it invalid in the backward threader, or should the test be marked with -fdisable-tree-threadN (or something else entirely)? Note that the backward threader is allowed to peel through loop headers. I am attaching the gimple dump as well as the graphviz output for easier analysis. Thanks. Aldy [-- Attachment #2: il-dump --] [-- Type: text/plain, Size: 1125 bytes --] __attribute__((noinline)) void simple_reduc_1 (int n) { int i; int sum; int j; int _1; int _2; int _3; <bb 2> [local count: 14598063]: if (n_10(D) > 0) goto <bb 7>; [89.00%] else goto <bb 6>; [11.00%] <bb 7> [local count: 12992276]: <bb 3> [local count: 118111600]: # j_19 = PHI <j_13(9), 0(7)> sum_11 = c[j_19]; if (n_10(D) > 0) goto <bb 8>; [89.00%] else goto <bb 5>; [11.00%] <bb 8> [local count: 105119324]: <bb 4> [local count: 955630225]: # sum_20 = PHI <sum_14(10), sum_11(8)> # i_22 = PHI <i_15(10), 0(8)> _1 = a[i_22][j_19]; _2 = b[i_22][j_19]; _3 = _1 * _2; sum_14 = _3 + sum_20; i_15 = i_22 + 1; if (n_10(D) > i_15) goto <bb 10>; [89.00%] else goto <bb 5>; [11.00%] <bb 10> [local count: 850510901]: goto <bb 4>; [100.00%] <bb 5> [local count: 118111600]: # sum_21 = PHI <sum_14(4), sum_11(3)> c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto <bb 9>; [89.00%] else goto <bb 6>; [11.00%] <bb 9> [local count: 105119324]: goto <bb 3>; [100.00%] <bb 6> [local count: 14598063]: return; } [-- Attachment #3: graph.ps --] [-- Type: application/postscript, Size: 22338 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-07 11:49 More aggressive threading causing loop-interchange-9.c regression Aldy Hernandez @ 2021-09-07 14:45 ` Michael Matz 2021-09-08 10:44 ` Aldy Hernandez 0 siblings, 1 reply; 33+ messages in thread From: Michael Matz @ 2021-09-07 14:45 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Jeff Law, Richard Biener, GCC Mailing List Hello, On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote: > The regression comes from the simple_reduc_1() function in > tree-ssa/loop-interchange-9.c, and it involves the following path: > > =========== BB 5 ============ > Imports: n_10(D) j_19 > Exports: n_10(D) j_13 j_19 > j_13 : j_19(I) > n_10(D) int [1, 257] > j_19 int [0, 256] > Relational : (j_13 > j_19) > <bb 5> [local count: 118111600]: > # sum_21 = PHI <sum_14(4), sum_11(3)> > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto <bb 9>; [89.00%] > else > goto <bb 6>; [11.00%] So, this is the outer loops exit block ... > =========== BB 9 ============ > n_10(D) int [2, 257] > j_13 int [1, 256] > Relational : (n_10(D) > j_19) > Relational : (n_10(D) > j_13) > <bb 9> [local count: 105119324]: > goto <bb 3>; [100.00%] ... this the outer loops latch block ... > =========== BB 3 ============ > Imports: n_10(D) > Exports: n_10(D) > n_10(D) int [1, +INF] > <bb 3> [local count: 118111600]: > # j_19 = PHI <j_13(9), 0(7)> > sum_11 = c[j_19]; > if (n_10(D) > 0) > goto <bb 8>; [89.00%] > else > goto <bb 5>; [11.00%] ... and this the outer loops header, as well as inner loops pre-header. The attempted thread hence involves a back-edge (of the outer loop) and a loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would destroy some properties that our loop optimizers rely on, e.g. that the loop header of a natural loop is only reached by two edges: entry edge and back edge, and that the latch blocks are empty and that there's only a single latch. (What exactly we require depends on several flags in loops_state_satisfies_p). > With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that > the loop cannot legally iterate outside the size of c[256]. So, j_13 > lies within [1, 257] and n_10 is [2, +INF] at the end of the path. > This allows us to thread BB3 to BB8. So, IIUC doing this threading would create a new entry to BB8: it would then be entered by its natural entry (bb3), by its natural back edge (whatever bb that is now) and the new entry from the threaded outer back edge (bb9 or bb5?). The inner loop wouldn't then be recognized as natural anymore and the whole nest not as perfect, and hence loop interchange can't easily happen anymore. Most other structural loop optimizers of us would have problems with that situation as well. > All the blocks lie within the same loop, and the path passes all the > tests in path_profitable_p(). > > Is there something about the shape of this path that should make it > invalid in the backward threader, or should the test be marked with > -fdisable-tree-threadN (or something else entirely)? This is a functionality test checking that the very necessary interchange in this test does happen with default plus -floop-interchange (with the intention of it being enabled by O3 or with profile info). So no additional flags can be added without changing the meaning of this test. > Note that the > backward threader is allowed to peel through loop headers. Something needs to give way in the path threaders before loop optimizations: either threading through back edges, through loop latches or through loop headers needs to be disabled. I think traditionally at least threading through latches should be disabled, because doing so usually destroys simple loop structure. I see that profitable_path_p() of the backward threader wants to be careful in some situations involving loops and latches; possibly it's not careful enough yet for the additional power brought by ranger. See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is doing when loops are active. After the structural loop optimizers the threader can go wild and thread everything it finds. Ciao, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-07 14:45 ` Michael Matz @ 2021-09-08 10:44 ` Aldy Hernandez 2021-09-08 13:13 ` Richard Biener 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-08 10:44 UTC (permalink / raw) To: Michael Matz; +Cc: Jeff Law, Richard Biener, GCC Mailing List, Andrew MacLeod [-- Attachment #1: Type: text/plain, Size: 6823 bytes --] First of all. Thanks so much for this incredibly useful explanation. It helps a lot. I'm still trying to wrap my head around this, so please bear with me. On 9/7/21 4:45 PM, Michael Matz wrote: > Hello, > > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote: > >> The regression comes from the simple_reduc_1() function in >> tree-ssa/loop-interchange-9.c, and it involves the following path: >> >> =========== BB 5 ============ >> Imports: n_10(D) j_19 >> Exports: n_10(D) j_13 j_19 >> j_13 : j_19(I) >> n_10(D) int [1, 257] >> j_19 int [0, 256] >> Relational : (j_13 > j_19) >> <bb 5> [local count: 118111600]: >> # sum_21 = PHI <sum_14(4), sum_11(3)> >> c[j_19] = sum_21; >> j_13 = j_19 + 1; >> if (n_10(D) > j_13) >> goto <bb 9>; [89.00%] >> else >> goto <bb 6>; [11.00%] > > So, this is the outer loops exit block ... > >> =========== BB 9 ============ >> n_10(D) int [2, 257] >> j_13 int [1, 256] >> Relational : (n_10(D) > j_19) >> Relational : (n_10(D) > j_13) >> <bb 9> [local count: 105119324]: >> goto <bb 3>; [100.00%] > > ... this the outer loops latch block ... > >> =========== BB 3 ============ >> Imports: n_10(D) >> Exports: n_10(D) >> n_10(D) int [1, +INF] >> <bb 3> [local count: 118111600]: >> # j_19 = PHI <j_13(9), 0(7)> >> sum_11 = c[j_19]; >> if (n_10(D) > 0) >> goto <bb 8>; [89.00%] >> else >> goto <bb 5>; [11.00%] > > ... and this the outer loops header, as well as inner loops pre-header. Actually, the inner loop's pre-header seems to be BB8, which is empty, and leads into the BB4 which is the header of the inner loop. > The attempted thread hence involves a back-edge (of the outer loop) and a > loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would > destroy some properties that our loop optimizers rely on, e.g. that the > loop header of a natural loop is only reached by two edges: entry edge and > back edge, and that the latch blocks are empty and that there's only a > single latch. (What exactly we require depends on several flags in > loops_state_satisfies_p). But threading 3->8 doesn't involve a loop-entry edge. It involves a new edge into the *pre-header* of the inner loop. I am attaching the IL right after threading and before we clean up the CFG (il-after-threading.txt). After threading we have have rewritten the 5->9 edge into 5->11: <bb 5> [local count: 118111600]: # sum_21 = PHI <sum_14(4), sum_11(3)> c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto <bb 11>; [89.00%] else goto <bb 6>; [11.00%] <bb 11> [local count: 105119324]: <bb 12> [local count: 105119324]: # j_7 = PHI <j_13(11)> sum_6 = c[j_19]; goto <bb 8>; [100.00%] <bb 9> [local count: 0]: ;; This becomes unreachable. goto <bb 3>; [100.00%] Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9. Question: could BB12 not be considered the loop latch now and BB8 be the outer loop's entry? This would also mean, that BB3 is now outside of the outer loop, which means the threader peeled off the first iteration of the loop. Or is it a requirement that the latch be empty? > >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that >> the loop cannot legally iterate outside the size of c[256]. So, j_13 >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path. >> This allows us to thread BB3 to BB8. > > So, IIUC doing this threading would create a new entry to BB8: it would > then be entered by its natural entry (bb3), by its natural back edge > (whatever bb that is now) and the new entry from the threaded outer back > edge (bb9 or bb5?). As I mentioned, BB8 looks like the pre-header to the inner loop. But yes, it now has multiple entries: the edge from BB3, and the back edge from BB12 (which seems like it could be a new latch, since it's the only back edge). > > The inner loop wouldn't then be recognized as natural anymore and the > whole nest not as perfect, and hence loop interchange can't easily happen > anymore. Most other structural loop optimizers of us would have problems > with that situation as well. If I clean up the CFG and call loop_optimizer_init() again at this point, what is destroyed is the outer loop, not the inner loop. If you look at the IL at this point (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 going back up to BB4, though the conditional is now in BB6 (eeech): <bb 4> ... ... ... <bb 6> [local count: 118111600]: # sum_21 = PHI <sum_14(5), sum_11(3)> # j_18 = PHI <j_17(5), j_19(3)> c[j_18] = sum_21; j_13 = j_18 + 1; if (n_10(D) > j_13) goto <bb 7>; [89.00%] else goto <bb 8>; [11.00%] <bb 7> [local count: 105119324]: # j_7 = PHI <j_13(6)> sum_6 = c[j_7]; goto <bb 4>; [100.00%] Perhaps, this looks sufficiently different that the loop optimizer can't recognize it as a loop? > >> All the blocks lie within the same loop, and the path passes all the >> tests in path_profitable_p(). >> >> Is there something about the shape of this path that should make it >> invalid in the backward threader, or should the test be marked with >> -fdisable-tree-threadN (or something else entirely)? > > This is a functionality test checking that the very necessary interchange > in this test does happen with default plus -floop-interchange (with the > intention of it being enabled by O3 or with profile info). So no > additional flags can be added without changing the meaning of this test. Agreed. We're shuffling things around. What I'm missing here is why we're unable to recognize this new form as a loop. That being said, I am not above figuring out what the magic incantation is to keep the threader from doing this transformation. However, I just want to make sure we're not papering over something else. > >> Note that the >> backward threader is allowed to peel through loop headers. > > Something needs to give way in the path threaders before loop > optimizations: either threading through back edges, through loop latches > or through loop headers needs to be disabled. I think traditionally at > least threading through latches should be disabled, because doing so > usually destroys simple loop structure. I see that profitable_path_p() of > the backward threader wants to be careful in some situations involving > loops and latches; possibly it's not careful enough yet for the additional > power brought by ranger. > > See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is > doing when loops are active. > > After the structural loop optimizers the threader can go wild and thread > everything it finds. Thanks again. This is very helpful. Aldy [-- Attachment #2: il-after-threading.txt --] [-- Type: text/plain, Size: 1260 bytes --] __attribute__((noinline)) void simple_reduc_1 (int n) { int i; int sum; int j; int _1; int _2; int _3; <bb 2> [local count: 14598063]: if (n_10(D) > 0) goto <bb 7>; [89.00%] else goto <bb 6>; [11.00%] <bb 7> [local count: 12992276]: <bb 3> [local count: 12992276]: # j_19 = PHI <j_13(9), 0(7)> sum_11 = c[j_19]; if (n_10(D) > 0) goto <bb 8>; [66.92%] else goto <bb 5>; [33.08%] <bb 8> [local count: 105119324]: <bb 4> [local count: 955630225]: # sum_20 = PHI <sum_14(10), sum_11(8)> # i_22 = PHI <i_15(10), 0(8)> _1 = a[i_22][j_19]; _2 = b[i_22][j_19]; _3 = _1 * _2; sum_14 = _3 + sum_20; i_15 = i_22 + 1; if (n_10(D) > i_15) goto <bb 10>; [89.00%] else goto <bb 5>; [11.00%] <bb 10> [local count: 850510901]: goto <bb 4>; [100.00%] <bb 5> [local count: 118111600]: # sum_21 = PHI <sum_14(4), sum_11(3)> c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto <bb 11>; [89.00%] else goto <bb 6>; [11.00%] <bb 11> [local count: 105119324]: <bb 12> [local count: 105119324]: # j_7 = PHI <j_13(11)> sum_6 = c[j_19]; goto <bb 8>; [100.00%] <bb 9> [local count: 0]: goto <bb 3>; [100.00%] <bb 6> [local count: 14598063]: return; } [-- Attachment #3: il-after-cleanup-and-rerunning-loop-init.txt --] [-- Type: text/plain, Size: 1196 bytes --] void simple_reduc_1 (int n) { int i; int sum; int j; int _1; int _2; int _3; <bb 2> [local count: 14598063]: if (n_10(D) > 0) goto <bb 3>; [89.00%] else goto <bb 8>; [11.00%] <bb 3> [local count: 12992276]: # j_19 = PHI <0(2)> sum_11 = c[j_19]; if (n_10(D) > 0) goto <bb 4>; [66.92%] else goto <bb 6>; [33.08%] <bb 4> [local count: 105119324]: # sum_5 = PHI <sum_11(3), sum_6(7)> # j_17 = PHI <j_19(3), j_7(7)> <bb 5> [local count: 955630225]: # sum_20 = PHI <sum_14(9), sum_5(4)> # i_22 = PHI <i_15(9), 0(4)> _1 = a[i_22][j_17]; _2 = b[i_22][j_17]; _3 = _1 * _2; sum_14 = _3 + sum_20; i_15 = i_22 + 1; if (n_10(D) > i_15) goto <bb 9>; [89.00%] else goto <bb 6>; [11.00%] <bb 9> [local count: 850510901]: goto <bb 5>; [100.00%] <bb 6> [local count: 118111600]: # sum_21 = PHI <sum_14(5), sum_11(3)> # j_18 = PHI <j_17(5), j_19(3)> c[j_18] = sum_21; j_13 = j_18 + 1; if (n_10(D) > j_13) goto <bb 7>; [89.00%] else goto <bb 8>; [11.00%] <bb 7> [local count: 105119324]: # j_7 = PHI <j_13(6)> sum_6 = c[j_7]; goto <bb 4>; [100.00%] <bb 8> [local count: 14598063]: return; } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 10:44 ` Aldy Hernandez @ 2021-09-08 13:13 ` Richard Biener 2021-09-08 13:25 ` Aldy Hernandez 0 siblings, 1 reply; 33+ messages in thread From: Richard Biener @ 2021-09-08 13:13 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On Wed, Sep 8, 2021 at 12:44 PM Aldy Hernandez <aldyh@redhat.com> wrote: > > First of all. Thanks so much for this incredibly useful explanation. > It helps a lot. > > I'm still trying to wrap my head around this, so please bear with me. > > On 9/7/21 4:45 PM, Michael Matz wrote: > > Hello, > > > > On Tue, 7 Sep 2021, Aldy Hernandez via Gcc wrote: > > > >> The regression comes from the simple_reduc_1() function in > >> tree-ssa/loop-interchange-9.c, and it involves the following path: > >> > >> =========== BB 5 ============ > >> Imports: n_10(D) j_19 > >> Exports: n_10(D) j_13 j_19 > >> j_13 : j_19(I) > >> n_10(D) int [1, 257] > >> j_19 int [0, 256] > >> Relational : (j_13 > j_19) > >> <bb 5> [local count: 118111600]: > >> # sum_21 = PHI <sum_14(4), sum_11(3)> > >> c[j_19] = sum_21; > >> j_13 = j_19 + 1; > >> if (n_10(D) > j_13) > >> goto <bb 9>; [89.00%] > >> else > >> goto <bb 6>; [11.00%] > > > > So, this is the outer loops exit block ... > > > >> =========== BB 9 ============ > >> n_10(D) int [2, 257] > >> j_13 int [1, 256] > >> Relational : (n_10(D) > j_19) > >> Relational : (n_10(D) > j_13) > >> <bb 9> [local count: 105119324]: > >> goto <bb 3>; [100.00%] > > > > ... this the outer loops latch block ... > > > >> =========== BB 3 ============ > >> Imports: n_10(D) > >> Exports: n_10(D) > >> n_10(D) int [1, +INF] > >> <bb 3> [local count: 118111600]: > >> # j_19 = PHI <j_13(9), 0(7)> > >> sum_11 = c[j_19]; > >> if (n_10(D) > 0) > >> goto <bb 8>; [89.00%] > >> else > >> goto <bb 5>; [11.00%] > > > > ... and this the outer loops header, as well as inner loops pre-header. > > Actually, the inner loop's pre-header seems to be BB8, which is empty, > and leads into the BB4 which is the header of the inner loop. > > > The attempted thread hence involves a back-edge (of the outer loop) and a > > loop-entry edge (bb3->bb8) of the inner loop. Doing that threading would > > destroy some properties that our loop optimizers rely on, e.g. that the > > loop header of a natural loop is only reached by two edges: entry edge and > > back edge, and that the latch blocks are empty and that there's only a > > single latch. (What exactly we require depends on several flags in > > loops_state_satisfies_p). > > But threading 3->8 doesn't involve a loop-entry edge. It involves a new > edge into the *pre-header* of the inner loop. > > I am attaching the IL right after threading and before we clean up the > CFG (il-after-threading.txt). > > After threading we have have rewritten the 5->9 edge into 5->11: > > <bb 5> [local count: 118111600]: > # sum_21 = PHI <sum_14(4), sum_11(3)> > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto <bb 11>; [89.00%] > else > goto <bb 6>; [11.00%] > > <bb 11> [local count: 105119324]: > > <bb 12> [local count: 105119324]: > # j_7 = PHI <j_13(11)> > sum_6 = c[j_19]; > goto <bb 8>; [100.00%] > > <bb 9> [local count: 0]: ;; This becomes unreachable. > goto <bb 3>; [100.00%] > > Notice BB9 becomes unreachable, so we're basically eliding the latch in BB9. > > Question: could BB12 not be considered the loop latch now and BB8 be the > outer loop's entry? This would also mean, that BB3 is now outside of > the outer loop, which means the threader peeled off the first iteration > of the loop. Or is it a requirement that the latch be empty? > > > > >> With knowledge from previous passes (SSA_NAME_RANGE_INFO), we know that > >> the loop cannot legally iterate outside the size of c[256]. So, j_13 > >> lies within [1, 257] and n_10 is [2, +INF] at the end of the path. > >> This allows us to thread BB3 to BB8. > > > > So, IIUC doing this threading would create a new entry to BB8: it would > > then be entered by its natural entry (bb3), by its natural back edge > > (whatever bb that is now) and the new entry from the threaded outer back > > edge (bb9 or bb5?). > > As I mentioned, BB8 looks like the pre-header to the inner loop. But > yes, it now has multiple entries: the edge from BB3, and the back edge > from BB12 (which seems like it could be a new latch, since it's the only > back edge). It would be helpful to have the patch causing the issue to look at the IL. But as Micha said, there needs to be a perfect loop nest for interchange to work. Richard. > > > > The inner loop wouldn't then be recognized as natural anymore and the > > whole nest not as perfect, and hence loop interchange can't easily happen > > anymore. Most other structural loop optimizers of us would have problems > > with that situation as well. > > If I clean up the CFG and call loop_optimizer_init() again at this > point, what is destroyed is the outer loop, not the inner loop. If you > look at the IL at this point > (il-after-cleanup-and-rerunning-loop-init.txt), we have a latch in BB7 > going back up to BB4, though the conditional is now in BB6 (eeech): > > <bb 4> > ... > ... > ... > > > <bb 6> [local count: 118111600]: > # sum_21 = PHI <sum_14(5), sum_11(3)> > # j_18 = PHI <j_17(5), j_19(3)> > c[j_18] = sum_21; > j_13 = j_18 + 1; > if (n_10(D) > j_13) > goto <bb 7>; [89.00%] > else > goto <bb 8>; [11.00%] > > <bb 7> [local count: 105119324]: > # j_7 = PHI <j_13(6)> > sum_6 = c[j_7]; > goto <bb 4>; [100.00%] > > Perhaps, this looks sufficiently different that the loop optimizer can't > recognize it as a loop? > > > > >> All the blocks lie within the same loop, and the path passes all the > >> tests in path_profitable_p(). > >> > >> Is there something about the shape of this path that should make it > >> invalid in the backward threader, or should the test be marked with > >> -fdisable-tree-threadN (or something else entirely)? > > > > This is a functionality test checking that the very necessary interchange > > in this test does happen with default plus -floop-interchange (with the > > intention of it being enabled by O3 or with profile info). So no > > additional flags can be added without changing the meaning of this test. > > Agreed. We're shuffling things around. What I'm missing here is why > we're unable to recognize this new form as a loop. That being said, I > am not above figuring out what the magic incantation is to keep the > threader from doing this transformation. However, I just want to make > sure we're not papering over something else. > > > > >> Note that the > >> backward threader is allowed to peel through loop headers. > > > > Something needs to give way in the path threaders before loop > > optimizations: either threading through back edges, through loop latches > > or through loop headers needs to be disabled. I think traditionally at > > least threading through latches should be disabled, because doing so > > usually destroys simple loop structure. I see that profitable_path_p() of > > the backward threader wants to be careful in some situations involving > > loops and latches; possibly it's not careful enough yet for the additional > > power brought by ranger. > > > > See also the additional tests tree-cfgcleanup.c:tree_forwarder_block_p is > > doing when loops are active. > > > > After the structural loop optimizers the threader can go wild and thread > > everything it finds. > > Thanks again. This is very helpful. > > Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 13:13 ` Richard Biener @ 2021-09-08 13:25 ` Aldy Hernandez 2021-09-08 13:49 ` Richard Biener 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-08 13:25 UTC (permalink / raw) To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod [-- Attachment #1: Type: text/plain, Size: 535 bytes --] > It would be helpful to have the patch causing the issue to look at the IL. > But as Micha said, there needs to be a perfect loop nest for interchange > to work. > > Richard. Absolutely! I'm attaching the reduced testcase, as well as the patch. The problematic thread shows up in the thread2 dump: Checking profitability of path (backwards): bb:3 (4 insns) bb:9 (0 insns) bb:5 Control statement insns: 2 Overall: 2 insns Registering FSM jump thread: (5, 9) incoming edge; (9, 3) (3, 8) nocopy; (3, 8) Thanks. Aldy [-- Attachment #2: p.patch --] [-- Type: text/x-patch, Size: 1312 bytes --] commit 1bf3f76a5ff075396b5b9f5f88d6b18649dac2ce Author: Aldy Hernandez <aldyh@redhat.com> Date: Sun Sep 5 16:54:00 2021 +0200 Take into account global ranges when folding statements in path solver. The path solver used by the backwards threader can refine a folded range by taking into account any global range known. This patch intersects the folded range with whatever global information we have. gcc/ChangeLog: * gimple-range-path.cc (path_range_query::internal_range_of_expr): Intersect with global ranges. > FAIL: gcc.dg/tree-ssa/loop-interchange-9.c scan-tree-dump-times linterchange "Loop_pair<outer:., inner:.> is interchanged" 1 > FAIL: gcc.dg/tree-ssa/cunroll-15.c scan-tree-dump optimized "return 1;" diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc index a4fa3b296ff..c616b65756f 100644 --- a/gcc/gimple-range-path.cc +++ b/gcc/gimple-range-path.cc @@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt) basic_block bb = stmt ? gimple_bb (stmt) : exit_bb (); if (stmt && range_defined_in_block (r, name, bb)) { + if (TREE_CODE (name) == SSA_NAME) + r.intersect (gimple_range_global (name)); + set_cache (r, name); return true; } [-- Attachment #3: a.c --] [-- Type: text/x-csrc, Size: 461 bytes --] /* { dg-do run } */ /* { dg-options "-O2 -floop-interchange -fdump-tree-linterchange-details" } */ /* { dg-require-effective-target size20plus } */ /* { dg-skip-if "too big data segment" { visium-*-* } } */ #define M 256 int a[M][M], b[M][M], c[M], d[M]; void __attribute__((noinline)) simple_reduc_1 (int n) { for (int j = 0; j < n; j++) { int sum = c[j]; for (int i = 0; i < n; i++) sum = sum + a[i][j]*b[i][j]; c[j] = sum; } } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 13:25 ` Aldy Hernandez @ 2021-09-08 13:49 ` Richard Biener 2021-09-08 16:19 ` Aldy Hernandez 0 siblings, 1 reply; 33+ messages in thread From: Richard Biener @ 2021-09-08 13:49 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez <aldyh@redhat.com> wrote: > > > It would be helpful to have the patch causing the issue to look at the IL. > > But as Micha said, there needs to be a perfect loop nest for interchange > > to work. > > > > Richard. > > Absolutely! I'm attaching the reduced testcase, as well as the patch. > > The problematic thread shows up in the thread2 dump: > > Checking profitability of path (backwards): bb:3 (4 insns) bb:9 (0 > insns) bb:5 > Control statement insns: 2 > Overall: 2 insns > Registering FSM jump thread: (5, 9) incoming edge; (9, 3) (3, 8) > nocopy; (3, 8) Well, so as Micha said, the threading destroys the outer loop by making it have two entries - we actually _do_ recover from this but it results in a non-empty latch of the outer loop and thus the loop is no longer a perfect nest. The interchange pass has special-sauce to recover from the loop store motion applied but not to handle the kind of loop rotation the jump threading caused. The forward threader guards against this by simply disallowing threadings that involve different loops. As I see the threading done here should be 7->3 (outer loop entry) to bb 8 rather than one involving the backedge. Also note the condition is invariant in the loop and in fact subsumed by the condition outside of the loop and it should have been simplified by VRP after pass_ch but I see there's a jump threading pass inbetween pass_ch and the next VRP which is likely the problem. Why does jump threading not try to simplify a condition before attempting to thread through it? After all ranger should be around? Richard. > Thanks. > Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 13:49 ` Richard Biener @ 2021-09-08 16:19 ` Aldy Hernandez 2021-09-08 16:39 ` Michael Matz 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-08 16:19 UTC (permalink / raw) To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/8/21 3:49 PM, Richard Biener wrote: > On Wed, Sep 8, 2021 at 3:25 PM Aldy Hernandez <aldyh@redhat.com> wrote: >> >>> It would be helpful to have the patch causing the issue to look at the IL. >>> But as Micha said, there needs to be a perfect loop nest for interchange >>> to work. >>> >>> Richard. >> >> Absolutely! I'm attaching the reduced testcase, as well as the patch. >> >> The problematic thread shows up in the thread2 dump: >> >> Checking profitability of path (backwards): bb:3 (4 insns) bb:9 (0 >> insns) bb:5 >> Control statement insns: 2 >> Overall: 2 insns >> Registering FSM jump thread: (5, 9) incoming edge; (9, 3) (3, 8) >> nocopy; (3, 8) > > Well, so as Micha said, the threading destroys the outer loop by > making it have two entries - we actually _do_ recover from this > but it results in a non-empty latch of the outer loop and thus > the loop is no longer a perfect nest. The interchange pass has > special-sauce to recover from the loop store motion applied > but not to handle the kind of loop rotation the jump threading > caused. Understood. The backedge into BB8 and the non-empty latch seem to cause problems. > > The forward threader guards against this by simply disallowing > threadings that involve different loops. As I see The thread in question (5->9->3) is all within the same outer loop, though. BTW, the backward threader also disallows threading across different loops (see path_crosses_loops variable). > the threading done here should be 7->3 (outer loop entry) to bb 8 > rather than one involving the backedge. Also note the condition > is invariant in the loop and in fact subsumed by the condition > outside of the loop and it should have been simplified by > VRP after pass_ch but I see there's a jump threading pass > inbetween pass_ch and the next VRP which is likely the > problem. A 7->3->8 thread would cross loops though, because 7 is outside the outer loop. > > Why does jump threading not try to simplify a condition before > attempting to thread through it? After all ranger should be around? That's a very good question ;-). The current code does not use the ranger to solve unknowns. Anything further back than the first block in a path will return varying. The plan was to add this ranger solving functionality as a follow-up. I have a whole slew of pending patches adding precisely this functionality. We should be able to solve ranges outside the path with ranger, as well as relationals occurring in a path. However, even if there are alternate ways of threading this IL, something like 5->9->3 could still happen. We need a way to disallow this. I'm having a hard time determining the hammer for this. I would vote for disabling threading through latches, but it seems the backward threader is aware of this scenario and allows it anyhow (see threaded_through_latch variable). Ughh. Thanks for looking into this. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 16:19 ` Aldy Hernandez @ 2021-09-08 16:39 ` Michael Matz 2021-09-08 18:13 ` Michael Matz 0 siblings, 1 reply; 33+ messages in thread From: Michael Matz @ 2021-09-08 16:39 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod Hello, On Wed, 8 Sep 2021, Aldy Hernandez wrote: > > The forward threader guards against this by simply disallowing > > threadings that involve different loops. As I see > > The thread in question (5->9->3) is all within the same outer loop, > though. BTW, the backward threader also disallows threading across > different loops (see path_crosses_loops variable). > > > the threading done here should be 7->3 (outer loop entry) to bb 8 > > rather than one involving the backedge. Also note the condition is > > invariant in the loop and in fact subsumed by the condition outside of > > the loop and it should have been simplified by VRP after pass_ch but I > > see there's a jump threading pass inbetween pass_ch and the next VRP > > which is likely the problem. > > A 7->3->8 thread would cross loops though, because 7 is outside the > outer loop. ... > However, even if there are alternate ways of threading this IL, > something like 5->9->3 could still happen. We need a way to disallow > this. I'm having a hard time determining the hammer for this. I would > vote for disabling threading through latches, but it seems the backward > threader is aware of this scenario and allows it anyhow (see > threaded_through_latch variable). Ughh. The backward threader seems to want to be careful with latches, but still allow it in some situations, in particular when doing so doesn't create a loop with non-simple latches (which is basically a single and empty latch block). If this improvement under discussion leads to creating a non-empty latch then those checks aren't restrictive enough (anymore). I think threading through a latch is always dubious regarding the loop structure, it's basically either loop rotation or iteration peeling, even if it doesn't cause non-simple latches. Those transformations should probably be left to a loop optimizer, or be only done when destroying loop structure is fine (i.e. late). Maybe it's possible to not disable threading over latches alltogether in the backward threader (like it's tried now), but I haven't looked at the specific situation here in depth, so take my view only as opinion from a large distance :-) Does anything break if you brutally disable any backward threading when any of the involved blocks is a latch when current_loops is set? (I guess for that latter test to be effective you want to disable the loop_optimizer_init() for the "late" jump thread passes) Ciao, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 16:39 ` Michael Matz @ 2021-09-08 18:13 ` Michael Matz 2021-09-09 6:57 ` Richard Biener 2021-09-09 8:14 ` Aldy Hernandez 0 siblings, 2 replies; 33+ messages in thread From: Michael Matz @ 2021-09-08 18:13 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod Hello, [lame answer to self] On Wed, 8 Sep 2021, Michael Matz wrote: > > > The forward threader guards against this by simply disallowing > > > threadings that involve different loops. As I see > > > > The thread in question (5->9->3) is all within the same outer loop, > > though. BTW, the backward threader also disallows threading across > > different loops (see path_crosses_loops variable). ... > Maybe it's possible to not disable threading over latches alltogether in > the backward threader (like it's tried now), but I haven't looked at the > specific situation here in depth, so take my view only as opinion from a > large distance :-) I've now looked at the concrete situation. So yeah, the whole path is in the same loop, crosses the latch, _and there's code following the latch on that path_. (I.e. the latch isn't the last block in the path). In particular, after loop_optimizer_init() (before any threading) we have: <bb 3> [local count: 118111600]: # j_19 = PHI <j_13(9), 0(7)> sum_11 = c[j_19]; if (n_10(D) > 0) goto <bb 8>; [89.00%] else goto <bb 5>; [11.00%] <bb 8> [local count: 105119324]: ... <bb 5> [local count: 118111600]: # sum_21 = PHI <sum_14(4), sum_11(3)> c[j_19] = sum_21; j_13 = j_19 + 1; if (n_10(D) > j_13) goto <bb 9>; [89.00%] else goto <bb 6>; [11.00%] <bb 9> [local count: 105119324]: goto <bb 3>; [100.00%] With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the pre-header of inner loop, but more importantly something that's not at the start of the outer loop. Now, any thread that includes the backedge 9->3 _including_ its destination (i.e. where the backedge isn't the last to-be-redirected edge) necessarily duplicates all code from that destination onto the back edge. Here it's the load from c[j] into sum_11. The important part is the code is emitted onto the back edge, conceptually; in reality it's simply included into the (new) latch block (the duplicate of bb9, which is bb12 intermediately, then named bb7 after cfg_cleanup). That's what we can't have for some of our structural loop optimizers: there must be no code executed after the exit test (e.g. in the latch block). (This requirement makes reasoning about which code is or isn't executed completely for an iteration trivial; simply everything in the body is always executed; e.g. loop interchange uses this to check that there are no memory references after the exit test, because those would then be only conditional and hence make loop interchange very awkward). Note that this situation can't be later rectified anymore: the duplicated instructions (because they are memory refs) must remain after the exit test. Only by rerolling/unrotating the loop (i.e. noticing that the memory refs on the loop-entry path and on the back edge are equivalent) would that be possible, but that's something we aren't capable of. Even if we were that would simply just revert the whole work that the threader did, so it's better to not even do that to start with. I believe something like below would be appropriate, it disables threading if the path contains a latch at the non-last position (due to being backwards on the non-first position in the array). I.e. it disables rotating the loop if there's danger of polluting the back edge. It might be improved if the blocks following (preceding!) the latch are themself empty because then no code is duplicated. It might also be improved if the latch is already non-empty. That code should probably only be active before the loop optimizers, but currently the backward threader isn't differentiating between before/after loop-optims. I haven't tested this patch at all, except that it fixes the testcase :) Ciao, Michael. diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 449232c7715..528a753b886 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, loop_p loop = m_path[0]->loop_father; bool path_crosses_loops = false; bool threaded_through_latch = false; + bool latch_within_path = false; bool multiway_branch_in_path = false; bool threaded_multiway_branch = false; bool contains_hot_bb = false; @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, the last entry in the array when determining if we thread through the loop latch. */ if (loop->latch == bb) - threaded_through_latch = true; + { + threaded_through_latch = true; + if (j != 0) + latch_within_path = true; + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " (latch)"); + } } gimple *stmt = get_gimple_control_stmt (m_path[0]); @@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, "a multiway branch.\n"); return false; } + + if (latch_within_path) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + " FAIL: FSM Thread through latch would create non-empty latch\n"); + return false; + + } return true; } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 18:13 ` Michael Matz @ 2021-09-09 6:57 ` Richard Biener 2021-09-09 7:37 ` Aldy Hernandez 2021-09-09 12:47 ` Michael Matz 2021-09-09 8:14 ` Aldy Hernandez 1 sibling, 2 replies; 33+ messages in thread From: Richard Biener @ 2021-09-09 6:57 UTC (permalink / raw) To: Michael Matz; +Cc: Aldy Hernandez, Jeff Law, GCC Mailing List, Andrew MacLeod On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: > > Hello, > > [lame answer to self] > > On Wed, 8 Sep 2021, Michael Matz wrote: > > > > > The forward threader guards against this by simply disallowing > > > > threadings that involve different loops. As I see > > > > > > The thread in question (5->9->3) is all within the same outer loop, > > > though. BTW, the backward threader also disallows threading across > > > different loops (see path_crosses_loops variable). > ... > > Maybe it's possible to not disable threading over latches alltogether in > > the backward threader (like it's tried now), but I haven't looked at the > > specific situation here in depth, so take my view only as opinion from a > > large distance :-) > > I've now looked at the concrete situation. So yeah, the whole path is in > the same loop, crosses the latch, _and there's code following the latch > on that path_. (I.e. the latch isn't the last block in the path). In > particular, after loop_optimizer_init() (before any threading) we have: > > <bb 3> [local count: 118111600]: > # j_19 = PHI <j_13(9), 0(7)> > sum_11 = c[j_19]; > if (n_10(D) > 0) > goto <bb 8>; [89.00%] > else > goto <bb 5>; [11.00%] > > <bb 8> [local count: 105119324]: > ... > > <bb 5> [local count: 118111600]: > # sum_21 = PHI <sum_14(4), sum_11(3)> > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto <bb 9>; [89.00%] > else > goto <bb 6>; [11.00%] > > <bb 9> [local count: 105119324]: > goto <bb 3>; [100.00%] > > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > pre-header of inner loop, but more importantly something that's not at the > start of the outer loop. > > Now, any thread that includes the backedge 9->3 _including_ its > destination (i.e. where the backedge isn't the last to-be-redirected edge) > necessarily duplicates all code from that destination onto the back edge. > Here it's the load from c[j] into sum_11. > > The important part is the code is emitted onto the back edge, > conceptually; in reality it's simply included into the (new) latch block > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > cfg_cleanup). > > That's what we can't have for some of our structural loop optimizers: > there must be no code executed after the exit test (e.g. in the latch > block). (This requirement makes reasoning about which code is or isn't > executed completely for an iteration trivial; simply everything in the > body is always executed; e.g. loop interchange uses this to check that > there are no memory references after the exit test, because those would > then be only conditional and hence make loop interchange very awkward). > > Note that this situation can't be later rectified anymore: the duplicated > instructions (because they are memory refs) must remain after the exit > test. Only by rerolling/unrotating the loop (i.e. noticing that the > memory refs on the loop-entry path and on the back edge are equivalent) > would that be possible, but that's something we aren't capable of. Even > if we were that would simply just revert the whole work that the threader > did, so it's better to not even do that to start with. > > I believe something like below would be appropriate, it disables threading > if the path contains a latch at the non-last position (due to being > backwards on the non-first position in the array). I.e. it disables > rotating the loop if there's danger of polluting the back edge. It might > be improved if the blocks following (preceding!) the latch are themself > empty because then no code is duplicated. It might also be improved if > the latch is already non-empty. That code should probably only be active > before the loop optimizers, but currently the backward threader isn't > differentiating between before/after loop-optims. > > I haven't tested this patch at all, except that it fixes the testcase :) Lame comment at the current end of the thread - it's not threading through the latch but threading through the loop header that's problematic, at least if the end of the threading path ends within the loop (threading through the header to the loop exit is fine). Because in that situation you effectively created an alternate loop entry. Threading through the latch into the loop header is fine but with simple latches that likely will never happen (if there are no simple latches then the latch can contain the loop exit test). See tree-ssa-threadupdate.c:thread_block_1 e2 = path->last ()->e; if (!e2 || noloop_only) { /* If NOLOOP_ONLY is true, we only allow threading through the header of a loop to exit edges. */ /* One case occurs when there was loop header buried in a jump threading path that crosses loop boundaries. We do not try and thread this elsewhere, so just cancel the jump threading request by clearing the AUX field now. */ if (bb->loop_father != e2->src->loop_father && (!loop_exit_edge_p (e2->src->loop_father, e2) || flow_loop_nested_p (bb->loop_father, e2->dest->loop_father))) { /* Since this case is not handled by our special code to thread through a loop header, we must explicitly cancel the threading request here. */ delete_jump_thread_path (path); e->aux = NULL; continue; } there are a lot of "useful" checks in this function and the backwards threader should adopt those. Note the backwards threader originally only did FSM style threadings which are exactly those possibly "harmful" ones, forming irreducible regions at worst or sub-loops at best. That might explain the lack of those checks. Richard. > > Ciao, > Michael. > > diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c > index 449232c7715..528a753b886 100644 > --- a/gcc/tree-ssa-threadbackward.c > +++ b/gcc/tree-ssa-threadbackward.c > @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, > loop_p loop = m_path[0]->loop_father; > bool path_crosses_loops = false; > bool threaded_through_latch = false; > + bool latch_within_path = false; > bool multiway_branch_in_path = false; > bool threaded_multiway_branch = false; > bool contains_hot_bb = false; > @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, > the last entry in the array when determining if we thread > through the loop latch. */ > if (loop->latch == bb) > - threaded_through_latch = true; > + { > + threaded_through_latch = true; > + if (j != 0) > + latch_within_path = true; > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, " (latch)"); > + } > } > > gimple *stmt = get_gimple_control_stmt (m_path[0]); > @@ -845,6 +852,15 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, > "a multiway branch.\n"); > return false; > } > + > + if (latch_within_path) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, > + " FAIL: FSM Thread through latch would create non-empty latch\n"); > + return false; > + > + } > return true; > } > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 6:57 ` Richard Biener @ 2021-09-09 7:37 ` Aldy Hernandez 2021-09-09 7:45 ` Richard Biener 2021-09-09 12:47 ` Michael Matz 1 sibling, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 7:37 UTC (permalink / raw) To: Richard Biener, Michael Matz; +Cc: Jeff Law, GCC Mailing List, Andrew MacLeod On 9/9/21 8:57 AM, Richard Biener wrote: > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: >> >> Hello, >> >> [lame answer to self] >> >> On Wed, 8 Sep 2021, Michael Matz wrote: >> >>>>> The forward threader guards against this by simply disallowing >>>>> threadings that involve different loops. As I see >>>> >>>> The thread in question (5->9->3) is all within the same outer loop, >>>> though. BTW, the backward threader also disallows threading across >>>> different loops (see path_crosses_loops variable). >> ... >>> Maybe it's possible to not disable threading over latches alltogether in >>> the backward threader (like it's tried now), but I haven't looked at the >>> specific situation here in depth, so take my view only as opinion from a >>> large distance :-) >> >> I've now looked at the concrete situation. So yeah, the whole path is in >> the same loop, crosses the latch, _and there's code following the latch >> on that path_. (I.e. the latch isn't the last block in the path). In >> particular, after loop_optimizer_init() (before any threading) we have: >> >> <bb 3> [local count: 118111600]: >> # j_19 = PHI <j_13(9), 0(7)> >> sum_11 = c[j_19]; >> if (n_10(D) > 0) >> goto <bb 8>; [89.00%] >> else >> goto <bb 5>; [11.00%] >> >> <bb 8> [local count: 105119324]: >> ... >> >> <bb 5> [local count: 118111600]: >> # sum_21 = PHI <sum_14(4), sum_11(3)> >> c[j_19] = sum_21; >> j_13 = j_19 + 1; >> if (n_10(D) > j_13) >> goto <bb 9>; [89.00%] >> else >> goto <bb 6>; [11.00%] >> >> <bb 9> [local count: 105119324]: >> goto <bb 3>; [100.00%] >> >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the >> pre-header of inner loop, but more importantly something that's not at the >> start of the outer loop. >> >> Now, any thread that includes the backedge 9->3 _including_ its >> destination (i.e. where the backedge isn't the last to-be-redirected edge) >> necessarily duplicates all code from that destination onto the back edge. >> Here it's the load from c[j] into sum_11. >> >> The important part is the code is emitted onto the back edge, >> conceptually; in reality it's simply included into the (new) latch block >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after >> cfg_cleanup). >> >> That's what we can't have for some of our structural loop optimizers: >> there must be no code executed after the exit test (e.g. in the latch >> block). (This requirement makes reasoning about which code is or isn't >> executed completely for an iteration trivial; simply everything in the >> body is always executed; e.g. loop interchange uses this to check that >> there are no memory references after the exit test, because those would >> then be only conditional and hence make loop interchange very awkward). >> >> Note that this situation can't be later rectified anymore: the duplicated >> instructions (because they are memory refs) must remain after the exit >> test. Only by rerolling/unrotating the loop (i.e. noticing that the >> memory refs on the loop-entry path and on the back edge are equivalent) >> would that be possible, but that's something we aren't capable of. Even >> if we were that would simply just revert the whole work that the threader >> did, so it's better to not even do that to start with. >> >> I believe something like below would be appropriate, it disables threading >> if the path contains a latch at the non-last position (due to being >> backwards on the non-first position in the array). I.e. it disables >> rotating the loop if there's danger of polluting the back edge. It might >> be improved if the blocks following (preceding!) the latch are themself >> empty because then no code is duplicated. It might also be improved if >> the latch is already non-empty. That code should probably only be active >> before the loop optimizers, but currently the backward threader isn't >> differentiating between before/after loop-optims. >> >> I haven't tested this patch at all, except that it fixes the testcase :) > > Lame comment at the current end of the thread - it's not threading through the I don't know why y'all keep using the word "lame". On the contrary, these are incredibly useful explanations. Thanks. > latch but threading through the loop header that's problematic, at least if the > end of the threading path ends within the loop (threading through the header > to the loop exit is fine). Because in that situation you effectively created an > alternate loop entry. Threading through the latch into the loop header is > fine but with simple latches that likely will never happen (if there are no > simple latches then the latch can contain the loop exit test). > > See tree-ssa-threadupdate.c:thread_block_1 > > e2 = path->last ()->e; > if (!e2 || noloop_only) > { > /* If NOLOOP_ONLY is true, we only allow threading through the > header of a loop to exit edges. */ > > /* One case occurs when there was loop header buried in a jump > threading path that crosses loop boundaries. We do not try > and thread this elsewhere, so just cancel the jump threading > request by clearing the AUX field now. */ > if (bb->loop_father != e2->src->loop_father > && (!loop_exit_edge_p (e2->src->loop_father, e2) > || flow_loop_nested_p (bb->loop_father, > e2->dest->loop_father))) > { > /* Since this case is not handled by our special code > to thread through a loop header, we must explicitly > cancel the threading request here. */ > delete_jump_thread_path (path); > e->aux = NULL; > continue; > } But this is for a threading path that crosses loop boundaries, which is not the case. Perhaps we should restrict this further to threads within a loop? > > there are a lot of "useful" checks in this function and the backwards threader > should adopt those. Note the backwards threader originally only did > FSM style threadings which are exactly those possibly "harmful" ones, forming > irreducible regions at worst or sub-loops at best. That might explain the > lack of those checks. Also, the aforementioned checks are in jump_thread_path_registry, which is also shared by the backward threader. These are thread discards _after_ a thread has been registered. The backward threader should also be using these restrictions. Unless, I'm missing some interaction with the FSM/etc threading types as per the preamble to the snippet you provided: if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) continue; You are right though, there are a lot of checks throughout the entire forward threader that should be audited and somehow shared. It's on my back burner, but I'm running out of cycles here :-/. Thanks. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 7:37 ` Aldy Hernandez @ 2021-09-09 7:45 ` Richard Biener 2021-09-09 8:36 ` Aldy Hernandez 0 siblings, 1 reply; 33+ messages in thread From: Richard Biener @ 2021-09-09 7:45 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote: > > > > On 9/9/21 8:57 AM, Richard Biener wrote: > > On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: > >> > >> Hello, > >> > >> [lame answer to self] > >> > >> On Wed, 8 Sep 2021, Michael Matz wrote: > >> > >>>>> The forward threader guards against this by simply disallowing > >>>>> threadings that involve different loops. As I see > >>>> > >>>> The thread in question (5->9->3) is all within the same outer loop, > >>>> though. BTW, the backward threader also disallows threading across > >>>> different loops (see path_crosses_loops variable). > >> ... > >>> Maybe it's possible to not disable threading over latches alltogether in > >>> the backward threader (like it's tried now), but I haven't looked at the > >>> specific situation here in depth, so take my view only as opinion from a > >>> large distance :-) > >> > >> I've now looked at the concrete situation. So yeah, the whole path is in > >> the same loop, crosses the latch, _and there's code following the latch > >> on that path_. (I.e. the latch isn't the last block in the path). In > >> particular, after loop_optimizer_init() (before any threading) we have: > >> > >> <bb 3> [local count: 118111600]: > >> # j_19 = PHI <j_13(9), 0(7)> > >> sum_11 = c[j_19]; > >> if (n_10(D) > 0) > >> goto <bb 8>; [89.00%] > >> else > >> goto <bb 5>; [11.00%] > >> > >> <bb 8> [local count: 105119324]: > >> ... > >> > >> <bb 5> [local count: 118111600]: > >> # sum_21 = PHI <sum_14(4), sum_11(3)> > >> c[j_19] = sum_21; > >> j_13 = j_19 + 1; > >> if (n_10(D) > j_13) > >> goto <bb 9>; [89.00%] > >> else > >> goto <bb 6>; [11.00%] > >> > >> <bb 9> [local count: 105119324]: > >> goto <bb 3>; [100.00%] > >> > >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >> pre-header of inner loop, but more importantly something that's not at the > >> start of the outer loop. > >> > >> Now, any thread that includes the backedge 9->3 _including_ its > >> destination (i.e. where the backedge isn't the last to-be-redirected edge) > >> necessarily duplicates all code from that destination onto the back edge. > >> Here it's the load from c[j] into sum_11. > >> > >> The important part is the code is emitted onto the back edge, > >> conceptually; in reality it's simply included into the (new) latch block > >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > >> cfg_cleanup). > >> > >> That's what we can't have for some of our structural loop optimizers: > >> there must be no code executed after the exit test (e.g. in the latch > >> block). (This requirement makes reasoning about which code is or isn't > >> executed completely for an iteration trivial; simply everything in the > >> body is always executed; e.g. loop interchange uses this to check that > >> there are no memory references after the exit test, because those would > >> then be only conditional and hence make loop interchange very awkward). > >> > >> Note that this situation can't be later rectified anymore: the duplicated > >> instructions (because they are memory refs) must remain after the exit > >> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >> memory refs on the loop-entry path and on the back edge are equivalent) > >> would that be possible, but that's something we aren't capable of. Even > >> if we were that would simply just revert the whole work that the threader > >> did, so it's better to not even do that to start with. > >> > >> I believe something like below would be appropriate, it disables threading > >> if the path contains a latch at the non-last position (due to being > >> backwards on the non-first position in the array). I.e. it disables > >> rotating the loop if there's danger of polluting the back edge. It might > >> be improved if the blocks following (preceding!) the latch are themself > >> empty because then no code is duplicated. It might also be improved if > >> the latch is already non-empty. That code should probably only be active > >> before the loop optimizers, but currently the backward threader isn't > >> differentiating between before/after loop-optims. > >> > >> I haven't tested this patch at all, except that it fixes the testcase :) > > > > Lame comment at the current end of the thread - it's not threading through the > > I don't know why y'all keep using the word "lame". On the contrary, > these are incredibly useful explanations. Thanks. > > > latch but threading through the loop header that's problematic, at least if the > > end of the threading path ends within the loop (threading through the header > > to the loop exit is fine). Because in that situation you effectively created an > > alternate loop entry. Threading through the latch into the loop header is > > fine but with simple latches that likely will never happen (if there are no > > simple latches then the latch can contain the loop exit test). > > > > See tree-ssa-threadupdate.c:thread_block_1 > > > > e2 = path->last ()->e; > > if (!e2 || noloop_only) > > { > > /* If NOLOOP_ONLY is true, we only allow threading through the > > header of a loop to exit edges. */ > > > > /* One case occurs when there was loop header buried in a jump > > threading path that crosses loop boundaries. We do not try > > and thread this elsewhere, so just cancel the jump threading > > request by clearing the AUX field now. */ > > if (bb->loop_father != e2->src->loop_father > > && (!loop_exit_edge_p (e2->src->loop_father, e2) > > || flow_loop_nested_p (bb->loop_father, > > e2->dest->loop_father))) > > { > > /* Since this case is not handled by our special code > > to thread through a loop header, we must explicitly > > cancel the threading request here. */ > > delete_jump_thread_path (path); > > e->aux = NULL; > > continue; > > } > > But this is for a threading path that crosses loop boundaries, which is > not the case. Perhaps we should restrict this further to threads within > a loop? > > > > > there are a lot of "useful" checks in this function and the backwards threader > > should adopt those. Note the backwards threader originally only did > > FSM style threadings which are exactly those possibly "harmful" ones, forming > > irreducible regions at worst or sub-loops at best. That might explain the > > lack of those checks. > > Also, the aforementioned checks are in jump_thread_path_registry, which > is also shared by the backward threader. These are thread discards > _after_ a thread has been registered. Yeah, that's indeed unfortunate. > The backward threader should also > be using these restrictions. Unless, I'm missing some interaction with > the FSM/etc threading types as per the preamble to the snippet you provided: > > if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) > || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) > continue; Indeed. But I understand the backwards threader does not (only) do FSM threading now. > You are right though, there are a lot of checks throughout the entire > forward threader that should be audited and somehow shared. It's on my > back burner, but I'm running out of cycles here :-/. yeah, it's quite a mess indeed and merging the path validity/costing/code-transform paths of both threaders would be incredibly useful. Richard. > > Thanks. > Aldy > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 7:45 ` Richard Biener @ 2021-09-09 8:36 ` Aldy Hernandez 2021-09-09 8:58 ` Richard Biener 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 8:36 UTC (permalink / raw) To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/9/21 9:45 AM, Richard Biener wrote: > On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote: >> >> >> >> On 9/9/21 8:57 AM, Richard Biener wrote: >>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: >>>> >>>> Hello, >>>> >>>> [lame answer to self] >>>> >>>> On Wed, 8 Sep 2021, Michael Matz wrote: >>>> >>>>>>> The forward threader guards against this by simply disallowing >>>>>>> threadings that involve different loops. As I see >>>>>> >>>>>> The thread in question (5->9->3) is all within the same outer loop, >>>>>> though. BTW, the backward threader also disallows threading across >>>>>> different loops (see path_crosses_loops variable). >>>> ... >>>>> Maybe it's possible to not disable threading over latches alltogether in >>>>> the backward threader (like it's tried now), but I haven't looked at the >>>>> specific situation here in depth, so take my view only as opinion from a >>>>> large distance :-) >>>> >>>> I've now looked at the concrete situation. So yeah, the whole path is in >>>> the same loop, crosses the latch, _and there's code following the latch >>>> on that path_. (I.e. the latch isn't the last block in the path). In >>>> particular, after loop_optimizer_init() (before any threading) we have: >>>> >>>> <bb 3> [local count: 118111600]: >>>> # j_19 = PHI <j_13(9), 0(7)> >>>> sum_11 = c[j_19]; >>>> if (n_10(D) > 0) >>>> goto <bb 8>; [89.00%] >>>> else >>>> goto <bb 5>; [11.00%] >>>> >>>> <bb 8> [local count: 105119324]: >>>> ... >>>> >>>> <bb 5> [local count: 118111600]: >>>> # sum_21 = PHI <sum_14(4), sum_11(3)> >>>> c[j_19] = sum_21; >>>> j_13 = j_19 + 1; >>>> if (n_10(D) > j_13) >>>> goto <bb 9>; [89.00%] >>>> else >>>> goto <bb 6>; [11.00%] >>>> >>>> <bb 9> [local count: 105119324]: >>>> goto <bb 3>; [100.00%] >>>> >>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the >>>> pre-header of inner loop, but more importantly something that's not at the >>>> start of the outer loop. >>>> >>>> Now, any thread that includes the backedge 9->3 _including_ its >>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) >>>> necessarily duplicates all code from that destination onto the back edge. >>>> Here it's the load from c[j] into sum_11. >>>> >>>> The important part is the code is emitted onto the back edge, >>>> conceptually; in reality it's simply included into the (new) latch block >>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after >>>> cfg_cleanup). >>>> >>>> That's what we can't have for some of our structural loop optimizers: >>>> there must be no code executed after the exit test (e.g. in the latch >>>> block). (This requirement makes reasoning about which code is or isn't >>>> executed completely for an iteration trivial; simply everything in the >>>> body is always executed; e.g. loop interchange uses this to check that >>>> there are no memory references after the exit test, because those would >>>> then be only conditional and hence make loop interchange very awkward). >>>> >>>> Note that this situation can't be later rectified anymore: the duplicated >>>> instructions (because they are memory refs) must remain after the exit >>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the >>>> memory refs on the loop-entry path and on the back edge are equivalent) >>>> would that be possible, but that's something we aren't capable of. Even >>>> if we were that would simply just revert the whole work that the threader >>>> did, so it's better to not even do that to start with. >>>> >>>> I believe something like below would be appropriate, it disables threading >>>> if the path contains a latch at the non-last position (due to being >>>> backwards on the non-first position in the array). I.e. it disables >>>> rotating the loop if there's danger of polluting the back edge. It might >>>> be improved if the blocks following (preceding!) the latch are themself >>>> empty because then no code is duplicated. It might also be improved if >>>> the latch is already non-empty. That code should probably only be active >>>> before the loop optimizers, but currently the backward threader isn't >>>> differentiating between before/after loop-optims. >>>> >>>> I haven't tested this patch at all, except that it fixes the testcase :) >>> >>> Lame comment at the current end of the thread - it's not threading through the >> >> I don't know why y'all keep using the word "lame". On the contrary, >> these are incredibly useful explanations. Thanks. >> >>> latch but threading through the loop header that's problematic, at least if the >>> end of the threading path ends within the loop (threading through the header >>> to the loop exit is fine). Because in that situation you effectively created an >>> alternate loop entry. Threading through the latch into the loop header is >>> fine but with simple latches that likely will never happen (if there are no >>> simple latches then the latch can contain the loop exit test). >>> >>> See tree-ssa-threadupdate.c:thread_block_1 >>> >>> e2 = path->last ()->e; >>> if (!e2 || noloop_only) >>> { >>> /* If NOLOOP_ONLY is true, we only allow threading through the >>> header of a loop to exit edges. */ >>> >>> /* One case occurs when there was loop header buried in a jump >>> threading path that crosses loop boundaries. We do not try >>> and thread this elsewhere, so just cancel the jump threading >>> request by clearing the AUX field now. */ >>> if (bb->loop_father != e2->src->loop_father >>> && (!loop_exit_edge_p (e2->src->loop_father, e2) >>> || flow_loop_nested_p (bb->loop_father, >>> e2->dest->loop_father))) >>> { >>> /* Since this case is not handled by our special code >>> to thread through a loop header, we must explicitly >>> cancel the threading request here. */ >>> delete_jump_thread_path (path); >>> e->aux = NULL; >>> continue; >>> } >> >> But this is for a threading path that crosses loop boundaries, which is >> not the case. Perhaps we should restrict this further to threads within >> a loop? >> >>> >>> there are a lot of "useful" checks in this function and the backwards threader >>> should adopt those. Note the backwards threader originally only did >>> FSM style threadings which are exactly those possibly "harmful" ones, forming >>> irreducible regions at worst or sub-loops at best. That might explain the >>> lack of those checks. >> >> Also, the aforementioned checks are in jump_thread_path_registry, which >> is also shared by the backward threader. These are thread discards >> _after_ a thread has been registered. > > Yeah, that's indeed unfortunate. > >> The backward threader should also >> be using these restrictions. Unless, I'm missing some interaction with >> the FSM/etc threading types as per the preamble to the snippet you provided: >> >> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) >> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) >> continue; > > Indeed. But I understand the backwards threader does not (only) do FSM > threading now. If it does, it was not part of my rewrite. I was careful to not touch anything dealing with either path profitability or low-level path registering. The path registering is in back_threader_registry::register_path(). We only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that those are only EDGE_FSM_THREADs?? > >> You are right though, there are a lot of checks throughout the entire >> forward threader that should be audited and somehow shared. It's on my >> back burner, but I'm running out of cycles here :-/. > > yeah, it's quite a mess indeed and merging the path > validity/costing/code-transform > paths of both threaders would be incredibly useful. For the VRP threader replacement work I am doing, I was hoping to use an enhanced version of the backward threader, but it was incredibly hard to compare forward and backward threads, because the difference in the cost models. I've opted for a hybrid approach using the forward threader, but with the path solver to resolve conditionals/etc. I hope to return to the merging of the threaders for the next release, after the dust settles. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 8:36 ` Aldy Hernandez @ 2021-09-09 8:58 ` Richard Biener 2021-09-09 9:21 ` Aldy Hernandez 2021-09-09 16:59 ` Jeff Law 0 siblings, 2 replies; 33+ messages in thread From: Richard Biener @ 2021-09-09 8:58 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote: > > > > On 9/9/21 9:45 AM, Richard Biener wrote: > > On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote: > >> > >> > >> > >> On 9/9/21 8:57 AM, Richard Biener wrote: > >>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: > >>>> > >>>> Hello, > >>>> > >>>> [lame answer to self] > >>>> > >>>> On Wed, 8 Sep 2021, Michael Matz wrote: > >>>> > >>>>>>> The forward threader guards against this by simply disallowing > >>>>>>> threadings that involve different loops. As I see > >>>>>> > >>>>>> The thread in question (5->9->3) is all within the same outer loop, > >>>>>> though. BTW, the backward threader also disallows threading across > >>>>>> different loops (see path_crosses_loops variable). > >>>> ... > >>>>> Maybe it's possible to not disable threading over latches alltogether in > >>>>> the backward threader (like it's tried now), but I haven't looked at the > >>>>> specific situation here in depth, so take my view only as opinion from a > >>>>> large distance :-) > >>>> > >>>> I've now looked at the concrete situation. So yeah, the whole path is in > >>>> the same loop, crosses the latch, _and there's code following the latch > >>>> on that path_. (I.e. the latch isn't the last block in the path). In > >>>> particular, after loop_optimizer_init() (before any threading) we have: > >>>> > >>>> <bb 3> [local count: 118111600]: > >>>> # j_19 = PHI <j_13(9), 0(7)> > >>>> sum_11 = c[j_19]; > >>>> if (n_10(D) > 0) > >>>> goto <bb 8>; [89.00%] > >>>> else > >>>> goto <bb 5>; [11.00%] > >>>> > >>>> <bb 8> [local count: 105119324]: > >>>> ... > >>>> > >>>> <bb 5> [local count: 118111600]: > >>>> # sum_21 = PHI <sum_14(4), sum_11(3)> > >>>> c[j_19] = sum_21; > >>>> j_13 = j_19 + 1; > >>>> if (n_10(D) > j_13) > >>>> goto <bb 9>; [89.00%] > >>>> else > >>>> goto <bb 6>; [11.00%] > >>>> > >>>> <bb 9> [local count: 105119324]: > >>>> goto <bb 3>; [100.00%] > >>>> > >>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >>>> pre-header of inner loop, but more importantly something that's not at the > >>>> start of the outer loop. > >>>> > >>>> Now, any thread that includes the backedge 9->3 _including_ its > >>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) > >>>> necessarily duplicates all code from that destination onto the back edge. > >>>> Here it's the load from c[j] into sum_11. > >>>> > >>>> The important part is the code is emitted onto the back edge, > >>>> conceptually; in reality it's simply included into the (new) latch block > >>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > >>>> cfg_cleanup). > >>>> > >>>> That's what we can't have for some of our structural loop optimizers: > >>>> there must be no code executed after the exit test (e.g. in the latch > >>>> block). (This requirement makes reasoning about which code is or isn't > >>>> executed completely for an iteration trivial; simply everything in the > >>>> body is always executed; e.g. loop interchange uses this to check that > >>>> there are no memory references after the exit test, because those would > >>>> then be only conditional and hence make loop interchange very awkward). > >>>> > >>>> Note that this situation can't be later rectified anymore: the duplicated > >>>> instructions (because they are memory refs) must remain after the exit > >>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >>>> memory refs on the loop-entry path and on the back edge are equivalent) > >>>> would that be possible, but that's something we aren't capable of. Even > >>>> if we were that would simply just revert the whole work that the threader > >>>> did, so it's better to not even do that to start with. > >>>> > >>>> I believe something like below would be appropriate, it disables threading > >>>> if the path contains a latch at the non-last position (due to being > >>>> backwards on the non-first position in the array). I.e. it disables > >>>> rotating the loop if there's danger of polluting the back edge. It might > >>>> be improved if the blocks following (preceding!) the latch are themself > >>>> empty because then no code is duplicated. It might also be improved if > >>>> the latch is already non-empty. That code should probably only be active > >>>> before the loop optimizers, but currently the backward threader isn't > >>>> differentiating between before/after loop-optims. > >>>> > >>>> I haven't tested this patch at all, except that it fixes the testcase :) > >>> > >>> Lame comment at the current end of the thread - it's not threading through the > >> > >> I don't know why y'all keep using the word "lame". On the contrary, > >> these are incredibly useful explanations. Thanks. > >> > >>> latch but threading through the loop header that's problematic, at least if the > >>> end of the threading path ends within the loop (threading through the header > >>> to the loop exit is fine). Because in that situation you effectively created an > >>> alternate loop entry. Threading through the latch into the loop header is > >>> fine but with simple latches that likely will never happen (if there are no > >>> simple latches then the latch can contain the loop exit test). > >>> > >>> See tree-ssa-threadupdate.c:thread_block_1 > >>> > >>> e2 = path->last ()->e; > >>> if (!e2 || noloop_only) > >>> { > >>> /* If NOLOOP_ONLY is true, we only allow threading through the > >>> header of a loop to exit edges. */ > >>> > >>> /* One case occurs when there was loop header buried in a jump > >>> threading path that crosses loop boundaries. We do not try > >>> and thread this elsewhere, so just cancel the jump threading > >>> request by clearing the AUX field now. */ > >>> if (bb->loop_father != e2->src->loop_father > >>> && (!loop_exit_edge_p (e2->src->loop_father, e2) > >>> || flow_loop_nested_p (bb->loop_father, > >>> e2->dest->loop_father))) > >>> { > >>> /* Since this case is not handled by our special code > >>> to thread through a loop header, we must explicitly > >>> cancel the threading request here. */ > >>> delete_jump_thread_path (path); > >>> e->aux = NULL; > >>> continue; > >>> } > >> > >> But this is for a threading path that crosses loop boundaries, which is > >> not the case. Perhaps we should restrict this further to threads within > >> a loop? > >> > >>> > >>> there are a lot of "useful" checks in this function and the backwards threader > >>> should adopt those. Note the backwards threader originally only did > >>> FSM style threadings which are exactly those possibly "harmful" ones, forming > >>> irreducible regions at worst or sub-loops at best. That might explain the > >>> lack of those checks. > >> > >> Also, the aforementioned checks are in jump_thread_path_registry, which > >> is also shared by the backward threader. These are thread discards > >> _after_ a thread has been registered. > > > > Yeah, that's indeed unfortunate. > > > >> The backward threader should also > >> be using these restrictions. Unless, I'm missing some interaction with > >> the FSM/etc threading types as per the preamble to the snippet you provided: > >> > >> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) > >> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) > >> continue; > > > > Indeed. But I understand the backwards threader does not (only) do FSM > > threading now. > > If it does, it was not part of my rewrite. I was careful to not touch > anything dealing with either path profitability or low-level path > registering. > > The path registering is in back_threader_registry::register_path(). We > only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that > those are only EDGE_FSM_THREADs?? Well, if the backwards threader classifies everything as FSM that's probably inaccurate since only threads through the loop latch are "FSM". There is the comment /* If this path does not thread through the loop latch, then we are using the FSM threader to find old style jump threads. This is good, except the FSM threader does not re-use an existing threading path to reduce code duplication. So for that case, drastically reduce the number of statements we are allowed to copy. */ so these cases should use the "old style" validity/costing metrics and thus classify threading opportunities in a different way? I think today "backwards" vs, "forwards" only refers to the way we find threading opportunities. > > > >> You are right though, there are a lot of checks throughout the entire > >> forward threader that should be audited and somehow shared. It's on my > >> back burner, but I'm running out of cycles here :-/. > > > > yeah, it's quite a mess indeed and merging the path > > validity/costing/code-transform > > paths of both threaders would be incredibly useful. > > For the VRP threader replacement work I am doing, I was hoping to use an > enhanced version of the backward threader, but it was incredibly hard to > compare forward and backward threads, because the difference in the cost > models. I've opted for a hybrid approach using the forward threader, > but with the path solver to resolve conditionals/etc. > > I hope to return to the merging of the threaders for the next release, > after the dust settles. Hmm, I see. Richard. > Aldy > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 8:58 ` Richard Biener @ 2021-09-09 9:21 ` Aldy Hernandez 2021-09-09 10:15 ` Richard Biener 2021-09-10 15:43 ` Jeff Law 2021-09-09 16:59 ` Jeff Law 1 sibling, 2 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 9:21 UTC (permalink / raw) To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/9/21 10:58 AM, Richard Biener wrote: > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote: >> >> >> >> On 9/9/21 9:45 AM, Richard Biener wrote: >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote: >>>> >>>> >>>> >>>> On 9/9/21 8:57 AM, Richard Biener wrote: >>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> [lame answer to self] >>>>>> >>>>>> On Wed, 8 Sep 2021, Michael Matz wrote: >>>>>> >>>>>>>>> The forward threader guards against this by simply disallowing >>>>>>>>> threadings that involve different loops. As I see >>>>>>>> >>>>>>>> The thread in question (5->9->3) is all within the same outer loop, >>>>>>>> though. BTW, the backward threader also disallows threading across >>>>>>>> different loops (see path_crosses_loops variable). >>>>>> ... >>>>>>> Maybe it's possible to not disable threading over latches alltogether in >>>>>>> the backward threader (like it's tried now), but I haven't looked at the >>>>>>> specific situation here in depth, so take my view only as opinion from a >>>>>>> large distance :-) >>>>>> >>>>>> I've now looked at the concrete situation. So yeah, the whole path is in >>>>>> the same loop, crosses the latch, _and there's code following the latch >>>>>> on that path_. (I.e. the latch isn't the last block in the path). In >>>>>> particular, after loop_optimizer_init() (before any threading) we have: >>>>>> >>>>>> <bb 3> [local count: 118111600]: >>>>>> # j_19 = PHI <j_13(9), 0(7)> >>>>>> sum_11 = c[j_19]; >>>>>> if (n_10(D) > 0) >>>>>> goto <bb 8>; [89.00%] >>>>>> else >>>>>> goto <bb 5>; [11.00%] >>>>>> >>>>>> <bb 8> [local count: 105119324]: >>>>>> ... >>>>>> >>>>>> <bb 5> [local count: 118111600]: >>>>>> # sum_21 = PHI <sum_14(4), sum_11(3)> >>>>>> c[j_19] = sum_21; >>>>>> j_13 = j_19 + 1; >>>>>> if (n_10(D) > j_13) >>>>>> goto <bb 9>; [89.00%] >>>>>> else >>>>>> goto <bb 6>; [11.00%] >>>>>> >>>>>> <bb 9> [local count: 105119324]: >>>>>> goto <bb 3>; [100.00%] >>>>>> >>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the >>>>>> pre-header of inner loop, but more importantly something that's not at the >>>>>> start of the outer loop. >>>>>> >>>>>> Now, any thread that includes the backedge 9->3 _including_ its >>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) >>>>>> necessarily duplicates all code from that destination onto the back edge. >>>>>> Here it's the load from c[j] into sum_11. >>>>>> >>>>>> The important part is the code is emitted onto the back edge, >>>>>> conceptually; in reality it's simply included into the (new) latch block >>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after >>>>>> cfg_cleanup). >>>>>> >>>>>> That's what we can't have for some of our structural loop optimizers: >>>>>> there must be no code executed after the exit test (e.g. in the latch >>>>>> block). (This requirement makes reasoning about which code is or isn't >>>>>> executed completely for an iteration trivial; simply everything in the >>>>>> body is always executed; e.g. loop interchange uses this to check that >>>>>> there are no memory references after the exit test, because those would >>>>>> then be only conditional and hence make loop interchange very awkward). >>>>>> >>>>>> Note that this situation can't be later rectified anymore: the duplicated >>>>>> instructions (because they are memory refs) must remain after the exit >>>>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the >>>>>> memory refs on the loop-entry path and on the back edge are equivalent) >>>>>> would that be possible, but that's something we aren't capable of. Even >>>>>> if we were that would simply just revert the whole work that the threader >>>>>> did, so it's better to not even do that to start with. >>>>>> >>>>>> I believe something like below would be appropriate, it disables threading >>>>>> if the path contains a latch at the non-last position (due to being >>>>>> backwards on the non-first position in the array). I.e. it disables >>>>>> rotating the loop if there's danger of polluting the back edge. It might >>>>>> be improved if the blocks following (preceding!) the latch are themself >>>>>> empty because then no code is duplicated. It might also be improved if >>>>>> the latch is already non-empty. That code should probably only be active >>>>>> before the loop optimizers, but currently the backward threader isn't >>>>>> differentiating between before/after loop-optims. >>>>>> >>>>>> I haven't tested this patch at all, except that it fixes the testcase :) >>>>> >>>>> Lame comment at the current end of the thread - it's not threading through the >>>> >>>> I don't know why y'all keep using the word "lame". On the contrary, >>>> these are incredibly useful explanations. Thanks. >>>> >>>>> latch but threading through the loop header that's problematic, at least if the >>>>> end of the threading path ends within the loop (threading through the header >>>>> to the loop exit is fine). Because in that situation you effectively created an >>>>> alternate loop entry. Threading through the latch into the loop header is >>>>> fine but with simple latches that likely will never happen (if there are no >>>>> simple latches then the latch can contain the loop exit test). >>>>> >>>>> See tree-ssa-threadupdate.c:thread_block_1 >>>>> >>>>> e2 = path->last ()->e; >>>>> if (!e2 || noloop_only) >>>>> { >>>>> /* If NOLOOP_ONLY is true, we only allow threading through the >>>>> header of a loop to exit edges. */ >>>>> >>>>> /* One case occurs when there was loop header buried in a jump >>>>> threading path that crosses loop boundaries. We do not try >>>>> and thread this elsewhere, so just cancel the jump threading >>>>> request by clearing the AUX field now. */ >>>>> if (bb->loop_father != e2->src->loop_father >>>>> && (!loop_exit_edge_p (e2->src->loop_father, e2) >>>>> || flow_loop_nested_p (bb->loop_father, >>>>> e2->dest->loop_father))) >>>>> { >>>>> /* Since this case is not handled by our special code >>>>> to thread through a loop header, we must explicitly >>>>> cancel the threading request here. */ >>>>> delete_jump_thread_path (path); >>>>> e->aux = NULL; >>>>> continue; >>>>> } >>>> >>>> But this is for a threading path that crosses loop boundaries, which is >>>> not the case. Perhaps we should restrict this further to threads within >>>> a loop? >>>> >>>>> >>>>> there are a lot of "useful" checks in this function and the backwards threader >>>>> should adopt those. Note the backwards threader originally only did >>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming >>>>> irreducible regions at worst or sub-loops at best. That might explain the >>>>> lack of those checks. >>>> >>>> Also, the aforementioned checks are in jump_thread_path_registry, which >>>> is also shared by the backward threader. These are thread discards >>>> _after_ a thread has been registered. >>> >>> Yeah, that's indeed unfortunate. >>> >>>> The backward threader should also >>>> be using these restrictions. Unless, I'm missing some interaction with >>>> the FSM/etc threading types as per the preamble to the snippet you provided: >>>> >>>> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) >>>> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) >>>> continue; >>> >>> Indeed. But I understand the backwards threader does not (only) do FSM >>> threading now. >> >> If it does, it was not part of my rewrite. I was careful to not touch >> anything dealing with either path profitability or low-level path >> registering. >> >> The path registering is in back_threader_registry::register_path(). We >> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that >> those are only EDGE_FSM_THREADs?? > > Well, if the backwards threader classifies everything as FSM that's probably > inaccurate since only threads through the loop latch are "FSM". There is > the comment > > /* If this path does not thread through the loop latch, then we are > using the FSM threader to find old style jump threads. This > is good, except the FSM threader does not re-use an existing > threading path to reduce code duplication. > > So for that case, drastically reduce the number of statements > we are allowed to copy. */ *blink* Woah. The backward threader has been using FSM threads indiscriminately as far as I can remember. I wonder what would break if we "fixed it". > > so these cases should use the "old style" validity/costing metrics and thus > classify threading opportunities in a different way? Jeff, do you have any insight here? > > I think today "backwards" vs, "forwards" only refers to the way we find > threading opportunities. Yes, it's a mess. I ran some experiments a while back, and my current work on the enhanced solver/threader, can fold virtually everything the DOM/threader gets (even with its use of const_and_copies, avail_exprs, and evrp_range_analyzer), while getting 5% more DOM threads and 1% more overall threads. That is, I've been testing if the path solver can solve everything the DOM threader needs (the hybrid approach I mentioned). Unfortunately, replacing the forward threader right now is not feasible for a few reasons: a) The const_and_copies/avail_exprs relation framework can do floats, and that's next year's ranger work. b) Even though we can seemingly fold everything DOM/threader does, in order to replace it with a backward threader instance we'd have to merge the cost/profitability code scattered throughout the forward threader, as well as the EDGE_FSM* / EDGE_COPY* business. c) DOM changes the IL as it goes. Though we could conceivably divorce do the threading after DOM is done. But I digress... Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 9:21 ` Aldy Hernandez @ 2021-09-09 10:15 ` Richard Biener 2021-09-09 11:28 ` Aldy Hernandez 2021-09-10 15:51 ` Jeff Law 2021-09-10 15:43 ` Jeff Law 1 sibling, 2 replies; 33+ messages in thread From: Richard Biener @ 2021-09-09 10:15 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez <aldyh@redhat.com> wrote: > > > > On 9/9/21 10:58 AM, Richard Biener wrote: > > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote: > >> > >> > >> > >> On 9/9/21 9:45 AM, Richard Biener wrote: > >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote: > >>>> > >>>> > >>>> > >>>> On 9/9/21 8:57 AM, Richard Biener wrote: > >>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> [lame answer to self] > >>>>>> > >>>>>> On Wed, 8 Sep 2021, Michael Matz wrote: > >>>>>> > >>>>>>>>> The forward threader guards against this by simply disallowing > >>>>>>>>> threadings that involve different loops. As I see > >>>>>>>> > >>>>>>>> The thread in question (5->9->3) is all within the same outer loop, > >>>>>>>> though. BTW, the backward threader also disallows threading across > >>>>>>>> different loops (see path_crosses_loops variable). > >>>>>> ... > >>>>>>> Maybe it's possible to not disable threading over latches alltogether in > >>>>>>> the backward threader (like it's tried now), but I haven't looked at the > >>>>>>> specific situation here in depth, so take my view only as opinion from a > >>>>>>> large distance :-) > >>>>>> > >>>>>> I've now looked at the concrete situation. So yeah, the whole path is in > >>>>>> the same loop, crosses the latch, _and there's code following the latch > >>>>>> on that path_. (I.e. the latch isn't the last block in the path). In > >>>>>> particular, after loop_optimizer_init() (before any threading) we have: > >>>>>> > >>>>>> <bb 3> [local count: 118111600]: > >>>>>> # j_19 = PHI <j_13(9), 0(7)> > >>>>>> sum_11 = c[j_19]; > >>>>>> if (n_10(D) > 0) > >>>>>> goto <bb 8>; [89.00%] > >>>>>> else > >>>>>> goto <bb 5>; [11.00%] > >>>>>> > >>>>>> <bb 8> [local count: 105119324]: > >>>>>> ... > >>>>>> > >>>>>> <bb 5> [local count: 118111600]: > >>>>>> # sum_21 = PHI <sum_14(4), sum_11(3)> > >>>>>> c[j_19] = sum_21; > >>>>>> j_13 = j_19 + 1; > >>>>>> if (n_10(D) > j_13) > >>>>>> goto <bb 9>; [89.00%] > >>>>>> else > >>>>>> goto <bb 6>; [11.00%] > >>>>>> > >>>>>> <bb 9> [local count: 105119324]: > >>>>>> goto <bb 3>; [100.00%] > >>>>>> > >>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > >>>>>> pre-header of inner loop, but more importantly something that's not at the > >>>>>> start of the outer loop. > >>>>>> > >>>>>> Now, any thread that includes the backedge 9->3 _including_ its > >>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) > >>>>>> necessarily duplicates all code from that destination onto the back edge. > >>>>>> Here it's the load from c[j] into sum_11. > >>>>>> > >>>>>> The important part is the code is emitted onto the back edge, > >>>>>> conceptually; in reality it's simply included into the (new) latch block > >>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > >>>>>> cfg_cleanup). > >>>>>> > >>>>>> That's what we can't have for some of our structural loop optimizers: > >>>>>> there must be no code executed after the exit test (e.g. in the latch > >>>>>> block). (This requirement makes reasoning about which code is or isn't > >>>>>> executed completely for an iteration trivial; simply everything in the > >>>>>> body is always executed; e.g. loop interchange uses this to check that > >>>>>> there are no memory references after the exit test, because those would > >>>>>> then be only conditional and hence make loop interchange very awkward). > >>>>>> > >>>>>> Note that this situation can't be later rectified anymore: the duplicated > >>>>>> instructions (because they are memory refs) must remain after the exit > >>>>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the > >>>>>> memory refs on the loop-entry path and on the back edge are equivalent) > >>>>>> would that be possible, but that's something we aren't capable of. Even > >>>>>> if we were that would simply just revert the whole work that the threader > >>>>>> did, so it's better to not even do that to start with. > >>>>>> > >>>>>> I believe something like below would be appropriate, it disables threading > >>>>>> if the path contains a latch at the non-last position (due to being > >>>>>> backwards on the non-first position in the array). I.e. it disables > >>>>>> rotating the loop if there's danger of polluting the back edge. It might > >>>>>> be improved if the blocks following (preceding!) the latch are themself > >>>>>> empty because then no code is duplicated. It might also be improved if > >>>>>> the latch is already non-empty. That code should probably only be active > >>>>>> before the loop optimizers, but currently the backward threader isn't > >>>>>> differentiating between before/after loop-optims. > >>>>>> > >>>>>> I haven't tested this patch at all, except that it fixes the testcase :) > >>>>> > >>>>> Lame comment at the current end of the thread - it's not threading through the > >>>> > >>>> I don't know why y'all keep using the word "lame". On the contrary, > >>>> these are incredibly useful explanations. Thanks. > >>>> > >>>>> latch but threading through the loop header that's problematic, at least if the > >>>>> end of the threading path ends within the loop (threading through the header > >>>>> to the loop exit is fine). Because in that situation you effectively created an > >>>>> alternate loop entry. Threading through the latch into the loop header is > >>>>> fine but with simple latches that likely will never happen (if there are no > >>>>> simple latches then the latch can contain the loop exit test). > >>>>> > >>>>> See tree-ssa-threadupdate.c:thread_block_1 > >>>>> > >>>>> e2 = path->last ()->e; > >>>>> if (!e2 || noloop_only) > >>>>> { > >>>>> /* If NOLOOP_ONLY is true, we only allow threading through the > >>>>> header of a loop to exit edges. */ > >>>>> > >>>>> /* One case occurs when there was loop header buried in a jump > >>>>> threading path that crosses loop boundaries. We do not try > >>>>> and thread this elsewhere, so just cancel the jump threading > >>>>> request by clearing the AUX field now. */ > >>>>> if (bb->loop_father != e2->src->loop_father > >>>>> && (!loop_exit_edge_p (e2->src->loop_father, e2) > >>>>> || flow_loop_nested_p (bb->loop_father, > >>>>> e2->dest->loop_father))) > >>>>> { > >>>>> /* Since this case is not handled by our special code > >>>>> to thread through a loop header, we must explicitly > >>>>> cancel the threading request here. */ > >>>>> delete_jump_thread_path (path); > >>>>> e->aux = NULL; > >>>>> continue; > >>>>> } > >>>> > >>>> But this is for a threading path that crosses loop boundaries, which is > >>>> not the case. Perhaps we should restrict this further to threads within > >>>> a loop? > >>>> > >>>>> > >>>>> there are a lot of "useful" checks in this function and the backwards threader > >>>>> should adopt those. Note the backwards threader originally only did > >>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming > >>>>> irreducible regions at worst or sub-loops at best. That might explain the > >>>>> lack of those checks. > >>>> > >>>> Also, the aforementioned checks are in jump_thread_path_registry, which > >>>> is also shared by the backward threader. These are thread discards > >>>> _after_ a thread has been registered. > >>> > >>> Yeah, that's indeed unfortunate. > >>> > >>>> The backward threader should also > >>>> be using these restrictions. Unless, I'm missing some interaction with > >>>> the FSM/etc threading types as per the preamble to the snippet you provided: > >>>> > >>>> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) > >>>> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) > >>>> continue; > >>> > >>> Indeed. But I understand the backwards threader does not (only) do FSM > >>> threading now. > >> > >> If it does, it was not part of my rewrite. I was careful to not touch > >> anything dealing with either path profitability or low-level path > >> registering. > >> > >> The path registering is in back_threader_registry::register_path(). We > >> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that > >> those are only EDGE_FSM_THREADs?? > > > > Well, if the backwards threader classifies everything as FSM that's probably > > inaccurate since only threads through the loop latch are "FSM". There is > > the comment > > > > /* If this path does not thread through the loop latch, then we are > > using the FSM threader to find old style jump threads. This > > is good, except the FSM threader does not re-use an existing > > threading path to reduce code duplication. > > > > So for that case, drastically reduce the number of statements > > we are allowed to copy. */ > > *blink* > > Woah. The backward threader has been using FSM threads indiscriminately > as far as I can remember. I wonder what would break if we "fixed it". > > > > > so these cases should use the "old style" validity/costing metrics and thus > > classify threading opportunities in a different way? > > Jeff, do you have any insight here? > > > > > I think today "backwards" vs, "forwards" only refers to the way we find > > threading opportunities. > > Yes, it's a mess. > > I ran some experiments a while back, and my current work on the enhanced > solver/threader, can fold virtually everything the DOM/threader gets > (even with its use of const_and_copies, avail_exprs, and > evrp_range_analyzer), while getting 5% more DOM threads and 1% more > overall threads. That is, I've been testing if the path solver can > solve everything the DOM threader needs (the hybrid approach I mentioned). > > Unfortunately, replacing the forward threader right now is not feasible > for a few reasons: > > a) The const_and_copies/avail_exprs relation framework can do floats, > and that's next year's ranger work. Hmm, it sounds like relying fully on ranger is a bit premature then. > b) Even though we can seemingly fold everything DOM/threader does, in > order to replace it with a backward threader instance we'd have to merge > the cost/profitability code scattered throughout the forward threader, > as well as the EDGE_FSM* / EDGE_COPY* business. > > c) DOM changes the IL as it goes. Though we could conceivably divorce > do the threading after DOM is done. Yeah, it does not actually process/simplify the blocks copied by threading. In fact I think it only registers jump threading opportunities during the DOM walk and commits them only later. But it of course uses its avail / copies stack to find them - that part you cannot easily divorce. DOM is also yet another value-numbering framework - one could think of stripping it down from that side, keeping the threading bits only (but you'd still have the avail / copies bits). That said, it has one nice property it can leverage due to its incredibly simple memory redundancy handling, in that it usually performs way less alias queries than FRE (even when you run the latter in non-iterative mode). But the same way DOM can register jump threading opportunities FRE could do as well. Richard. > But I digress... > Aldy > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 10:15 ` Richard Biener @ 2021-09-09 11:28 ` Aldy Hernandez 2021-09-10 15:51 ` Jeff Law 1 sibling, 0 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 11:28 UTC (permalink / raw) To: Richard Biener; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/9/21 12:15 PM, Richard Biener wrote: > On Thu, Sep 9, 2021 at 11:21 AM Aldy Hernandez <aldyh@redhat.com> wrote: >> >> >> >> On 9/9/21 10:58 AM, Richard Biener wrote: >> I ran some experiments a while back, and my current work on the enhanced >> solver/threader, can fold virtually everything the DOM/threader gets >> (even with its use of const_and_copies, avail_exprs, and >> evrp_range_analyzer), while getting 5% more DOM threads and 1% more >> overall threads. That is, I've been testing if the path solver can >> solve everything the DOM threader needs (the hybrid approach I mentioned). >> >> Unfortunately, replacing the forward threader right now is not feasible >> for a few reasons: >> >> a) The const_and_copies/avail_exprs relation framework can do floats, >> and that's next year's ranger work. > > Hmm, it sounds like relying fully on ranger is a bit premature then. For floats, most definitely. Ranger doesn't do them. They're for the next release. However, const_and_copies/avail_exprs for integers we can do just fine, that's why evrp/VRP are much easier targets. > >> b) Even though we can seemingly fold everything DOM/threader does, in >> order to replace it with a backward threader instance we'd have to merge >> the cost/profitability code scattered throughout the forward threader, >> as well as the EDGE_FSM* / EDGE_COPY* business. >> >> c) DOM changes the IL as it goes. Though we could conceivably divorce >> do the threading after DOM is done. > > Yeah, it does not actually process/simplify the blocks copied by threading. > In fact I think it only registers jump threading opportunities during the DOM > walk and commits them only later. But it of course uses its avail / copies > stack to find them - that part you cannot easily divorce. Yes, all the threaders register paths ahead of time and only do the actual CFG shuffling at the end with thread_through_all_blocks(). > > DOM is also yet another value-numbering framework - one could think > of stripping it down from that side, keeping the threading bits only > (but you'd still have the avail / copies bits). Ughh, yeah. I've seen some random missed cases because of the value-numbering it does :-/. It's very frustrating that we have various ad-hoc VN methods. > > That said, it has one nice property it can leverage due to its incredibly > simple memory redundancy handling, in that it usually performs way less > alias queries than FRE (even when you run the latter in non-iterative mode). > > But the same way DOM can register jump threading opportunities FRE > could do as well. My plan for DOM/threading for this release is to replace its evrp analyzer with a path solver (path_range_query). We can still use the avail/copies, and everything else should just work. I'm hand waiving on a few missed cases due to IL changing, but last time I checked we got way more in return. Andrew even thought we should get most things even in the presence of changing IL, but I haven't distilled testcases for him. For VRP/threading, the same hybrid thing, but replacing the ASSERT_EXPRs, and the avail/copies with a path solver. There are no floats here. Things are looking good in my tests. Once we do floats, if we could merge the EDGE_*_THREAD and the cost/profit bits between both threaders, we should be able to replace the entire forward threader with a backward client. Hopefully I don't run out of steam by then ;-). Thanks for your insight on DOM. It's very helpful. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 10:15 ` Richard Biener 2021-09-09 11:28 ` Aldy Hernandez @ 2021-09-10 15:51 ` Jeff Law 2021-09-10 16:11 ` Aldy Hernandez 1 sibling, 1 reply; 33+ messages in thread From: Jeff Law @ 2021-09-10 15:51 UTC (permalink / raw) To: Richard Biener, Aldy Hernandez Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/9/2021 4:15 AM, Richard Biener wrote: > >> b) Even though we can seemingly fold everything DOM/threader does, in >> order to replace it with a backward threader instance we'd have to merge >> the cost/profitability code scattered throughout the forward threader, >> as well as the EDGE_FSM* / EDGE_COPY* business. >> >> c) DOM changes the IL as it goes. Though we could conceivably divorce >> do the threading after DOM is done. > Yeah, it does not actually process/simplify the blocks copied by threading. > In fact I think it only registers jump threading opportunities during the DOM > walk and commits them only later. But it of course uses its avail / copies > stack to find them - that part you cannot easily divorce. Well, divorcing from using the context sensitive avail/copies is part of what Aldy & Andrew have been working on. All indications I've seen are they're on track to be able to do that. And yes, it only registers the threads and waits until after DOM is done to transform the CFG. That in and of itself introduces all kinds of complexity. If we can get to the point where we don't need the context sensitive avail/copies, then we've got a real shot at untangling DOM and threading which would be a huge maintainability win in my mind. > > DOM is also yet another value-numbering framework - one could think > of stripping it down from that side, keeping the threading bits only > (but you'd still have the avail / copies bits). Yes. I think you and I touched on this a while back. At a high level I'd prefer to have FRE rather than DOM doing the bulk of the redundant expression elimination. The big blocker there was the tight integration of DOM and threading. But if Aldy can untangle that we can then evaluate replacing DOM with FRE. > > That said, it has one nice property it can leverage due to its incredibly > simple memory redundancy handling, in that it usually performs way less > alias queries than FRE (even when you run the latter in non-iterative mode). DOM as an infrastructure for optimization is probably reaching the end of its useful life. FRE has a lot more going for it. > > But the same way DOM can register jump threading opportunities FRE > could do as well. I'd advise against that and instead look towards a model where no pass has integrated jump threading and the only jump threading module we have is the backwards threader. jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-10 15:51 ` Jeff Law @ 2021-09-10 16:11 ` Aldy Hernandez 0 siblings, 0 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-10 16:11 UTC (permalink / raw) To: Jeff Law, Richard Biener; +Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/10/21 5:51 PM, Jeff Law wrote: > > > On 9/9/2021 4:15 AM, Richard Biener wrote: >> >>> b) Even though we can seemingly fold everything DOM/threader does, in >>> order to replace it with a backward threader instance we'd have to merge >>> the cost/profitability code scattered throughout the forward threader, >>> as well as the EDGE_FSM* / EDGE_COPY* business. >>> >>> c) DOM changes the IL as it goes. Though we could conceivably divorce >>> do the threading after DOM is done. >> Yeah, it does not actually process/simplify the blocks copied by >> threading. >> In fact I think it only registers jump threading opportunities during >> the DOM >> walk and commits them only later. But it of course uses its avail / >> copies >> stack to find them - that part you cannot easily divorce. > Well, divorcing from using the context sensitive avail/copies is part of > what Aldy & Andrew have been working on. All indications I've seen are > they're on track to be able to do that. > > And yes, it only registers the threads and waits until after DOM is done > to transform the CFG. That in and of itself introduces all kinds of > complexity. If we can get to the point where we don't need the context > sensitive avail/copies, then we've got a real shot at untangling DOM and > threading which would be a huge maintainability win in my mind. > >> >> DOM is also yet another value-numbering framework - one could think >> of stripping it down from that side, keeping the threading bits only >> (but you'd still have the avail / copies bits). > Yes. I think you and I touched on this a while back. At a high level > I'd prefer to have FRE rather than DOM doing the bulk of the redundant > expression elimination. The big blocker there was the tight integration > of DOM and threading. But if Aldy can untangle that we can then > evaluate replacing DOM with FRE. Once ranger does floats, I can't think of anything the forward threader could get that the backward threader couldn't. > > >> >> That said, it has one nice property it can leverage due to its incredibly >> simple memory redundancy handling, in that it usually performs way less >> alias queries than FRE (even when you run the latter in non-iterative >> mode). > DOM as an infrastructure for optimization is probably reaching the end > of its useful life. FRE has a lot more going for it. > >> >> But the same way DOM can register jump threading opportunities FRE >> could do as well. > I'd advise against that and instead look towards a model where no pass > has integrated jump threading and the only jump threading module we have > is the backwards threader. Yes, please. We need to separate jump threading from all passes. One thing, and do it well. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 9:21 ` Aldy Hernandez 2021-09-09 10:15 ` Richard Biener @ 2021-09-10 15:43 ` Jeff Law 2021-09-10 16:05 ` Aldy Hernandez 1 sibling, 1 reply; 33+ messages in thread From: Jeff Law @ 2021-09-10 15:43 UTC (permalink / raw) To: Aldy Hernandez, Richard Biener Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/9/2021 3:21 AM, Aldy Hernandez wrote: > >> >> /* If this path does not thread through the loop latch, then we are >> using the FSM threader to find old style jump threads. This >> is good, except the FSM threader does not re-use an existing >> threading path to reduce code duplication. >> >> So for that case, drastically reduce the number of statements >> we are allowed to copy. */ > > *blink* > > Woah. The backward threader has been using FSM threads > indiscriminately as far as I can remember. I wonder what would break > if we "fixed it". ?!? I'm not sure what you're suggesting here. If you s/FSM threader/backwards threader/ in the comment does it make more sense? The term FSM really should largely have been dropped as the backwards threader was improved to handle more cases. > >> >> so these cases should use the "old style" validity/costing metrics >> and thus >> classify threading opportunities in a different way? > > Jeff, do you have any insight here? This is precisely what you're cleaning up. > >> >> I think today "backwards" vs, "forwards" only refers to the way we find >> threading opportunities. > > Yes, it's a mess. > > I ran some experiments a while back, and my current work on the > enhanced solver/threader, can fold virtually everything the > DOM/threader gets (even with its use of const_and_copies, avail_exprs, > and evrp_range_analyzer), while getting 5% more DOM threads and 1% > more overall threads. That is, I've been testing if the path solver > can solve everything the DOM threader needs (the hybrid approach I > mentioned). > > Unfortunately, replacing the forward threader right now is not > feasible for a few reasons: Right. But I thought the short term goal was to replace/remove the forward threading from VRP. Dropping from DOM is going to be tougher. > > a) The const_and_copies/avail_exprs relation framework can do floats, > and that's next year's ranger work. Right. I'd actually run into this as well when I wanted to drop all the range bits out of DOM and rely exclusively on EVRP. It'd still be a step forward to rip out the EVRP engine from DOM and simplify all the code that derives one equivalence from another so that it's only working on FP values. > > b) Even though we can seemingly fold everything DOM/threader does, in > order to replace it with a backward threader instance we'd have to > merge the cost/profitability code scattered throughout the forward > threader, as well as the EDGE_FSM* / EDGE_COPY* business. Right. This is a prerequisite. Though some of the costing will need to be conditional on the threader being used. Refer back to the discussion around how the forward threader can commonize thread paths that lead to the same destination while the backwards threader can not. > > c) DOM changes the IL as it goes. Though we could conceivably divorce > do the threading after DOM is done. The only reason threading runs in parallel with DOM is so that it can use the context sensitive equivalences. With the infrastructure you're building, there's a reasonable chance we can move to a model where we run DOM (and in the long term a simpler DOM) and threading as distinct, independent passes. Jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-10 15:43 ` Jeff Law @ 2021-09-10 16:05 ` Aldy Hernandez 2021-09-10 16:21 ` Jeff Law 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-10 16:05 UTC (permalink / raw) To: Jeff Law, Richard Biener; +Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/10/21 5:43 PM, Jeff Law wrote: > > > On 9/9/2021 3:21 AM, Aldy Hernandez wrote: >> >>> >>> /* If this path does not thread through the loop latch, then we are >>> using the FSM threader to find old style jump threads. This >>> is good, except the FSM threader does not re-use an existing >>> threading path to reduce code duplication. >>> >>> So for that case, drastically reduce the number of statements >>> we are allowed to copy. */ >> >> *blink* >> >> Woah. The backward threader has been using FSM threads >> indiscriminately as far as I can remember. I wonder what would break >> if we "fixed it". > ?!? I'm not sure what you're suggesting here. If you s/FSM > threader/backwards threader/ in the comment does it make more sense? The > term FSM really should largely have been dropped as the backwards > threader was improved to handle more cases. back_threader_registry::register_path() uses EDGE_FSM_THREAD as the thread type to register threads. I was wondering if it should have been some combination of EDGE_START_JUMP_THREAD / EDGE_*_COPY_SRC_BLOCK, etc. I (purposely) know nothing about the underlying threading types ;-). But if the backwards threader has been improved, then perhaps we should just remove the confusing FSM references. > > > > >> >>> >>> so these cases should use the "old style" validity/costing metrics >>> and thus >>> classify threading opportunities in a different way? >> >> Jeff, do you have any insight here? > This is precisely what you're cleaning up. > > >> >>> >>> I think today "backwards" vs, "forwards" only refers to the way we find >>> threading opportunities. >> >> Yes, it's a mess. >> >> I ran some experiments a while back, and my current work on the >> enhanced solver/threader, can fold virtually everything the >> DOM/threader gets (even with its use of const_and_copies, avail_exprs, >> and evrp_range_analyzer), while getting 5% more DOM threads and 1% >> more overall threads. That is, I've been testing if the path solver >> can solve everything the DOM threader needs (the hybrid approach I >> mentioned). >> >> Unfortunately, replacing the forward threader right now is not >> feasible for a few reasons: > Right. But I thought the short term goal was to replace/remove the > forward threading from VRP. Dropping from DOM is going to be tougher. My current thinking is that replacing the forward VRP threader with a hybrid one is a gentler approach to the longer term goal of replacing the forward threader altogether. However, all the work I've been doing could go either way-- we could try the forward/VRP replacement or a hybrid approach. It will all use the path solver underneath. My main problem with replacing the forward/VRP with a backward client is that the cost models are so different that it was difficult to compare how we fared. I constantly ran into threads the solver could handle just fine, but profitable_path_p was holding it back. FWIW, we get virtually everything the forward threader gets, minus a very few things. At least when I plug in the solver to the DOM/forwarder threader, it can solve everything it can (minus noise and floats). If you prefer a backward threader instance to replace the VRP/forward threader, I'm game. It's just harder to compare. Either way (backward threader or a hybrid forward+solver) uses the same underlying solver which is solid. >> a) The const_and_copies/avail_exprs relation framework can do floats, >> and that's next year's ranger work. > Right. I'd actually run into this as well when I wanted to drop all the > range bits out of DOM and rely exclusively on EVRP. It'd still be a > step forward to rip out the EVRP engine from DOM and simplify all the > code that derives one equivalence from another so that it's only working > on FP values. Sure. > >> >> b) Even though we can seemingly fold everything DOM/threader does, in >> order to replace it with a backward threader instance we'd have to >> merge the cost/profitability code scattered throughout the forward >> threader, as well as the EDGE_FSM* / EDGE_COPY* business. > Right. This is a prerequisite. Though some of the costing will need to > be conditional on the threader being used. Refer back to the discussion > around how the forward threader can commonize thread paths that lead to > the same destination while the backwards threader can not. Yup, yup. > >> >> c) DOM changes the IL as it goes. Though we could conceivably divorce >> do the threading after DOM is done. > The only reason threading runs in parallel with DOM is so that it can > use the context sensitive equivalences. With the infrastructure you're > building, there's a reasonable chance we can move to a model where we > run DOM (and in the long term a simpler DOM) and threading as distinct, > independent passes. Andrew mumbled something about replacing all of DOM eventually :-). Well, except that value-numbering business I bet. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-10 16:05 ` Aldy Hernandez @ 2021-09-10 16:21 ` Jeff Law 2021-09-10 16:38 ` Aldy Hernandez 0 siblings, 1 reply; 33+ messages in thread From: Jeff Law @ 2021-09-10 16:21 UTC (permalink / raw) To: Aldy Hernandez, Richard Biener Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/10/2021 10:05 AM, Aldy Hernandez wrote: > > > On 9/10/21 5:43 PM, Jeff Law wrote: >> >> >> On 9/9/2021 3:21 AM, Aldy Hernandez wrote: >>> >>>> >>>> /* If this path does not thread through the loop latch, then we are >>>> using the FSM threader to find old style jump threads. This >>>> is good, except the FSM threader does not re-use an existing >>>> threading path to reduce code duplication. >>>> >>>> So for that case, drastically reduce the number of statements >>>> we are allowed to copy. */ >>> >>> *blink* >>> >>> Woah. The backward threader has been using FSM threads >>> indiscriminately as far as I can remember. I wonder what would >>> break if we "fixed it". >> ?!? I'm not sure what you're suggesting here. If you s/FSM >> threader/backwards threader/ in the comment does it make more sense? >> The term FSM really should largely have been dropped as the backwards >> threader was improved to handle more cases. > > back_threader_registry::register_path() uses EDGE_FSM_THREAD as the > thread type to register threads. I was wondering if it should have > been some combination of EDGE_START_JUMP_THREAD / > EDGE_*_COPY_SRC_BLOCK, etc. I (purposely) know nothing about the > underlying threading types ;-). But if the backwards threader has been > improved, then perhaps we should just remove the confusing FSM > references. No we shouldn't change it to any of the other types. EDGE_FSM_THREAD means a thread found by the backwards threader and it's the key used to determine which of the two CFG updating mechanisms should be used, the generic copier in the case of EDGE_FSM_THREAD. Changing the name, yes, absolutely. I probably should have done that when the original FSM threader was tweaked to handle generic threading. As you've probably heard me mention before, all the EDGE_FSM_THREAD stuff in the registry really could be pulled out. The registry's purpose is to deal with the two stage nature of jump threading in DOM (find threads, then optimize them later). I don't think any of the backwards threading bits need that two stage handling. > My current thinking is that replacing the forward VRP threader with a > hybrid one is a gentler approach to the longer term goal of replacing > the forward threader altogether. However, all the work I've been > doing could go either way-- we could try the forward/VRP replacement > or a hybrid approach. It will all use the path solver underneath. And that's probably a reasonable intermediate step on the way towards removing the VRP threading. > > My main problem with replacing the forward/VRP with a backward client > is that the cost models are so different that it was difficult to > compare how we fared. I constantly ran into threads the solver could > handle just fine, but profitable_path_p was holding it back. Yea. Sorry about that tangle of insanity > > FWIW, we get virtually everything the forward threader gets, minus a > very few things. At least when I plug in the solver to the > DOM/forwarder threader, it can solve everything it can (minus noise > and floats). So once you plug those bits in, we don't have to carry around the avail/copies tables for the threader anymore, right? That's a nice cleanup in and of itself. > > If you prefer a backward threader instance to replace the VRP/forward > threader, I'm game. It's just harder to compare. Either way (backward > threader or a hybrid forward+solver) uses the same underlying solver > which is solid. I think we can go hybrid, then look at the next step, which could well be bringing some consistency into the costing models. >> >>> >>> c) DOM changes the IL as it goes. Though we could conceivably >>> divorce do the threading after DOM is done. >> The only reason threading runs in parallel with DOM is so that it can >> use the context sensitive equivalences. With the infrastructure >> you're building, there's a reasonable chance we can move to a model >> where we run DOM (and in the long term a simpler DOM) and threading >> as distinct, independent passes. > > Andrew mumbled something about replacing all of DOM eventually :-). > Well, except that value-numbering business I bet. Essentially a realization of Bodik's work in GCC. The nugget in there is it's a path sensitive optimizer. That's kindof what I've envisioned DOM turning into. 1. We separate out jump threading from DOM. 2. We replace the bulk of DOM with FRE 3. The remnants of DOM turn into a path sensitive optimizer (and for god's sake we don't want to call it DOM anymore :-) Jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-10 16:21 ` Jeff Law @ 2021-09-10 16:38 ` Aldy Hernandez 0 siblings, 0 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-10 16:38 UTC (permalink / raw) To: Jeff Law, Richard Biener; +Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/10/21 6:21 PM, Jeff Law wrote: > > > On 9/10/2021 10:05 AM, Aldy Hernandez wrote: >> >> >> On 9/10/21 5:43 PM, Jeff Law wrote: >>> >>> >>> On 9/9/2021 3:21 AM, Aldy Hernandez wrote: >>>> >>>>> >>>>> /* If this path does not thread through the loop latch, then we are >>>>> using the FSM threader to find old style jump threads. This >>>>> is good, except the FSM threader does not re-use an existing >>>>> threading path to reduce code duplication. >>>>> >>>>> So for that case, drastically reduce the number of statements >>>>> we are allowed to copy. */ >>>> >>>> *blink* >>>> >>>> Woah. The backward threader has been using FSM threads >>>> indiscriminately as far as I can remember. I wonder what would >>>> break if we "fixed it". >>> ?!? I'm not sure what you're suggesting here. If you s/FSM >>> threader/backwards threader/ in the comment does it make more sense? >>> The term FSM really should largely have been dropped as the backwards >>> threader was improved to handle more cases. >> >> back_threader_registry::register_path() uses EDGE_FSM_THREAD as the >> thread type to register threads. I was wondering if it should have >> been some combination of EDGE_START_JUMP_THREAD / >> EDGE_*_COPY_SRC_BLOCK, etc. I (purposely) know nothing about the >> underlying threading types ;-). But if the backwards threader has been >> improved, then perhaps we should just remove the confusing FSM >> references. > No we shouldn't change it to any of the other types. EDGE_FSM_THREAD > means a thread found by the backwards threader and it's the key used to > determine which of the two CFG updating mechanisms should be used, the > generic copier in the case of EDGE_FSM_THREAD. > > > Changing the name, yes, absolutely. I probably should have done that > when the original FSM threader was tweaked to handle generic threading. I'll put that on my list. > > As you've probably heard me mention before, all the EDGE_FSM_THREAD > stuff in the registry really could be pulled out. The registry's > purpose is to deal with the two stage nature of jump threading in DOM > (find threads, then optimize them later). I don't think any of the > backwards threading bits need that two stage handling. Yeah yeah. But you said that a year ago, and all I heard was *mumble mumble/complicated things*. That was before I had much exposure to the code. Now I feel a lot more comfortable ;-). I'll also put this on my list, but it may not get done this release, cause I'm running against the clock with the VRP/threader replacement, which is what's keeping us back from replacing VRP with an evrp instance right now :). > >> My current thinking is that replacing the forward VRP threader with a >> hybrid one is a gentler approach to the longer term goal of replacing >> the forward threader altogether. However, all the work I've been >> doing could go either way-- we could try the forward/VRP replacement >> or a hybrid approach. It will all use the path solver underneath. > And that's probably a reasonable intermediate step on the way towards > removing the VRP threading. > >> >> My main problem with replacing the forward/VRP with a backward client >> is that the cost models are so different that it was difficult to >> compare how we fared. I constantly ran into threads the solver could >> handle just fine, but profitable_path_p was holding it back. > Yea. Sorry about that tangle of insanity > >> >> FWIW, we get virtually everything the forward threader gets, minus a >> very few things. At least when I plug in the solver to the >> DOM/forwarder threader, it can solve everything it can (minus noise >> and floats). > So once you plug those bits in, we don't have to carry around the > avail/copies tables for the threader anymore, right? That's a nice > cleanup in and of itself. Correct. For the VRP/hybrid approach I'm working on, there are no copies/avails. The solver can do everything they did. After all, it's an easier target, since VRP threading only happens on ints and without the IL changing constantly. > >> >> If you prefer a backward threader instance to replace the VRP/forward >> threader, I'm game. It's just harder to compare. Either way (backward >> threader or a hybrid forward+solver) uses the same underlying solver >> which is solid. > I think we can go hybrid, then look at the next step, which could well > be bringing some consistency into the costing models. >>> >>>> >>>> c) DOM changes the IL as it goes. Though we could conceivably >>>> divorce do the threading after DOM is done. >>> The only reason threading runs in parallel with DOM is so that it can >>> use the context sensitive equivalences. With the infrastructure >>> you're building, there's a reasonable chance we can move to a model >>> where we run DOM (and in the long term a simpler DOM) and threading >>> as distinct, independent passes. >> >> Andrew mumbled something about replacing all of DOM eventually :-). >> Well, except that value-numbering business I bet. > Essentially a realization of Bodik's work in GCC. The nugget in there > is it's a path sensitive optimizer. That's kindof what I've envisioned > DOM turning into. > > 1. We separate out jump threading from DOM. > 2. We replace the bulk of DOM with FRE > 3. The remnants of DOM turn into a path sensitive optimizer (and for > god's sake we don't want to call it DOM anymore :-) Well, my tree has improvements to the solver for full path sensitive ranges (using ranger to resolve definitions outside of the path). And it also does path sensitive relations, though Andrew is overhauling them next week. So yeah, given a path, we could tell you all sorts of interesting things about it :). Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 8:58 ` Richard Biener 2021-09-09 9:21 ` Aldy Hernandez @ 2021-09-09 16:59 ` Jeff Law 1 sibling, 0 replies; 33+ messages in thread From: Jeff Law @ 2021-09-09 16:59 UTC (permalink / raw) To: Richard Biener, Aldy Hernandez Cc: Michael Matz, GCC Mailing List, Andrew MacLeod On 9/9/2021 2:58 AM, Richard Biener wrote: > On Thu, Sep 9, 2021 at 10:36 AM Aldy Hernandez <aldyh@redhat.com> wrote: >> >> >> On 9/9/21 9:45 AM, Richard Biener wrote: >>> On Thu, Sep 9, 2021 at 9:37 AM Aldy Hernandez <aldyh@redhat.com> wrote: >>>> >>>> >>>> On 9/9/21 8:57 AM, Richard Biener wrote: >>>>> On Wed, Sep 8, 2021 at 8:13 PM Michael Matz <matz@suse.de> wrote: >>>>>> Hello, >>>>>> >>>>>> [lame answer to self] >>>>>> >>>>>> On Wed, 8 Sep 2021, Michael Matz wrote: >>>>>> >>>>>>>>> The forward threader guards against this by simply disallowing >>>>>>>>> threadings that involve different loops. As I see >>>>>>>> The thread in question (5->9->3) is all within the same outer loop, >>>>>>>> though. BTW, the backward threader also disallows threading across >>>>>>>> different loops (see path_crosses_loops variable). >>>>>> ... >>>>>>> Maybe it's possible to not disable threading over latches alltogether in >>>>>>> the backward threader (like it's tried now), but I haven't looked at the >>>>>>> specific situation here in depth, so take my view only as opinion from a >>>>>>> large distance :-) >>>>>> I've now looked at the concrete situation. So yeah, the whole path is in >>>>>> the same loop, crosses the latch, _and there's code following the latch >>>>>> on that path_. (I.e. the latch isn't the last block in the path). In >>>>>> particular, after loop_optimizer_init() (before any threading) we have: >>>>>> >>>>>> <bb 3> [local count: 118111600]: >>>>>> # j_19 = PHI <j_13(9), 0(7)> >>>>>> sum_11 = c[j_19]; >>>>>> if (n_10(D) > 0) >>>>>> goto <bb 8>; [89.00%] >>>>>> else >>>>>> goto <bb 5>; [11.00%] >>>>>> >>>>>> <bb 8> [local count: 105119324]: >>>>>> ... >>>>>> >>>>>> <bb 5> [local count: 118111600]: >>>>>> # sum_21 = PHI <sum_14(4), sum_11(3)> >>>>>> c[j_19] = sum_21; >>>>>> j_13 = j_19 + 1; >>>>>> if (n_10(D) > j_13) >>>>>> goto <bb 9>; [89.00%] >>>>>> else >>>>>> goto <bb 6>; [11.00%] >>>>>> >>>>>> <bb 9> [local count: 105119324]: >>>>>> goto <bb 3>; [100.00%] >>>>>> >>>>>> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the >>>>>> pre-header of inner loop, but more importantly something that's not at the >>>>>> start of the outer loop. >>>>>> >>>>>> Now, any thread that includes the backedge 9->3 _including_ its >>>>>> destination (i.e. where the backedge isn't the last to-be-redirected edge) >>>>>> necessarily duplicates all code from that destination onto the back edge. >>>>>> Here it's the load from c[j] into sum_11. >>>>>> >>>>>> The important part is the code is emitted onto the back edge, >>>>>> conceptually; in reality it's simply included into the (new) latch block >>>>>> (the duplicate of bb9, which is bb12 intermediately, then named bb7 after >>>>>> cfg_cleanup). >>>>>> >>>>>> That's what we can't have for some of our structural loop optimizers: >>>>>> there must be no code executed after the exit test (e.g. in the latch >>>>>> block). (This requirement makes reasoning about which code is or isn't >>>>>> executed completely for an iteration trivial; simply everything in the >>>>>> body is always executed; e.g. loop interchange uses this to check that >>>>>> there are no memory references after the exit test, because those would >>>>>> then be only conditional and hence make loop interchange very awkward). >>>>>> >>>>>> Note that this situation can't be later rectified anymore: the duplicated >>>>>> instructions (because they are memory refs) must remain after the exit >>>>>> test. Only by rerolling/unrotating the loop (i.e. noticing that the >>>>>> memory refs on the loop-entry path and on the back edge are equivalent) >>>>>> would that be possible, but that's something we aren't capable of. Even >>>>>> if we were that would simply just revert the whole work that the threader >>>>>> did, so it's better to not even do that to start with. >>>>>> >>>>>> I believe something like below would be appropriate, it disables threading >>>>>> if the path contains a latch at the non-last position (due to being >>>>>> backwards on the non-first position in the array). I.e. it disables >>>>>> rotating the loop if there's danger of polluting the back edge. It might >>>>>> be improved if the blocks following (preceding!) the latch are themself >>>>>> empty because then no code is duplicated. It might also be improved if >>>>>> the latch is already non-empty. That code should probably only be active >>>>>> before the loop optimizers, but currently the backward threader isn't >>>>>> differentiating between before/after loop-optims. >>>>>> >>>>>> I haven't tested this patch at all, except that it fixes the testcase :) >>>>> Lame comment at the current end of the thread - it's not threading through the >>>> I don't know why y'all keep using the word "lame". On the contrary, >>>> these are incredibly useful explanations. Thanks. >>>> >>>>> latch but threading through the loop header that's problematic, at least if the >>>>> end of the threading path ends within the loop (threading through the header >>>>> to the loop exit is fine). Because in that situation you effectively created an >>>>> alternate loop entry. Threading through the latch into the loop header is >>>>> fine but with simple latches that likely will never happen (if there are no >>>>> simple latches then the latch can contain the loop exit test). >>>>> >>>>> See tree-ssa-threadupdate.c:thread_block_1 >>>>> >>>>> e2 = path->last ()->e; >>>>> if (!e2 || noloop_only) >>>>> { >>>>> /* If NOLOOP_ONLY is true, we only allow threading through the >>>>> header of a loop to exit edges. */ >>>>> >>>>> /* One case occurs when there was loop header buried in a jump >>>>> threading path that crosses loop boundaries. We do not try >>>>> and thread this elsewhere, so just cancel the jump threading >>>>> request by clearing the AUX field now. */ >>>>> if (bb->loop_father != e2->src->loop_father >>>>> && (!loop_exit_edge_p (e2->src->loop_father, e2) >>>>> || flow_loop_nested_p (bb->loop_father, >>>>> e2->dest->loop_father))) >>>>> { >>>>> /* Since this case is not handled by our special code >>>>> to thread through a loop header, we must explicitly >>>>> cancel the threading request here. */ >>>>> delete_jump_thread_path (path); >>>>> e->aux = NULL; >>>>> continue; >>>>> } >>>> But this is for a threading path that crosses loop boundaries, which is >>>> not the case. Perhaps we should restrict this further to threads within >>>> a loop? >>>> >>>>> there are a lot of "useful" checks in this function and the backwards threader >>>>> should adopt those. Note the backwards threader originally only did >>>>> FSM style threadings which are exactly those possibly "harmful" ones, forming >>>>> irreducible regions at worst or sub-loops at best. That might explain the >>>>> lack of those checks. >>>> Also, the aforementioned checks are in jump_thread_path_registry, which >>>> is also shared by the backward threader. These are thread discards >>>> _after_ a thread has been registered. >>> Yeah, that's indeed unfortunate. >>> >>>> The backward threader should also >>>> be using these restrictions. Unless, I'm missing some interaction with >>>> the FSM/etc threading types as per the preamble to the snippet you provided: >>>> >>>> if (((*path)[1]->type == EDGE_COPY_SRC_JOINER_BLOCK && !joiners) >>>> || ((*path)[1]->type == EDGE_COPY_SRC_BLOCK && joiners)) >>>> continue; >>> Indeed. But I understand the backwards threader does not (only) do FSM >>> threading now. >> If it does, it was not part of my rewrite. I was careful to not touch >> anything dealing with either path profitability or low-level path >> registering. >> >> The path registering is in back_threader_registry::register_path(). We >> only use EDGE_FSM_THREADs and then a final EDGE_NO_COPY. ISTM that >> those are only EDGE_FSM_THREADs?? > Well, if the backwards threader classifies everything as FSM that's probably > inaccurate since only threads through the loop latch are "FSM". There is > the comment FSM is used all over the place, but it really just applies to threading through the latch and through an indirect jump/switch. Many of the places that say FSM should probably just say "backwards" or something similar. > > /* If this path does not thread through the loop latch, then we are > using the FSM threader to find old style jump threads. This > is good, except the FSM threader does not re-use an existing > threading path to reduce code duplication. > > So for that case, drastically reduce the number of statements > we are allowed to copy. */ > > so these cases should use the "old style" validity/costing metrics and thus > classify threading opportunities in a different way? It's not that simple. The backwards threader uses a different block copier and CFG update mechanism that does not support commonizing threads from multiple paths to the same target. So each jump thread ends up with a copy of the path which can lead to excessive code bloat. THe forward threader knows how to commonize those paths so cut down on the duplicates and thus can accept larger blocks without having such a negative impact on code size. > > I think today "backwards" vs, "forwards" only refers to the way we find > threading opportunities. Correct. jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 6:57 ` Richard Biener 2021-09-09 7:37 ` Aldy Hernandez @ 2021-09-09 12:47 ` Michael Matz 1 sibling, 0 replies; 33+ messages in thread From: Michael Matz @ 2021-09-09 12:47 UTC (permalink / raw) To: Richard Biener; +Cc: Aldy Hernandez, Jeff Law, GCC Mailing List, Andrew MacLeod Hello, On Thu, 9 Sep 2021, Richard Biener wrote: > > I believe something like below would be appropriate, it disables > > threading if the path contains a latch at the non-last position (due > > to being backwards on the non-first position in the array). I.e. it > > disables rotating the loop if there's danger of polluting the back > > edge. It might be improved if the blocks following (preceding!) the > > latch are themself empty because then no code is duplicated. It might > > also be improved if the latch is already non-empty. That code should > > probably only be active before the loop optimizers, but currently the > > backward threader isn't differentiating between before/after > > loop-optims. > > > > I haven't tested this patch at all, except that it fixes the testcase > > :) > > Lame comment at the current end of the thread - it's not threading > through the latch but threading through the loop header that's > problematic, I beg to differ, but that's probably because of the ambiguity of the word "through" (does it or does it not include the ends of the path :) ). If you thread through the loop header from the entry block (i.e. duplicate code from header to entry) all would be fine (or not, in case you created multiple entries from outside). If you thread through the latch, then through an empty header and then through another non-empty basic block within the loop, you still wouldn't be fine: you've just created code on the back edge (and hence into the new latch block). If you thread through the latch and through an empty header (and stop there), you also are fine. Also note that in this situation you do _not_ create new entries into the loop, not even intermediately. The old back edge is the one that goes away due to the threading, the old entry edge is moved comletely out of the loop, the edge header->thread-dest becomes the new entry edge, and the edge new-latch->thread-dest becomes the back edge. No additional entries. So, no, it's not the threading through the loop header that is problematic but creating a situation that fills the (new) latch with code, and that can only happen if the candidate thread contains the latch block. (Of course it's somewhat related to the header block as well, because that is simply the only destination the latch has, and hence is usually included in any thread that also include the latch; but it's not the header that indicates the situation). > See tree-ssa-threadupdate.c:thread_block_1 > > e2 = path->last ()->e; > if (!e2 || noloop_only) > { > /* If NOLOOP_ONLY is true, we only allow threading through the > header of a loop to exit edges. */ > > /* One case occurs when there was loop header buried in a jump > threading path that crosses loop boundaries. We do not try > and thread this elsewhere, so just cancel the jump threading > request by clearing the AUX field now. */ > if (bb->loop_father != e2->src->loop_father > && (!loop_exit_edge_p (e2->src->loop_father, e2) > || flow_loop_nested_p (bb->loop_father, > e2->dest->loop_father))) > { > /* Since this case is not handled by our special code > to thread through a loop header, we must explicitly > cancel the threading request here. */ > delete_jump_thread_path (path); > e->aux = NULL; > continue; > } Yeah, sure, but I'm not sure if the above check is _really_ testing it wants to test or if the effects it achieves are side effects. Like in my proposed patch: I could also test for existence of loop header in the thread and reject that; it would work as well, except that it works because any useful thread including a latch (which is the problematic one) also includes the header. I'm not sure if the above check is in the same line, or tests for some still another situation. Ciao, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-08 18:13 ` Michael Matz 2021-09-09 6:57 ` Richard Biener @ 2021-09-09 8:14 ` Aldy Hernandez 2021-09-09 8:24 ` Richard Biener ` (2 more replies) 1 sibling, 3 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 8:14 UTC (permalink / raw) To: Michael Matz; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/8/21 8:13 PM, Michael Matz wrote: > Hello, > > [lame answer to self] > > On Wed, 8 Sep 2021, Michael Matz wrote: > >>>> The forward threader guards against this by simply disallowing >>>> threadings that involve different loops. As I see >>> >>> The thread in question (5->9->3) is all within the same outer loop, >>> though. BTW, the backward threader also disallows threading across >>> different loops (see path_crosses_loops variable). > ... >> Maybe it's possible to not disable threading over latches alltogether in >> the backward threader (like it's tried now), but I haven't looked at the >> specific situation here in depth, so take my view only as opinion from a >> large distance :-) > > I've now looked at the concrete situation. So yeah, the whole path is in > the same loop, crosses the latch, _and there's code following the latch > on that path_. (I.e. the latch isn't the last block in the path). In > particular, after loop_optimizer_init() (before any threading) we have: > > <bb 3> [local count: 118111600]: > # j_19 = PHI <j_13(9), 0(7)> > sum_11 = c[j_19]; > if (n_10(D) > 0) > goto <bb 8>; [89.00%] > else > goto <bb 5>; [11.00%] > > <bb 8> [local count: 105119324]: > ... > > <bb 5> [local count: 118111600]: > # sum_21 = PHI <sum_14(4), sum_11(3)> > c[j_19] = sum_21; > j_13 = j_19 + 1; > if (n_10(D) > j_13) > goto <bb 9>; [89.00%] > else > goto <bb 6>; [11.00%] > > <bb 9> [local count: 105119324]: > goto <bb 3>; [100.00%] > > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > pre-header of inner loop, but more importantly something that's not at the > start of the outer loop. > > Now, any thread that includes the backedge 9->3 _including_ its > destination (i.e. where the backedge isn't the last to-be-redirected edge) > necessarily duplicates all code from that destination onto the back edge. > Here it's the load from c[j] into sum_11. > > The important part is the code is emitted onto the back edge, > conceptually; in reality it's simply included into the (new) latch block > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > cfg_cleanup). > > That's what we can't have for some of our structural loop optimizers: > there must be no code executed after the exit test (e.g. in the latch > block). (This requirement makes reasoning about which code is or isn't > executed completely for an iteration trivial; simply everything in the > body is always executed; e.g. loop interchange uses this to check that > there are no memory references after the exit test, because those would > then be only conditional and hence make loop interchange very awkward). > > Note that this situation can't be later rectified anymore: the duplicated > instructions (because they are memory refs) must remain after the exit > test. Only by rerolling/unrotating the loop (i.e. noticing that the > memory refs on the loop-entry path and on the back edge are equivalent) > would that be possible, but that's something we aren't capable of. Even > if we were that would simply just revert the whole work that the threader > did, so it's better to not even do that to start with. > > I believe something like below would be appropriate, it disables threading > if the path contains a latch at the non-last position (due to being > backwards on the non-first position in the array). I.e. it disables > rotating the loop if there's danger of polluting the back edge. It might > be improved if the blocks following (preceding!) the latch are themself > empty because then no code is duplicated. It might also be improved if > the latch is already non-empty. That code should probably only be active > before the loop optimizers, but currently the backward threader isn't > differentiating between before/after loop-optims. > > I haven't tested this patch at all, except that it fixes the testcase :) Thanks for looking at this. I think you're onto something with this approach. Perhaps in addition to the loop header threading Richard mentions. Your patch causes some regressions, but I think most are noise from FSM tests that must be adjusted. However, there are some other ones that are curious: > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated memset zero" > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag" < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue "Performing shrink-wrapping" < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, line 36) > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times graphite "1 loops carried no dependency" 1 > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times optimized "loopfn.1" 4 > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times graphite "5 loops carried no dependency" 1 Interestingly your patch is fixing shrink-wrap-loop.c and Warray-bounds-87, both of which were introduced by the backward threader rewrite. At least the Warray-bounds was the threader peeling off an iteration that caused a bogus warning. The ldist-22 regression is interesting though: void foo () { int i; <bb 2> : goto <bb 6>; [INV] <bb 3> : a[i_1] = 0; if (i_1 > 100) goto <bb 4>; [INV] else goto <bb 5>; [INV] <bb 4> : b[i_1] = i_1; <bb 5> : i_8 = i_1 + 1; <bb 6> : # i_1 = PHI <0(2), i_8(5)> if (i_1 <= 1023) goto <bb 3>; [INV] else goto <bb 7>; [INV] <bb 7> : return; } Here we fail to look past 5->6 because BB5 is the latch and is not the last block in the path. So we fail to thread 3->5->6->3. Doing so would have split the function into two loops, one of which could use a memset: void foo () { int i; <bb 2> : goto <bb 6>; [INV] <bb 3> : # i_12 = PHI <i_1(6), i_9(4)> a[i_12] = 0; if (i_12 > 100) goto <bb 5>; [INV] else goto <bb 4>; [INV] <bb 4> : i_9 = i_12 + 1; goto <bb 3>; [100.00%] <bb 5> : b[i_12] = i_12; i_8 = i_12 + 1; <bb 6> : # i_1 = PHI <0(2), i_8(5)> if (i_1 <= 1023) goto <bb 3>; [INV] else goto <bb 7>; [INV] <bb 7> : return; } I would have to agree that threading through latches is problematic. For that matter, the ldist-22 test shows that we're depending on the threader to do work that seems to belong in the loop optimizer world. Would it be crazy to suggest that we disable threading through latches altogether, and do whatever we're missing in the loop world? It seems loop has all the tools, cost model, and framework to do so. Of course, I know 0 about loop, and would hate to add work to other's plates. Thanks. Aldy BTW, I haven't looked at the graphite regressions. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 8:14 ` Aldy Hernandez @ 2021-09-09 8:24 ` Richard Biener 2021-09-09 12:52 ` Michael Matz 2021-09-09 16:54 ` Jeff Law 2 siblings, 0 replies; 33+ messages in thread From: Richard Biener @ 2021-09-09 8:24 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Michael Matz, Jeff Law, GCC Mailing List, Andrew MacLeod On Thu, Sep 9, 2021 at 10:14 AM Aldy Hernandez <aldyh@redhat.com> wrote: > > > > On 9/8/21 8:13 PM, Michael Matz wrote: > > Hello, > > > > [lame answer to self] > > > > On Wed, 8 Sep 2021, Michael Matz wrote: > > > >>>> The forward threader guards against this by simply disallowing > >>>> threadings that involve different loops. As I see > >>> > >>> The thread in question (5->9->3) is all within the same outer loop, > >>> though. BTW, the backward threader also disallows threading across > >>> different loops (see path_crosses_loops variable). > > ... > >> Maybe it's possible to not disable threading over latches alltogether in > >> the backward threader (like it's tried now), but I haven't looked at the > >> specific situation here in depth, so take my view only as opinion from a > >> large distance :-) > > > > I've now looked at the concrete situation. So yeah, the whole path is in > > the same loop, crosses the latch, _and there's code following the latch > > on that path_. (I.e. the latch isn't the last block in the path). In > > particular, after loop_optimizer_init() (before any threading) we have: > > > > <bb 3> [local count: 118111600]: > > # j_19 = PHI <j_13(9), 0(7)> > > sum_11 = c[j_19]; > > if (n_10(D) > 0) > > goto <bb 8>; [89.00%] > > else > > goto <bb 5>; [11.00%] > > > > <bb 8> [local count: 105119324]: > > ... > > > > <bb 5> [local count: 118111600]: > > # sum_21 = PHI <sum_14(4), sum_11(3)> > > c[j_19] = sum_21; > > j_13 = j_19 + 1; > > if (n_10(D) > j_13) > > goto <bb 9>; [89.00%] > > else > > goto <bb 6>; [11.00%] > > > > <bb 9> [local count: 105119324]: > > goto <bb 3>; [100.00%] > > > > With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the > > pre-header of inner loop, but more importantly something that's not at the > > start of the outer loop. > > > > Now, any thread that includes the backedge 9->3 _including_ its > > destination (i.e. where the backedge isn't the last to-be-redirected edge) > > necessarily duplicates all code from that destination onto the back edge. > > Here it's the load from c[j] into sum_11. > > > > The important part is the code is emitted onto the back edge, > > conceptually; in reality it's simply included into the (new) latch block > > (the duplicate of bb9, which is bb12 intermediately, then named bb7 after > > cfg_cleanup). > > > > That's what we can't have for some of our structural loop optimizers: > > there must be no code executed after the exit test (e.g. in the latch > > block). (This requirement makes reasoning about which code is or isn't > > executed completely for an iteration trivial; simply everything in the > > body is always executed; e.g. loop interchange uses this to check that > > there are no memory references after the exit test, because those would > > then be only conditional and hence make loop interchange very awkward). > > > > Note that this situation can't be later rectified anymore: the duplicated > > instructions (because they are memory refs) must remain after the exit > > test. Only by rerolling/unrotating the loop (i.e. noticing that the > > memory refs on the loop-entry path and on the back edge are equivalent) > > would that be possible, but that's something we aren't capable of. Even > > if we were that would simply just revert the whole work that the threader > > did, so it's better to not even do that to start with. > > > > I believe something like below would be appropriate, it disables threading > > if the path contains a latch at the non-last position (due to being > > backwards on the non-first position in the array). I.e. it disables > > rotating the loop if there's danger of polluting the back edge. It might > > be improved if the blocks following (preceding!) the latch are themself > > empty because then no code is duplicated. It might also be improved if > > the latch is already non-empty. That code should probably only be active > > before the loop optimizers, but currently the backward threader isn't > > differentiating between before/after loop-optims. > > > > I haven't tested this patch at all, except that it fixes the testcase :) > > Thanks for looking at this. > > I think you're onto something with this approach. Perhaps in addition > to the loop header threading Richard mentions. > > Your patch causes some regressions, but I think most are noise from FSM > tests that must be adjusted. However, there are some other ones that > are curious: > > > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated > memset zero" > > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag" > < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue > "Performing shrink-wrapping" > < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, > line 36) > > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times > graphite "1 loops carried no dependency" 1 > > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times > optimized "loopfn.1" 4 > > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times > graphite "5 loops carried no dependency" 1 > > Interestingly your patch is fixing shrink-wrap-loop.c and > Warray-bounds-87, both of which were introduced by the backward threader > rewrite. At least the Warray-bounds was the threader peeling off an > iteration that caused a bogus warning. > > The ldist-22 regression is interesting though: > > void foo () > { > int i; > > <bb 2> : > goto <bb 6>; [INV] > > <bb 3> : > a[i_1] = 0; > if (i_1 > 100) > goto <bb 4>; [INV] > else > goto <bb 5>; [INV] > > <bb 4> : > b[i_1] = i_1; > > <bb 5> : > i_8 = i_1 + 1; > > <bb 6> : > # i_1 = PHI <0(2), i_8(5)> > if (i_1 <= 1023) > goto <bb 3>; [INV] > else > goto <bb 7>; [INV] > > <bb 7> : > return; > > } > > Here we fail to look past 5->6 because BB5 is the latch and is not the > last block in the path. So we fail to thread 3->5->6->3. Doing so > would have split the function into two loops, one of which could use a > memset: The loop splitting pass should handle this case though but I guess it never gets to see this original loop form? > void foo () > { > int i; > > <bb 2> : > goto <bb 6>; [INV] > > <bb 3> : > # i_12 = PHI <i_1(6), i_9(4)> > a[i_12] = 0; > if (i_12 > 100) > goto <bb 5>; [INV] > else > goto <bb 4>; [INV] > > <bb 4> : > i_9 = i_12 + 1; > goto <bb 3>; [100.00%] > > <bb 5> : > b[i_12] = i_12; > i_8 = i_12 + 1; > > <bb 6> : > # i_1 = PHI <0(2), i_8(5)> > if (i_1 <= 1023) > goto <bb 3>; [INV] > else > goto <bb 7>; [INV] > > <bb 7> : > return; > > } > > I would have to agree that threading through latches is problematic. > For that matter, the ldist-22 test shows that we're depending on the > threader to do work that seems to belong in the loop optimizer world. > > Would it be crazy to suggest that we disable threading through latches > altogether, and do whatever we're missing in the loop world? It seems > loop has all the tools, cost model, and framework to do so. Of course, > I know 0 about loop, and would hate to add work to other's plates. At least FSM threading should be done before loop opts when it introduces sub-loops (but not when it creates irreducible regions). Otherwise I agree that jump threading should try very hard to not disrupt loops before loop optimizations but after those it can (and should) go wild. Yes, we have RTL loop optimizations, but ... Richard. > Thanks. > Aldy > > BTW, I haven't looked at the graphite regressions. > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 8:14 ` Aldy Hernandez 2021-09-09 8:24 ` Richard Biener @ 2021-09-09 12:52 ` Michael Matz 2021-09-09 13:37 ` Aldy Hernandez 2021-09-09 16:54 ` Jeff Law 2 siblings, 1 reply; 33+ messages in thread From: Michael Matz @ 2021-09-09 12:52 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod Hello, On Thu, 9 Sep 2021, Aldy Hernandez wrote: > The ldist-22 regression is interesting though: > > void foo () > { > int i; > > <bb 2> : > goto <bb 6>; [INV] > > <bb 3> : > a[i_1] = 0; > if (i_1 > 100) > goto <bb 4>; [INV] > else > goto <bb 5>; [INV] > > <bb 4> : > b[i_1] = i_1; > > <bb 5> : > i_8 = i_1 + 1; > > <bb 6> : > # i_1 = PHI <0(2), i_8(5)> > if (i_1 <= 1023) > goto <bb 3>; [INV] > else > goto <bb 7>; [INV] Here there's no simple latch block to start with (the backedge comes directly out of the loop exit block). So my suggested improvement (testing if the latch was empty and only then reject the thread), would solve this. > Would it be crazy to suggest that we disable threading through latches > altogether, I think it wouldn't be crazy, but we can do a bit better as suggested above (only reject empty latches, and reject it only for the threaders coming before the loop optims). Ciao, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 12:52 ` Michael Matz @ 2021-09-09 13:37 ` Aldy Hernandez 2021-09-09 14:44 ` Michael Matz 0 siblings, 1 reply; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 13:37 UTC (permalink / raw) To: Michael Matz; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/9/21 2:52 PM, Michael Matz wrote: > Hello, > > On Thu, 9 Sep 2021, Aldy Hernandez wrote: > >> The ldist-22 regression is interesting though: >> >> void foo () >> { >> int i; >> >> <bb 2> : >> goto <bb 6>; [INV] >> >> <bb 3> : >> a[i_1] = 0; >> if (i_1 > 100) >> goto <bb 4>; [INV] >> else >> goto <bb 5>; [INV] >> >> <bb 4> : >> b[i_1] = i_1; >> >> <bb 5> : >> i_8 = i_1 + 1; >> >> <bb 6> : >> # i_1 = PHI <0(2), i_8(5)> >> if (i_1 <= 1023) >> goto <bb 3>; [INV] >> else >> goto <bb 7>; [INV] > > Here there's no simple latch block to start with (the backedge comes > directly out of the loop exit block). So my suggested improvement > (testing if the latch was empty and only then reject the thread), would > solve this. Well, there's the thing. Loop discovery is marking BB5 as the latch, so it's not getting threaded: Checking profitability of path (backwards): bb:6 (2 insns) bb:5 (latch) > >> Would it be crazy to suggest that we disable threading through latches >> altogether, > > I think it wouldn't be crazy, but we can do a bit better as suggested > above (only reject empty latches, and reject it only for the threaders > coming before the loop optims). BTW, I'm not sure your check for the non-last position makes a difference: > diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c > index 449232c7715..528a753b886 100644 > --- a/gcc/tree-ssa-threadbackward.c > +++ b/gcc/tree-ssa-threadbackward.c > @@ -600,6 +600,7 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, > loop_p loop = m_path[0]->loop_father; > bool path_crosses_loops = false; > bool threaded_through_latch = false; > + bool latch_within_path = false; > bool multiway_branch_in_path = false; > bool threaded_multiway_branch = false; > bool contains_hot_bb = false; > @@ -725,7 +726,13 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, > the last entry in the array when determining if we thread > through the loop latch. */ > if (loop->latch == bb) > - threaded_through_latch = true; > + { > + threaded_through_latch = true; > + if (j != 0) > + latch_within_path = true; > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, " (latch)"); > + } > } If the last position being considered is a simple latch, it only has a simple outgoing jump. There's no need to thread that. You need a block with >= 2 succ edges to thread anything. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 13:37 ` Aldy Hernandez @ 2021-09-09 14:44 ` Michael Matz 2021-09-09 15:07 ` Aldy Hernandez 2021-09-10 7:04 ` Aldy Hernandez 0 siblings, 2 replies; 33+ messages in thread From: Michael Matz @ 2021-09-09 14:44 UTC (permalink / raw) To: Aldy Hernandez; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod Hello, On Thu, 9 Sep 2021, Aldy Hernandez wrote: > > Here there's no simple latch block to start with (the backedge comes > > directly out of the loop exit block). So my suggested improvement > > (testing if the latch was empty and only then reject the thread), would > > solve this. > > Well, there's the thing. Loop discovery is marking BB5 as the latch, so > it's not getting threaded: Yes, it's a latch, but not an empty one. So the thread would make it just even more non-empty, which might not be a problem anymore. So amending my patch somewhere with a strategic && empty_block_p (latch) and only then rejecting the thread should make this testcase work again. (There's still a catch, though: if this non-empty latch, which is also the exit test block, is threaded through and is followed by actual code, then that code will be inserted onto the back edge, not into the latch block before the exit test, and so also create a (new) non-empty latch. That might or might not create problems downstreams, but not as many as transformaing an empty into a non-empty latch would do; but it could create for instance a second back edge (not in this testcase) and suchlike) > BTW, I'm not sure your check for the non-last position makes a difference: > > > diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c > > index 449232c7715..528a753b886 100644 > > --- a/gcc/tree-ssa-threadbackward.c > > +++ b/gcc/tree-ssa-threadbackward.c > > - threaded_through_latch = true; > > + { > > + threaded_through_latch = true; > > + if (j != 0) > > + latch_within_path = true; > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + fprintf (dump_file, " (latch)"); > > + } > > } > > If the last position being considered is a simple latch, it only has a simple > outgoing jump. There's no need to thread that. You need a block with >= 2 > succ edges to thread anything. So, you are saying that any candidate thread path wouldn't have the latch in the last position if it were just an empty forwarder? I simply wasn't sure about that, so was conservative and only wanted to reject things I knew where positively bad (namely code in the path following the latch that is in danger of being moved into the latch). Ciao, Michael. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 14:44 ` Michael Matz @ 2021-09-09 15:07 ` Aldy Hernandez 2021-09-10 7:04 ` Aldy Hernandez 1 sibling, 0 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-09 15:07 UTC (permalink / raw) To: Michael Matz; +Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod On 9/9/21 4:44 PM, Michael Matz wrote: > Hello, > > On Thu, 9 Sep 2021, Aldy Hernandez wrote: > >>> Here there's no simple latch block to start with (the backedge comes >>> directly out of the loop exit block). So my suggested improvement >>> (testing if the latch was empty and only then reject the thread), would >>> solve this. >> >> Well, there's the thing. Loop discovery is marking BB5 as the latch, so >> it's not getting threaded: > > Yes, it's a latch, but not an empty one. So the thread would make it just > even more non-empty, which might not be a problem anymore. So amending my > patch somewhere with a strategic > > && empty_block_p (latch) > > and only then rejecting the thread should make this testcase work again. Sweet. > > (There's still a catch, though: if this non-empty latch, which is also the > exit test block, is threaded through and is followed by actual code, then > that code will be inserted onto the back edge, not into the latch block > before the exit test, and so also create a (new) non-empty latch. That > might or might not create problems downstreams, but not as many as > transformaing an empty into a non-empty latch would do; but it could > create for instance a second back edge (not in this testcase) and > suchlike) > >> BTW, I'm not sure your check for the non-last position makes a difference: >> >>> diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c >>> index 449232c7715..528a753b886 100644 >>> --- a/gcc/tree-ssa-threadbackward.c >>> +++ b/gcc/tree-ssa-threadbackward.c >>> - threaded_through_latch = true; >>> + { >>> + threaded_through_latch = true; >>> + if (j != 0) >>> + latch_within_path = true; >>> + if (dump_file && (dump_flags & TDF_DETAILS)) >>> + fprintf (dump_file, " (latch)"); >>> + } >>> } >> >> If the last position being considered is a simple latch, it only has a simple >> outgoing jump. There's no need to thread that. You need a block with >= 2 >> succ edges to thread anything. > > So, you are saying that any candidate thread path wouldn't have the latch > in the last position if it were just an empty forwarder? I simply wasn't > sure about that, so was conservative and only wanted to reject things I > knew where positively bad (namely code in the path following the latch > that is in danger of being moved into the latch). Correct. In the backwards threader, we don't start the party if the last block has < 2 succs: FOR_EACH_BB_FN (bb, fun) { if (EDGE_COUNT (bb->succs) > 1) threader.maybe_thread_block (bb); } Thanks. Aldy ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 14:44 ` Michael Matz 2021-09-09 15:07 ` Aldy Hernandez @ 2021-09-10 7:04 ` Aldy Hernandez 1 sibling, 0 replies; 33+ messages in thread From: Aldy Hernandez @ 2021-09-10 7:04 UTC (permalink / raw) To: Michael Matz Cc: Richard Biener, Jeff Law, GCC Mailing List, Andrew MacLeod, gcc-patches [-- Attachment #1: Type: text/plain, Size: 2187 bytes --] I think things are clear enough to propose a patch. Thanks for everyone's input. I have added some comments and tweaked Michael's patch to include the final BB where we're threading to, in the check. Without this last check, we fail in tree-ssa/cunroll-15.c because the latch becomes polluted with PHI nodes. OK for trunk? Aldy commit e735a2cd01485773635bd66c97af21059992d5a7 (HEAD -> pending/global-ranges) Author: Aldy Hernandez <aldyh@redhat.com> Date: Thu Sep 9 20:30:28 2021 +0200 Disable threading through latches until after loop optimizations. The motivation for this patch was enabling the use of global ranges in the path solver, but this caused certain properties of loops being destroyed which made subsequent loop optimizations to fail. Consequently, this patch's mail goal is to disable jump threading involving the latch until after loop optimizations have run. I have added a bit in cfun to indicate when the "loopdone" optimization has completed. This helps the path threader determine when it's safe to thread through loop structures. I can adapt the patch if there is an alternate way of determining this. As can be seen in the test adjustments, we mostly shift the threading from the early threaders (ethread, thread[12] to the late threaders thread[34]). I have nuked some of the early notes in the testcases that came as part of the jump threader rewrite. They're mostly noise now. Note that we could probably relax some other restrictions in profitable_path_p when loop optimizations have completed, but it would require more testing, and I'm hesitant to touch more things than needed at this point. I have added a reminder to the function to keep this in mind. Finally, perhaps as a follow-up, we should apply the same restrictions to the forward threader. At some point I'd like to combine the cost models. Tested on x86-64 Linux. p.s. There is a thorough discussion involving the limitations of jump threading involving loops here: https://gcc.gnu.org/pipermail/gcc/2021-September/237247.html [-- Attachment #2: 0001-Disable-threading-through-latches-until-after-loop-o.patch --] [-- Type: text/x-patch, Size: 11395 bytes --] From e735a2cd01485773635bd66c97af21059992d5a7 Mon Sep 17 00:00:00 2001 From: Aldy Hernandez <aldyh@redhat.com> Date: Thu, 9 Sep 2021 20:30:28 +0200 Subject: [PATCH] Disable threading through latches until after loop optimizations. The motivation for this patch was enabling the use of global ranges in the path solver, but this caused certain properties of loops being destroyed which made subsequent loop optimizations to fail. Consequently, this patch's mail goal is to disable jump threading involving the latch until after loop optimizations have run. I have added a bit in cfun to indicate when the "loopdone" optimization has completed. This helps the path threader determine when it's safe to thread through loop structures. I can adapt the patch if there is an alternate way of determining this. As can be seen in the test adjustments, we mostly shift the threading from the early threaders (ethread, thread[12] to the late threaders thread[34]). I have nuked some of the early notes in the testcases that came as part of the jump threader rewrite. They're mostly noise now. Note that we could probably relax some other restrictions in profitable_path_p when loop optimizations have completed, but it would require more testing, and I'm hesitant to touch more things than needed at this point. I have added a reminder to the function to keep this in mind. Finally, perhaps as a follow-up, we should apply the same restrictions to the forward threader. At some point I'd like to combine the cost models. Tested on x86-64 Linux. p.s. There is a thorough discussion involving the limitations of jump threading involving loops here: https://gcc.gnu.org/pipermail/gcc/2021-September/237247.html gcc/ChangeLog: * function.h (struct function): Add loop_optimizers_done. (set_loop_optimizers_done): New. (loop_optimizers_done_p): New. * gimple-range-path.cc (path_range_query::internal_range_of_expr): Intersect with global range. * tree-ssa-loop.c (tree_ssa_loop_done): Call set_loop_optimizers_done. * tree-ssa-threadbackward.c (back_threader_profitability::profitable_path_p): Disable threading through latches until after loop optimizations have run. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/ssa-dom-thread-2b.c: Adjust for disabling of threading through latches. * gcc.dg/tree-ssa/ssa-dom-thread-6.c: Same. * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Same. Co-authored-by: Michael Matz <matz@suse.de> --- gcc/function.h | 15 ++++++++ gcc/gimple-range-path.cc | 3 ++ .../gcc.dg/tree-ssa/ssa-dom-thread-2b.c | 4 +- .../gcc.dg/tree-ssa/ssa-dom-thread-6.c | 37 +------------------ .../gcc.dg/tree-ssa/ssa-dom-thread-7.c | 17 +-------- gcc/tree-ssa-loop.c | 1 + gcc/tree-ssa-threadbackward.c | 28 +++++++++++++- 7 files changed, 50 insertions(+), 55 deletions(-) diff --git a/gcc/function.h b/gcc/function.h index 36003e7576a..fb99b6a44fc 100644 --- a/gcc/function.h +++ b/gcc/function.h @@ -438,6 +438,9 @@ struct GTY(()) function { /* Set if there are any OMP_TARGET regions in the function. */ unsigned int has_omp_target : 1; + + /* Set if SSA loop optimizers have completed. */ + unsigned int loop_optimizers_done : 1; }; /* Add the decl D to the local_decls list of FUN. */ @@ -514,6 +517,18 @@ set_loops_for_fn (struct function *fn, struct loops *loops) fn->x_current_loops = loops; } +inline void +set_loop_optimizers_done (struct function *fn) +{ + fn->loop_optimizers_done = 1; +} + +inline bool +loop_optimizers_done_p (struct function *fn) +{ + return fn->loop_optimizers_done; +} + /* For backward compatibility... eventually these should all go away. */ #define current_function_funcdef_no (cfun->funcdef_no) diff --git a/gcc/gimple-range-path.cc b/gcc/gimple-range-path.cc index a4fa3b296ff..c616b65756f 100644 --- a/gcc/gimple-range-path.cc +++ b/gcc/gimple-range-path.cc @@ -127,6 +127,9 @@ path_range_query::internal_range_of_expr (irange &r, tree name, gimple *stmt) basic_block bb = stmt ? gimple_bb (stmt) : exit_bb (); if (stmt && range_defined_in_block (r, name, bb)) { + if (TREE_CODE (name) == SSA_NAME) + r.intersect (gimple_range_global (name)); + set_cache (r, name); return true; } diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c index e1c33e86cd7..823ada982ff 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-2b.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */ +/* { dg-options "-O2 -fdump-tree-thread3-stats -fdump-tree-dom2-stats -fdisable-tree-ethread" } */ void foo(); void bla(); @@ -26,4 +26,4 @@ void thread_latch_through_header (void) case. And we want to thread through the header as well. These are both caught by threading in DOM. */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2"} } */ -/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "thread1"} } */ +/* { dg-final { scan-tree-dump-times "Jumps threaded: 1" 1 "thread3"} } */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c index c7bf867b084..ee46759bacc 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c @@ -1,41 +1,8 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread2-details" } */ +/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-thread3-details" } */ -/* All the threads in the thread1 dump start on a X->BB12 edge, as can - be seen in the dump: - - Registering FSM jump thread: (x, 12) incoming edge; ... - etc - etc - - Before the new evrp, we were threading paths that started at the - following edges: - - Registering FSM jump thread: (10, 12) incoming edge - Registering FSM jump thread: (6, 12) incoming edge - Registering FSM jump thread: (9, 12) incoming edge - - This was because the PHI at BB12 had constant values coming in from - BB10, BB6, and BB9: - - # state_10 = PHI <state_11(7), 0(10), state_11(5), 1(6), state_11(8), 2(9), state_11(11)> - - Now with the new evrp, we get: - - # state_10 = PHI <0(7), 0(10), state_11(5), 1(6), 0(8), 2(9), 1(11)> - - Thus, we have 3 more paths that are known to be constant and can be - threaded. Which means that by the second threading pass, we can - only find one profitable path. - - For the record, all these extra constants are better paths coming - out of switches. For example: - - SWITCH_BB -> BBx -> BBy -> BBz -> PHI - - We now know the value of the switch index at PHI. */ /* { dg-final { scan-tree-dump-times "Registering FSM jump" 6 "thread1" } } */ -/* { dg-final { scan-tree-dump-times "Registering FSM jump" 1 "thread2" } } */ +/* { dg-final { scan-tree-dump-times "Registering FSM jump" 1 "thread3" } } */ int sum0, sum1, sum2, sum3; int foo (char *s, char **ret) diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index 5fc2145a432..ba07942f9dd 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,23 +1,8 @@ /* { dg-do compile } */ /* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ -/* Here we have the same issue as was commented in ssa-dom-thread-6.c. - The PHI coming into the threader has a lot more constants, so the - threader can thread more paths. - -$ diff clean/a.c.105t.mergephi2 a.c.105t.mergephi2 -252c252 -< # s_50 = PHI <s_49(10), 5(14), s_51(18), s_51(22), 1(26), 1(29), 1(31), s_51(5), 4(12), 1(15), 5(17), 1(19), 3(21), 1(23), 6(25), 7(28), s_51(30)> ---- -> # s_50 = PHI <s_49(10), 5(14), 4(18), 5(22), 1(26), 1(29), 1(31), s_51(5), 4(12), 1(15), 5(17), 1(19), 3(21), 1(23), 6(25), 7(28), 7(30)> -272a273 - - I spot checked a few and they all have the same pattern. We are - basically tracking the switch index better through multiple - paths. */ - /* { dg-final { scan-tree-dump "Jumps threaded: 18" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread2" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 8" "thread3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ /* aarch64 has the highest CASE_VALUES_THRESHOLD in GCC. It's high enough diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c index 0cc4b3bbccf..c827a51d3ce 100644 --- a/gcc/tree-ssa-loop.c +++ b/gcc/tree-ssa-loop.c @@ -528,6 +528,7 @@ tree_ssa_loop_done (void) free_numbers_of_iterations_estimates (cfun); scev_finalize (); loop_optimizer_finalize (cfun, true); + set_loop_optimizers_done (cfun); return 0; } diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 449232c7715..ded31a9e2de 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3. If not see #include "ssa.h" #include "tree-cfgcleanup.h" #include "tree-pretty-print.h" +#include "cfghooks.h" // Path registry for the backwards threader. After all paths have been // registered with register_path(), thread_through_all_blocks() is called @@ -564,7 +565,10 @@ back_threader_registry::thread_through_all_blocks (bool may_peel_loop_headers) TAKEN_EDGE, otherwise it is NULL. CREATES_IRREDUCIBLE_LOOP, if non-null is set to TRUE if threading this path - would create an irreducible loop. */ + would create an irreducible loop. + + ?? It seems we should be able to loosen some of the restrictions in + this function after loop optimizations have run. */ bool back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, @@ -725,7 +729,11 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, the last entry in the array when determining if we thread through the loop latch. */ if (loop->latch == bb) - threaded_through_latch = true; + { + threaded_through_latch = true; + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, " (latch)"); + } } gimple *stmt = get_gimple_control_stmt (m_path[0]); @@ -845,6 +853,22 @@ back_threader_profitability::profitable_path_p (const vec<basic_block> &m_path, "a multiway branch.\n"); return false; } + + /* Threading through a non-empty latch would cause code to be added + to the latch. This could alter the loop form sufficiently to + cause loop optimizations to fail. Disable these threads until + after loop optimizations have run. */ + if ((threaded_through_latch + || (taken_edge && taken_edge->dest == loop->latch)) + && !loop_optimizers_done_p (cfun) + && empty_block_p (loop->latch)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, + " FAIL: FSM Thread through latch before loop opts would create non-empty latch\n"); + return false; + + } return true; } -- 2.31.1 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: More aggressive threading causing loop-interchange-9.c regression 2021-09-09 8:14 ` Aldy Hernandez 2021-09-09 8:24 ` Richard Biener 2021-09-09 12:52 ` Michael Matz @ 2021-09-09 16:54 ` Jeff Law 2 siblings, 0 replies; 33+ messages in thread From: Jeff Law @ 2021-09-09 16:54 UTC (permalink / raw) To: Aldy Hernandez, Michael Matz Cc: Richard Biener, GCC Mailing List, Andrew MacLeod On 9/9/2021 2:14 AM, Aldy Hernandez wrote: > > > On 9/8/21 8:13 PM, Michael Matz wrote: >> Hello, >> >> [lame answer to self] >> >> On Wed, 8 Sep 2021, Michael Matz wrote: >> >>>>> The forward threader guards against this by simply disallowing >>>>> threadings that involve different loops. As I see >>>> >>>> The thread in question (5->9->3) is all within the same outer loop, >>>> though. BTW, the backward threader also disallows threading across >>>> different loops (see path_crosses_loops variable). >> ... >>> Maybe it's possible to not disable threading over latches >>> alltogether in >>> the backward threader (like it's tried now), but I haven't looked at >>> the >>> specific situation here in depth, so take my view only as opinion >>> from a >>> large distance :-) >> >> I've now looked at the concrete situation. So yeah, the whole path >> is in >> the same loop, crosses the latch, _and there's code following the latch >> on that path_. (I.e. the latch isn't the last block in the path). In >> particular, after loop_optimizer_init() (before any threading) we have: >> >> <bb 3> [local count: 118111600]: >> # j_19 = PHI <j_13(9), 0(7)> >> sum_11 = c[j_19]; >> if (n_10(D) > 0) >> goto <bb 8>; [89.00%] >> else >> goto <bb 5>; [11.00%] >> >> <bb 8> [local count: 105119324]: >> ... >> >> <bb 5> [local count: 118111600]: >> # sum_21 = PHI <sum_14(4), sum_11(3)> >> c[j_19] = sum_21; >> j_13 = j_19 + 1; >> if (n_10(D) > j_13) >> goto <bb 9>; [89.00%] >> else >> goto <bb 6>; [11.00%] >> >> <bb 9> [local count: 105119324]: >> goto <bb 3>; [100.00%] >> >> With bb9 the outer (empty) latch, bb3 the outer header, and bb8 the >> pre-header of inner loop, but more importantly something that's not >> at the >> start of the outer loop. >> >> Now, any thread that includes the backedge 9->3 _including_ its >> destination (i.e. where the backedge isn't the last to-be-redirected >> edge) >> necessarily duplicates all code from that destination onto the back >> edge. >> Here it's the load from c[j] into sum_11. >> >> The important part is the code is emitted onto the back edge, >> conceptually; in reality it's simply included into the (new) latch block >> (the duplicate of bb9, which is bb12 intermediately, then named bb7 >> after >> cfg_cleanup). >> >> That's what we can't have for some of our structural loop optimizers: >> there must be no code executed after the exit test (e.g. in the latch >> block). (This requirement makes reasoning about which code is or isn't >> executed completely for an iteration trivial; simply everything in the >> body is always executed; e.g. loop interchange uses this to check that >> there are no memory references after the exit test, because those would >> then be only conditional and hence make loop interchange very awkward). >> >> Note that this situation can't be later rectified anymore: the >> duplicated >> instructions (because they are memory refs) must remain after the exit >> test. Only by rerolling/unrotating the loop (i.e. noticing that the >> memory refs on the loop-entry path and on the back edge are equivalent) >> would that be possible, but that's something we aren't capable of. Even >> if we were that would simply just revert the whole work that the >> threader >> did, so it's better to not even do that to start with. >> >> I believe something like below would be appropriate, it disables >> threading >> if the path contains a latch at the non-last position (due to being >> backwards on the non-first position in the array). I.e. it disables >> rotating the loop if there's danger of polluting the back edge. It might >> be improved if the blocks following (preceding!) the latch are themself >> empty because then no code is duplicated. It might also be improved if >> the latch is already non-empty. That code should probably only be >> active >> before the loop optimizers, but currently the backward threader isn't >> differentiating between before/after loop-optims. >> >> I haven't tested this patch at all, except that it fixes the testcase :) > > Thanks for looking at this. > > I think you're onto something with this approach. Perhaps in addition > to the loop header threading Richard mentions. > > Your patch causes some regressions, but I think most are noise from > FSM tests that must be adjusted. However, there are some other ones > that are curious: > > > FAIL: gcc.dg/tree-ssa/ldist-22.c scan-tree-dump ldist "generated > memset zero" > > FAIL: gcc.dg/tree-ssa/pr66752-3.c scan-tree-dump-not dce2 "if .flag" > < XFAIL: gcc.dg/shrink-wrap-loop.c scan-rtl-dump pro_and_epilogue > "Performing shrink-wrapping" > < XFAIL: gcc.dg/Warray-bounds-87.c pr101671 (test for bogus messages, > line 36) > > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times > graphite "1 loops carried no dependency" 1 > > FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times > optimized "loopfn.1" 4 > > FAIL: libgomp.graphite/force-parallel-8.c scan-tree-dump-times > graphite "5 loops carried no dependency" 1 > > Interestingly your patch is fixing shrink-wrap-loop.c and > Warray-bounds-87, both of which were introduced by the backward > threader rewrite. At least the Warray-bounds was the threader peeling > off an iteration that caused a bogus warning. > > The ldist-22 regression is interesting though: > > void foo () > { > int i; > > <bb 2> : > goto <bb 6>; [INV] > > <bb 3> : > a[i_1] = 0; > if (i_1 > 100) > goto <bb 4>; [INV] > else > goto <bb 5>; [INV] > > <bb 4> : > b[i_1] = i_1; > > <bb 5> : > i_8 = i_1 + 1; > > <bb 6> : > # i_1 = PHI <0(2), i_8(5)> > if (i_1 <= 1023) > goto <bb 3>; [INV] > else > goto <bb 7>; [INV] > > <bb 7> : > return; > > } > > Here we fail to look past 5->6 because BB5 is the latch and is not the > last block in the path. So we fail to thread 3->5->6->3. Doing so > would have split the function into two loops, one of which could use a > memset: > > void foo () > { > int i; > > <bb 2> : > goto <bb 6>; [INV] > > <bb 3> : > # i_12 = PHI <i_1(6), i_9(4)> > a[i_12] = 0; > if (i_12 > 100) > goto <bb 5>; [INV] > else > goto <bb 4>; [INV] > > <bb 4> : > i_9 = i_12 + 1; > goto <bb 3>; [100.00%] > > <bb 5> : > b[i_12] = i_12; > i_8 = i_12 + 1; > > <bb 6> : > # i_1 = PHI <0(2), i_8(5)> > if (i_1 <= 1023) > goto <bb 3>; [INV] > else > goto <bb 7>; [INV] > > <bb 7> : > return; > > } > > I would have to agree that threading through latches is problematic. > For that matter, the ldist-22 test shows that we're depending on the > threader to do work that seems to belong in the loop optimizer world. > > Would it be crazy to suggest that we disable threading through latches > altogether, and do whatever we're missing in the loop world? It seems > loop has all the tools, cost model, and framework to do so. Of > course, I know 0 about loop, and would hate to add work to other's > plates. Threading through latches can be extremely problematical as they can change the underlying loop structure. For example, it can take a simple loop and turn it into a loop nest. I thought we'd already disabled threading through latches except for the case where doing so allows us to thread through an indirect jump (ie, the original motivation for the FSM threader). As I mentioned in our private email, threading through headers, through latches, across loops, etc has a number of negative effects and we generally want to avoid that, at least in the instances before loop optimization and vectorization. THe threader doesn't have any cost model for what is effectively loop peeling, loop rotatation, loop header copying and the like. Zdenek argued those belong in the loop optimizers, not jump threading and I broadly agree with that assessment. Jeff ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2021-09-10 16:38 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-07 11:49 More aggressive threading causing loop-interchange-9.c regression Aldy Hernandez 2021-09-07 14:45 ` Michael Matz 2021-09-08 10:44 ` Aldy Hernandez 2021-09-08 13:13 ` Richard Biener 2021-09-08 13:25 ` Aldy Hernandez 2021-09-08 13:49 ` Richard Biener 2021-09-08 16:19 ` Aldy Hernandez 2021-09-08 16:39 ` Michael Matz 2021-09-08 18:13 ` Michael Matz 2021-09-09 6:57 ` Richard Biener 2021-09-09 7:37 ` Aldy Hernandez 2021-09-09 7:45 ` Richard Biener 2021-09-09 8:36 ` Aldy Hernandez 2021-09-09 8:58 ` Richard Biener 2021-09-09 9:21 ` Aldy Hernandez 2021-09-09 10:15 ` Richard Biener 2021-09-09 11:28 ` Aldy Hernandez 2021-09-10 15:51 ` Jeff Law 2021-09-10 16:11 ` Aldy Hernandez 2021-09-10 15:43 ` Jeff Law 2021-09-10 16:05 ` Aldy Hernandez 2021-09-10 16:21 ` Jeff Law 2021-09-10 16:38 ` Aldy Hernandez 2021-09-09 16:59 ` Jeff Law 2021-09-09 12:47 ` Michael Matz 2021-09-09 8:14 ` Aldy Hernandez 2021-09-09 8:24 ` Richard Biener 2021-09-09 12:52 ` Michael Matz 2021-09-09 13:37 ` Aldy Hernandez 2021-09-09 14:44 ` Michael Matz 2021-09-09 15:07 ` Aldy Hernandez 2021-09-10 7:04 ` Aldy Hernandez 2021-09-09 16:54 ` Jeff Law
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).