* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix [not found] <20231219123224.3D481385E459@sourceware.org> @ 2023-12-20 9:50 ` Thomas Schwinge 2023-12-20 10:21 ` Richard Biener 0 siblings, 1 reply; 5+ messages in thread From: Thomas Schwinge @ 2023-12-20 9:50 UTC (permalink / raw) To: Richard Biener, Andrew Stubbs, Julian Brown; +Cc: gcc-patches Hi! On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote: > The PR112736 testcase fails on RISC-V because the aligned exception > uses the wrong check. The alignment support scheme can be > dr_aligned even when the access isn't aligned to the vector size > but some targets are happy with element alignment. The following > fixes that. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed. I've noticed this to regresses GCN target as follows: PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors) PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3 PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1 [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1 As so often, I've got no clue whether that's a vectorizer, GCN back end, or test case issue. ;-) 'diff'ing before vs. after: --- bb-slp-pr78205.c.191t.slp2 2023-12-20 09:49:45.834344620 +0100 +++ bb-slp-pr78205.c.191t.slp2 2023-12-20 09:10:14.706300941 +0100 [...] @@ -505,8 +505,9 @@ [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[0] [...]/bb-slp-pr78205.c:9:8: note: created &c[0] [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B]; -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B]; +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0}; +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1; [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal [...] @@ -537,9 +538,10 @@ [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1 [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[2] [...]/bb-slp-pr78205.c:13:8: note: created &c[2] -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c]; -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B]; -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>; +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c]; +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B]; +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0}; +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>; [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3; [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal [...] @@ -580,18 +582,22 @@ double _4; double _5; vector(2) double _17; + double _20; + double _25; <bb 2> [local count: 1073741824]: vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; + _20 = MEM[(double *)&c + 32B]; + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; _1 = c[0]; _2 = c[1]; - MEM <vector(4) double> [(double *)&a] = vect__1.9_21; - vect__3.14_23 = MEM <vector(4) double> [(double *)&c]; - vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>; + MEM <vector(4) double> [(double *)&a] = vect__1.9_22; + vect__3.14_24 = MEM <vector(4) double> [(double *)&c]; + _25 = MEM[(double *)&c + 32B]; + vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>; _3 = c[2]; _4 = c[3]; - MEM <vector(4) double> [(double *)&b] = vect__1.16_25; + MEM <vector(4) double> [(double *)&b] = vect__1.16_27; _5 = c[4]; _17 = {_5, _5}; MEM <vector(2) double> [(double *)&x] = _17; --- bb-slp-pr78205.c.265t.optimized 2023-12-20 09:49:45.838344586 +0100 +++ bb-slp-pr78205.c.265t.optimized 2023-12-20 09:10:14.706300941 +0100 @@ -6,17 +6,17 @@ vector(4) double vect__1.16; vector(4) double vect__1.9; vector(4) double vect__1.7; - double _5; vector(2) double _17; + double _20; <bb 2> [local count: 1073741824]: vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; - MEM <vector(4) double> [(double *)&a] = vect__1.9_21; - vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>; - MEM <vector(4) double> [(double *)&b] = vect__1.16_25; - _5 = c[4]; - _17 = {_5, _5}; + _20 = MEM[(double *)&c + 32B]; + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; + MEM <vector(4) double> [(double *)&a] = vect__1.9_22; + vect__1.16_27 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>; + MEM <vector(4) double> [(double *)&b] = vect__1.16_27; + _17 = {_20, _20}; MEM <vector(2) double> [(double *)&x] = _17; return; --- bb-slp-pr78205.s 2023-12-20 09:49:45.846344519 +0100 +++ bb-slp-pr78205.s 2023-12-20 09:10:14.722300807 +0100 @@ -41,7 +41,17 @@ v_addc_co_u32 v7, s[22:23], 0, v7, s[22:23] flat_load_dwordx2 v[6:7], v[6:7] offset:0 s_waitcnt 0 + v_writelane_b32 v8, s12, 0 + v_writelane_b32 v9, s13, 0 + s_mov_b64 exec, 1 + v_add_co_u32 v8, vcc, 32, v8 + v_addc_co_u32 v9, vcc, 0, v9, vcc + flat_load_dwordx2 v[8:9], v[8:9] + s_waitcnt 0 + v_readlane_b32 s12, v8, 0 + v_readlane_b32 s13, v9, 0 s_mov_b64 s[18:19], 10 + s_mov_b64 exec, 15 v_cndmask_b32 v0, 0, 4, s[18:19] s_mov_b64 exec, 15 v_mov_b32 v11, v7 @@ -73,15 +83,6 @@ v_mov_b32 v5, s19 v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23] flat_store_dwordx2 v[4:5], v[6:7] offset:0 - v_writelane_b32 v4, s12, 0 - v_writelane_b32 v5, s13, 0 - s_mov_b64 exec, 1 - v_add_co_u32 v4, vcc, 32, v4 - v_addc_co_u32 v5, vcc, 0, v5, vcc - flat_load_dwordx2 v[4:5], v[4:5] - s_waitcnt 0 - v_readlane_b32 s12, v4, 0 - v_readlane_b32 s13, v5, 0 s_mov_b64 exec, 3 v_mov_b32 v6, s12 v_mov_b32 v7, s13 I haven't looked at the full context, but this appears to effectively just move this block of code, and use different registers. In: + s_mov_b64 exec, 15 v_cndmask_b32 v0, 0, 4, s[18:19] s_mov_b64 exec, 15 ... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant, though? Grüße Thomas > PR tree-optimization/113073 > * tree-vect-stmts.cc (vectorizable_load): Properly ensure > to exempt only vector-size aligned overreads. > --- > gcc/tree-vect-stmts.cc | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index fc6923cf68a..e9ff728dfd4 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo, > - (group_size * vf - gap), nunits)) > /* DR will be unused. */ > ltype = NULL_TREE; > - else if (alignment_support_scheme == dr_aligned) > + else if (known_ge (vect_align, > + tree_to_poly_uint64 > + (TYPE_SIZE_UNIT (vectype)))) > /* Aligned access to excess elements is OK if > at least one element is accessed in the > scalar loop. */ ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix 2023-12-20 9:50 ` [PATCH] tree-optimization/113073 - amend PR112736 fix Thomas Schwinge @ 2023-12-20 10:21 ` Richard Biener 2023-12-20 13:44 ` Richard Biener 0 siblings, 1 reply; 5+ messages in thread From: Richard Biener @ 2023-12-20 10:21 UTC (permalink / raw) To: Thomas Schwinge; +Cc: Andrew Stubbs, Julian Brown, gcc-patches On Wed, 20 Dec 2023, Thomas Schwinge wrote: > Hi! > > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote: > > The PR112736 testcase fails on RISC-V because the aligned exception > > uses the wrong check. The alignment support scheme can be > > dr_aligned even when the access isn't aligned to the vector size > > but some targets are happy with element alignment. The following > > fixes that. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed. > > I've noticed this to regresses GCN target as follows: > > PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors) > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3 > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1 > [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1 > > As so often, I've got no clue whether that's a vectorizer, GCN back end, > or test case issue. ;-) > > 'diff'ing before vs. after: > > --- bb-slp-pr78205.c.191t.slp2 2023-12-20 09:49:45.834344620 +0100 > +++ bb-slp-pr78205.c.191t.slp2 2023-12-20 09:10:14.706300941 +0100 > [...] > @@ -505,8 +505,9 @@ > [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[0] > [...]/bb-slp-pr78205.c:9:8: note: created &c[0] > [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; > -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B]; > -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B]; > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0}; > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1; > [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal > [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal > [...] > @@ -537,9 +538,10 @@ > [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1 > [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[2] > [...]/bb-slp-pr78205.c:13:8: note: created &c[2] > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c]; > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B]; > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>; > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c]; > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B]; > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0}; > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>; > [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3; > [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal > [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal > [...] > @@ -580,18 +582,22 @@ > double _4; > double _5; > vector(2) double _17; > + double _20; > + double _25; > > <bb 2> [local count: 1073741824]: > vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; > - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > + _20 = MEM[(double *)&c + 32B]; > + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > _1 = c[0]; > _2 = c[1]; > - MEM <vector(4) double> [(double *)&a] = vect__1.9_21; > - vect__3.14_23 = MEM <vector(4) double> [(double *)&c]; > - vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>; > + MEM <vector(4) double> [(double *)&a] = vect__1.9_22; > + vect__3.14_24 = MEM <vector(4) double> [(double *)&c]; > + _25 = MEM[(double *)&c + 32B]; that looks like a noop (but it's odd we keep the unused load) > + vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>; > _3 = c[2]; > _4 = c[3]; > - MEM <vector(4) double> [(double *)&b] = vect__1.16_25; > + MEM <vector(4) double> [(double *)&b] = vect__1.16_27; > _5 = c[4]; > _17 = {_5, _5}; > MEM <vector(2) double> [(double *)&x] = _17; > > --- bb-slp-pr78205.c.265t.optimized 2023-12-20 09:49:45.838344586 +0100 > +++ bb-slp-pr78205.c.265t.optimized 2023-12-20 09:10:14.706300941 +0100 > @@ -6,17 +6,17 @@ > vector(4) double vect__1.16; > vector(4) double vect__1.9; > vector(4) double vect__1.7; > - double _5; > vector(2) double _17; > + double _20; > > <bb 2> [local count: 1073741824]: > vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; > - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > - MEM <vector(4) double> [(double *)&a] = vect__1.9_21; > - vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>; > - MEM <vector(4) double> [(double *)&b] = vect__1.16_25; > - _5 = c[4]; > - _17 = {_5, _5}; > + _20 = MEM[(double *)&c + 32B]; that looks similar in the end, but trades c[4] for MEM here. That's because we CSEd c[4] with the "dead" load above. I think we could simply remove that extra scan-tree-dump-times, we should instead look to _not_ see a vector load starting at c[4], but scan-tree-dump-not is also not very reliable (in a false negative sense). I'll see what that "dead" load is. > + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > + MEM <vector(4) double> [(double *)&a] = vect__1.9_22; > + vect__1.16_27 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>; > + MEM <vector(4) double> [(double *)&b] = vect__1.16_27; > + _17 = {_20, _20}; > MEM <vector(2) double> [(double *)&x] = _17; > return; > --- bb-slp-pr78205.s 2023-12-20 09:49:45.846344519 +0100 > +++ bb-slp-pr78205.s 2023-12-20 09:10:14.722300807 +0100 > @@ -41,7 +41,17 @@ > v_addc_co_u32 v7, s[22:23], 0, v7, s[22:23] > flat_load_dwordx2 v[6:7], v[6:7] offset:0 > s_waitcnt 0 > + v_writelane_b32 v8, s12, 0 > + v_writelane_b32 v9, s13, 0 > + s_mov_b64 exec, 1 > + v_add_co_u32 v8, vcc, 32, v8 > + v_addc_co_u32 v9, vcc, 0, v9, vcc > + flat_load_dwordx2 v[8:9], v[8:9] > + s_waitcnt 0 > + v_readlane_b32 s12, v8, 0 > + v_readlane_b32 s13, v9, 0 > s_mov_b64 s[18:19], 10 > + s_mov_b64 exec, 15 > v_cndmask_b32 v0, 0, 4, s[18:19] > s_mov_b64 exec, 15 > v_mov_b32 v11, v7 > @@ -73,15 +83,6 @@ > v_mov_b32 v5, s19 > v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23] > flat_store_dwordx2 v[4:5], v[6:7] offset:0 > - v_writelane_b32 v4, s12, 0 > - v_writelane_b32 v5, s13, 0 > - s_mov_b64 exec, 1 > - v_add_co_u32 v4, vcc, 32, v4 > - v_addc_co_u32 v5, vcc, 0, v5, vcc > - flat_load_dwordx2 v[4:5], v[4:5] > - s_waitcnt 0 > - v_readlane_b32 s12, v4, 0 > - v_readlane_b32 s13, v5, 0 > s_mov_b64 exec, 3 > v_mov_b32 v6, s12 > v_mov_b32 v7, s13 > > I haven't looked at the full context, but this appears to effectively > just move this block of code, and use different registers. > > In: > > + s_mov_b64 exec, 15 > v_cndmask_b32 v0, 0, 4, s[18:19] > s_mov_b64 exec, 15 > > ... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant, > though? I have no idea about that. > > Gr??e > Thomas > > > > PR tree-optimization/113073 > > * tree-vect-stmts.cc (vectorizable_load): Properly ensure > > to exempt only vector-size aligned overreads. > > --- > > gcc/tree-vect-stmts.cc | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > index fc6923cf68a..e9ff728dfd4 100644 > > --- a/gcc/tree-vect-stmts.cc > > +++ b/gcc/tree-vect-stmts.cc > > @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo, > > - (group_size * vf - gap), nunits)) > > /* DR will be unused. */ > > ltype = NULL_TREE; > > - else if (alignment_support_scheme == dr_aligned) > > + else if (known_ge (vect_align, > > + tree_to_poly_uint64 > > + (TYPE_SIZE_UNIT (vectype)))) > > /* Aligned access to excess elements is OK if > > at least one element is accessed in the > > scalar loop. */ > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra?e 201, 80634 M?nchen; Gesellschaft mit beschr?nkter Haftung; Gesch?ftsf?hrer: Thomas Heurung, Frank Th?rauf; Sitz der Gesellschaft: M?nchen; Registergericht M?nchen, HRB 106955 > -- Richard Biener <rguenther@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix 2023-12-20 10:21 ` Richard Biener @ 2023-12-20 13:44 ` Richard Biener 2023-12-20 14:31 ` Thomas Schwinge 0 siblings, 1 reply; 5+ messages in thread From: Richard Biener @ 2023-12-20 13:44 UTC (permalink / raw) To: Thomas Schwinge; +Cc: Andrew Stubbs, Julian Brown, gcc-patches On Wed, 20 Dec 2023, Richard Biener wrote: > On Wed, 20 Dec 2023, Thomas Schwinge wrote: > > > Hi! > > > > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote: > > > The PR112736 testcase fails on RISC-V because the aligned exception > > > uses the wrong check. The alignment support scheme can be > > > dr_aligned even when the access isn't aligned to the vector size > > > but some targets are happy with element alignment. The following > > > fixes that. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed. > > > > I've noticed this to regresses GCN target as follows: > > > > PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors) > > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3 > > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1 > > [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1 > > > > As so often, I've got no clue whether that's a vectorizer, GCN back end, > > or test case issue. ;-) > > > > 'diff'ing before vs. after: > > > > --- bb-slp-pr78205.c.191t.slp2 2023-12-20 09:49:45.834344620 +0100 > > +++ bb-slp-pr78205.c.191t.slp2 2023-12-20 09:10:14.706300941 +0100 > > [...] > > @@ -505,8 +505,9 @@ > > [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[0] > > [...]/bb-slp-pr78205.c:9:8: note: created &c[0] > > [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; > > -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B]; > > -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B]; > > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0}; > > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > > [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1; > > [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal > > [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal > > [...] > > @@ -537,9 +538,10 @@ > > [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1 > > [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[2] > > [...]/bb-slp-pr78205.c:13:8: note: created &c[2] > > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c]; > > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B]; > > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>; > > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c]; > > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B]; > > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0}; > > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>; > > [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3; > > [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal > > [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal > > [...] > > @@ -580,18 +582,22 @@ > > double _4; > > double _5; > > vector(2) double _17; > > + double _20; > > + double _25; > > > > <bb 2> [local count: 1073741824]: > > vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; > > - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > > + _20 = MEM[(double *)&c + 32B]; > > + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > > _1 = c[0]; > > _2 = c[1]; > > - MEM <vector(4) double> [(double *)&a] = vect__1.9_21; > > - vect__3.14_23 = MEM <vector(4) double> [(double *)&c]; > > - vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>; > > + MEM <vector(4) double> [(double *)&a] = vect__1.9_22; > > + vect__3.14_24 = MEM <vector(4) double> [(double *)&c]; > > + _25 = MEM[(double *)&c + 32B]; > > that looks like a noop (but it's odd we keep the unused load) > > > + vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>; > > _3 = c[2]; > > _4 = c[3]; > > - MEM <vector(4) double> [(double *)&b] = vect__1.16_25; > > + MEM <vector(4) double> [(double *)&b] = vect__1.16_27; > > _5 = c[4]; > > _17 = {_5, _5}; > > MEM <vector(2) double> [(double *)&x] = _17; > > > > --- bb-slp-pr78205.c.265t.optimized 2023-12-20 09:49:45.838344586 +0100 > > +++ bb-slp-pr78205.c.265t.optimized 2023-12-20 09:10:14.706300941 +0100 > > @@ -6,17 +6,17 @@ > > vector(4) double vect__1.16; > > vector(4) double vect__1.9; > > vector(4) double vect__1.7; > > - double _5; > > vector(2) double _17; > > + double _20; > > > > <bb 2> [local count: 1073741824]: > > vect__1.7_19 = MEM <vector(4) double> [(double *)&c]; > > - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>; > > - MEM <vector(4) double> [(double *)&a] = vect__1.9_21; > > - vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>; > > - MEM <vector(4) double> [(double *)&b] = vect__1.16_25; > > - _5 = c[4]; > > - _17 = {_5, _5}; > > + _20 = MEM[(double *)&c + 32B]; > > that looks similar in the end, but trades c[4] for MEM here. That's > because we CSEd c[4] with the "dead" load above. > > I think we could simply remove that extra scan-tree-dump-times, we > should instead look to _not_ see a vector load starting at c[4], > but scan-tree-dump-not is also not very reliable (in a false negative > sense). > > I'll see what that "dead" load is. Should be fixed by r14-6748-ga8f0278ade1353 Richard. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix 2023-12-20 13:44 ` Richard Biener @ 2023-12-20 14:31 ` Thomas Schwinge 0 siblings, 0 replies; 5+ messages in thread From: Thomas Schwinge @ 2023-12-20 14:31 UTC (permalink / raw) To: Richard Biener; +Cc: Andrew Stubbs, Julian Brown, gcc-patches Hi Richard! On 2023-12-20T14:44:29+0100, Richard Biener <rguenther@suse.de> wrote: > On Wed, 20 Dec 2023, Richard Biener wrote: >> On Wed, 20 Dec 2023, Thomas Schwinge wrote: >> > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote: >> > > The PR112736 testcase fails on RISC-V because the aligned exception >> > > uses the wrong check. The alignment support scheme can be >> > > dr_aligned even when the access isn't aligned to the vector size >> > > but some targets are happy with element alignment. The following >> > > fixes that. >> > > >> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed. >> > >> > I've noticed this to regresses GCN target as follows: >> > >> > PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors) >> > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3 >> > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1 >> > [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1 > Should be fixed by r14-6748-ga8f0278ade1353 Thanks! Confirmed that 'gcc.dg/vect/bb-slp-pr78205.c' again is all-PASS (and no unexpected changes in the dumps, and 'bb-slp-pr78205.s' identical once again, before vs. after). Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH] tree-optimization/113073 - amend PR112736 fix @ 2023-12-19 12:30 Richard Biener 0 siblings, 0 replies; 5+ messages in thread From: Richard Biener @ 2023-12-19 12:30 UTC (permalink / raw) To: gcc-patches The PR112736 testcase fails on RISC-V because the aligned exception uses the wrong check. The alignment support scheme can be dr_aligned even when the access isn't aligned to the vector size but some targets are happy with element alignment. The following fixes that. Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed. PR tree-optimization/113073 * tree-vect-stmts.cc (vectorizable_load): Properly ensure to exempt only vector-size aligned overreads. --- gcc/tree-vect-stmts.cc | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index fc6923cf68a..e9ff728dfd4 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo, - (group_size * vf - gap), nunits)) /* DR will be unused. */ ltype = NULL_TREE; - else if (alignment_support_scheme == dr_aligned) + else if (known_ge (vect_align, + tree_to_poly_uint64 + (TYPE_SIZE_UNIT (vectype)))) /* Aligned access to excess elements is OK if at least one element is accessed in the scalar loop. */ -- 2.35.3 ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-12-20 14:31 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20231219123224.3D481385E459@sourceware.org> 2023-12-20 9:50 ` [PATCH] tree-optimization/113073 - amend PR112736 fix Thomas Schwinge 2023-12-20 10:21 ` Richard Biener 2023-12-20 13:44 ` Richard Biener 2023-12-20 14:31 ` Thomas Schwinge 2023-12-19 12:30 Richard Biener
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).