* [PATCH] tree-optimization/113073 - amend PR112736 fix
@ 2023-12-19 12:30 Richard Biener
0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2023-12-19 12:30 UTC (permalink / raw)
To: gcc-patches
The PR112736 testcase fails on RISC-V because the aligned exception
uses the wrong check. The alignment support scheme can be
dr_aligned even when the access isn't aligned to the vector size
but some targets are happy with element alignment. The following
fixes that.
Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
PR tree-optimization/113073
* tree-vect-stmts.cc (vectorizable_load): Properly ensure
to exempt only vector-size aligned overreads.
---
gcc/tree-vect-stmts.cc | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fc6923cf68a..e9ff728dfd4 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
- (group_size * vf - gap), nunits))
/* DR will be unused. */
ltype = NULL_TREE;
- else if (alignment_support_scheme == dr_aligned)
+ else if (known_ge (vect_align,
+ tree_to_poly_uint64
+ (TYPE_SIZE_UNIT (vectype))))
/* Aligned access to excess elements is OK if
at least one element is accessed in the
scalar loop. */
--
2.35.3
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
2023-12-20 13:44 ` Richard Biener
@ 2023-12-20 14:31 ` Thomas Schwinge
0 siblings, 0 replies; 5+ messages in thread
From: Thomas Schwinge @ 2023-12-20 14:31 UTC (permalink / raw)
To: Richard Biener; +Cc: Andrew Stubbs, Julian Brown, gcc-patches
Hi Richard!
On 2023-12-20T14:44:29+0100, Richard Biener <rguenther@suse.de> wrote:
> On Wed, 20 Dec 2023, Richard Biener wrote:
>> On Wed, 20 Dec 2023, Thomas Schwinge wrote:
>> > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
>> > > The PR112736 testcase fails on RISC-V because the aligned exception
>> > > uses the wrong check. The alignment support scheme can be
>> > > dr_aligned even when the access isn't aligned to the vector size
>> > > but some targets are happy with element alignment. The following
>> > > fixes that.
>> > >
>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
>> >
>> > I've noticed this to regresses GCN target as follows:
>> >
>> > PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
>> > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
>> > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
>> > [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1
> Should be fixed by r14-6748-ga8f0278ade1353
Thanks! Confirmed that 'gcc.dg/vect/bb-slp-pr78205.c' again is all-PASS
(and no unexpected changes in the dumps, and 'bb-slp-pr78205.s' identical
once again, before vs. after).
Grüße
Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
2023-12-20 10:21 ` Richard Biener
@ 2023-12-20 13:44 ` Richard Biener
2023-12-20 14:31 ` Thomas Schwinge
0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2023-12-20 13:44 UTC (permalink / raw)
To: Thomas Schwinge; +Cc: Andrew Stubbs, Julian Brown, gcc-patches
On Wed, 20 Dec 2023, Richard Biener wrote:
> On Wed, 20 Dec 2023, Thomas Schwinge wrote:
>
> > Hi!
> >
> > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> > > The PR112736 testcase fails on RISC-V because the aligned exception
> > > uses the wrong check. The alignment support scheme can be
> > > dr_aligned even when the access isn't aligned to the vector size
> > > but some targets are happy with element alignment. The following
> > > fixes that.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
> >
> > I've noticed this to regresses GCN target as follows:
> >
> > PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
> > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
> > PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
> > [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1
> >
> > As so often, I've got no clue whether that's a vectorizer, GCN back end,
> > or test case issue. ;-)
> >
> > 'diff'ing before vs. after:
> >
> > --- bb-slp-pr78205.c.191t.slp2 2023-12-20 09:49:45.834344620 +0100
> > +++ bb-slp-pr78205.c.191t.slp2 2023-12-20 09:10:14.706300941 +0100
> > [...]
> > @@ -505,8 +505,9 @@
> > [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[0]
> > [...]/bb-slp-pr78205.c:9:8: note: created &c[0]
> > [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> > -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B];
> > -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B];
> > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0};
> > +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> > [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1;
> > [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal
> > [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal
> > [...]
> > @@ -537,9 +538,10 @@
> > [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1
> > [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[2]
> > [...]/bb-slp-pr78205.c:13:8: note: created &c[2]
> > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
> > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B];
> > -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
> > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
> > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B];
> > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0};
> > +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
> > [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3;
> > [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal
> > [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal
> > [...]
> > @@ -580,18 +582,22 @@
> > double _4;
> > double _5;
> > vector(2) double _17;
> > + double _20;
> > + double _25;
> >
> > <bb 2> [local count: 1073741824]:
> > vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> > - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> > + _20 = MEM[(double *)&c + 32B];
> > + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> > _1 = c[0];
> > _2 = c[1];
> > - MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
> > - vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
> > - vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
> > + MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
> > + vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
> > + _25 = MEM[(double *)&c + 32B];
>
> that looks like a noop (but it's odd we keep the unused load)
>
> > + vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
> > _3 = c[2];
> > _4 = c[3];
> > - MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
> > + MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
> > _5 = c[4];
> > _17 = {_5, _5};
> > MEM <vector(2) double> [(double *)&x] = _17;
> >
> > --- bb-slp-pr78205.c.265t.optimized 2023-12-20 09:49:45.838344586 +0100
> > +++ bb-slp-pr78205.c.265t.optimized 2023-12-20 09:10:14.706300941 +0100
> > @@ -6,17 +6,17 @@
> > vector(4) double vect__1.16;
> > vector(4) double vect__1.9;
> > vector(4) double vect__1.7;
> > - double _5;
> > vector(2) double _17;
> > + double _20;
> >
> > <bb 2> [local count: 1073741824]:
> > vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> > - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> > - MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
> > - vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
> > - MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
> > - _5 = c[4];
> > - _17 = {_5, _5};
> > + _20 = MEM[(double *)&c + 32B];
>
> that looks similar in the end, but trades c[4] for MEM here. That's
> because we CSEd c[4] with the "dead" load above.
>
> I think we could simply remove that extra scan-tree-dump-times, we
> should instead look to _not_ see a vector load starting at c[4],
> but scan-tree-dump-not is also not very reliable (in a false negative
> sense).
>
> I'll see what that "dead" load is.
Should be fixed by r14-6748-ga8f0278ade1353
Richard.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
2023-12-20 9:50 ` Thomas Schwinge
@ 2023-12-20 10:21 ` Richard Biener
2023-12-20 13:44 ` Richard Biener
0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2023-12-20 10:21 UTC (permalink / raw)
To: Thomas Schwinge; +Cc: Andrew Stubbs, Julian Brown, gcc-patches
On Wed, 20 Dec 2023, Thomas Schwinge wrote:
> Hi!
>
> On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> > The PR112736 testcase fails on RISC-V because the aligned exception
> > uses the wrong check. The alignment support scheme can be
> > dr_aligned even when the access isn't aligned to the vector size
> > but some targets are happy with element alignment. The following
> > fixes that.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
>
> I've noticed this to regresses GCN target as follows:
>
> PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
> PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
> PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
> [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1
>
> As so often, I've got no clue whether that's a vectorizer, GCN back end,
> or test case issue. ;-)
>
> 'diff'ing before vs. after:
>
> --- bb-slp-pr78205.c.191t.slp2 2023-12-20 09:49:45.834344620 +0100
> +++ bb-slp-pr78205.c.191t.slp2 2023-12-20 09:10:14.706300941 +0100
> [...]
> @@ -505,8 +505,9 @@
> [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[0]
> [...]/bb-slp-pr78205.c:9:8: note: created &c[0]
> [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B];
> -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B];
> +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0};
> +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1;
> [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal
> [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal
> [...]
> @@ -537,9 +538,10 @@
> [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1
> [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[2]
> [...]/bb-slp-pr78205.c:13:8: note: created &c[2]
> -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
> -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B];
> -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
> +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
> +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B];
> +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0};
> +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
> [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3;
> [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal
> [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal
> [...]
> @@ -580,18 +582,22 @@
> double _4;
> double _5;
> vector(2) double _17;
> + double _20;
> + double _25;
>
> <bb 2> [local count: 1073741824]:
> vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> + _20 = MEM[(double *)&c + 32B];
> + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> _1 = c[0];
> _2 = c[1];
> - MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
> - vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
> - vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
> + MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
> + vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
> + _25 = MEM[(double *)&c + 32B];
that looks like a noop (but it's odd we keep the unused load)
> + vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
> _3 = c[2];
> _4 = c[3];
> - MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
> + MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
> _5 = c[4];
> _17 = {_5, _5};
> MEM <vector(2) double> [(double *)&x] = _17;
>
> --- bb-slp-pr78205.c.265t.optimized 2023-12-20 09:49:45.838344586 +0100
> +++ bb-slp-pr78205.c.265t.optimized 2023-12-20 09:10:14.706300941 +0100
> @@ -6,17 +6,17 @@
> vector(4) double vect__1.16;
> vector(4) double vect__1.9;
> vector(4) double vect__1.7;
> - double _5;
> vector(2) double _17;
> + double _20;
>
> <bb 2> [local count: 1073741824]:
> vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> - vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> - MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
> - vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
> - MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
> - _5 = c[4];
> - _17 = {_5, _5};
> + _20 = MEM[(double *)&c + 32B];
that looks similar in the end, but trades c[4] for MEM here. That's
because we CSEd c[4] with the "dead" load above.
I think we could simply remove that extra scan-tree-dump-times, we
should instead look to _not_ see a vector load starting at c[4],
but scan-tree-dump-not is also not very reliable (in a false negative
sense).
I'll see what that "dead" load is.
> + vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> + MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
> + vect__1.16_27 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
> + MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
> + _17 = {_20, _20};
> MEM <vector(2) double> [(double *)&x] = _17;
> return;
> --- bb-slp-pr78205.s 2023-12-20 09:49:45.846344519 +0100
> +++ bb-slp-pr78205.s 2023-12-20 09:10:14.722300807 +0100
> @@ -41,7 +41,17 @@
> v_addc_co_u32 v7, s[22:23], 0, v7, s[22:23]
> flat_load_dwordx2 v[6:7], v[6:7] offset:0
> s_waitcnt 0
> + v_writelane_b32 v8, s12, 0
> + v_writelane_b32 v9, s13, 0
> + s_mov_b64 exec, 1
> + v_add_co_u32 v8, vcc, 32, v8
> + v_addc_co_u32 v9, vcc, 0, v9, vcc
> + flat_load_dwordx2 v[8:9], v[8:9]
> + s_waitcnt 0
> + v_readlane_b32 s12, v8, 0
> + v_readlane_b32 s13, v9, 0
> s_mov_b64 s[18:19], 10
> + s_mov_b64 exec, 15
> v_cndmask_b32 v0, 0, 4, s[18:19]
> s_mov_b64 exec, 15
> v_mov_b32 v11, v7
> @@ -73,15 +83,6 @@
> v_mov_b32 v5, s19
> v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23]
> flat_store_dwordx2 v[4:5], v[6:7] offset:0
> - v_writelane_b32 v4, s12, 0
> - v_writelane_b32 v5, s13, 0
> - s_mov_b64 exec, 1
> - v_add_co_u32 v4, vcc, 32, v4
> - v_addc_co_u32 v5, vcc, 0, v5, vcc
> - flat_load_dwordx2 v[4:5], v[4:5]
> - s_waitcnt 0
> - v_readlane_b32 s12, v4, 0
> - v_readlane_b32 s13, v5, 0
> s_mov_b64 exec, 3
> v_mov_b32 v6, s12
> v_mov_b32 v7, s13
>
> I haven't looked at the full context, but this appears to effectively
> just move this block of code, and use different registers.
>
> In:
>
> + s_mov_b64 exec, 15
> v_cndmask_b32 v0, 0, 4, s[18:19]
> s_mov_b64 exec, 15
>
> ... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant,
> though?
I have no idea about that.
>
> Gr??e
> Thomas
>
>
> > PR tree-optimization/113073
> > * tree-vect-stmts.cc (vectorizable_load): Properly ensure
> > to exempt only vector-size aligned overreads.
> > ---
> > gcc/tree-vect-stmts.cc | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index fc6923cf68a..e9ff728dfd4 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
> > - (group_size * vf - gap), nunits))
> > /* DR will be unused. */
> > ltype = NULL_TREE;
> > - else if (alignment_support_scheme == dr_aligned)
> > + else if (known_ge (vect_align,
> > + tree_to_poly_uint64
> > + (TYPE_SIZE_UNIT (vectype))))
> > /* Aligned access to excess elements is OK if
> > at least one element is accessed in the
> > scalar loop. */
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra?e 201, 80634 M?nchen; Gesellschaft mit beschr?nkter Haftung; Gesch?ftsf?hrer: Thomas Heurung, Frank Th?rauf; Sitz der Gesellschaft: M?nchen; Registergericht M?nchen, HRB 106955
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
[not found] <20231219123224.3D481385E459@sourceware.org>
@ 2023-12-20 9:50 ` Thomas Schwinge
2023-12-20 10:21 ` Richard Biener
0 siblings, 1 reply; 5+ messages in thread
From: Thomas Schwinge @ 2023-12-20 9:50 UTC (permalink / raw)
To: Richard Biener, Andrew Stubbs, Julian Brown; +Cc: gcc-patches
Hi!
On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> The PR112736 testcase fails on RISC-V because the aligned exception
> uses the wrong check. The alignment support scheme can be
> dr_aligned even when the access isn't aligned to the vector size
> but some targets are happy with element alignment. The following
> fixes that.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
I've noticed this to regresses GCN target as follows:
PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
[-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1
As so often, I've got no clue whether that's a vectorizer, GCN back end,
or test case issue. ;-)
'diff'ing before vs. after:
--- bb-slp-pr78205.c.191t.slp2 2023-12-20 09:49:45.834344620 +0100
+++ bb-slp-pr78205.c.191t.slp2 2023-12-20 09:10:14.706300941 +0100
[...]
@@ -505,8 +505,9 @@
[...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[0]
[...]/bb-slp-pr78205.c:9:8: note: created &c[0]
[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
-[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B];
-[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
+[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B];
+[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0};
+[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
[...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1;
[...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal
[...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal
[...]
@@ -537,9 +538,10 @@
[...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1
[...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double vectorizing a pointer ref: c[2]
[...]/bb-slp-pr78205.c:13:8: note: created &c[2]
-[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
-[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B];
-[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
+[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
+[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B];
+[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0};
+[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
[...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3;
[...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal
[...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal
[...]
@@ -580,18 +582,22 @@
double _4;
double _5;
vector(2) double _17;
+ double _20;
+ double _25;
<bb 2> [local count: 1073741824]:
vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
- vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
+ _20 = MEM[(double *)&c + 32B];
+ vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
_1 = c[0];
_2 = c[1];
- MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
- vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
- vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
+ MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
+ vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
+ _25 = MEM[(double *)&c + 32B];
+ vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
_3 = c[2];
_4 = c[3];
- MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
+ MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
_5 = c[4];
_17 = {_5, _5};
MEM <vector(2) double> [(double *)&x] = _17;
--- bb-slp-pr78205.c.265t.optimized 2023-12-20 09:49:45.838344586 +0100
+++ bb-slp-pr78205.c.265t.optimized 2023-12-20 09:10:14.706300941 +0100
@@ -6,17 +6,17 @@
vector(4) double vect__1.16;
vector(4) double vect__1.9;
vector(4) double vect__1.7;
- double _5;
vector(2) double _17;
+ double _20;
<bb 2> [local count: 1073741824]:
vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
- vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
- MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
- vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
- MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
- _5 = c[4];
- _17 = {_5, _5};
+ _20 = MEM[(double *)&c + 32B];
+ vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
+ MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
+ vect__1.16_27 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
+ MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
+ _17 = {_20, _20};
MEM <vector(2) double> [(double *)&x] = _17;
return;
--- bb-slp-pr78205.s 2023-12-20 09:49:45.846344519 +0100
+++ bb-slp-pr78205.s 2023-12-20 09:10:14.722300807 +0100
@@ -41,7 +41,17 @@
v_addc_co_u32 v7, s[22:23], 0, v7, s[22:23]
flat_load_dwordx2 v[6:7], v[6:7] offset:0
s_waitcnt 0
+ v_writelane_b32 v8, s12, 0
+ v_writelane_b32 v9, s13, 0
+ s_mov_b64 exec, 1
+ v_add_co_u32 v8, vcc, 32, v8
+ v_addc_co_u32 v9, vcc, 0, v9, vcc
+ flat_load_dwordx2 v[8:9], v[8:9]
+ s_waitcnt 0
+ v_readlane_b32 s12, v8, 0
+ v_readlane_b32 s13, v9, 0
s_mov_b64 s[18:19], 10
+ s_mov_b64 exec, 15
v_cndmask_b32 v0, 0, 4, s[18:19]
s_mov_b64 exec, 15
v_mov_b32 v11, v7
@@ -73,15 +83,6 @@
v_mov_b32 v5, s19
v_addc_co_u32 v5, s[22:23], 0, v5, s[22:23]
flat_store_dwordx2 v[4:5], v[6:7] offset:0
- v_writelane_b32 v4, s12, 0
- v_writelane_b32 v5, s13, 0
- s_mov_b64 exec, 1
- v_add_co_u32 v4, vcc, 32, v4
- v_addc_co_u32 v5, vcc, 0, v5, vcc
- flat_load_dwordx2 v[4:5], v[4:5]
- s_waitcnt 0
- v_readlane_b32 s12, v4, 0
- v_readlane_b32 s13, v5, 0
s_mov_b64 exec, 3
v_mov_b32 v6, s12
v_mov_b32 v7, s13
I haven't looked at the full context, but this appears to effectively
just move this block of code, and use different registers.
In:
+ s_mov_b64 exec, 15
v_cndmask_b32 v0, 0, 4, s[18:19]
s_mov_b64 exec, 15
... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant,
though?
Grüße
Thomas
> PR tree-optimization/113073
> * tree-vect-stmts.cc (vectorizable_load): Properly ensure
> to exempt only vector-size aligned overreads.
> ---
> gcc/tree-vect-stmts.cc | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index fc6923cf68a..e9ff728dfd4 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
> - (group_size * vf - gap), nunits))
> /* DR will be unused. */
> ltype = NULL_TREE;
> - else if (alignment_support_scheme == dr_aligned)
> + else if (known_ge (vect_align,
> + tree_to_poly_uint64
> + (TYPE_SIZE_UNIT (vectype))))
> /* Aligned access to excess elements is OK if
> at least one element is accessed in the
> scalar loop. */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2023-12-20 14:31 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-19 12:30 [PATCH] tree-optimization/113073 - amend PR112736 fix Richard Biener
[not found] <20231219123224.3D481385E459@sourceware.org>
2023-12-20 9:50 ` Thomas Schwinge
2023-12-20 10:21 ` Richard Biener
2023-12-20 13:44 ` Richard Biener
2023-12-20 14:31 ` Thomas Schwinge
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).