[PATCH] tree-optimization/113073

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] tree-optimization/113073 - amend PR112736 fix
@ 2023-12-19 12:30 Richard Biener
  0 siblings, 0 replies; 5+ messages in thread
From: Richard Biener @ 2023-12-19 12:30 UTC (permalink / raw)
  To: gcc-patches

The PR112736 testcase fails on RISC-V because the aligned exception
uses the wrong check.  The alignment support scheme can be
dr_aligned even when the access isn't aligned to the vector size
but some targets are happy with element alignment.  The following
fixes that.

Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

	PR tree-optimization/113073
	* tree-vect-stmts.cc (vectorizable_load): Properly ensure
	to exempt only vector-size aligned overreads.
---
 gcc/tree-vect-stmts.cc | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fc6923cf68a..e9ff728dfd4 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
 				      - (group_size * vf - gap), nunits))
 			  /* DR will be unused.  */
 			  ltype = NULL_TREE;
-			else if (alignment_support_scheme == dr_aligned)
+			else if (known_ge (vect_align,
+					   tree_to_poly_uint64
+					     (TYPE_SIZE_UNIT (vectype))))
 			  /* Aligned access to excess elements is OK if
 			     at least one element is accessed in the
 			     scalar loop.  */
-- 
2.35.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
  2023-12-20 13:44     ` Richard Biener
@ 2023-12-20 14:31       ` Thomas Schwinge
  0 siblings, 0 replies; 5+ messages in thread
From: Thomas Schwinge @ 2023-12-20 14:31 UTC (permalink / raw)
  To: Richard Biener; +Cc: Andrew Stubbs, Julian Brown, gcc-patches

Hi Richard!

On 2023-12-20T14:44:29+0100, Richard Biener <rguenther@suse.de> wrote:
> On Wed, 20 Dec 2023, Richard Biener wrote:
>> On Wed, 20 Dec 2023, Thomas Schwinge wrote:
>> > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
>> > > The PR112736 testcase fails on RISC-V because the aligned exception
>> > > uses the wrong check.  The alignment support scheme can be
>> > > dr_aligned even when the access isn't aligned to the vector size
>> > > but some targets are happy with element alignment.  The following
>> > > fixes that.
>> > >
>> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
>> >
>> > I've noticed this to regresses GCN target as follows:
>> >
>> >     PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
>> >     PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
>> >     PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
>> >     [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1

> Should be fixed by r14-6748-ga8f0278ade1353

Thanks!  Confirmed that 'gcc.dg/vect/bb-slp-pr78205.c' again is all-PASS
(and no unexpected changes in the dumps, and 'bb-slp-pr78205.s' identical
once again, before vs. after).


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
  2023-12-20 10:21   ` Richard Biener
@ 2023-12-20 13:44     ` Richard Biener
  2023-12-20 14:31       ` Thomas Schwinge
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2023-12-20 13:44 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Andrew Stubbs, Julian Brown, gcc-patches

On Wed, 20 Dec 2023, Richard Biener wrote:

> On Wed, 20 Dec 2023, Thomas Schwinge wrote:
> 
> > Hi!
> > 
> > On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> > > The PR112736 testcase fails on RISC-V because the aligned exception
> > > uses the wrong check.  The alignment support scheme can be
> > > dr_aligned even when the access isn't aligned to the vector size
> > > but some targets are happy with element alignment.  The following
> > > fixes that.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
> > 
> > I've noticed this to regresses GCN target as follows:
> > 
> >     PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
> >     PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
> >     PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
> >     [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1
> > 
> > As so often, I've got no clue whether that's a vectorizer, GCN back end,
> > or test case issue.  ;-)
> > 
> > 'diff'ing before vs. after:
> > 
> >     --- bb-slp-pr78205.c.191t.slp2        2023-12-20 09:49:45.834344620 +0100
> >     +++ bb-slp-pr78205.c.191t.slp2        2023-12-20 09:10:14.706300941 +0100
> >     [...]
> >     @@ -505,8 +505,9 @@
> >      [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double  vectorizing a pointer ref: c[0]
> >      [...]/bb-slp-pr78205.c:9:8: note: created &c[0]
> >      [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> >     -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B];
> >     -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> >     +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B];
> >     +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0};
> >     +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> >      [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1;
> >      [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal
> >      [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal
> >     [...]
> >     @@ -537,9 +538,10 @@
> >      [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1
> >      [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double  vectorizing a pointer ref: c[2]
> >      [...]/bb-slp-pr78205.c:13:8: note: created &c[2]
> >     -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
> >     -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B];
> >     -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
> >     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
> >     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B];
> >     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0};
> >     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
> >      [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3;
> >      [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal
> >      [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal
> >     [...]
> >     @@ -580,18 +582,22 @@
> >        double _4;
> >        double _5;
> >        vector(2) double _17;
> >     +  double _20;
> >     +  double _25;
> > 
> >        <bb 2> [local count: 1073741824]:
> >        vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> >     -  vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> >     +  _20 = MEM[(double *)&c + 32B];
> >     +  vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> >        _1 = c[0];
> >        _2 = c[1];
> >     -  MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
> >     -  vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
> >     -  vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
> >     +  MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
> >     +  vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
> >     +  _25 = MEM[(double *)&c + 32B];
> 
> that looks like a noop (but it's odd we keep the unused load)
> 
> >     +  vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
> >        _3 = c[2];
> >        _4 = c[3];
> >     -  MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
> >     +  MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
> >        _5 = c[4];
> >        _17 = {_5, _5};
> >        MEM <vector(2) double> [(double *)&x] = _17;
> > 
> >     --- bb-slp-pr78205.c.265t.optimized   2023-12-20 09:49:45.838344586 +0100
> >     +++ bb-slp-pr78205.c.265t.optimized   2023-12-20 09:10:14.706300941 +0100
> >     @@ -6,17 +6,17 @@
> >        vector(4) double vect__1.16;
> >        vector(4) double vect__1.9;
> >        vector(4) double vect__1.7;
> >     -  double _5;
> >        vector(2) double _17;
> >     +  double _20;
> > 
> >        <bb 2> [local count: 1073741824]:
> >        vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
> >     -  vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
> >     -  MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
> >     -  vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
> >     -  MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
> >     -  _5 = c[4];
> >     -  _17 = {_5, _5};
> >     +  _20 = MEM[(double *)&c + 32B];
> 
> that looks similar in the end, but trades c[4] for MEM here.  That's
> because we CSEd c[4] with the "dead" load above.
> 
> I think we could simply remove that extra scan-tree-dump-times, we
> should instead look to _not_ see a vector load starting at c[4],
> but scan-tree-dump-not is also not very reliable (in a false negative
> sense).
> 
> I'll see what that "dead" load is.

Should be fixed by r14-6748-ga8f0278ade1353

Richard.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
  2023-12-20  9:50 ` Thomas Schwinge
@ 2023-12-20 10:21   ` Richard Biener
  2023-12-20 13:44     ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Richard Biener @ 2023-12-20 10:21 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Andrew Stubbs, Julian Brown, gcc-patches

On Wed, 20 Dec 2023, Thomas Schwinge wrote:

> Hi!
> 
> On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> > The PR112736 testcase fails on RISC-V because the aligned exception
> > uses the wrong check.  The alignment support scheme can be
> > dr_aligned even when the access isn't aligned to the vector size
> > but some targets are happy with element alignment.  The following
> > fixes that.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.
> 
> I've noticed this to regresses GCN target as follows:
> 
>     PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
>     PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
>     PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
>     [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1
> 
> As so often, I've got no clue whether that's a vectorizer, GCN back end,
> or test case issue.  ;-)
> 
> 'diff'ing before vs. after:
> 
>     --- bb-slp-pr78205.c.191t.slp2        2023-12-20 09:49:45.834344620 +0100
>     +++ bb-slp-pr78205.c.191t.slp2        2023-12-20 09:10:14.706300941 +0100
>     [...]
>     @@ -505,8 +505,9 @@
>      [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double  vectorizing a pointer ref: c[0]
>      [...]/bb-slp-pr78205.c:9:8: note: created &c[0]
>      [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
>     -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B];
>     -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
>     +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B];
>     +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0};
>     +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
>      [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1;
>      [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal
>      [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal
>     [...]
>     @@ -537,9 +538,10 @@
>      [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1
>      [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double  vectorizing a pointer ref: c[2]
>      [...]/bb-slp-pr78205.c:13:8: note: created &c[2]
>     -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
>     -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B];
>     -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
>     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
>     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B];
>     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0};
>     +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
>      [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3;
>      [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal
>      [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal
>     [...]
>     @@ -580,18 +582,22 @@
>        double _4;
>        double _5;
>        vector(2) double _17;
>     +  double _20;
>     +  double _25;
> 
>        <bb 2> [local count: 1073741824]:
>        vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
>     -  vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
>     +  _20 = MEM[(double *)&c + 32B];
>     +  vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
>        _1 = c[0];
>        _2 = c[1];
>     -  MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
>     -  vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
>     -  vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
>     +  MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
>     +  vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
>     +  _25 = MEM[(double *)&c + 32B];

that looks like a noop (but it's odd we keep the unused load)

>     +  vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
>        _3 = c[2];
>        _4 = c[3];
>     -  MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
>     +  MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
>        _5 = c[4];
>        _17 = {_5, _5};
>        MEM <vector(2) double> [(double *)&x] = _17;
> 
>     --- bb-slp-pr78205.c.265t.optimized   2023-12-20 09:49:45.838344586 +0100
>     +++ bb-slp-pr78205.c.265t.optimized   2023-12-20 09:10:14.706300941 +0100
>     @@ -6,17 +6,17 @@
>        vector(4) double vect__1.16;
>        vector(4) double vect__1.9;
>        vector(4) double vect__1.7;
>     -  double _5;
>        vector(2) double _17;
>     +  double _20;
> 
>        <bb 2> [local count: 1073741824]:
>        vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
>     -  vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
>     -  MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
>     -  vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
>     -  MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
>     -  _5 = c[4];
>     -  _17 = {_5, _5};
>     +  _20 = MEM[(double *)&c + 32B];

that looks similar in the end, but trades c[4] for MEM here.  That's
because we CSEd c[4] with the "dead" load above.

I think we could simply remove that extra scan-tree-dump-times, we
should instead look to _not_ see a vector load starting at c[4],
but scan-tree-dump-not is also not very reliable (in a false negative
sense).

I'll see what that "dead" load is.

>     +  vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
>     +  MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
>     +  vect__1.16_27 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
>     +  MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
>     +  _17 = {_20, _20};
>        MEM <vector(2) double> [(double *)&x] = _17;
>        return;
>     --- bb-slp-pr78205.s  2023-12-20 09:49:45.846344519 +0100
>     +++ bb-slp-pr78205.s  2023-12-20 09:10:14.722300807 +0100
>     @@ -41,7 +41,17 @@
>             v_addc_co_u32   v7, s[22:23], 0, v7, s[22:23]
>             flat_load_dwordx2       v[6:7], v[6:7] offset:0
>             s_waitcnt       0
>     +       v_writelane_b32 v8, s12, 0
>     +       v_writelane_b32 v9, s13, 0
>     +       s_mov_b64       exec, 1
>     +       v_add_co_u32    v8, vcc, 32, v8
>     +       v_addc_co_u32   v9, vcc, 0, v9, vcc
>     +       flat_load_dwordx2       v[8:9], v[8:9]
>     +       s_waitcnt       0
>     +       v_readlane_b32  s12, v8, 0
>     +       v_readlane_b32  s13, v9, 0
>             s_mov_b64       s[18:19], 10
>     +       s_mov_b64       exec, 15
>             v_cndmask_b32   v0, 0, 4, s[18:19]
>             s_mov_b64       exec, 15
>             v_mov_b32       v11, v7
>     @@ -73,15 +83,6 @@
>             v_mov_b32       v5, s19
>             v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
>             flat_store_dwordx2      v[4:5], v[6:7] offset:0
>     -       v_writelane_b32 v4, s12, 0
>     -       v_writelane_b32 v5, s13, 0
>     -       s_mov_b64       exec, 1
>     -       v_add_co_u32    v4, vcc, 32, v4
>     -       v_addc_co_u32   v5, vcc, 0, v5, vcc
>     -       flat_load_dwordx2       v[4:5], v[4:5]
>     -       s_waitcnt       0
>     -       v_readlane_b32  s12, v4, 0
>     -       v_readlane_b32  s13, v5, 0
>             s_mov_b64       exec, 3
>             v_mov_b32       v6, s12
>             v_mov_b32       v7, s13
> 
> I haven't looked at the full context, but this appears to effectively
> just move this block of code, and use different registers.
> 
> In:
> 
>     +       s_mov_b64       exec, 15
>             v_cndmask_b32   v0, 0, 4, s[18:19]
>             s_mov_b64       exec, 15
> 
> ... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant,
> though?

I have no idea about that.

> 
> Gr??e
>  Thomas
> 
> 
> >       PR tree-optimization/113073
> >       * tree-vect-stmts.cc (vectorizable_load): Properly ensure
> >       to exempt only vector-size aligned overreads.
> > ---
> >  gcc/tree-vect-stmts.cc | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index fc6923cf68a..e9ff728dfd4 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
> >                                     - (group_size * vf - gap), nunits))
> >                         /* DR will be unused.  */
> >                         ltype = NULL_TREE;
> > -                     else if (alignment_support_scheme == dr_aligned)
> > +                     else if (known_ge (vect_align,
> > +                                        tree_to_poly_uint64
> > +                                          (TYPE_SIZE_UNIT (vectype))))
> >                         /* Aligned access to excess elements is OK if
> >                            at least one element is accessed in the
> >                            scalar loop.  */
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra?e 201, 80634 M?nchen; Gesellschaft mit beschr?nkter Haftung; Gesch?ftsf?hrer: Thomas Heurung, Frank Th?rauf; Sitz der Gesellschaft: M?nchen; Registergericht M?nchen, HRB 106955
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
       [not found] <20231219123224.3D481385E459@sourceware.org>
@ 2023-12-20  9:50 ` Thomas Schwinge
  2023-12-20 10:21   ` Richard Biener
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Schwinge @ 2023-12-20  9:50 UTC (permalink / raw)
  To: Richard Biener, Andrew Stubbs, Julian Brown; +Cc: gcc-patches

Hi!

On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> The PR112736 testcase fails on RISC-V because the aligned exception
> uses the wrong check.  The alignment support scheme can be
> dr_aligned even when the access isn't aligned to the vector size
> but some targets are happy with element alignment.  The following
> fixes that.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

I've noticed this to regresses GCN target as follows:

    PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
    PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized: basic block" 3
    PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vectorization with gaps at the end of a load is not supported" 1
    [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times optimized " = c\\[4\\];" 1

As so often, I've got no clue whether that's a vectorizer, GCN back end,
or test case issue.  ;-)

'diff'ing before vs. after:

    --- bb-slp-pr78205.c.191t.slp2        2023-12-20 09:49:45.834344620 +0100
    +++ bb-slp-pr78205.c.191t.slp2        2023-12-20 09:10:14.706300941 +0100
    [...]
    @@ -505,8 +505,9 @@
     [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable to type: vector(4) double  vectorizing a pointer ref: c[0]
     [...]/bb-slp-pr78205.c:9:8: note: created &c[0]
     [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
    -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 = MEM <vector(4) double> [(double *)&c + 32B];
    -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
    +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 = MEM[(double *)&c + 32B];
    +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 = {_20, 0.0, 0.0, 0.0};
    +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
     [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting from: a[0] = _1;
     [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], type of def: internal
     [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], type of def: internal
    [...]
    @@ -537,9 +538,10 @@
     [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies = 1
     [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable to type: vector(4) double  vectorizing a pointer ref: c[2]
     [...]/bb-slp-pr78205.c:13:8: note: created &c[2]
    -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
    -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 = MEM <vector(4) double> [(double *)&c + 32B];
    -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 = MEM[(double *)&c + 32B];
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 = {_25, 0.0, 0.0, 0.0};
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
     [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node starting from: b[0] = _3;
     [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], type of def: internal
     [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], type of def: internal
    [...]
    @@ -580,18 +582,22 @@
       double _4;
       double _5;
       vector(2) double _17;
    +  double _20;
    +  double _25;

       <bb 2> [local count: 1073741824]:
       vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
    -  vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
    +  _20 = MEM[(double *)&c + 32B];
    +  vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
       _1 = c[0];
       _2 = c[1];
    -  MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
    -  vect__3.14_23 = MEM <vector(4) double> [(double *)&c];
    -  vect__1.16_25 = VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
    +  MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
    +  vect__3.14_24 = MEM <vector(4) double> [(double *)&c];
    +  _25 = MEM[(double *)&c + 32B];
    +  vect__1.16_27 = VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
       _3 = c[2];
       _4 = c[3];
    -  MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
    +  MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
       _5 = c[4];
       _17 = {_5, _5};
       MEM <vector(2) double> [(double *)&x] = _17;

    --- bb-slp-pr78205.c.265t.optimized   2023-12-20 09:49:45.838344586 +0100
    +++ bb-slp-pr78205.c.265t.optimized   2023-12-20 09:10:14.706300941 +0100
    @@ -6,17 +6,17 @@
       vector(4) double vect__1.16;
       vector(4) double vect__1.9;
       vector(4) double vect__1.7;
    -  double _5;
       vector(2) double _17;
    +  double _20;

       <bb 2> [local count: 1073741824]:
       vect__1.7_19 = MEM <vector(4) double> [(double *)&c];
    -  vect__1.9_21 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
    -  MEM <vector(4) double> [(double *)&a] = vect__1.9_21;
    -  vect__1.16_25 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
    -  MEM <vector(4) double> [(double *)&b] = vect__1.16_25;
    -  _5 = c[4];
    -  _17 = {_5, _5};
    +  _20 = MEM[(double *)&c + 32B];
    +  vect__1.9_22 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
    +  MEM <vector(4) double> [(double *)&a] = vect__1.9_22;
    +  vect__1.16_27 = VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3, 2, 3 }>;
    +  MEM <vector(4) double> [(double *)&b] = vect__1.16_27;
    +  _17 = {_20, _20};
       MEM <vector(2) double> [(double *)&x] = _17;
       return;

    --- bb-slp-pr78205.s  2023-12-20 09:49:45.846344519 +0100
    +++ bb-slp-pr78205.s  2023-12-20 09:10:14.722300807 +0100
    @@ -41,7 +41,17 @@
            v_addc_co_u32   v7, s[22:23], 0, v7, s[22:23]
            flat_load_dwordx2       v[6:7], v[6:7] offset:0
            s_waitcnt       0
    +       v_writelane_b32 v8, s12, 0
    +       v_writelane_b32 v9, s13, 0
    +       s_mov_b64       exec, 1
    +       v_add_co_u32    v8, vcc, 32, v8
    +       v_addc_co_u32   v9, vcc, 0, v9, vcc
    +       flat_load_dwordx2       v[8:9], v[8:9]
    +       s_waitcnt       0
    +       v_readlane_b32  s12, v8, 0
    +       v_readlane_b32  s13, v9, 0
            s_mov_b64       s[18:19], 10
    +       s_mov_b64       exec, 15
            v_cndmask_b32   v0, 0, 4, s[18:19]
            s_mov_b64       exec, 15
            v_mov_b32       v11, v7
    @@ -73,15 +83,6 @@
            v_mov_b32       v5, s19
            v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
            flat_store_dwordx2      v[4:5], v[6:7] offset:0
    -       v_writelane_b32 v4, s12, 0
    -       v_writelane_b32 v5, s13, 0
    -       s_mov_b64       exec, 1
    -       v_add_co_u32    v4, vcc, 32, v4
    -       v_addc_co_u32   v5, vcc, 0, v5, vcc
    -       flat_load_dwordx2       v[4:5], v[4:5]
    -       s_waitcnt       0
    -       v_readlane_b32  s12, v4, 0
    -       v_readlane_b32  s13, v5, 0
            s_mov_b64       exec, 3
            v_mov_b32       v6, s12
            v_mov_b32       v7, s13

I haven't looked at the full context, but this appears to effectively
just move this block of code, and use different registers.

In:

    +       s_mov_b64       exec, 15
            v_cndmask_b32   v0, 0, 4, s[18:19]
            s_mov_b64       exec, 15

... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant,
though?


Grüße
 Thomas


>       PR tree-optimization/113073
>       * tree-vect-stmts.cc (vectorizable_load): Properly ensure
>       to exempt only vector-size aligned overreads.
> ---
>  gcc/tree-vect-stmts.cc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index fc6923cf68a..e9ff728dfd4 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
>                                     - (group_size * vf - gap), nunits))
>                         /* DR will be unused.  */
>                         ltype = NULL_TREE;
> -                     else if (alignment_support_scheme == dr_aligned)
> +                     else if (known_ge (vect_align,
> +                                        tree_to_poly_uint64
> +                                          (TYPE_SIZE_UNIT (vectype))))
>                         /* Aligned access to excess elements is OK if
>                            at least one element is accessed in the
>                            scalar loop.  */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-12-20 14:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-19 12:30 [PATCH] tree-optimization/113073 - amend PR112736 fix Richard Biener
     [not found] <20231219123224.3D481385E459@sourceware.org>
2023-12-20  9:50 ` Thomas Schwinge
2023-12-20 10:21   ` Richard Biener
2023-12-20 13:44     ` Richard Biener
2023-12-20 14:31       ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).