From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=WDkn=H7=mentor.com=Thomas_Schwinge@sourceware.org>
Received: from esa3.mentor.iphmx.com (esa3.mentor.iphmx.com [68.232.137.180])
	by sourceware.org (Postfix) with ESMTPS id B0CBB3858C2C
	for <gcc-patches@gcc.gnu.org>; Wed, 20 Dec 2023 09:50:16 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org B0CBB3858C2C
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org B0CBB3858C2C
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=68.232.137.180
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1703065818; cv=none;
	b=maberCjKS+fzWnoDHfMT2e8pYhT3vW0XMI6x/wpL/AF9QBaij/wNTldOBl1TRw479KJiMC1klxpSgy+bmr3F6c5QvW/fVeKm2t++kb0bTjVexXjOa4VsRQ7gSwlp+lCi1WkNKqH2fJczQGADokrnwOs8LteSQ9x33djzJrEf/6A=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1703065818; c=relaxed/simple;
	bh=Nq0mMr5uxtVwNr9yw3R5JrQARf3qpdd3sGpjU/OfZ/4=;
	h=From:To:Subject:Date:Message-ID:MIME-Version; b=ek7hsSqEtnQVBEtPFWeZ0iaEyzMGvDU7TrGfPOqGD6aBZTA3z6hu1j4Em06UB/fh2ZX2rZVzlAMk9yoxAqoMuMM1fpA4+XPVH85bIQMqndSU7px5a75DKb+xgLM7+eUarXORhNONdEH8qkU7sRSaTtUYVqbmFfbIg/Jbr7hypl0=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-CSE-ConnectionGUID: d6tokdl6RpWD5ajRyWa8xA==
X-CSE-MsgGUID: X0SMjZ/fSjOD9SUcQedzkw==
X-IronPort-AV: E=Sophos;i="6.04,291,1695715200"; 
   d="scan'208";a="25822510"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
  by esa3.mentor.iphmx.com with ESMTP; 20 Dec 2023 01:50:14 -0800
IronPort-SDR: vMS8sS2j9Ji/mGp+EyIDj7a8brXOrKD/FAbbSul1XEzzoUen+yGi8J+FNhIsk0r+fWH9UOkjSq
 dSaMWOak7fhCpgn5dK8++KYnE/uPKS1MSkPuYGkYB2CACvBLyD7Ifk80fho4bgrA0PIfm3znmk
 /P83tSOWQepu8p+KCwBKXjlpOaj/BbJZAkPmcLx38JvF0qW7bXklYx66PUrG3I/JdTaz6Yc1qh
 E/UKwG6eYHjPuae9Fhiq9/UaBh5ZEnsbkKBFSmVqtAVcqOa367qTpZc53j7bPr/ElVwJA8jtpy
 /Yk=
From: Thomas Schwinge <thomas@codesourcery.com>
To: Richard Biener <rguenther@suse.de>, Andrew Stubbs <ams@codesourcery.com>,
	Julian Brown <julian@codesourcery.com>
CC: <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] tree-optimization/113073 - amend PR112736 fix
In-Reply-To: <20231219123224.3D481385E459@sourceware.org>
References: <20231219123224.3D481385E459@sourceware.org>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/28.2
 (x86_64-pc-linux-gnu)
Date: Wed, 20 Dec 2023 10:50:07 +0100
Message-ID: <871qbh4300.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To
 svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10)
X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi!

On 2023-12-19T13:30:58+0100, Richard Biener <rguenther@suse.de> wrote:
> The PR112736 testcase fails on RISC-V because the aligned exception
> uses the wrong check.  The alignment support scheme can be
> dr_aligned even when the access isn't aligned to the vector size
> but some targets are happy with element alignment.  The following
> fixes that.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, pushed.

I've noticed this to regresses GCN target as follows:

    PASS: gcc.dg/vect/bb-slp-pr78205.c (test for excess errors)
    PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "optimized=
: basic block" 3
    PASS: gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times slp2 "BB vector=
ization with gaps at the end of a load is not supported" 1
    [-PASS:-]{+FAIL:+} gcc.dg/vect/bb-slp-pr78205.c scan-tree-dump-times op=
timized " =3D c\\[4\\];" 1

As so often, I've got no clue whether that's a vectorizer, GCN back end,
or test case issue.  ;-)

'diff'ing before vs. after:

    --- bb-slp-pr78205.c.191t.slp2        2023-12-20 09:49:45.834344620 +01=
00
    +++ bb-slp-pr78205.c.191t.slp2        2023-12-20 09:10:14.706300941 +01=
00
    [...]
    @@ -505,8 +505,9 @@
     [...]/bb-slp-pr78205.c:9:8: note: create vector_type-pointer variable =
to type: vector(4) double  vectorizing a pointer ref: c[0]
     [...]/bb-slp-pr78205.c:9:8: note: created &c[0]
     [...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.7_19 =3D MEM <=
vector(4) double> [(double *)&c];
    -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_20 =3D MEM <=
vector(4) double> [(double *)&c + 32B];
    -[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_21 =3D VEC_P=
ERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
    +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: _20 =3D MEM[(double *)=
&c + 32B];
    +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.8_21 =3D {_20,=
 0.0, 0.0, 0.0};
    +[...]/bb-slp-pr78205.c:9:8: note: add new stmt: vect__1.9_22 =3D VEC_P=
ERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, 0, 1 }>;
     [...]/bb-slp-pr78205.c:9:8: note: ------>vectorizing SLP node starting=
 from: a[0] =3D _1;
     [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[0], ty=
pe of def: internal
     [...]/bb-slp-pr78205.c:9:8: note: vect_is_simple_use: operand c[1], ty=
pe of def: internal
    [...]
    @@ -537,9 +538,10 @@
     [...]/bb-slp-pr78205.c:13:8: note: transform load. ncopies =3D 1
     [...]/bb-slp-pr78205.c:13:8: note: create vector_type-pointer variable=
 to type: vector(4) double  vectorizing a pointer ref: c[2]
     [...]/bb-slp-pr78205.c:13:8: note: created &c[2]
    -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_23 =3D MEM=
 <vector(4) double> [(double *)&c];
    -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_24 =3D MEM=
 <vector(4) double> [(double *)&c + 32B];
    -[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_25 =3D VEC=
_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, 3, 2, 3 }>;
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.14_24 =3D MEM=
 <vector(4) double> [(double *)&c];
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: _25 =3D MEM[(double *=
)&c + 32B];
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__3.15_26 =3D {_2=
5, 0.0, 0.0, 0.0};
    +[...]/bb-slp-pr78205.c:13:8: note: add new stmt: vect__1.16_27 =3D VEC=
_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, 3, 2, 3 }>;
     [...]/bb-slp-pr78205.c:13:8: note: ------>vectorizing SLP node startin=
g from: b[0] =3D _3;
     [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[2], t=
ype of def: internal
     [...]/bb-slp-pr78205.c:13:8: note: vect_is_simple_use: operand c[3], t=
ype of def: internal
    [...]
    @@ -580,18 +582,22 @@
       double _4;
       double _5;
       vector(2) double _17;
    +  double _20;
    +  double _25;

       <bb 2> [local count: 1073741824]:
       vect__1.7_19 =3D MEM <vector(4) double> [(double *)&c];
    -  vect__1.9_21 =3D VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, =
0, 1 }>;
    +  _20 =3D MEM[(double *)&c + 32B];
    +  vect__1.9_22 =3D VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, =
0, 1 }>;
       _1 =3D c[0];
       _2 =3D c[1];
    -  MEM <vector(4) double> [(double *)&a] =3D vect__1.9_21;
    -  vect__3.14_23 =3D MEM <vector(4) double> [(double *)&c];
    -  vect__1.16_25 =3D VEC_PERM_EXPR <vect__3.14_23, vect__3.14_23, { 2, =
3, 2, 3 }>;
    +  MEM <vector(4) double> [(double *)&a] =3D vect__1.9_22;
    +  vect__3.14_24 =3D MEM <vector(4) double> [(double *)&c];
    +  _25 =3D MEM[(double *)&c + 32B];
    +  vect__1.16_27 =3D VEC_PERM_EXPR <vect__3.14_24, vect__3.14_24, { 2, =
3, 2, 3 }>;
       _3 =3D c[2];
       _4 =3D c[3];
    -  MEM <vector(4) double> [(double *)&b] =3D vect__1.16_25;
    +  MEM <vector(4) double> [(double *)&b] =3D vect__1.16_27;
       _5 =3D c[4];
       _17 =3D {_5, _5};
       MEM <vector(2) double> [(double *)&x] =3D _17;

    --- bb-slp-pr78205.c.265t.optimized   2023-12-20 09:49:45.838344586 +01=
00
    +++ bb-slp-pr78205.c.265t.optimized   2023-12-20 09:10:14.706300941 +01=
00
    @@ -6,17 +6,17 @@
       vector(4) double vect__1.16;
       vector(4) double vect__1.9;
       vector(4) double vect__1.7;
    -  double _5;
       vector(2) double _17;
    +  double _20;

       <bb 2> [local count: 1073741824]:
       vect__1.7_19 =3D MEM <vector(4) double> [(double *)&c];
    -  vect__1.9_21 =3D VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, =
0, 1 }>;
    -  MEM <vector(4) double> [(double *)&a] =3D vect__1.9_21;
    -  vect__1.16_25 =3D VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3,=
 2, 3 }>;
    -  MEM <vector(4) double> [(double *)&b] =3D vect__1.16_25;
    -  _5 =3D c[4];
    -  _17 =3D {_5, _5};
    +  _20 =3D MEM[(double *)&c + 32B];
    +  vect__1.9_22 =3D VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 0, 1, =
0, 1 }>;
    +  MEM <vector(4) double> [(double *)&a] =3D vect__1.9_22;
    +  vect__1.16_27 =3D VEC_PERM_EXPR <vect__1.7_19, vect__1.7_19, { 2, 3,=
 2, 3 }>;
    +  MEM <vector(4) double> [(double *)&b] =3D vect__1.16_27;
    +  _17 =3D {_20, _20};
       MEM <vector(2) double> [(double *)&x] =3D _17;
       return;

    --- bb-slp-pr78205.s  2023-12-20 09:49:45.846344519 +0100
    +++ bb-slp-pr78205.s  2023-12-20 09:10:14.722300807 +0100
    @@ -41,7 +41,17 @@
            v_addc_co_u32   v7, s[22:23], 0, v7, s[22:23]
            flat_load_dwordx2       v[6:7], v[6:7] offset:0
            s_waitcnt       0
    +       v_writelane_b32 v8, s12, 0
    +       v_writelane_b32 v9, s13, 0
    +       s_mov_b64       exec, 1
    +       v_add_co_u32    v8, vcc, 32, v8
    +       v_addc_co_u32   v9, vcc, 0, v9, vcc
    +       flat_load_dwordx2       v[8:9], v[8:9]
    +       s_waitcnt       0
    +       v_readlane_b32  s12, v8, 0
    +       v_readlane_b32  s13, v9, 0
            s_mov_b64       s[18:19], 10
    +       s_mov_b64       exec, 15
            v_cndmask_b32   v0, 0, 4, s[18:19]
            s_mov_b64       exec, 15
            v_mov_b32       v11, v7
    @@ -73,15 +83,6 @@
            v_mov_b32       v5, s19
            v_addc_co_u32   v5, s[22:23], 0, v5, s[22:23]
            flat_store_dwordx2      v[4:5], v[6:7] offset:0
    -       v_writelane_b32 v4, s12, 0
    -       v_writelane_b32 v5, s13, 0
    -       s_mov_b64       exec, 1
    -       v_add_co_u32    v4, vcc, 32, v4
    -       v_addc_co_u32   v5, vcc, 0, v5, vcc
    -       flat_load_dwordx2       v[4:5], v[4:5]
    -       s_waitcnt       0
    -       v_readlane_b32  s12, v4, 0
    -       v_readlane_b32  s13, v5, 0
            s_mov_b64       exec, 3
            v_mov_b32       v6, s12
            v_mov_b32       v7, s13

I haven't looked at the full context, but this appears to effectively
just move this block of code, and use different registers.

In:

    +       s_mov_b64       exec, 15
            v_cndmask_b32   v0, 0, 4, s[18:19]
            s_mov_b64       exec, 15

... isn't the second (pre-existing) 's_mov_b64 exec, 15' now redundant,
though?


Gr=C3=BC=C3=9Fe
 Thomas


>       PR tree-optimization/113073
>       * tree-vect-stmts.cc (vectorizable_load): Properly ensure
>       to exempt only vector-size aligned overreads.
> ---
>  gcc/tree-vect-stmts.cc | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index fc6923cf68a..e9ff728dfd4 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11476,7 +11476,9 @@ vectorizable_load (vec_info *vinfo,
>                                     - (group_size * vf - gap), nunits))
>                         /* DR will be unused.  */
>                         ltype =3D NULL_TREE;
> -                     else if (alignment_support_scheme =3D=3D dr_aligned=
)
> +                     else if (known_ge (vect_align,
> +                                        tree_to_poly_uint64
> +                                          (TYPE_SIZE_UNIT (vectype))))
>                         /* Aligned access to excess elements is OK if
>                            at least one element is accessed in the
>                            scalar loop.  */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955