From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 41CBF3851C3D; Fri, 13 Aug 2021 05:01:46 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 41CBF3851C3D
From: "law at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/101895] New: [11/12 Regression] SLP
 Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further
 optimization opportunities
Date: Fri, 13 Aug 2021 05:01:45 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 11.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: law at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter cc target_milestone
Message-ID: <bug-101895-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Aug 2021 05:01:46 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D101895

            Bug ID: 101895
           Summary: [11/12 Regression] SLP Vectorizer change pushes
                    VEC_PERM_EXPR into bad location spoiling further
                    optimization opportunities
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: law at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---

Consider this code:


void foo(int * restrict a, int b, int *c) {
  a[0] =3D c[0]*b + a[0];
  a[1] =3D c[2]*b + a[1];
  a[2] =3D c[1]*b + a[2];
  a[3] =3D c[3]*b + a[3];
}

Prior to this commit:

commit 126ed72b9f48f8530b194532cc281fb761690435
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Sep 30 17:08:01 2020 +0200

    optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts

    This introduces a permute optimization phase for SLP which is
    intended to cover the existing permute eliding for SLP reductions
    plus handling commonizing the easy cases.

    It currently uses graphds to compute a postorder on the reverse
    SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts
    did (hopefully - I've adjusted most testcases that triggered it
    a few days ago).  It restricts itself to move around bijective
    permutations to simplify things for now, mainly around constant nodes.

    As a prerequesite it makes the SLP graph cyclic (ugh).  It looks
    like it would pay off to compute a PRE/POST order visit array
    once and elide all the recursive SLP graph walks and their
    visited hash-set.  At least for the time where we do not change
    the SLP graph during such walk.

    I do not like using graphds too much but at least I don't have to
    re-implement yet another RPO walk, so maybe it isn't too bad.

    It now computes permute placement during iteration and thus should
    get cycles more obviously correct.

[ ... ]

GCC would generate this (x86_64 -O3 -march=3Dnative):

  vect__1.6_27 =3D VEC_PERM_EXPR <vect__1.5_25, vect__1.5_25, { 0, 2, 1, 3 =
}>;
  vect__2.7_29 =3D vect__1.6_27 * _28;
  _1 =3D *c_18(D);
  _2 =3D _1 * b_19(D);
  vectp.9_30 =3D a_20(D);
  vect__3.10_31 =3D MEM <vector(4) int> [(int *)vectp.9_30];
  vect__4.11_32 =3D vect__2.7_29 + vect__3.10_31;


This is good.  Note how the VEC_PERM_EXPR happens before the vector multiply
and how the vector multiply directly feeds the vector add.  On our target we
have a vector multiply-add which would be generated and all is good.


After the above change we generate this:

  vect__2.6_28 =3D vect__1.5_25 * _27;
  _29 =3D VEC_PERM_EXPR <vect__2.6_28, vect__2.6_28, { 0, 2, 1, 3 }>;
  _1 =3D *c_18(D);
  _2 =3D _1 * b_19(D);
  vectp.8_30 =3D a_20(D);
  vect__3.9_31 =3D MEM <vector(4) int> [(int *)vectp.8_30];
  vect__4.10_32 =3D _29 + vect__3.9_31;


Note how we have the vmul, then permute, then vadd.  This spoils our abilit=
y to
generate a vmadd.  This behavior is still seen on the trunk as well.

Conceptually it seems to me that having a permute at the start or end of a
chain of vector operations is better than moving the permute into the middl=
e of
a chain of dependent vector operations.

We could probably fix this in the backend with some special patterns, but I=
STM
that getting it right in SLP would be better.=