From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 41CBF3851C3D; Fri, 13 Aug 2021 05:01:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 41CBF3851C3D From: "law at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/101895] New: [11/12 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further optimization opportunities Date: Fri, 13 Aug 2021 05:01:45 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: law at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Aug 2021 05:01:46 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D101895 Bug ID: 101895 Summary: [11/12 Regression] SLP Vectorizer change pushes VEC_PERM_EXPR into bad location spoiling further optimization opportunities Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: law at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Consider this code: void foo(int * restrict a, int b, int *c) { a[0] =3D c[0]*b + a[0]; a[1] =3D c[2]*b + a[1]; a[2] =3D c[1]*b + a[2]; a[3] =3D c[3]*b + a[3]; } Prior to this commit: commit 126ed72b9f48f8530b194532cc281fb761690435 Author: Richard Biener Date: Wed Sep 30 17:08:01 2020 +0200 optimize permutes in SLP, remove vect_attempt_slp_rearrange_stmts This introduces a permute optimization phase for SLP which is intended to cover the existing permute eliding for SLP reductions plus handling commonizing the easy cases. It currently uses graphds to compute a postorder on the reverse SLP graph and it handles all cases vect_attempt_slp_rearrange_stmts did (hopefully - I've adjusted most testcases that triggered it a few days ago). It restricts itself to move around bijective permutations to simplify things for now, mainly around constant nodes. As a prerequesite it makes the SLP graph cyclic (ugh). It looks like it would pay off to compute a PRE/POST order visit array once and elide all the recursive SLP graph walks and their visited hash-set. At least for the time where we do not change the SLP graph during such walk. I do not like using graphds too much but at least I don't have to re-implement yet another RPO walk, so maybe it isn't too bad. It now computes permute placement during iteration and thus should get cycles more obviously correct. [ ... ] GCC would generate this (x86_64 -O3 -march=3Dnative): vect__1.6_27 =3D VEC_PERM_EXPR ; vect__2.7_29 =3D vect__1.6_27 * _28; _1 =3D *c_18(D); _2 =3D _1 * b_19(D); vectp.9_30 =3D a_20(D); vect__3.10_31 =3D MEM [(int *)vectp.9_30]; vect__4.11_32 =3D vect__2.7_29 + vect__3.10_31; This is good. Note how the VEC_PERM_EXPR happens before the vector multiply and how the vector multiply directly feeds the vector add. On our target we have a vector multiply-add which would be generated and all is good. After the above change we generate this: vect__2.6_28 =3D vect__1.5_25 * _27; _29 =3D VEC_PERM_EXPR ; _1 =3D *c_18(D); _2 =3D _1 * b_19(D); vectp.8_30 =3D a_20(D); vect__3.9_31 =3D MEM [(int *)vectp.8_30]; vect__4.10_32 =3D _29 + vect__3.9_31; Note how we have the vmul, then permute, then vadd. This spoils our abilit= y to generate a vmadd. This behavior is still seen on the trunk as well. Conceptually it seems to me that having a permute at the start or end of a chain of vector operations is better than moving the permute into the middl= e of a chain of dependent vector operations. We could probably fix this in the backend with some special patterns, but I= STM that getting it right in SLP would be better.=