From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id EBB963858D1E; Sun, 6 Feb 2022 09:13:14 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EBB963858D1E From: "tnfchris at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/104408] New: SLP discovery fails due to -Ofast rewriting Date: Sun, 06 Feb 2022 09:13:14 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: tnfchris at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status keywords bug_severity priority component assigned_to reporter target_milestone Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Feb 2022 09:13:15 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104408 Bug ID: 104408 Summary: SLP discovery fails due to -Ofast rewriting Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following testcase: typedef struct { float r, i; } cf; void f (cf *restrict a, cf *restrict b, cf *restrict c, cf *restrict d, cf e) { for (int i =3D 0; i < 100; ++i) { b[i].r =3D e.r * (c[i].r - d[i].r) - e.i * (c[i].i - d[i].i); b[i].i =3D e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r); } } when compiled at -O3 forms an SLP tree but fails at -Ofast because match.pd rewrites the expression into=20 b[i].r =3D e.r * (c[i].r - d[i].r) + e.i * (d[i].i - c[i].i); b[i].i =3D e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r); and so introduces a different interleaving in the second multiply operation. It's unclear to me what the gain of actually doing this is as it results in worse vector and scalar code due to you losing the sharing of the computed value of the nodes. Without the rewriting the first code can re-use the load from the first vec= tor and just reverse the elements: .L2: ldr q1, [x3, x0] ldr q0, [x2, x0] fsub v0.4s, v0.4s, v1.4s fmul v1.4s, v2.4s, v0.4s fmul v0.4s, v3.4s, v0.4s rev64 v1.4s, v1.4s fneg v0.2d, v0.2d fadd v0.4s, v0.4s, v1.4s str q0, [x1, x0] add x0, x0, 16 cmp x0, 800 bne .L2 While with the rewrite it forces an increase in VF to be able to handle the interleaving .L2: ld2 {v0.4s - v1.4s}, [x3], 32 ld2 {v4.4s - v5.4s}, [x2], 32 fsub v2.4s, v1.4s, v5.4s fsub v3.4s, v4.4s, v0.4s fsub v5.4s, v5.4s, v1.4s fmul v2.4s, v2.4s, v6.4s fmul v4.4s, v6.4s, v3.4s fmla v2.4s, v7.4s, v3.4s fmla v4.4s, v5.4s, v7.4s mov v0.16b, v2.16b mov v1.16b, v4.16b st2 {v0.4s - v1.4s}, [x1], 32 cmp x5, x1 bne .L2 in scalar you lose the ability to re-use the subtract so you get an extra s= ub.=