From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id EBB963858D1E; Sun,  6 Feb 2022 09:13:14 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EBB963858D1E
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/104408] New: SLP discovery fails due to
 -Ofast rewriting
Date: Sun, 06 Feb 2022 09:13:14 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 keywords bug_severity priority component assigned_to reporter
 target_milestone
Message-ID: <bug-104408-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Feb 2022 09:13:15 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104408

            Bug ID: 104408
           Summary: SLP discovery fails due to -Ofast rewriting
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

typedef struct { float r, i; } cf;
void
f (cf *restrict a, cf *restrict b, cf *restrict c, cf *restrict d, cf e)
{
  for (int i =3D 0; i < 100; ++i)
    {
      b[i].r =3D e.r * (c[i].r - d[i].r) - e.i * (c[i].i - d[i].i);
      b[i].i =3D e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);
    }
}

when compiled at -O3 forms an SLP tree but fails at -Ofast because match.pd
rewrites the expression into=20

      b[i].r =3D e.r * (c[i].r - d[i].r) + e.i * (d[i].i - c[i].i);
      b[i].i =3D e.r * (c[i].i - d[i].i) + e.i * (c[i].r - d[i].r);

and so introduces a different interleaving in the second multiply operation.

It's unclear to me what the gain of actually doing this is as it results in
worse vector and scalar code due to you losing the sharing of the computed
value of the nodes.

Without the rewriting the first code can re-use the load from the first vec=
tor
and just reverse the elements:

.L2:
        ldr     q1, [x3, x0]
        ldr     q0, [x2, x0]
        fsub    v0.4s, v0.4s, v1.4s
        fmul    v1.4s, v2.4s, v0.4s
        fmul    v0.4s, v3.4s, v0.4s
        rev64   v1.4s, v1.4s
        fneg    v0.2d, v0.2d
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 800
        bne     .L2

While with the rewrite it forces an increase in VF to be able to handle the
interleaving

.L2:
        ld2     {v0.4s - v1.4s}, [x3], 32
        ld2     {v4.4s - v5.4s}, [x2], 32
        fsub    v2.4s, v1.4s, v5.4s
        fsub    v3.4s, v4.4s, v0.4s
        fsub    v5.4s, v5.4s, v1.4s
        fmul    v2.4s, v2.4s, v6.4s
        fmul    v4.4s, v6.4s, v3.4s
        fmla    v2.4s, v7.4s, v3.4s
        fmla    v4.4s, v5.4s, v7.4s
        mov     v0.16b, v2.16b
        mov     v1.16b, v4.16b
        st2     {v0.4s - v1.4s}, [x1], 32
        cmp     x5, x1
        bne     .L2

in scalar you lose the ability to re-use the subtract so you get an extra s=
ub.=