From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id DAB583858C5E; Thu,  9 Mar 2023 14:30:14 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DAB583858C5E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1678372214;
	bh=ZPo55lrPPEEQwTuM0Js8dCBFAjVHdE9X/p5xoRsy3oE=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=Bcq6Vz1Eb7VDKIxbAPSaovzDWtfJSHkRnde8dG4pweKnEiVeGi58D7nw0akriIB1y
	 EW5+4XJQ6/dJiKHUIlcvKh/x3m9mTOXnfyH+C68krQEDby1p6edi+kgOFDIkIA1wK5
	 1+ZaL6Ao6mrnEIwM82a0T+uf9wtoJbl0UqouuGqI=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109072] [12/13 Regression] SLP costs for vec duplicate
 too high since g:4963079769c99c4073adfd799885410ad484cbbe
Date: Thu, 09 Mar 2023 14:30:14 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109072-4-pHA0G7edgS@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109072
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #1)
> (In reply to Tamar Christina from comment #0)
> > The SLP costs went from:
> >=20
> >   Vector cost: 2
> >   Scalar cost: 4
> >=20
> > to:
> >=20
> >   Vector cost: 12
> >   Scalar cost: 4
> >=20
> > it looks like it's no longer costing it as a duplicate but instead 4 vec
> > inserts.
> We do cost it as a duplicate, but we only try to vectorize up to
> the stores, rather than up to the load back.  So we're costing
> the difference between:
>=20
>         fmov    s1, s0
>         stp     s1, s1, [x0]
>         stp     s1, s1, [x0, 8]
>=20
> (no idea why we have an fmov, pretend we don't) and:
>=20
>         fmov    s1, s0
>         dup     v1.4s, v1.s[0]
>         str     q1, [x0]
>=20
> If we want the latter as a general principle, the PR is
> easy to fix.  But if we don't, we'd need to make the
> vectoriser start at the load or (alternatively) fold
> to a constructor independently of vectorisation.

I thought the SLP algorithm was bottom up and stores were
already sinks?  So is this maybe a bug?

Ah, guess there are two problems.

1. how did we end up with such poor scalar code, at least 5 instructions are
unneeded (separate issue)
2. The costing of the above, I guess I'm still slightly confused how we got=
 to
that cost.

If it's costing purely on latency than the two are equivalent no? if you ta=
ke
throughput into account the first would win, but the difference in costs is
still a lot higher then I would have expected.

In this case:

node 0x4f45480 1 times scalar_to_vec costs 4 in prologue

seems quite high, but I guess it doesn't know that there's no regfile trans=
fer?=