From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 49DCC3858296; Thu,  9 Mar 2023 15:04:48 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 49DCC3858296
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1678374288;
	bh=/UkLpvBQNA46jFF6qtFk+i+/WuA4hZ+S0872my9+7eA=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=e7gjYR/AAlvqG/9RWwOGl0Hq2hgCrFsYZqppbTq9+YbdvtxAFwSAgV4KNVjbIDcHb
	 Bxrkj6pK7XVJz1MXQ1r2UMC2hf6+PRRXTcJDRreDdR8rigqHk+FW0K7adhdiIGiJG6
	 FiveQC63N530I/8l4n5UghFYWAvSjeC46l96degk=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109072] [12/13 Regression] SLP costs for vec duplicate
 too high since g:4963079769c99c4073adfd799885410ad484cbbe
Date: Thu, 09 Mar 2023 15:04:48 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109072-4-f2LiRIdwYJ@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109072
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #3)
> (In reply to Tamar Christina from comment #2)
> > I thought the SLP algorithm was bottom up and stores were
> > already sinks?
> Yeah, they are.  But the point is that we're vectorising
> the stores in isolation, with no knowledge of what happens
> later.  The reason the code here is particularly bad is
> that the array is later loaded into a vector.  But the
> vectoriser doesn't know that.
>=20

Ah right, you meant use the loads as the seeds. yeah makes sense.

> > Ah, guess there are two problems.
> >=20
> > 1. how did we end up with such poor scalar code, at least 5 instruction=
s are
> > unneeded (separate issue)
> > 2. The costing of the above, I guess I'm still slightly confused how we=
 got
> > to that cost
> The patch that introduce the regression uses an on-the-side costing
> scheme for store sequences.  If it thinks that the scalar code is
> better, it manipulates the vector body cost so that the body is twice
> as expensive as the scalar body.  The prologue cost (1 for the
> scalar_to_vec) is then added on top.

Ah, that makes sense.

> > If it's costing purely on latency than the two are equivalent no? if you
> > take throughput into account the first would win, but the difference in
> > costs is still a lot higher then I would have expected.
> >=20
> > In this case:
> >=20
> > node 0x4f45480 1 times scalar_to_vec costs 4 in prologue
> >=20
> > seems quite high, but I guess it doesn't know that there's no regfile
> > transfer?
> Which -mcpu/-mtune are you using?  For generic it's 1 rather than 4
> (so that the vector cost is 9 rather than 12, although still
> higher than the scalar cost).

I was using neoverse-v1 which looks like matches neoverse-n2 in cost of 4, =
but
neoverse-n1 has 6.  that really seems excessive..=