From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D50533858CDB; Thu,  9 Mar 2023 14:46:29 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D50533858CDB
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1678373189;
	bh=R+P+OGo/ktgTIWNc6Atp7/GCUWJX0C9nRvIK0eWPjeI=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=xqkKIUKIY/Bas5Oujfs39ZAAnLwKeDFOKQiPGAPtPGOc/0V2QZuv3zU8m4T9cOiA2
	 URgxcaxt7LgVdPagLD/6IY6oJRbTTD6Q5HPZge4SvLoujlJUrMg/Ru7JwSjgUOCzem
	 4tk8L3X+m3AGNkupVFRujIb9Pwj1r2Y1Bq0t1Z9Y=
From: "rsandifo at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109072] [12/13 Regression] SLP costs for vec duplicate
 too high since g:4963079769c99c4073adfd799885410ad484cbbe
Date: Thu, 09 Mar 2023 14:46:29 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rsandifo at gcc dot gnu.org
X-Bugzilla-Status: NEW
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109072-4-iRPQcOxMRX@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109072
--- Comment #3 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.or=
g> ---
(In reply to Tamar Christina from comment #2)
> I thought the SLP algorithm was bottom up and stores were
> already sinks?
Yeah, they are.  But the point is that we're vectorising
the stores in isolation, with no knowledge of what happens
later.  The reason the code here is particularly bad is
that the array is later loaded into a vector.  But the
vectoriser doesn't know that.

> Ah, guess there are two problems.
>=20
> 1. how did we end up with such poor scalar code, at least 5 instructions =
are
> unneeded (separate issue)
> 2. The costing of the above, I guess I'm still slightly confused how we g=
ot
> to that cost
The patch that introduce the regression uses an on-the-side costing
scheme for store sequences.  If it thinks that the scalar code is
better, it manipulates the vector body cost so that the body is twice
as expensive as the scalar body.  The prologue cost (1 for the
scalar_to_vec) is then added on top.

> If it's costing purely on latency than the two are equivalent no? if you
> take throughput into account the first would win, but the difference in
> costs is still a lot higher then I would have expected.
>=20
> In this case:
>=20
> node 0x4f45480 1 times scalar_to_vec costs 4 in prologue
>=20
> seems quite high, but I guess it doesn't know that there's no regfile
> transfer?
Which -mcpu/-mtune are you using?  For generic it's 1 rather than 4
(so that the vector cost is 9 rather than 12, although still
higher than the scalar cost).=