From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id E77613858C60; Fri, 10 Mar 2023 07:40:17 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E77613858C60
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1678434017;
	bh=vIiRxXnlUfkcDehj0sO77CJx9XUd9AX1LYrSoK4vTkE=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=FAeNahaPhdMKR8jFNUDOvuW7Ik08dxxQ173ktNtHSrJGsnLgl0B2GB2m5niFFcO46
	 BDc7NrFDzyKVU6J47mA1ybvv3tZaqc7/WI477UPrTgfuXqcLP/Ki0brilcgLBqug1o
	 5Zlu/ooUDTSQKFES23rF3hVTYyGY7/hwnnfB7ZOc=
From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109072] [12/13 Regression] SLP costs for vec duplicate
 too high since g:4963079769c99c4073adfd799885410ad484cbbe
Date: Fri, 10 Mar 2023 07:40:17 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rguenth at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.3
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-109072-4-AfN4EJnb06@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
References: <bug-109072-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109072
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsandifo@gcc.gnu.org from comment #1)
> (In reply to Tamar Christina from comment #0)
> > The SLP costs went from:
> >=20
> >   Vector cost: 2
> >   Scalar cost: 4
> >=20
> > to:
> >=20
> >   Vector cost: 12
> >   Scalar cost: 4
> >=20
> > it looks like it's no longer costing it as a duplicate but instead 4 vec
> > inserts.
> We do cost it as a duplicate, but we only try to vectorize up to
> the stores, rather than up to the load back.  So we're costing
> the difference between:
>=20
>         fmov    s1, s0
>         stp     s1, s1, [x0]
>         stp     s1, s1, [x0, 8]
>=20
> (no idea why we have an fmov, pretend we don't) and:
>=20
>         fmov    s1, s0
>         dup     v1.4s, v1.s[0]
>         str     q1, [x0]
>=20
> If we want the latter as a general principle, the PR is
> easy to fix.  But if we don't, we'd need to make the
> vectoriser start at the load or (alternatively) fold
> to a constructor independently of vectorisation.

Just to clarify, the vectorizer sees

  <bb 2> [local count: 1073741824]:
  data[0] =3D res_2(D);
  data[1] =3D res_2(D);
  data[2] =3D res_2(D);
  data[3] =3D res_2(D);
  _7 =3D MEM <__Float32x4_t> [(float * {ref-all})&data];
  data =3D{v} {CLOBBER(eol)};
  return _7;

and indeed the SLP vectorizer does not consider vector typed loads as
"sinks" to start SLP discovery from.  We could handle those the same
as CONSTRUCTOR but then SLP discovery isn't prepared to follow
"memory edges" (for must-aliases).  The question here would be
whether for example SRA could have elided 'data', materializing
the vector load as CONSTRUCTOR (I also have an old VN patch that
would do this, but it has profitability issues so I never pushed it).

Whatever you do with cost heuristics you'll find a testcase where that
regresses.=