From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E77613858C60; Fri, 10 Mar 2023 07:40:17 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E77613858C60 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1678434017; bh=vIiRxXnlUfkcDehj0sO77CJx9XUd9AX1LYrSoK4vTkE=; h=From:To:Subject:Date:In-Reply-To:References:From; b=FAeNahaPhdMKR8jFNUDOvuW7Ik08dxxQ173ktNtHSrJGsnLgl0B2GB2m5niFFcO46 BDc7NrFDzyKVU6J47mA1ybvv3tZaqc7/WI477UPrTgfuXqcLP/Ki0brilcgLBqug1o 5Zlu/ooUDTSQKFES23rF3hVTYyGY7/hwnnfB7ZOc= From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109072] [12/13 Regression] SLP costs for vec duplicate too high since g:4963079769c99c4073adfd799885410ad484cbbe Date: Fri, 10 Mar 2023 07:40:17 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.3 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109072 --- Comment #7 from Richard Biener --- (In reply to rsandifo@gcc.gnu.org from comment #1) > (In reply to Tamar Christina from comment #0) > > The SLP costs went from: > >=20 > > Vector cost: 2 > > Scalar cost: 4 > >=20 > > to: > >=20 > > Vector cost: 12 > > Scalar cost: 4 > >=20 > > it looks like it's no longer costing it as a duplicate but instead 4 vec > > inserts. > We do cost it as a duplicate, but we only try to vectorize up to > the stores, rather than up to the load back. So we're costing > the difference between: >=20 > fmov s1, s0 > stp s1, s1, [x0] > stp s1, s1, [x0, 8] >=20 > (no idea why we have an fmov, pretend we don't) and: >=20 > fmov s1, s0 > dup v1.4s, v1.s[0] > str q1, [x0] >=20 > If we want the latter as a general principle, the PR is > easy to fix. But if we don't, we'd need to make the > vectoriser start at the load or (alternatively) fold > to a constructor independently of vectorisation. Just to clarify, the vectorizer sees [local count: 1073741824]: data[0] =3D res_2(D); data[1] =3D res_2(D); data[2] =3D res_2(D); data[3] =3D res_2(D); _7 =3D MEM <__Float32x4_t> [(float * {ref-all})&data]; data =3D{v} {CLOBBER(eol)}; return _7; and indeed the SLP vectorizer does not consider vector typed loads as "sinks" to start SLP discovery from. We could handle those the same as CONSTRUCTOR but then SLP discovery isn't prepared to follow "memory edges" (for must-aliases). The question here would be whether for example SRA could have elided 'data', materializing the vector load as CONSTRUCTOR (I also have an old VN patch that would do this, but it has profitability issues so I never pushed it). Whatever you do with cost heuristics you'll find a testcase where that regresses.=