From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D50533858CDB; Thu, 9 Mar 2023 14:46:29 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D50533858CDB DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1678373189; bh=R+P+OGo/ktgTIWNc6Atp7/GCUWJX0C9nRvIK0eWPjeI=; h=From:To:Subject:Date:In-Reply-To:References:From; b=xqkKIUKIY/Bas5Oujfs39ZAAnLwKeDFOKQiPGAPtPGOc/0V2QZuv3zU8m4T9cOiA2 URgxcaxt7LgVdPagLD/6IY6oJRbTTD6Q5HPZge4SvLoujlJUrMg/Ru7JwSjgUOCzem 4tk8L3X+m3AGNkupVFRujIb9Pwj1r2Y1Bq0t1Z9Y= From: "rsandifo at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109072] [12/13 Regression] SLP costs for vec duplicate too high since g:4963079769c99c4073adfd799885410ad484cbbe Date: Thu, 09 Mar 2023 14:46:29 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rsandifo at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.3 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109072 --- Comment #3 from rsandifo at gcc dot gnu.org --- (In reply to Tamar Christina from comment #2) > I thought the SLP algorithm was bottom up and stores were > already sinks? Yeah, they are. But the point is that we're vectorising the stores in isolation, with no knowledge of what happens later. The reason the code here is particularly bad is that the array is later loaded into a vector. But the vectoriser doesn't know that. > Ah, guess there are two problems. >=20 > 1. how did we end up with such poor scalar code, at least 5 instructions = are > unneeded (separate issue) > 2. The costing of the above, I guess I'm still slightly confused how we g= ot > to that cost The patch that introduce the regression uses an on-the-side costing scheme for store sequences. If it thinks that the scalar code is better, it manipulates the vector body cost so that the body is twice as expensive as the scalar body. The prologue cost (1 for the scalar_to_vec) is then added on top. > If it's costing purely on latency than the two are equivalent no? if you > take throughput into account the first would win, but the difference in > costs is still a lot higher then I would have expected. >=20 > In this case: >=20 > node 0x4f45480 1 times scalar_to_vec costs 4 in prologue >=20 > seems quite high, but I guess it doesn't know that there's no regfile > transfer? Which -mcpu/-mtune are you using? For generic it's 1 rather than 4 (so that the vector cost is 9 rather than 12, although still higher than the scalar cost).=