From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 601B338930DA; Tue, 6 Apr 2021 11:44:22 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 601B338930DA From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/99881] Regression compare -O2 -ftree-vectorize with -O2 on SKX/CLX Date: Tue, 06 Apr 2021 11:44:22 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: NEW X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Apr 2021 11:44:22 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D99881 --- Comment #5 from Richard Biener --- (In reply to Hongtao.liu from comment #4) > (In reply to Richard Biener from comment #3) > > But 2 element construction _should_ be cheap. What is missing is the m= ove > > cost from GPR to XMM regs (but we do not have a good idea whether the s= ources > > are memory, so it's not as clear-cut here either). > >=20 > > IMHO a better approach might be to up unaligned vector store/load costs? > >=20 > > For the testcase at hand why does a throughput of 1 pose a problem? Th= ere's > > only one punpckldq instruction around? > >=20 >=20 > There're several lea/add(which also may use port 5) instructions around > punckldq, considering that FAST LEA and Int ALU will be common in address > computation, throughput of 1 for punckldq will be a bottleneck. >=20 > refer to https://godbolt.org/z/hK9r5vTzd for original case Too bad. But this is starting to model resource constraints which are not at all handled by the generic part of the vectorizer cost model. We kind-of have the ability to do this in the target (see how rs6000 models some of th= is in its finis_cost hook via rs6000_density_test). But then the cost model suffers from quite some GIGO already and I fear adding complexity will only produce more 'G'. As you have seen you need quite some offset to make up for the saved store, I think trying to get integer_to_sse costed for the movd/pinsrq would be a better way than parametrizing 'vec_construct' (because there's no vec_const= ruct instruction - there's multiple pieces to it). > > Note that for the case of non-loop vectorization of 'double' the two el= ement > > vector CTORs are common and important to handle cheaply. See also all = the > > discussion in PR98856=