From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 70A8E383B80B; Mon, 16 Aug 2021 09:43:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 70A8E383B80B From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/101929] r12-2549 regress x264_r by 4% on CLX. Date: Mon, 16 Aug 2021 09:43:37 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Aug 2021 09:43:38 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D101929 Richard Biener changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org --- Comment #3 from Richard Biener --- It's interesting to note that in - _820 =3D {_187, _189, _187, _189}; - vect_t2_188.65_821 =3D VIEW_CONVERT_EXPR(_820); - vect__200.67_823 =3D vect_t0_184.64_819 - vect_t2_188.65_821; - vect__191.66_822 =3D vect_t0_184.64_819 + vect_t2_188.65_821; - _824 =3D VEC_PERM_EXPR ; we only need parts of the CTOR for the add/sub parts (because we ignore some lanes with the blend). That might even allow to elide the final compose of the low/high part and expose some more insn parallelism. Of course that looks quite difficult to achieve. -- Note your CTOR cost estimates might be off given the CTORs are mostly regular like { _181, _181, _181, _181, _262, _262, _262, _262, _343, _343, _343, _343, _= 48, _48, _48, _48 } thus could use 4 splats to xmm and 4 inserts? For the V4SI vectorization we unfortunately decide to do t.c:37:9: note: Using a splat of the uniform operand t.c:37:9: note: Using a splat of the uniform operand t.c:37:9: note: Building parent vector operands from scalars instead and thus end up with { _49, _50, _49, _50 }. That said, I don't think the backend gets easy access to the actual CTOR layout yet to improve costi= ng (similar as with permutes and the actual permute mask). -- It's difficult (if not impossible) for the vectorizer to second-guess the followup FRE, we're a long way from doing loop + SLP vectorization in one go and discover we can elide the vector store.=