From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 9ECA6383E805; Fri, 5 Mar 2021 10:43:47 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9ECA6383E805 From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af Date: Fri, 05 Mar 2021 10:43:47 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization, ra X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Mar 2021 10:43:47 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98856 --- Comment #26 from Richard Biener --- (In reply to rguenther@suse.de from comment #25) > On Fri, 5 Mar 2021, ubizjak at gmail dot com wrote: >=20 > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98856 > >=20 > > --- Comment #24 from Uro=C5=A1 Bizjak --- > > (In reply to Richard Biener from comment #22) > > > I guess the idea of this insn setup was exactly to get IRA/LRA choose > > > the optimal instruction sequence - otherwise exposing the reload so > > > late is probably suboptimal. > >=20 > > THere is one more tool in the toolbox. A peephole2 pattern can be > > conditionalized on availabe XMM register. So, if XMM reg is available, = the > > GPR->XMM move can be emitted in front of the insn. So, if there is XMM = register > > pressure, pinsrd will be used, but if an XMM register is availabe, it w= ill be > > reused to emit punpcklqdq. > >=20 > > The peephole2 pattern can also be conditionalized for targets where GPR= ->XMM > > moves are fast. >=20 > Note the trick is esp. important when GPR->XMM moves are _slow_. But only > in the case we originally combine two GPR operands. Doing two > GPR->XMM moves and then one puncklqdq hides half of the latency of the > slow moves since they have no data dependence on each other. So for the > peephole we should try to match this - a reloaded operand and a GPR > operand. When the %xmm operand results from a SSE computation there's > no point in splitting out a GPR->XMM move. >=20 > So in the end a peephole2 sounds like it could better match the condition > the transform is profitable on. I tried diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index db5be59f5b7..8d0d3077cf8 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -1419,6 +1419,23 @@ DONE; }) +(define_peephole2 + [(set (match_operand:DI 0 "sse_reg_operand") + (match_operand:DI 1 "general_gr_operand")) + (match_scratch:DI 2 "sse_reg_operand") + (set (match_operand:V2DI 2 "sse_reg_operand") + (vec_concat:V2DI (match_dup:DI 0) + (match_operand:DI 3 "general_gr_operand")))] + "reload_completed" + [(set (match_dup 0) + (match_dup 1)) + (set (match_dup 2) + (match_dup 3)) + (set (match_dup 2) + (vec_concat:V2DI (match_dup 0) + (match_dup 2)))] + "") + ;; Merge movsd/movhpd to movupd for TARGET_SSE_UNALIGNED_LOAD_OPTIMAL targ= ets. (define_peephole2 [(set (match_operand:V2DF 0 "sse_reg_operand") but that doesn't seem to match for some unknown reason.=