From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 0B114385800F; Fri, 5 Mar 2021 07:44:25 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0B114385800F From: "rguenth at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af Date: Fri, 05 Mar 2021 07:44:24 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 11.0 X-Bugzilla-Keywords: missed-optimization, ra X-Bugzilla-Severity: normal X-Bugzilla-Who: rguenth at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 11.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Mar 2021 07:44:25 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D98856 --- Comment #22 from Richard Biener --- (In reply to Uro=C5=A1 Bizjak from comment #21) > (In reply to Uro=C5=A1 Bizjak from comment #20) > > (In reply to Richard Biener from comment #18) > > > Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. N= ot > > > sure if we should somehow do this late somehow (peephole or splitter)= since > > > it requires one more %xmm register. > > What happens if you disparage [v]pinsrd alternatives in vec_concatv2di? >=20 > Please try this: >=20 > diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md > index db5be59f5b7..edf7b1a3074 100644 > --- a/gcc/config/i386/sse.md > +++ b/gcc/config/i386/sse.md > @@ -16043,7 +16043,12 @@ > (const_string "maybe_evex") > ] > (const_string "orig"))) > - (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")]) > + (set_attr "mode" "TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF") > + (set (attr "preferred_for_speed") > + (cond [(eq_attr "alternative" "0,1,2,3") > + (symbol_ref "false") > + ] > + (symbol_ref "true")))]) >=20=20 > (define_insn "*vec_concatv2di_0" That works to avoid the vpinsrq. I guess the case of a mem operand behaves similar to a gpr (plus the load uop), at least I don't have any contrary evidence (but I didn't do any microbenchmarks either). I'm not sure IRA/LRA will optimally handle the situation with register pressure causing spilling in case it needs to reload both gpr operands. At least for typedef long v2di __attribute__((vector_size(16))); v2di foo (long a, long b) { return (v2di){a, b}; } with -msse4.1 -O3 -ffixed-xmm1 -ffixed-xmm2 -ffixed-xmm3 -ffixed-xmm4 -ffixed-xmm5 -ffixed-xmm6 -ffixed-xmm7 -ffixed-xmm8 -ffixed-xmm9 -ffixed-xm= m10 -ffixed-xmm11 -ffixed-xmm12 -ffixed-xmm13 -ffixed-xmm14 -ffixed-xmm15 I get with the patch foo: .LFB0: .cfi_startproc movq %rsi, -16(%rsp) movq %rdi, %xmm0 pinsrq $1, -16(%rsp), %xmm0 ret while without it's movq %rdi, %xmm0 pinsrq $1, %rsi, %xmm0 as far as I understand LRA dumps the new attribute is a hard one, even applying when other alternatives are worse. In this case we choose alt 7. Covering also alts 7 and 8 with the optimize-for-speed attribute causes reload fails - which is expected if there's no way for LRA to choose alt 1. The following seems to work for the small testcase above but not for the important case in the benchmark (meh). diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index db5be59f5b7..e393a0d823b 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -15992,7 +15992,7 @@ (match_operand:DI 1 "register_operand" " 0, 0,x ,Yv,0,Yv,0,0,v") (match_operand:DI 2 "nonimmediate_operand" - " rm,rm,rm,rm,x,Yv,x,m,m")))] + " !rm,!rm,!rm,!rm,x,Yv,x,!m,!m")))] "TARGET_SSE" "@ pinsrq\t{$1, %2, %0|%0, %2, 1} I guess the idea of this insn setup was exactly to get IRA/LRA choose the optimal instruction sequence - otherwise exposing the reload so late is probably suboptimal.=