From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from akamas.troja.mff.cuni.cz (akamas.n.mff.cuni.cz [195.113.16.19]) by sourceware.org (Postfix) with ESMTPS id 0A2EB385ED40; Thu, 27 Jan 2022 12:04:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0A2EB385ED40 Received: from nikam.ms.mff.cuni.cz (nikam.kam.mff.cuni.cz [195.113.17.177]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by akamas.troja.mff.cuni.cz (Postfix) with ESMTPS id 7823040067; Thu, 27 Jan 2022 13:04:34 +0100 (CET) Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 720CD2812EC; Thu, 27 Jan 2022 13:04:34 +0100 (CET) Date: Thu, 27 Jan 2022 13:04:34 +0100 From: Jan Hubicka To: rguenther at suse dot de Cc: gcc-bugs@gcc.gnu.org Subject: Re: [Bug rtl-optimization/102178] [12 Regression] SPECFP 2006 470.lbm regressions on AMD Zen CPUs after r12-897-gde56f95afaaa22 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-3.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, KAM_NUMSUBJECT, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Jan 2022 12:04:37 -0000 > I would say so. It saves code size and also uop space unless the two > can magically fuse to a immediate to %xmm move (I doubt that). I made simple benchmark double a=10; int main() { long int i; double sum,val1,val2,val3,val4; for (i=0;i<1000000000;i++) { #if 1 #if 1 asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val1): :"r8","xmm11"); asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val2): :"r8","xmm11"); asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val3): :"r8","xmm11"); asm __volatile__("movabsq $0x3ff03db8fde2ef4e, %%r8;vmovq %%r8, %0": "=x"(val4): :"r8","xmm11"); #else asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val1):"m"(a) :"r8","xmm11"); asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val2):"m"(a) :"r8","xmm11"); asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val3):"m"(a) :"r8","xmm11"); asm __volatile__("movq %1, %%r8;vmovq %%r8, %0": "=x"(val4):"m"(a) :"r8","xmm11"); #endif #else asm __volatile__("vmovq %1, %0": "=x"(val1):"m"(a) :"r8","xmm11"); asm __volatile__("vmovq %1, %0": "=x"(val2):"m"(a) :"r8","xmm11"); asm __volatile__("vmovq %1, %0": "=x"(val3):"m"(a) :"r8","xmm11"); asm __volatile__("vmovq %1, %0": "=x"(val4):"m"(a) :"r8","xmm11"); #endif sum+=val1+val2+val3+val4; } return sum; and indeed the third variant runs 1.2s while the first two takes equal time 2.4s on my zen2 laptop.