From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 6FA4D3858C62; Sun, 25 Jun 2023 05:56:05 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6FA4D3858C62 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1687672565; bh=EQUMRttpOBc1aOB2w5qZ8ljyo21XAa31c+DrSyfArTE=; h=From:To:Subject:Date:In-Reply-To:References:From; b=JmoIc4EPx8EI0ElBYymolL+R2rsdxNdVAbDRVpzzsdbC5HPHIPOxkS8nDBJzEApIx zTzw2HtVGBuwEbCkk45eQcPOZCiXi2W/jfzl8nfzi9qE+2eqLPvHmUXelvnr/lqiEt Zy8zQWsMjuOleaPVZJO1dXcTYRob72P6uQbvB3k8= From: "lili.cui at intel dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/110148] [14 Regression] TSVC s242 regression between g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8 Date: Sun, 25 Jun 2023 05:56:04 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization, needs-bisection X-Bugzilla-Severity: normal X-Bugzilla-Who: lili.cui at intel dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: 14.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110148 --- Comment #3 from cuilili --- I reproduced S1244 regression on znver3. Src code: for (int i =3D 0; i < LEN_1D-1; i++) { a[i] =3D b[i] + c[i] * c[i] + b[i] * b[i] + c[i]; d[i] =3D a[i] + a[i+1]; } -------------------------------------------------------- Base version: Base + commit version:=20=20=20=20=20=20= =20=20=20=20=20=20 Assembler Assembler=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 Loop1: Loop1:=20=20=20=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20 vmovsd 0x60c400(%rax),%xmm2 vmovsd 0x60ba00(%rax),%xmm2=20=20=20=20= =20=20=20 vmovsd 0x60ba00(%rax),%xmm1 vmovsd 0x60c400(%rax),%xmm1=20=20=20=20= =20=20=20 add $0x8,%rax add $0x8,%rax=20=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20=20 -------------------------------------------------------------------- vaddsd %xmm1,%xmm2,%xmm0 vmovsd %xmm2,%xmm2,%xmm0=20=20=20=20=20= =20=20=20=20=20 vmulsd %xmm2,%xmm2,%xmm2 vfmadd132sd %xmm2,%xmm1,%xmm0=20=20=20=20= =20 vfmadd132sd %xmm1,%xmm2,%xmm1 vfmadd132sd %xmm1,%xmm2,%xmm1=20=20=20=20= =20 -------------------------------------------------------------------- vaddsd %xmm1,%xmm0,%xmm0 vaddsd %xmm1,%xmm0,%xmm0=20=20=20=20=20= =20=20=20=20=20 vmovsd %xmm0,0x60cdf8(%rax) vmovsd %xmm0,0x60cdf8(%rax)=20=20=20=20= =20=20=20 vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0=20 vmovsd %xmm0,0x60aff8(%rax) vmovsd %xmm0,0x60aff8(%rax)=20=20=20=20= =20=20=20 cmp $0x9f8,%rax cmp $0x9f8,%rax=20=20=20=20=20=20=20= =20=20=20=20=20=20=20=20=20 jne Loop1: jne Loop1=20=20=20=20=20=20=20=20 For the Base version, mult and FMA have dependencies, which increases the latency of the critical dependency chain. I didn't find out why znver3 has regression. Same binary running on ICX has 11% gain (with #define iterations 100000000).=