From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 6FA4D3858C62; Sun, 25 Jun 2023 05:56:05 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6FA4D3858C62
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1687672565;
	bh=EQUMRttpOBc1aOB2w5qZ8ljyo21XAa31c+DrSyfArTE=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=JmoIc4EPx8EI0ElBYymolL+R2rsdxNdVAbDRVpzzsdbC5HPHIPOxkS8nDBJzEApIx
	 zTzw2HtVGBuwEbCkk45eQcPOZCiXi2W/jfzl8nfzi9qE+2eqLPvHmUXelvnr/lqiEt
	 Zy8zQWsMjuOleaPVZJO1dXcTYRob72P6uQbvB3k8=
From: "lili.cui at intel dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug middle-end/110148] [14 Regression] TSVC s242 regression between
 g:c0df96b3cda5738afbba3a65bb054183c5cd5530 and
 g:e4c986fde56a6248f8fbe6cf0704e1da34b055d8
Date: Sun, 25 Jun 2023 05:56:04 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: middle-end
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization, needs-bisection
X-Bugzilla-Severity: normal
X-Bugzilla-Who: lili.cui at intel dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 14.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-110148-4-CvRznp1kAJ@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-110148-4@http.gcc.gnu.org/bugzilla/>
References: <bug-110148-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110148

--- Comment #3 from cuilili <lili.cui at intel dot com> ---
I reproduced S1244 regression on znver3.

Src code:

for (int i =3D 0; i < LEN_1D-1; i++)
  {
    a[i] =3D b[i] + c[i] * c[i] + b[i] * b[i] + c[i];
    d[i] =3D a[i] + a[i+1];
  }
--------------------------------------------------------
Base version:                     Base + commit version:=20=20=20=20=20=20=
=20=20=20=20=20=20

Assembler                         Assembler=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Loop1:                            Loop1:=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
vmovsd 0x60c400(%rax),%xmm2       vmovsd 0x60ba00(%rax),%xmm2=20=20=20=20=
=20=20=20
vmovsd 0x60ba00(%rax),%xmm1       vmovsd 0x60c400(%rax),%xmm1=20=20=20=20=
=20=20=20
add    $0x8,%rax                  add    $0x8,%rax=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20
--------------------------------------------------------------------
vaddsd %xmm1,%xmm2,%xmm0          vmovsd %xmm2,%xmm2,%xmm0=20=20=20=20=20=
=20=20=20=20=20
vmulsd %xmm2,%xmm2,%xmm2          vfmadd132sd %xmm2,%xmm1,%xmm0=20=20=20=20=
=20
vfmadd132sd %xmm1,%xmm2,%xmm1     vfmadd132sd %xmm1,%xmm2,%xmm1=20=20=20=20=
=20
--------------------------------------------------------------------
vaddsd %xmm1,%xmm0,%xmm0          vaddsd %xmm1,%xmm0,%xmm0=20=20=20=20=20=
=20=20=20=20=20
vmovsd %xmm0,0x60cdf8(%rax)       vmovsd %xmm0,0x60cdf8(%rax)=20=20=20=20=
=20=20=20
vaddsd 0x60ce00(%rax),%xmm0,%xmm0 vaddsd 0x60ce00(%rax),%xmm0,%xmm0=20
vmovsd %xmm0,0x60aff8(%rax)       vmovsd %xmm0,0x60aff8(%rax)=20=20=20=20=
=20=20=20
cmp    $0x9f8,%rax                cmp    $0x9f8,%rax=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20
jne    Loop1:                     jne    Loop1=20=20=20=20=20=20=20=20


For the Base version, mult and FMA have dependencies, which increases the
latency of the critical dependency chain. I didn't find out why znver3 has
regression. Same binary running on ICX has 11% gain (with #define iterations
100000000).=