From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D898E385558F; Sat, 26 Nov 2022 19:36:52 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D898E385558F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1669491412;
	bh=G1fA3ergyPMnR7bq4DKeVfBCG03zZh09L4ULeiG+BCc=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=mE6rTnzhu7uRSbRX7PxveHLaKu0frr30/Q7Bxdtw0rL4X7MGy/6DDI9I4Jk/lvpqf
	 vJEb6cGzzNgjYd46fqg9AyWdrVD29vowxIBR1RRyC0eDseICPM1YyskVfVYkCcIZki
	 w5Hdtb5HXCIfPj7tuovoGO1gZ2LIchl44qRwMLIY=
From: "amonakov at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/97832] AoSoA complex caxpy-like loops:
 AVX2+FMA -Ofast 7 times slower than -O3
Date: Sat, 26 Nov 2022 19:36:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 10.2.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: amonakov at gcc dot gnu.org
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Resolution: FIXED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-97832-4-IMr8rSebwa@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-97832-4@http.gcc.gnu.org/bugzilla/>
References: <bug-97832-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97832
--- Comment #21 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Michael_S from comment #19)
> > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > 'unlaminated' (turned to 2 uops before renaming), so selecting independ=
ent
> > IVs for the two arrays actually helps on this testcase.
>=20
> Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx=
),
> %ymm3, %ymm0' would be turned into 2 uops.

The difference is at which point in the pipeline. The latter goes through
renaming as one fused uop.

> Misuse of load+op is far bigger problem in this particular test case than
> sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns
> loop that can potentially run at 3 clocks per iteration into loop of 4+
> clocks per iteration.

Sorry, which assembler output this refers to?

> But I consider it a separate issue. I reported similar issue in 97127, but
> here it is more serious. It looks to me that the issue is not soluble wit=
hin
> existing gcc optimization framework. The only chance is if you accept my =
old
> and simple advice - within inner loops pretend that AVX is RISC, i.e.
> generate code as if load-op form of AVX instructions weren't existing.

In bug 97127 the best explanation we have so far is we don't optimally hand=
le
the case where non-memory inputs of an fma are reused, so we can't combine a
load with an fma without causing an extra register copy (PR 97127 comment 16
demonstrates what I mean). I cannot imagine such trouble arising with more
common commutative operations like mul/add, especially with non-destructive=
 VEX
encoding. If you hit such examples, I would suggest to report them also,
because their root cause might be different.

In general load-op combining should be very helpful on x86, because it redu=
ces
the number of uops flowing through the renaming stage, which is one of the
narrowest points in the pipeline.=