From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id D890B3858C52; Sat, 26 Nov 2022 18:27:37 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D890B3858C52
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1669487257;
	bh=DRetLYydicJlOeEQcjtNCzQt+C1cqJRny3fUb2j2Yiw=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=IaxKEo6KvTLtZbRma0oj8zmqISo9ccIqct9OXmIqZ8cF+VKGpnNetRXThF7cHxIa4
	 wvmtjHJDuWkdpR5JgKxoiUldfH530GQXverGiDEH1aEnXrqwkamFVZaXvrO/QhEt5x
	 sixDl3Adab3vIopUzYkTncsLzbfM95s/zC/hWsUA=
From: "already5chosen at yahoo dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug tree-optimization/97832] AoSoA complex caxpy-like loops:
 AVX2+FMA -Ofast 7 times slower than -O3
Date: Sat, 26 Nov 2022 18:27:35 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: tree-optimization
X-Bugzilla-Version: 10.2.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: already5chosen at yahoo dot com
X-Bugzilla-Status: RESOLVED
X-Bugzilla-Resolution: FIXED
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org
X-Bugzilla-Target-Milestone: 12.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-97832-4-XmjgFT49YJ@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-97832-4@http.gcc.gnu.org/bugzilla/>
References: <bug-97832-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97832
--- Comment #19 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Alexander Monakov from comment #18)
> The apparent 'bias' is introduced by instruction scheduling: haifa-sched
> lifts a +64 increment over memory accesses, transforming +0 and +32
> displacements to -64 and -32. Sometimes this helps a little bit even on
> modern x86 CPUs.

I don't think that it ever helps on Intel Sandy Bridge or later or on AMD Z=
en1
or later.

>=20
> Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> IVs for the two arrays actually helps on this testcase.

Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
%ymm3, %ymm0' would be turned into 2 uops.

Misuse of load+op is far bigger problem in this particular test case than
sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns lo=
op
that can potentially run at 3 clocks per iteration into loop of 4+ clocks p=
er
iteration.
But I consider it a separate issue. I reported similar issue in 97127, but =
here
it is more serious. It looks to me that the issue is not soluble within
existing gcc optimization framework. The only chance is if you accept my old
and simple advice - within inner loops pretend that AVX is RISC, i.e. gener=
ate
code as if load-op form of AVX instructions weren't existing.=