From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D890B3858C52; Sat, 26 Nov 2022 18:27:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D890B3858C52 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1669487257; bh=DRetLYydicJlOeEQcjtNCzQt+C1cqJRny3fUb2j2Yiw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=IaxKEo6KvTLtZbRma0oj8zmqISo9ccIqct9OXmIqZ8cF+VKGpnNetRXThF7cHxIa4 wvmtjHJDuWkdpR5JgKxoiUldfH530GQXverGiDEH1aEnXrqwkamFVZaXvrO/QhEt5x sixDl3Adab3vIopUzYkTncsLzbfM95s/zC/hWsUA= From: "already5chosen at yahoo dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3 Date: Sat, 26 Nov 2022 18:27:35 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 10.2.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: already5chosen at yahoo dot com X-Bugzilla-Status: RESOLVED X-Bugzilla-Resolution: FIXED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97832 --- Comment #19 from Michael_S --- (In reply to Alexander Monakov from comment #18) > The apparent 'bias' is introduced by instruction scheduling: haifa-sched > lifts a +64 increment over memory accesses, transforming +0 and +32 > displacements to -64 and -32. Sometimes this helps a little bit even on > modern x86 CPUs. I don't think that it ever helps on Intel Sandy Bridge or later or on AMD Z= en1 or later. >=20 > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > 'unlaminated' (turned to 2 uops before renaming), so selecting independent > IVs for the two arrays actually helps on this testcase. Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx), %ymm3, %ymm0' would be turned into 2 uops. Misuse of load+op is far bigger problem in this particular test case than sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns lo= op that can potentially run at 3 clocks per iteration into loop of 4+ clocks p= er iteration. But I consider it a separate issue. I reported similar issue in 97127, but = here it is more serious. It looks to me that the issue is not soluble within existing gcc optimization framework. The only chance is if you accept my old and simple advice - within inner loops pretend that AVX is RISC, i.e. gener= ate code as if load-op form of AVX instructions weren't existing.=