From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D898E385558F; Sat, 26 Nov 2022 19:36:52 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D898E385558F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1669491412; bh=G1fA3ergyPMnR7bq4DKeVfBCG03zZh09L4ULeiG+BCc=; h=From:To:Subject:Date:In-Reply-To:References:From; b=mE6rTnzhu7uRSbRX7PxveHLaKu0frr30/Q7Bxdtw0rL4X7MGy/6DDI9I4Jk/lvpqf vJEb6cGzzNgjYd46fqg9AyWdrVD29vowxIBR1RRyC0eDseICPM1YyskVfVYkCcIZki w5Hdtb5HXCIfPj7tuovoGO1gZ2LIchl44qRwMLIY= From: "amonakov at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3 Date: Sat, 26 Nov 2022 19:36:52 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: tree-optimization X-Bugzilla-Version: 10.2.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: amonakov at gcc dot gnu.org X-Bugzilla-Status: RESOLVED X-Bugzilla-Resolution: FIXED X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rguenth at gcc dot gnu.org X-Bugzilla-Target-Milestone: 12.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D97832 --- Comment #21 from Alexander Monakov --- (In reply to Michael_S from comment #19) > > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be > > 'unlaminated' (turned to 2 uops before renaming), so selecting independ= ent > > IVs for the two arrays actually helps on this testcase. >=20 > Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx= ), > %ymm3, %ymm0' would be turned into 2 uops. The difference is at which point in the pipeline. The latter goes through renaming as one fused uop. > Misuse of load+op is far bigger problem in this particular test case than > sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns > loop that can potentially run at 3 clocks per iteration into loop of 4+ > clocks per iteration. Sorry, which assembler output this refers to? > But I consider it a separate issue. I reported similar issue in 97127, but > here it is more serious. It looks to me that the issue is not soluble wit= hin > existing gcc optimization framework. The only chance is if you accept my = old > and simple advice - within inner loops pretend that AVX is RISC, i.e. > generate code as if load-op form of AVX instructions weren't existing. In bug 97127 the best explanation we have so far is we don't optimally hand= le the case where non-memory inputs of an fma are reused, so we can't combine a load with an fma without causing an extra register copy (PR 97127 comment 16 demonstrates what I mean). I cannot imagine such trouble arising with more common commutative operations like mul/add, especially with non-destructive= VEX encoding. If you hit such examples, I would suggest to report them also, because their root cause might be different. In general load-op combining should be very helpful on x86, because it redu= ces the number of uops flowing through the renaming stage, which is one of the narrowest points in the pipeline.=