From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id DE4B03858404; Thu, 28 Oct 2021 04:44:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org DE4B03858404 From: "pinskia at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug middle-end/102977] [GCC12 regression] vectorizer failed to generate complex fma with SVE Date: Thu, 28 Oct 2021 04:44:27 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: middle-end X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: pinskia at gcc dot gnu.org X-Bugzilla-Status: RESOLVED X-Bugzilla-Resolution: INVALID X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: resolution bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 28 Oct 2021 04:44:28 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D102977 Andrew Pinski changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #1 from Andrew Pinski --- Huh. The trunk code is vectorized all the way: ptrue p1.h, vl8 ; set p1.h to 8 wide ptrue p0.b, all ; set p0.b to all ones ld2h {z2.h - z3.h}, p1/z, [x1] ; load the 8x2 vector into z2/z3 ld2h {z0.h - z1.h}, p1/z, [x2] ; load the 8x2 vector into z0/z1 ld2h {z16.h - z17.h}, p1/z, [x0] ; load the 8x2 vector into z16/= 17 fmul z6.h, z0.h, z3.h ; z6 =3D z0 * z3 movprfx z7, z16 ; z7 =3D z16 fmla z7.h, p0/m, z0.h, z2.h ; z7+=3Dz0*z2 fmla z6.h, p0/m, z1.h, z2.h ; z6 +=3D z1*z2 movprfx z4, z7 ; z4 =3D z7 fmls z4.h, p0/m, z1.h, z3.h ; z4 -=3D z1*z3 fadd z5.h, z6.h, z17.h ; z5 =3D z6 + z17 st2h {z4.h - z5.h}, p1, [x0] ; store the 8x2 vector into x0 note the way ld2 works is the first element goes into the first vector, sec= ond element goes into the second vector, the 3rd element goes into the first vector, the 4th element goes into the second vector. So this is optimized all the way. Knowing the lower limit of the size of the vectors will be 128 byte (or 64 half floats) so 8 half floats will always f= it into one vector just fine. So this is vectorized all the way such that it is unrolled even.=