From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id E1995385800A; Tue, 23 May 2023 10:35:00 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org E1995385800A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1684838100; bh=z3nC/19f8HxktiGGyyzpfs3Cfdgy8Kz5xTsoHGUPNEA=; h=From:To:Subject:Date:In-Reply-To:References:From; b=LMnk0sV7WcmhrmY89xlUXrZmhccgqmObTwGNWnlE355x2NyIzn/y92ygbLO88TthK cgmVQTTaCzdG4KsurmP3SuK2twUww+lMRwXtXnQ2GyD3OcS8H4aY3EsDH760YzVCpu dvgWPrJR63OopHj3WlA68uluKwxjIZvzg356GMTs= From: "cvs-commit at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109632] Inefficient codegen when complex numbers are emulated with structs Date: Tue, 23 May 2023 10:34:59 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: cvs-commit at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: rsandifo at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109632 --- Comment #11 from CVS Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:b096a6ebe9d9f9fed4c105f6555f724eb32af95c commit r14-1131-gb096a6ebe9d9f9fed4c105f6555f724eb32af95c Author: Richard Sandiford Date: Tue May 23 11:34:42 2023 +0100 aarch64: Provide FPR alternatives for some bit insertions [PR109632] At -O2, and so with SLP vectorisation enabled: struct complx_t { float re, im; }; complx_t add(complx_t a, complx_t b) { return {a.re + b.re, a.im + b.im}; } generates: fmov w3, s1 fmov x0, d0 fmov x1, d2 fmov w2, s3 bfi x0, x3, 32, 32 fmov d31, x0 bfi x1, x2, 32, 32 fmov d30, x1 fadd v31.2s, v31.2s, v30.2s fmov x1, d31 lsr x0, x1, 32 fmov s1, w0 lsr w0, w1, 0 fmov s0, w0 ret This is because complx_t is passed and returned in FPRs, but GCC gives it DImode. We therefore =C3=A2need=C3=A2 to assemble a DImode pseudo f= rom the two individual floats, bitcast it to a vector, do the arithmetic, bitcast it back to a DImode pseudo, then extract the individual floats. There are many problems here. The most basic is that we shouldn't use SLP for such a trivial example. But SLP should in principle be beneficial for more complicated examples, so preventing SLP for the example above just changes the reproducer needed. A more fundamental problem is that it doesn't make sense to use single DImode pseudos in a testcase like this. I have a WIP patch to allow re and im to be stored in individual SFmode pseudos instead, but it's quite an invasive change and might end up going nowhere. A simpler problem to tackle is that we allow DImode pseudos to be stored in FPRs, but we don't provide any patterns for inserting values into them, even though INS makes that easy for element-like insertions. This patch adds some patterns for that. Doing that showed that aarch64_modes_tieable_p was too strict: it didn't allow SFmode and DImode values to be tied, even though both of them occupy a single GPR and FPR, and even though we allow both classes to change between the modes. The *aarch64_bfidi_subreg_ pattern is especially ugly, but it's not clear what target-independent code ought to simplify it to, if it was going to simplify it. We should probably do the same thing for extractions, but that's left as future work. After the patch we generate: ins v0.s[1], v1.s[0] ins v2.s[1], v3.s[0] fadd v0.2s, v0.2s, v2.2s fmov x0, d0 ushr d1, d0, 32 lsr w0, w0, 0 fmov s0, w0 ret which seems like a step in the right direction. All in all, there's nothing elegant about this patchh. It just seems like the least worst option. gcc/ PR target/109632 * config/aarch64/aarch64.cc (aarch64_modes_tieable_p): Allow subregs between any scalars that are 64 bits or smaller. * config/aarch64/iterators.md (SUBDI_BITS): New int iterator. (bits_etype): New int attribute. * config/aarch64/aarch64.md (*insv_reg_) (*aarch64_bfi_): New patterns. (*aarch64_bfidi_subreg_): Likewise. gcc/testsuite/ * gcc.target/aarch64/ins_bitfield_1.c: New test. * gcc.target/aarch64/ins_bitfield_2.c: Likewise. * gcc.target/aarch64/ins_bitfield_3.c: Likewise. * gcc.target/aarch64/ins_bitfield_4.c: Likewise. * gcc.target/aarch64/ins_bitfield_5.c: Likewise. * gcc.target/aarch64/ins_bitfield_6.c: Likewise.=