From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id BC0223858C53; Wed, 26 Apr 2023 13:03:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BC0223858C53 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1682514182; bh=u4659WDICAjoADAm2yjIXaA51jVDCyflV/X5gWq27EI=; h=From:To:Subject:Date:From; b=PWdfh/cQct4aTfnVPhk4W3DPoLd7XxrhMmI9Ty5xK/48B9lSZn1lWgTNoIbuQez+Q IIOHDPLPyTFlgpRAqXvJLeD57LBNgTCxpYd3GwqjNQYM/IMXgQUW6XNLTvY77HTtB3 kHqjg3hO22Nj2GcBKS2RE0hLmsgyncbkMIQX826Y= From: "tnfchris at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/109632] New: Inefficient codegen when complex numbers are emulated with structs Date: Wed, 26 Apr 2023 13:03:02 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 14.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: normal X-Bugzilla-Who: tnfchris at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status keywords bug_severity priority component assigned_to reporter target_milestone cf_gcctarget Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109632 Bug ID: 109632 Summary: Inefficient codegen when complex numbers are emulated with structs Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following two cases are the same struct complx_t { float re; float im; }; complx_t add(const complx_t &a, const complx_t &b) { return {a.re + b.re, a.im + b.im}; } _Complex float add(const _Complex float *a, const _Complex float *b) { return {__real__ *a + __real__ *b, __imag__ *a + __imag__ *b}; } But we generate much different code (looking at -O2), For the first one we= do: ldr d1, [x1] ldr d0, [x0] fadd v0.2s, v0.2s, v1.2s fmov x0, d0 lsr x1, x0, 32 lsr w0, w0, 0 fmov s1, w1 fmov s0, w0 ret which is bad for obvious reasons, but also also never needed to go through = the genreg for such a reversal. we could have used many other NEON instructions. For the second one we generate the good instructions: add(float _Complex const*, float _Complex const*): ldp s3, s2, [x0] ldp s0, s1, [x1] fadd s1, s2, s1 fadd s0, s3, s0 ret The difference being that in the second one we have decomposed the initial structure by loading the elements: [local count: 1073741824]: _1 =3D REALPART_EXPR <*a_8(D)>; _2 =3D REALPART_EXPR <*b_9(D)>; _3 =3D _1 + _2; _4 =3D IMAGPART_EXPR <*a_8(D)>; _5 =3D IMAGPART_EXPR <*b_9(D)>; _6 =3D _4 + _5; _10 =3D COMPLEX_EXPR <_3, _6>; return _10; In the first one we've kept them as vectors: [local count: 1073741824]: vect__1.6_13 =3D MEM [(float *)a_8(D)]; vect__2.9_15 =3D MEM [(float *)b_9(D)]; vect__3.10_16 =3D vect__1.6_13 + vect__2.9_15; MEM [(float *)&D.4435] =3D vect__3.10_16; return D.4435; This part is probably a costing issue, we SLP them even though it's not profitable because for the APCS we have to return them in separate register= s. Using -fno-tree-vectorize gets the gimple code right: [local count: 1073741824]: _1 =3D a_8(D)->re; _2 =3D b_9(D)->re; _3 =3D _1 + _2; D.4435.re =3D _3; _4 =3D a_8(D)->im; _5 =3D b_9(D)->im; _6 =3D _4 + _5; D.4435.im =3D _6; return D.4435; But we generate worse code: ldp s1, s0, [x0] mov x2, 0 ldp s3, s2, [x1] fadd s1, s1, s3 fadd s0, s0, s2 fmov w1, s1 fmov w0, s0 bfi x2, x1, 0, 32 bfi x2, x0, 32, 32 lsr x0, x2, 32 lsr w2, w2, 0 fmov s1, w0 fmov s0, w2 where we again use genreg as a very complicated way to do a no-op. So there are two bugs here: 1. a costing, we shouldn't SLP 2. an expansion, the code out of expand is bad to begin with.=