From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id BC0223858C53; Wed, 26 Apr 2023 13:03:02 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org BC0223858C53
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1682514182;
	bh=u4659WDICAjoADAm2yjIXaA51jVDCyflV/X5gWq27EI=;
	h=From:To:Subject:Date:From;
	b=PWdfh/cQct4aTfnVPhk4W3DPoLd7XxrhMmI9Ty5xK/48B9lSZn1lWgTNoIbuQez+Q
	 IIOHDPLPyTFlgpRAqXvJLeD57LBNgTCxpYd3GwqjNQYM/IMXgQUW6XNLTvY77HTtB3
	 kHqjg3hO22Nj2GcBKS2RE0hLmsgyncbkMIQX826Y=
From: "tnfchris at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/109632] New: Inefficient codegen when complex numbers
 are emulated with structs
Date: Wed, 26 Apr 2023 13:03:02 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: normal
X-Bugzilla-Who: tnfchris at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 keywords bug_severity priority component assigned_to reporter
 target_milestone cf_gcctarget
Message-ID: <bug-109632-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D109632

            Bug ID: 109632
           Summary: Inefficient codegen when complex numbers are emulated
                    with structs
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

The following two cases are the same

struct complx_t {
    float re;
    float im;
};

complx_t
add(const complx_t &a, const complx_t &b) {
  return {a.re + b.re, a.im + b.im};
}

_Complex float
add(const _Complex float *a, const _Complex float *b) {
  return {__real__ *a + __real__ *b, __imag__ *a + __imag__ *b};
}

But we generate much different code (looking at -O2),  For the first one we=
 do:

        ldr     d1, [x1]
        ldr     d0, [x0]
        fadd    v0.2s, v0.2s, v1.2s
        fmov    x0, d0
        lsr     x1, x0, 32
        lsr     w0, w0, 0
        fmov    s1, w1
        fmov    s0, w0
        ret

which is bad for obvious reasons, but also also never needed to go through =
the
genreg for such a reversal. we could have used many other NEON instructions.

For the second one we generate the good instructions:

add(float _Complex const*, float _Complex const*):
        ldp     s3, s2, [x0]
        ldp     s0, s1, [x1]
        fadd    s1, s2, s1
        fadd    s0, s3, s0
        ret

The difference being that in the second one we have decomposed the initial
structure by loading the elements:

  <bb 2> [local count: 1073741824]:
  _1 =3D REALPART_EXPR <*a_8(D)>;
  _2 =3D REALPART_EXPR <*b_9(D)>;
  _3 =3D _1 + _2;
  _4 =3D IMAGPART_EXPR <*a_8(D)>;
  _5 =3D IMAGPART_EXPR <*b_9(D)>;
  _6 =3D _4 + _5;
  _10 =3D COMPLEX_EXPR <_3, _6>;
  return _10;

In the first one we've kept them as vectors:

  <bb 2> [local count: 1073741824]:
  vect__1.6_13 =3D MEM <const vector(2) float> [(float *)a_8(D)];
  vect__2.9_15 =3D MEM <const vector(2) float> [(float *)b_9(D)];
  vect__3.10_16 =3D vect__1.6_13 + vect__2.9_15;
  MEM <vector(2) float> [(float *)&D.4435] =3D vect__3.10_16;
  return D.4435;

This part is probably a costing issue, we SLP them even though it's not
profitable because for the APCS we have to return them in separate register=
s.

Using -fno-tree-vectorize gets the gimple code right:

  <bb 2> [local count: 1073741824]:
  _1 =3D a_8(D)->re;
  _2 =3D b_9(D)->re;
  _3 =3D _1 + _2;
  D.4435.re =3D _3;
  _4 =3D a_8(D)->im;
  _5 =3D b_9(D)->im;
  _6 =3D _4 + _5;
  D.4435.im =3D _6;
  return D.4435;

But we generate worse code:

        ldp     s1, s0, [x0]
        mov     x2, 0
        ldp     s3, s2, [x1]
        fadd    s1, s1, s3
        fadd    s0, s0, s2
        fmov    w1, s1
        fmov    w0, s0
        bfi     x2, x1, 0, 32
        bfi     x2, x0, 32, 32
        lsr     x0, x2, 32
        lsr     w2, w2, 0
        fmov    s1, w0
        fmov    s0, w2

where we again use genreg as a very complicated way to do a no-op.

So there are two bugs here:

1. a costing, we shouldn't SLP
2. an expansion, the code out of expand is bad to begin with.=