From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id EF2E1385840F; Fri,  1 Mar 2024 09:16:56 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org EF2E1385840F
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1709284616;
	bh=aIRTyzuLn8E80frjvAbFbRDFHevbozcrMyxE+drQi0A=;
	h=From:To:Subject:Date:From;
	b=cwPS6T4Hfm+NQgh2YGQCbqzwAp+uIHxta8FQ/nGLYlYMVU/3NmPyKKiUFxiz7R3LN
	 g2PUcG5HZ9yurUGPcgKPLfAr3Q75PkYiTawyr+FapU1s/f0iESuc/4wDjMEA5q9iF6
	 q8B2nrcfBrkoJxT401P86n0g9jYgiD+rQtMJ9sEc=
From: "matteo at mitalia dot net" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug rtl-optimization/114187] New: [14 regression] bizarre register
 dance on x86_64 for pass-by-value struct
Date: Fri, 01 Mar 2024 09:16:52 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: rtl-optimization
X-Bugzilla-Version: 14.0
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: matteo at mitalia dot net
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter target_milestone
Message-ID: <bug-114187-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D114187

            Bug ID: 114187
           Summary: [14 regression] bizarre register dance on x86_64 for
                    pass-by-value struct
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matteo at mitalia dot net
  Target Milestone: ---

Sample code (+ godbolt link https://godbolt.org/z/zf6e16Wcq )

```
struct P2d {
    double x, y;
};

double sumxy(double x, double y) {
    return x + y;
}

double sumxy_p(P2d p) {
    return p.x + p.y;
}

double sumxy_p_ref(const P2d& p) {
    return p.x + p.y;
}
```

with g++ 13.2 -O3 generates a perfectly reasonable

```
sumxy(double, double):
        addsd   xmm0, xmm1
        ret
sumxy_p(P2d):
        addsd   xmm0, xmm1
        ret
sumxy_p_ref(P2d const&):
        movsd   xmm0, QWORD PTR [rdi]
        addsd   xmm0, QWORD PTR [rdi+8]
        ret
```

instead with g++ 14 (g++
(Compiler-Explorer-Build-gcc-b05f474c8f7768dad50a99a2d676660ee4db09c6-binut=
ils-2.40)
14.0.1 20240301 (experimental)) we get

```
sumxy(double, double):
        addsd   xmm0, xmm1
        ret
sumxy_p(P2d):
        movq    rax, xmm1
        movq    rdx, xmm0
        xchg    rdx, rax
        movq    xmm0, rax
        movq    xmm2, rdx
        addsd   xmm0, xmm2
        ret
sumxy_p_ref(P2d const&):
        movsd   xmm0, QWORD PTR [rdi]
        addsd   xmm0, QWORD PTR [rdi+8]
        ret
```

Notice the bizarre registers dance for sumxy_p(P2d) (p.x goes through xmm0 =
=E2=86=92
rdx =E2=86=92 rax =E2=86=92 xmm0; p.y in turn xmm1 =E2=86=92 rax =E2=86=92 =
rdx =E2=86=92 xmm2; then they finally get
summed); sumxy(double, double) which, register-wise, should be the same, is
unaffected.

This exact same code (both for gcc 13 and gcc 14) is generated at all
optimization levels I tested (-Og, -O1, -O2, -O3) except -O0 of course, so =
it
doesn't seem to depend from particular optimization passes enabled only at =
high
optimization levels. Also (as reasonable) it doesn't seem to depend on the =
C++
frontend, as compiling this with plain gcc (adding a typedef for the struct=
 and
changing the reference to a pointer) yields the exact same results.

Most importantly, it seems something target-specific, as ARM64 builds don't
exhibit particular problems, and produce pretty much the same (reasonable) =
code
both on 14.0 and 13.2

```
sumxy(double, double):
        fadd    d0, d0, d1
        ret
sumxy_p(P2d):
        fadd    d0, d0, d1
        ret
sumxy_p_ref(P2d const&):
        ldp     d0, d31, [x0]
        fadd    d0, d0, d31
        ret
```

(gcc 13.2 generates slightly different code for sumxy_p_ref, but in a very
minor way)

Fiddling around, with -march=3Dnocona (that leaves gcc 13.2 unaffected) I g=
et a
more compact but still absurd dance:

```
sumxy_p(P2d):
        movsd   QWORD PTR [rsp-8], xmm1
        mov     rdx, QWORD PTR [rsp-8]
        movq    xmm2, rdx
        addsd   xmm0, xmm2
        ret
```

here p.x is left in xmm0 where it should, but xmm1 goes through the stack (=
!),
a GP register (rdx) and finally to xmm2. It feels like in general it wants =
to
launder xmm1 through a 64 bit GP register before summing it, a bit like a l=
ight
version of -ffloat-store.=