public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
From: "mkretz at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/115204] New: unnecessary stack usage and copies (of temporaries)
Date: Thu, 23 May 2024 12:19:27 +0000	[thread overview]
Message-ID: <bug-115204-4@http.gcc.gnu.org/bugzilla/> (raw)

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115204

            Bug ID: 115204
           Summary: unnecessary stack usage and copies (of temporaries)
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mkretz at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Test case (https://compiler-explorer.com/z/P7s75EhMr):

struct A {
  int data[8];
};

struct A gen();

void g(struct A);

void f()
{
  g(gen());
}

This places the returned A object from 'gen()' on the stack, copies it and then
calls 'g'. Why? So instead of

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        sub     rsp, 32
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        add     rsp, 72
        ret

can GCC just elide the copy? Like this:

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        call    g
        add     rsp, 40
        ret


I understand that this optimization requires the caller to never read from the
object anymore. So a second call to 'g' with the same object returned from
'gen' (like in https://compiler-explorer.com/z/6rMYdnb34) requires that the
first call to 'g' gets a copy. But the second call does not require the copy.
I.e.

int f()
{
  struct A a = gen();
  g(a);
  g(a);
  return 1;
}

compiles to

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        sub     rsp, 32
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        mov     eax, 1
        add     rsp, 72
        ret

but could be

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        sub     rsp, 32
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        add     rsp, 32
        call    g
        mov     eax, 1
        add     rsp, 40
        ret

IIUC, the second change would be significantly harder to implement because it
needs to shrink the stack. However, I don't believe this second case is as
important. The first one should be sufficiently common because of temporaries
passed into function arguments. So the following variation

void f()
{
  g(gen(), gen());
}

is something I see often, leading to many unnecessary stack copies. Instead of

f:
        sub     rsp, 72
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        lea     rdi, [rsp+32]
        xor     eax, eax
        call    gen
        sub     rsp, 64
        movdqa  xmm0, XMMWORD PTR [rsp+64]
        movups  XMMWORD PTR [rsp+32], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+80]
        movups  XMMWORD PTR [rsp+48], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+96]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+112]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        add     rsp, 136
        ret

I think it should be:

f:
        sub     rsp, 72
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        lea     rdi, [rsp+32]
        xor     eax, eax
        call    gen
        call    g
        add     rsp, 72
        ret

IIUC, this depends on the psABI and I don't know how target-dependent such an
optimization is. That's why I

             reply	other threads:[~2024-05-23 12:19 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-23 12:19 mkretz at gcc dot gnu.org [this message]
2024-05-23 12:21 ` [Bug target/115204] " mkretz at gcc dot gnu.org
2024-05-23 12:21 ` pinskia at gcc dot gnu.org
2024-05-23 12:29 ` pinskia at gcc dot gnu.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-115204-4@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).