[Bug c/99339] New: Poor codegen with simple varargs

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug c/99339] New: Poor codegen with simple varargs
@ 2021-03-02 11:49 redbeard0531 at gmail dot com
  2021-03-02 12:44 ` [Bug middle-end/99339] " rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: redbeard0531 at gmail dot com @ 2021-03-02 11:49 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

            Bug ID: 99339
           Summary: Poor codegen with simple varargs
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

These two functions should generate the same code, but the varargs version is
much worse. And it is even worser still when you enable
-fstack-protector-strong.

https://godbolt.org/z/noEYoh

#include <stdarg.h>

int test_va(int x, ...) {
    int i;
    va_list va;
    va_start(va, x);
    i = va_arg(va, int);
    va_end(va);
    return i + x;
}

int test_args(int x, int i) {
    return i + x;
}

# explicit args with or without stack protection
test_args:
        lea     eax, [rsi+rdi]
        ret

# without stack protection (why aren't dead stores to the stack being
eliminated?)
test_va:
        lea     rax, [rsp+8]
        mov     QWORD PTR [rsp-40], rsi
        mov     QWORD PTR [rsp-64], rax
        lea     rax, [rsp-48]
        mov     QWORD PTR [rsp-56], rax
        mov     eax, DWORD PTR [rsp-40]
        mov     DWORD PTR [rsp-72], 8
        add     eax, edi
        ret

# with stack protection (yikes!)
test_va:
        sub     rsp, 88
        mov     QWORD PTR [rsp+40], rsi
        mov     rax, QWORD PTR fs:40
        mov     QWORD PTR [rsp+24], rax
        xor     eax, eax
        lea     rax, [rsp+96]
        mov     DWORD PTR [rsp], 8
        mov     QWORD PTR [rsp+8], rax
        lea     rax, [rsp+32]
        mov     QWORD PTR [rsp+16], rax
        mov     eax, DWORD PTR [rsp+40]
        add     eax, edi
        mov     rdx, QWORD PTR [rsp+24]
        sub     rdx, QWORD PTR fs:40
        jne     .L7
        add     rsp, 88
        ret
.L7:
        call    __stack_chk_fail

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
@ 2021-03-02 12:44 ` rguenth at gcc dot gnu.org
  2021-03-02 12:46 ` rguenth at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-02 12:44 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
     Ever confirmed|0                           |1
           Keywords|                            |missed-optimization
             Target|                            |x86_64-*-*
          Component|c                           |middle-end
                 CC|                            |matz at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-03-02

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
The stack space is not eliminated because we lower __builtin_va_start only
after RTL expansion and that reserves stack space necessary for accessing
some of the meta (including the passed value itself) as memory.

So it's unavoidable up to somebody designing sth smarter around varargs
and GIMPLE.

Arguably the not lowered variant would be easier to expand optimally:

int test_va (int x)
{
  struct  va[1];
  int i;
  int _7;

  <bb 2> [local count: 1073741824]:
  __builtin_va_start (&va, 0);
  i_4 = .VA_ARG (&va, 0B, 0B);
  __builtin_va_end (&va);
  _7 = i_4 + x_6(D);
  va ={v} {CLOBBER};
  return _7;

I'm not fully sure why we lower at all.  Part of the lowering determines
whether there's any FP arguments referenced and optimizes based on that,
but IIRC that's all.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
  2021-03-02 12:44 ` [Bug middle-end/99339] " rguenth at gcc dot gnu.org
@ 2021-03-02 12:46 ` rguenth at gcc dot gnu.org
  2021-03-02 12:53 ` rguenth at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-02 12:46 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, clang manages to produce the following, which shows the situation could be
worse ;)

test_va:                                # @test_va
        .cfi_startproc
# %bb.0:
        subq    $88, %rsp
        .cfi_def_cfa_offset 96
        movl    %eax, %r10d
        movl    %edi, %eax
        testb   %r10b, %r10b
        je      .LBB0_2
# %bb.1:
        movaps  %xmm0, -48(%rsp)
        movaps  %xmm1, -32(%rsp)
        movaps  %xmm2, -16(%rsp)
        movaps  %xmm3, (%rsp)
        movaps  %xmm4, 16(%rsp)
        movaps  %xmm5, 32(%rsp)
        movaps  %xmm6, 48(%rsp)
        movaps  %xmm7, 64(%rsp)
.LBB0_2:
        movq    %rsi, -88(%rsp)
        movq    %rdx, -80(%rsp)
        movq    %rcx, -72(%rsp)
        movq    %r8, -64(%rsp)
        movq    %r9, -56(%rsp)
        leaq    -96(%rsp), %rcx
        movq    %rcx, -112(%rsp)
        leaq    96(%rsp), %rcx
        movq    %rcx, -120(%rsp)
        movabsq $206158430216, %rcx     # imm = 0x3000000008
        movq    %rcx, -128(%rsp)
        movl    $8, %edx
        cmpq    $40, %rdx
        ja      .LBB0_4
# %bb.3:
        movl    $8, %ecx
        addq    -112(%rsp), %rcx
        addl    $8, %edx
        movl    %edx, -128(%rsp)
        jmp     .LBB0_5
.LBB0_4:
        movq    -120(%rsp), %rcx
        leaq    8(%rcx), %rdx
        movq    %rdx, -120(%rsp)
.LBB0_5:
        addl    (%rcx), %eax
        addq    $88, %rsp
        .cfi_def_cfa_offset 8
        retq

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
  2021-03-02 12:44 ` [Bug middle-end/99339] " rguenth at gcc dot gnu.org
  2021-03-02 12:46 ` rguenth at gcc dot gnu.org
@ 2021-03-02 12:53 ` rguenth at gcc dot gnu.org
  2021-03-02 13:11 ` jakub at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-02 12:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So we could try to lower even va_start/end to expose the va_list meta fully
to the middle-end early which should eventually allow eliding it.  That
would require introducing other builtins/internal fns to allow referencing
the frame or the incoming arg registers by number.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
                   ` (2 preceding siblings ...)
  2021-03-02 12:53 ` rguenth at gcc dot gnu.org
@ 2021-03-02 13:11 ` jakub at gcc dot gnu.org
  2021-03-02 13:28 ` redbeard0531 at gmail dot com
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-02 13:11 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Clearly LLVM doesn't do the varargs optimizations we do.

Anyway, perhaps for these special cases we could just figure out which of the
exact arguments is accessed (if for all .VA_ARG calls we see the exact sequence
of those calls from __builtin_va_start and they are few) in a target hook
called from the stdarg pass and let the target hook replace them with some
internal fn call or register var read etc.
And optimize away __builtin_va_start if nothing else would use it after that
optimization.

The question is how common in the wild it is and if it is worth the work.
I guess e.g. the open function (when not implemented in assembly or hacked up
so that it just has 3 arguments on the definition instead of 2 + ...) is an
example where it could benefit from that.

The lowering is done so that the GIMPLE optimizers can actually optimize all
the struct field loads/stores etc., doing it only in RTL optimizations results
in worse code.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
                   ` (3 preceding siblings ...)
  2021-03-02 13:11 ` jakub at gcc dot gnu.org
@ 2021-03-02 13:28 ` redbeard0531 at gmail dot com
  2021-03-02 15:30 ` redbeard0531 at gmail dot com
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: redbeard0531 at gmail dot com @ 2021-03-02 13:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #5 from Mathias Stearn <redbeard0531 at gmail dot com> ---
I filed a related bug with clang right after this one if anyone want to follow
along https://bugs.llvm.org/show_bug.cgi?id=49395.

Just because clang does worse doesn't mean gcc shouldn't do better ;)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
                   ` (4 preceding siblings ...)
  2021-03-02 13:28 ` redbeard0531 at gmail dot com
@ 2021-03-02 15:30 ` redbeard0531 at gmail dot com
  2021-03-03  7:28 ` rguenth at gcc dot gnu.org
  2021-03-03  8:41 ` jakub at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: redbeard0531 at gmail dot com @ 2021-03-02 15:30 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #6 from Mathias Stearn <redbeard0531 at gmail dot com> ---
> The question is how common in the wild it is and if it is worth the work.

I'm not sure how common it is, but this is my use case. The code in the bug
report is a slightly simplified example from some Real World Code I am working
on:
https://source.wiredtiger.com/develop/struct_w_t___c_u_r_s_o_r.html#af19f6f9d9c7fc248ab38879032620b2f.
Essentially WT_CURSOR is a structure of function pointers that all take a
WT_CURSOR pointer as the first argument, similar to C++ virtual functions. The
get_key() and get_value() "methods" both take (WT_CURSOR*, ...) arguments, and
unpack the arguments based on the format string that was set up when you opened
the cursor. The expectation is that they will be called many times for each
cursor object. Since we know the format string when creating the cursor I was
experimenting with creating specialized functions for common formats rather
than dispatching through the generic format-inspecting logic every time. I
noticed that I was able to get even more performance by declaring the functions
as taking the arguments they actually take rather than dealing with va_args,
then casting the function pointers to the generic (WT_CURSOR, ...) type when
assigning into the method slot. I assume this is UB in C (I don't know the C
standard nearly as well as C++) but all ABIs I know of are designed to make
this kind of thing work.

I'd rather not have to do that kind of questionable shenanigans in order to get
the same performance.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
                   ` (5 preceding siblings ...)
  2021-03-02 15:30 ` redbeard0531 at gmail dot com
@ 2021-03-03  7:28 ` rguenth at gcc dot gnu.org
  2021-03-03  8:41 ` jakub at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2021-03-03  7:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
For simple cases some IPA pass (IPA-CP or IPA-SRA?) could also 'clone' varargs
functions based on callers, eliding varargs and thus also allow inlining
(or like the early IPA-SRA did, modify a function in place if all callers are
simple).

Directly supporting inlining might also be possible.

What's required for all this is some local analysis of the varargs function
on whether it's possible to replace the .VA_ARG calls with direct parameter
references (no .VA_ARG in loops for example, no passing of the va_list to
other functions, etc.).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug middle-end/99339] Poor codegen with simple varargs
  2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
                   ` (6 preceding siblings ...)
  2021-03-03  7:28 ` rguenth at gcc dot gnu.org
@ 2021-03-03  8:41 ` jakub at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: jakub at gcc dot gnu.org @ 2021-03-03  8:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The stdarg pass already performs similar analysis, see e.g.
reachable_at_most_once function, because if those aren't used in loops and
escape, it computes not just whether the function uses some fp or gpr args, but
also how many (though, upper bound for that).
For the optimization discussed here, it would need to punt also on cases like:
va_start
if (cond)
  va_arg
va_arg
(well, could assign the first va_arg a particular register, but not to the
other one).

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-03-03  8:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-02 11:49 [Bug c/99339] New: Poor codegen with simple varargs redbeard0531 at gmail dot com
2021-03-02 12:44 ` [Bug middle-end/99339] " rguenth at gcc dot gnu.org
2021-03-02 12:46 ` rguenth at gcc dot gnu.org
2021-03-02 12:53 ` rguenth at gcc dot gnu.org
2021-03-02 13:11 ` jakub at gcc dot gnu.org
2021-03-02 13:28 ` redbeard0531 at gmail dot com
2021-03-02 15:30 ` redbeard0531 at gmail dot com
2021-03-03  7:28 ` rguenth at gcc dot gnu.org
2021-03-03  8:41 ` jakub at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).