public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
@ 2022-11-23 10:09 pskocik at gmail dot com
  2022-11-23 12:34 ` [Bug c/107831] " pskocik at gmail dot com
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: pskocik at gmail dot com @ 2022-11-23 10:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

            Bug ID: 107831
           Summary: Missed optimization: -fclash-stack-protection causes
                    unnecessary code generation for dynamic stack
                    allocations that are clearly less than a page
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pskocik at gmail dot com
  Target Milestone: ---

I'm talking allocations such as

char buf [ (uint8_t)size ];

The resulting code for this should ideally be the same with or without
-fstack-clash-protection as this can clearly never skip a whole page.

But gcc generates a big loop trying to touch every page-sized subpart of that
allocation.

https://godbolt.org/z/G8EbzbshK

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
@ 2022-11-23 12:34 ` pskocik at gmail dot com
  2022-11-23 17:01 ` jakub at gcc dot gnu.org
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pskocik at gmail dot com @ 2022-11-23 12:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #1 from Petr Skocik <pskocik at gmail dot com> ---
Sidenote regarding the stack-allocating code for cases when the size is not
known to be less than pagesize: the code generated for those cases is quite
large. It could be replaced (at least under -Os) with a call to a special
assembly function that'd pop the return address (assuming the target machine
pushes return addresses to the stack), allocate adjust and allocate the stack
size in a piecemeal fashion so as to not skip guard pages, the repush the
return address and return to caller with the stacksize expanded.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
  2022-11-23 12:34 ` [Bug c/107831] " pskocik at gmail dot com
@ 2022-11-23 17:01 ` jakub at gcc dot gnu.org
  2022-11-23 17:05 ` jakub at gcc dot gnu.org
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-23 17:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
(In reply to Petr Skocik from comment #1)
> Sidenote regarding the stack-allocating code for cases when the size is not
> known to be less than pagesize: the code generated for those cases is quite
> large. It could be replaced (at least under -Os) with a call to a special
> assembly function that'd pop the return address (assuming the target machine
> pushes return addresses to the stack), allocate adjust and allocate the
> stack size in a piecemeal fashion so as to not skip guard pages, the repush
> the return address and return to caller with the stacksize expanded.

You certainly don't want to kill the return stack the CPU has, even if it
results in a few saved bytes for -Os.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
  2022-11-23 12:34 ` [Bug c/107831] " pskocik at gmail dot com
  2022-11-23 17:01 ` jakub at gcc dot gnu.org
@ 2022-11-23 17:05 ` jakub at gcc dot gnu.org
  2022-11-23 17:37 ` jakub at gcc dot gnu.org
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-23 17:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |law at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
We could use range information I guess to find out if a dynamic allocation must
be small enough, but because it is a dynamic allocation, we also need a
guarantee there won't be too many small dynamic allocations in the same frame.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
                   ` (2 preceding siblings ...)
  2022-11-23 17:05 ` jakub at gcc dot gnu.org
@ 2022-11-23 17:37 ` jakub at gcc dot gnu.org
  2022-11-23 17:57 ` law at gcc dot gnu.org
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-23 17:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Say for
void bar (char *);
void
foo (int x, int y)
{
  __attribute__((assume (x < 64)));
  for (int i = 0; i < y; ++i)
    bar (__builtin_alloca (x));
}
all the alloca calls are known to be small, yet they can quickly cross pages.
Similarly:
void
baz (int x)
{
  if (x >= 512) __builtin_unreachable ();
  char a[x];
  bar (a);
  char b[x];
  bar (b);
  char c[x];
  bar (c);
  char d[x];
  bar (d);
  char e[x];
  bar (e);
  char f[x];
  bar (f);
  char g[x];
  bar (g);
  char h[x];
  bar (h);
  char i[x];
  bar (i);
  char j[x];
  bar (j);
}
All the VLAs here are small, yet together they can cross a page.
So, we'd need to punt for dynamic allocations in loops and for others estimate
the maximum size of all the allocations together (+ __builtin_alloca overhead +
normal frame size).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
                   ` (3 preceding siblings ...)
  2022-11-23 17:37 ` jakub at gcc dot gnu.org
@ 2022-11-23 17:57 ` law at gcc dot gnu.org
  2022-11-23 21:27 ` pskocik at gmail dot com
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: law at gcc dot gnu.org @ 2022-11-23 17:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #5 from Jeffrey A. Law <law at gcc dot gnu.org> ---
Right.  You also have to know the distance from the last probe (possibly an
implicit one) to the start of the alloca space before you can contemplate
eliding the probes in alloca space.  There's a hook we can use, but it was more
meant for AArch64 IIRC, but we might be able to use it for this purpose.

You also have to worry about dynamic allocations in a loop.  A single byte
alloca could jump the stack if it's inside loop with a suitable number of
iterations.


In general, the model taken was to try and minimize explicit probes in the
common case (ie, function entry) at the expense of taking additional probes in
the uncommon case (dynamic allocations).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
                   ` (4 preceding siblings ...)
  2022-11-23 17:57 ` law at gcc dot gnu.org
@ 2022-11-23 21:27 ` pskocik at gmail dot com
  2022-11-24 19:16 ` pskocik at gmail dot com
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: pskocik at gmail dot com @ 2022-11-23 21:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #6 from Petr Skocik <pskocik at gmail dot com> ---
(In reply to Jakub Jelinek from comment #2)
> (In reply to Petr Skocik from comment #1)
> > Sidenote regarding the stack-allocating code for cases when the size is not
> > known to be less than pagesize: the code generated for those cases is quite
> > large. It could be replaced (at least under -Os) with a call to a special
> > assembly function that'd pop the return address (assuming the target machine
> > pushes return addresses to the stack), allocate adjust and allocate the
> > stack size in a piecemeal fashion so as to not skip guard pages, the repush
> > the return address and return to caller with the stacksize expanded.
> 
> You certainly don't want to kill the return stack the CPU has, even if it
> results in a few saved bytes for -Os.

That's a very interesting point  because I have written x86_64 assembly
"functions" that  did pop the return address, pushed something to the stack,
and then repushed the return address and returned. In a loop, it doesn't seem
to perform badly compared to inline code, so I figure it shouldn't be messing
with the return stack buffer. After all, even though the return happens through
a different place in the callstack, it's still returning to the original
caller. The one time I absolutely must have accidentally messed with the return
stack buffer was when I wrote context switching routine and originally tried to
"ret" to the new context. It turned out to be very measurably many times slower
that `pop %rcx; jmp *%rcx;` (also measured on a loop), so that's why I think
popping a return address, allocating on the stack, and then pushing and
returning is not really a performance killer (on my Intel CPU anyway). If it
was messing with the return stack buffer, I think would be getting  similar
slowdowns to what I got with context switching code trying to `ret`.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
                   ` (5 preceding siblings ...)
  2022-11-23 21:27 ` pskocik at gmail dot com
@ 2022-11-24 19:16 ` pskocik at gmail dot com
  2022-11-24 19:31 ` jakub at gcc dot gnu.org
  2022-12-17 19:51 ` pskocik at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: pskocik at gmail dot com @ 2022-11-24 19:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #7 from Petr Skocik <pskocik at gmail dot com> ---
(In reply to Jakub Jelinek from comment #4)
> Say for
> void bar (char *);
> void
> foo (int x, int y)
> {
>   __attribute__((assume (x < 64)));
>   for (int i = 0; i < y; ++i)
>     bar (__builtin_alloca (x));
> }
> all the alloca calls are known to be small, yet they can quickly cross pages.
> Similarly:
> void
> baz (int x)
> {
>   if (x >= 512) __builtin_unreachable ();
>   char a[x];
>   bar (a);
>   char b[x];
>   bar (b);
>   char c[x];
>   bar (c);
>   char d[x];
>   bar (d);
>   char e[x];
>   bar (e);
>   char f[x];
>   bar (f);
>   char g[x];
>   bar (g);
>   char h[x];
>   bar (h);
>   char i[x];
>   bar (i);
>   char j[x];
>   bar (j);
> }
> All the VLAs here are small, yet together they can cross a page.
> So, we'd need to punt for dynamic allocations in loops and for others
> estimate
> the maximum size of all the allocations together (+ __builtin_alloca
> overhead + normal frame size).

I think this shouldn't need probes either (unless you tried to coalesce the
allocations) on architectures where making a function call touches the stack.
Also alloca's of less than or equal to half a page intertwined with writes
anywhere to the allocated blocks should be always safe (but I guess I'll just
turn stack-clash-protection off in the one file where I'm making such clearly
safe dynamic stack allocations).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
                   ` (6 preceding siblings ...)
  2022-11-24 19:16 ` pskocik at gmail dot com
@ 2022-11-24 19:31 ` jakub at gcc dot gnu.org
  2022-12-17 19:51 ` pskocik at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: jakub at gcc dot gnu.org @ 2022-11-24 19:31 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Alloca itself doesn't touch the stack on many architectures, and the code
doesn't have to have a function call in between.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page
  2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
                   ` (7 preceding siblings ...)
  2022-11-24 19:31 ` jakub at gcc dot gnu.org
@ 2022-12-17 19:51 ` pskocik at gmail dot com
  8 siblings, 0 replies; 10+ messages in thread
From: pskocik at gmail dot com @ 2022-12-17 19:51 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107831

--- Comment #9 from Petr Skocik <pskocik at gmail dot com> ---
Regarding the size of alloca/VLA-generated code under -fstack-clash-protection.
I've played with this a little bit and while I love the feature, the code size
increases seem quite significant and unnecessarily so.

Take a simple

void ALLOCA_C(size_t Sz){ char buf[Sz]; asm volatile ("" : : "r"(&buf[0])); }

gcc -fno-stack-clash-protection: 17 bytes
gcc -fstack-clash-protection: 72 bytes

clang manages with less of an increase:

-fno-stack-clash_protection: 26 bytes
-stack-clash-protection: 45 bytes

Still this could be as low as 11 bytes  for the -fclash-stack-protection
version (less than for the unprotected one!) all by using a simple call to an
assembly function, whose code can be no-clobber without much extra effort.

Linked in compiler explorer is a crack at the idea along with benchmarks: 
https://godbolt.org/z/f8rhG1ozs

The performance impact of the call seems negligible (practically less than 1ns,
though in the above quick-and-dirty benchmark it fluctuates a tiny bit,
sometimes even giving the non-inline version an edge).

I originally suggested popping the address of the stack and repushing before
calling returning. Ended up just repushing -- the old return address becomes
part of the alloca allocation. The concern that this could mess up the return
stack buffer of the CPU seems valid but all the benchmarks indicate it
doesn't--not even when the ret address is popped--just as long as the return
target address is the same. 

(When it isn't, the performance penalty is rather significant: measured a 19
times slowdown of that for comparison (it's also in the linked benchmarks)).

The (x86-64) assembly function:
#define STR(...) STR__(__VA_ARGS__) //{{{
#define STR__(...) #__VA_ARGS__ //}}}
asm(STR(
.global safeAllocaAsm;
safeAllocaAsm: //no clobber, though does expect 16-byte aligned at entry as
usual
    push %r10;
    cmp $16, %rdi;
ja .LsafeAllocaAsm__test32;
    push 8(%rsp);
    ret;
    .LsafeAllocaAsm__test32:
    push %r10;
    push %rdi;
    mov %rsp, %r10;
    sub $17, %rdi;
    and $-16, %rdi; //(-32+15)&(-16) //substract the 32 and 16-align, rounding
up
    jnz .LsafeAllocaAsm__probes;
.LsafeAllocaAsm__ret:
    lea (3*8)(%r10,%rdi,1), %rdi;
    push (%rdi);
    mov -8(%rdi), %r10;
    mov -16(%rdi), %rdi;
    ret;
.LsafeAllocaAsm__probes:
    sub %rdi, %r10;  //r10 is the desired rsp
    .LsafeAllocaAsm__probedPastDesiredSpEh:
    cmp %rsp, %r10; jge .LsafeAllocaAsm__pastDesiredSp;
    orl $0x0,(%rsp);
    sub $0x1000,%rsp;
    jmp .LsafeAllocaAsm__probedPastDesiredSpEh;
.LsafeAllocaAsm__pastDesiredSp:
    mov %r10, %rsp; //set the desired sp
    jmp .LsafeAllocaAsm__ret;
.size safeAllocaAsm, .-safeAllocaAsm;
));

Cheers, 
Petr Skocik

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-12-17 19:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-23 10:09 [Bug c/107831] New: Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page pskocik at gmail dot com
2022-11-23 12:34 ` [Bug c/107831] " pskocik at gmail dot com
2022-11-23 17:01 ` jakub at gcc dot gnu.org
2022-11-23 17:05 ` jakub at gcc dot gnu.org
2022-11-23 17:37 ` jakub at gcc dot gnu.org
2022-11-23 17:57 ` law at gcc dot gnu.org
2022-11-23 21:27 ` pskocik at gmail dot com
2022-11-24 19:16 ` pskocik at gmail dot com
2022-11-24 19:31 ` jakub at gcc dot gnu.org
2022-12-17 19:51 ` pskocik at gmail dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).