From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id CE4C4386FF3E; Sat, 17 Dec 2022 19:51:18 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CE4C4386FF3E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1671306678;
	bh=GRtSfW0TScZ8o9tcq5MKDJ8wbpckaziN7F4q56YrttU=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=fADjdFIoCDTRkqrLydrP8OOQKhIHZht4aXwm3AKuxzPKIvnmpcvLJtU8FzXTkd0aP
	 jEvr8R0qS0ndRyr1dWbpwUUKE0XS811/p//CWaWqPpv4ySr5rKlzgpbAVCYXrWoz6t
	 faiY0YSVQmQZwnKRaEFJCbX3Nt20Ybuj4hF7vjr8=
From: "pskocik at gmail dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug c/107831] Missed optimization: -fclash-stack-protection causes
 unnecessary code generation for dynamic stack allocations that are clearly
 less than a page
Date: Sat, 17 Dec 2022 19:51:16 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: c
X-Bugzilla-Version: unknown
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: pskocik at gmail dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-107831-4-bt7RvgWpSE@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-107831-4@http.gcc.gnu.org/bugzilla/>
References: <bug-107831-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107831
--- Comment #9 from Petr Skocik <pskocik at gmail dot com> ---
Regarding the size of alloca/VLA-generated code under -fstack-clash-protect=
ion.
I've played with this a little bit and while I love the feature, the code s=
ize
increases seem quite significant and unnecessarily so.

Take a simple

void ALLOCA_C(size_t Sz){ char buf[Sz]; asm volatile ("" : : "r"(&buf[0]));=
 }

gcc -fno-stack-clash-protection: 17 bytes
gcc -fstack-clash-protection: 72 bytes

clang manages with less of an increase:

-fno-stack-clash_protection: 26 bytes
-stack-clash-protection: 45 bytes

Still this could be as low as 11 bytes  for the -fclash-stack-protection
version (less than for the unprotected one!) all by using a simple call to =
an
assembly function, whose code can be no-clobber without much extra effort.

Linked in compiler explorer is a crack at the idea along with benchmarks:=20
https://godbolt.org/z/f8rhG1ozs

The performance impact of the call seems negligible (practically less than =
1ns,
though in the above quick-and-dirty benchmark it fluctuates a tiny bit,
sometimes even giving the non-inline version an edge).

I originally suggested popping the address of the stack and repushing before
calling returning. Ended up just repushing -- the old return address becomes
part of the alloca allocation. The concern that this could mess up the retu=
rn
stack buffer of the CPU seems valid but all the benchmarks indicate it
doesn't--not even when the ret address is popped--just as long as the return
target address is the same.=20

(When it isn't, the performance penalty is rather significant: measured a 19
times slowdown of that for comparison (it's also in the linked benchmarks)).

The (x86-64) assembly function:
#define STR(...) STR__(__VA_ARGS__) //{{{
#define STR__(...) #__VA_ARGS__ //}}}
asm(STR(
.global safeAllocaAsm;
safeAllocaAsm: //no clobber, though does expect 16-byte aligned at entry as
usual
    push %r10;
    cmp $16, %rdi;
ja .LsafeAllocaAsm__test32;
    push 8(%rsp);
    ret;
    .LsafeAllocaAsm__test32:
    push %r10;
    push %rdi;
    mov %rsp, %r10;
    sub $17, %rdi;
    and $-16, %rdi; //(-32+15)&(-16) //substract the 32 and 16-align, round=
ing
up
    jnz .LsafeAllocaAsm__probes;
.LsafeAllocaAsm__ret:
    lea (3*8)(%r10,%rdi,1), %rdi;
    push (%rdi);
    mov -8(%rdi), %r10;
    mov -16(%rdi), %rdi;
    ret;
.LsafeAllocaAsm__probes:
    sub %rdi, %r10;  //r10 is the desired rsp
    .LsafeAllocaAsm__probedPastDesiredSpEh:
    cmp %rsp, %r10; jge .LsafeAllocaAsm__pastDesiredSp;
    orl $0x0,(%rsp);
    sub $0x1000,%rsp;
    jmp .LsafeAllocaAsm__probedPastDesiredSpEh;
.LsafeAllocaAsm__pastDesiredSp:
    mov %r10, %rsp; //set the desired sp
    jmp .LsafeAllocaAsm__ret;
.size safeAllocaAsm, .-safeAllocaAsm;
));

Cheers,=20
Petr Skocik=