From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id CE4C4386FF3E; Sat, 17 Dec 2022 19:51:18 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org CE4C4386FF3E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1671306678; bh=GRtSfW0TScZ8o9tcq5MKDJ8wbpckaziN7F4q56YrttU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=fADjdFIoCDTRkqrLydrP8OOQKhIHZht4aXwm3AKuxzPKIvnmpcvLJtU8FzXTkd0aP jEvr8R0qS0ndRyr1dWbpwUUKE0XS811/p//CWaWqPpv4ySr5rKlzgpbAVCYXrWoz6t faiY0YSVQmQZwnKRaEFJCbX3Nt20Ybuj4hF7vjr8= From: "pskocik at gmail dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug c/107831] Missed optimization: -fclash-stack-protection causes unnecessary code generation for dynamic stack allocations that are clearly less than a page Date: Sat, 17 Dec 2022 19:51:16 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: c X-Bugzilla-Version: unknown X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: pskocik at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D107831 --- Comment #9 from Petr Skocik --- Regarding the size of alloca/VLA-generated code under -fstack-clash-protect= ion. I've played with this a little bit and while I love the feature, the code s= ize increases seem quite significant and unnecessarily so. Take a simple void ALLOCA_C(size_t Sz){ char buf[Sz]; asm volatile ("" : : "r"(&buf[0]));= } gcc -fno-stack-clash-protection: 17 bytes gcc -fstack-clash-protection: 72 bytes clang manages with less of an increase: -fno-stack-clash_protection: 26 bytes -stack-clash-protection: 45 bytes Still this could be as low as 11 bytes for the -fclash-stack-protection version (less than for the unprotected one!) all by using a simple call to = an assembly function, whose code can be no-clobber without much extra effort. Linked in compiler explorer is a crack at the idea along with benchmarks:=20 https://godbolt.org/z/f8rhG1ozs The performance impact of the call seems negligible (practically less than = 1ns, though in the above quick-and-dirty benchmark it fluctuates a tiny bit, sometimes even giving the non-inline version an edge). I originally suggested popping the address of the stack and repushing before calling returning. Ended up just repushing -- the old return address becomes part of the alloca allocation. The concern that this could mess up the retu= rn stack buffer of the CPU seems valid but all the benchmarks indicate it doesn't--not even when the ret address is popped--just as long as the return target address is the same.=20 (When it isn't, the performance penalty is rather significant: measured a 19 times slowdown of that for comparison (it's also in the linked benchmarks)). The (x86-64) assembly function: #define STR(...) STR__(__VA_ARGS__) //{{{ #define STR__(...) #__VA_ARGS__ //}}} asm(STR( .global safeAllocaAsm; safeAllocaAsm: //no clobber, though does expect 16-byte aligned at entry as usual push %r10; cmp $16, %rdi; ja .LsafeAllocaAsm__test32; push 8(%rsp); ret; .LsafeAllocaAsm__test32: push %r10; push %rdi; mov %rsp, %r10; sub $17, %rdi; and $-16, %rdi; //(-32+15)&(-16) //substract the 32 and 16-align, round= ing up jnz .LsafeAllocaAsm__probes; .LsafeAllocaAsm__ret: lea (3*8)(%r10,%rdi,1), %rdi; push (%rdi); mov -8(%rdi), %r10; mov -16(%rdi), %rdi; ret; .LsafeAllocaAsm__probes: sub %rdi, %r10; //r10 is the desired rsp .LsafeAllocaAsm__probedPastDesiredSpEh: cmp %rsp, %r10; jge .LsafeAllocaAsm__pastDesiredSp; orl $0x0,(%rsp); sub $0x1000,%rsp; jmp .LsafeAllocaAsm__probedPastDesiredSpEh; .LsafeAllocaAsm__pastDesiredSp: mov %r10, %rsp; //set the desired sp jmp .LsafeAllocaAsm__ret; .size safeAllocaAsm, .-safeAllocaAsm; )); Cheers,=20 Petr Skocik=