From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 97096 invoked by alias); 13 Jul 2018 16:46:59 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 97070 invoked by uid 89); 13 Jul 2018 16:46:56 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=occurring X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 13 Jul 2018 16:46:54 +0000 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9F6407EA80; Fri, 13 Jul 2018 16:46:53 +0000 (UTC) Received: from localhost.localdomain (ovpn-112-9.rdu2.redhat.com [10.10.112.9]) by smtp.corp.redhat.com (Postfix) with ESMTP id 15A7426DED; Fri, 13 Jul 2018 16:46:50 +0000 (UTC) Subject: Re: [PATCH][GCC][AArch64] Updated stack-clash implementation supporting 64k probes. [patch (1/6)] To: Tamar Christina Cc: "gcc-patches@gcc.gnu.org" , nd , James Greenhalgh , Richard Earnshaw , Marcus Shawcroft References: <20180711112037.GA15738@arm.com> <0321b0d1-b768-87eb-3573-70547a1c3595@redhat.com> <20180712173958.GA24275@arm.com> From: Jeff Law Openpgp: preference=signencrypt Message-ID: <4dcce101-0b0c-c2df-414a-94880dbeb8b2@redhat.com> Date: Fri, 13 Jul 2018 16:46:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <20180712173958.GA24275@arm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-IsSubscribed: yes X-SW-Source: 2018-07/txt/msg00736.txt.bz2 On 07/12/2018 11:39 AM, Tamar Christina wrote: >>> + >>> + /* Round size to the nearest multiple of guard_size, and calculate the >>> + residual as the difference between the original size and the rounded >>> + size. */ >>> + HOST_WIDE_INT rounded_size = size & -guard_size; >>> + HOST_WIDE_INT residual = size - rounded_size; >>> + >>> + /* We can handle a small number of allocations/probes inline. Otherwise >>> + punt to a loop. */ >>> + if (rounded_size <= STACK_CLASH_MAX_UNROLL_PAGES * guard_size) >>> + { >>> + for (HOST_WIDE_INT i = 0; i < rounded_size; i += guard_size) >>> + { >>> + aarch64_sub_sp (NULL, temp2, guard_size, true); >>> + emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx, >>> + STACK_CLASH_CALLER_GUARD)); >>> + } >> So the only concern with this code is that it'll be inefficient and >> possibly incorrect for probe sizes larger than ARITH_FACTOR. >> Ultimately, that's a case I don't think we really care that much about. >> I wouldn't lose sleep if the port clamped the requested probing interval >> at ARITH_FACTOR. >> > I'm a bit confused here, the ARITH_FACTOR seems to have to do with the Ada > stack probing implementation, which isn't used by this new code aside > from the part that emits the actual probe when doing a variable or large > allocation in aarch64_output_probe_stack_range. > > Clamping the probing interval at ARITH_FACTOR means we can't do 64KB > probing intervals. It may have been a misunderstanding on my part. My understanding is that ARITH_FACTOR represents the largest immediate constant we could handle in this code using a single insn. Anything above ARITH_FACTOR needed a scratch register and I couldn't convince myself that we had a suitable scratch register available. But I'm really not well versed on the aarch64 architecture or the various implementation details in aarch64.c. So I'm happy to defer to you and others @ ARM on what's best to do here. >> That can be problematical in a couple cases. First it can run afoul of >> combine-stack-adjustments. Essentially that pass will combine a series >> of stack adjustments into a single insn so your unrolled probing turns >> into something like this: >> >> multi-page stack adjust >> probe first page >> probe second page >> probe third page >> and so on.. >> > This is something we actually do want, particularly in the case of 4KB pages > as the immediates would fit in the store. It's one of the things we were > thinking about for future improvements. > >> That violates the design constraint that we never allocate more than a >> page at a time. > Would there be a big issue here if we didn't adhere to this constraint? Yes, because it enlarges a window for exploitation. Consider signal delivery occurring after the adjustment but before the probes. The signal handler could then be running on a clashed heap/stack. > >> Do you happen to have a libc.so and ld.so compiled with this code? I've >> got a scanner here which will look at a DSO and flag obviously invalid >> stack allocations. If you've got a suitable libc.so and ld.so I can run >> them through the scanner. >> >> >> Along similar lines, have you run the glibc testsuite with this stuff >> enabled by default. That proved useful to find codegen issues, >> particularly with the register inheritance in the epilogue. >> > I'm running one now, I'll send the two binaries once I get the results back > from the run. Thanks! Great. Looking forward getting those .so files I can can throw them into the scanner. >>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md >>> index 830f97603487d6ed07c4dc854a7898c4d17894d1..d1ed54c7160ab682c78d5950947608244d293025 100644 >>> --- a/gcc/config/aarch64/aarch64.md >>> +++ b/gcc/config/aarch64/aarch64.md >>> @@ -3072,7 +3072,7 @@ >>> >>> (define_insn "cmp" >>> [(set (reg:CC CC_REGNUM) >>> - (compare:CC (match_operand:GPI 0 "register_operand" "r,r,r") >>> + (compare:CC (match_operand:GPI 0 "register_operand" "rk,r,r") >>> (match_operand:GPI 1 "aarch64_plus_operand" "r,I,J")))] >>> "" >>> "@ >> I don't think I needed this, but I can certainly understand how it might >> be necessary. THe only one I know we needed as the probe_stack_range >> constraint change. >> > It's not strictly needed, but it allows us to do the comparison with the > stack pointer in the loop without needing to emit sp to a temporary first. > So it's just a tiny optimization. :) Understood. A similar tweak for cmp was done in a patch from Richard E in his patches for spectre v1 mitigation. I'm certainly not objecting :-) Jeff