From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 60742 invoked by alias); 11 Dec 2017 23:49:15 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 57474 invoked by uid 89); 11 Dec 2017 23:49:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=notices, scenarios, generous, eating X-HELO: mx1.redhat.com Subject: Re: [RFC] nptl: change default stack guard size of threads To: James Greenhalgh , Florian Weimer Cc: Rich Felker , Szabolcs Nagy , GNU C Library , nd , Richard Earnshaw , Wilco Dijkstra References: <5A1ECB40.9080801@arm.com> <76c38ecf-6497-c96c-5c8c-95cceed100a5@redhat.com> <5A1EFF28.9050406@arm.com> <5c796246-1907-8cf4-00fc-eee11614b092@redhat.com> <20171129205148.GG1627@brightrain.aerifal.cx> <00c123b5-dd46-6777-2c24-d80eae8d35df@redhat.com> <20171205105530.GA12966@arm.com> From: Jeff Law Message-ID: Date: Mon, 11 Dec 2017 23:49:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171205105530.GA12966@arm.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2017-12/txt/msg00349.txt.bz2 On 12/05/2017 03:55 AM, James Greenhalgh wrote: >>> >>> Hm? What are you thinking of that GCC might have gotten wrong? >> >> Use 64 KiB probe intervals (almost) everywhere as an optimization. I >> assumed the original RFC patch was motivated by that. >> >> I knew that ARM would be broken because that's what the gcc ARM >> maintainers want. I assumed that it was restricted to that, but now I'm >> worried that it's not. > > To be clear here, I'm coming in to the review of the probing support in GCC > late, and with very little context on the design of the feature. I certainly > wouldn't want to cause you worry - I've got no intention of pushing for > optimization to a larger guard page size if it would leaves things broken > for AArch64. No worries. Richard E. can give you the background on the AArch64 side of things. I'll try to channel Richard's request. If I over-simplify or mis-state something, Richard's view should be taken as "the truth". AArch64 is at a bit of a disadvantage from an architectural standpoint relative to x86 and at a disadvantage from an ABI standpoint relative to PPC. On x86 the calling sequence itself is an implicit probe as it pushes the return address onto the stack. Thus except for a couple oddball cases we assume at function entry that *sp has been probed and thus an attacker can't have pushed the stack pointer into the guard page. On PPC we maintain a backchain pointer at *sp, so the mechanism is different, but the net result again is that an attacker can't have pushed the stack pointer into the guard page. On s390 the caller typically allocates space for the callee to perform register saves. So while the attacker could push the stack pointer into the guard, callee register saves limit how far into the guard the stack pointer can possibly be -- and the prologue code obviously knows where those save areas are located. The closest thing we have on aarch64 is that the caller must have saved LR into its stack. But we have no idea the difference between where the caller saved LR and the value of the stack pointer in the callee. Thus to fully protect AArch64 we would have to emit many more probes than on other architectures because we have to make worst case assumptions at function entry. This cost was deemed too high. The key is that the outgoing argument space sits between the caller's LR slot and the callee's frame. So it was decided that certain requirements would be placed on the caller and that the callee would be able to make certain assumptions WRT whether or not the caller would write into the outgoing argument area. After analysis (of spec2k6 I believe) and review of the kernel's guarantees WRT the guard size it was decided that the size of the guard on aarch64 would be 64k. That corresponds to 2 pages for a RHEL kernel. It corresponds to 16 pages on a Fedora kernel. The caller would be responsible for ensuring that it always would write/probe within 1k of the limit of its stack. Thus the callee would be able to allocate up to 63k without probing. This essentially brings the cost of probing down into the noise on AArch64. Once probing, we probe at 4k intervals. That fits nicely into the 12bit shifted immediates available on aarch64. In theory a larger probing interval would reduce the cost of probing, but you'd have to twiddle the sequences in the target files to get a scratch register in places were they don't right now. We all agreed that there's a bit of a hole where unprotected code calling protected code could leave the stack pointer somewhere in the guard page on aarch64 and be a vector for attack. However, it's more likely that if those scenarios that the attacker has enough control on the caller side that they'd just jump the guard in the caller. So that's why things are the way they are. Again, if I've gotten something wrong, I apologize to you and Richard :-) > > Likewise, I have no real desire for us to emit a bunch of extra operations > if we're not required to for glibc. Agreed. I think we all want probing to be low enough overhead that we just enable it by default everywhere to get protected and nobody notices. > If assuming that 64k probes are sufficient on AArch64 is not going to allow > us a correct implementation, then we can't assume 64k probes on AArch64. My > understanding was that we were safe in this as the kernel was giving us a > generous 1MB to play with, and we could modify glibc to also give us 64k > (I admit, I had not considered ILP32, where you've rightly pointed out we > will eat lots of address space if we make this decision). Richard E. explicitly took ILP32 off the table during out discussion. I believe the conclusion was that if/when the time came that ARM would address ILP32 independently. I don't offhand know of anything in the current implementation that would break on ILP32 -- except for concerns about eating up address space with large guards around thread stacks. > >>> GCC needs to emit probe intervals for the smallest supported page size >>> on the the target architecture. If it does not do that, we end up in >>> trouble on the glibc side. > > This is where I may have a misunderstanding, why would it require probing > at the smallest page size, rather than probing at a multiple of the guard > size? It is very likely I'm missing something here as I don't know the glibc > side of this at all. I'm not sure where that statement comes from either. I guess the concern is someone could boot a kernel with a smaller page size and perhaps the kernel/glibc create their guards based on # pages rather than absolute size. Thus booting a kernel with a smaller pagesize would code with less protection. jeff ps. I still need to address your questions/comments on the aarch64 GCC bits. I'm keen to wrap that up for gcc-8 as it's the only target where we've got support for custom stack clash protected prologues that hasn't been merged yet.