From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-88049-listarch-libc-alpha=sources.redhat.com@sourceware.org>
Received: (qmail 60742 invoked by alias); 11 Dec 2017 23:49:15 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 57474 invoked by uid 89); 11 Dec 2017 23:49:14 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=notices, scenarios, generous, eating
X-HELO: mx1.redhat.com
Subject: Re: [RFC] nptl: change default stack guard size of threads
To: James Greenhalgh <james.greenhalgh@arm.com>,
 Florian Weimer <fweimer@redhat.com>
Cc: Rich Felker <dalias@libc.org>, Szabolcs Nagy <Szabolcs.Nagy@arm.com>,
 GNU C Library <libc-alpha@sourceware.org>, nd <nd@arm.com>,
 Richard Earnshaw <Richard.Earnshaw@arm.com>,
 Wilco Dijkstra <Wilco.Dijkstra@arm.com>
References: <5A1ECB40.9080801@arm.com>
 <76c38ecf-6497-c96c-5c8c-95cceed100a5@redhat.com> <5A1EFF28.9050406@arm.com>
 <5c796246-1907-8cf4-00fc-eee11614b092@redhat.com>
 <20171129205148.GG1627@brightrain.aerifal.cx>
 <00c123b5-dd46-6777-2c24-d80eae8d35df@redhat.com>
 <20171205105530.GA12966@arm.com>
From: Jeff Law <law@redhat.com>
Message-ID: <b3576661-98d8-ce11-4c8c-54b3d732e6df@redhat.com>
Date: Mon, 11 Dec 2017 23:49:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <20171205105530.GA12966@arm.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-12/txt/msg00349.txt.bz2

On 12/05/2017 03:55 AM, James Greenhalgh wrote:
>>>
>>> Hm? What are you thinking of that GCC might have gotten wrong?
>>
>> Use 64 KiB probe intervals (almost) everywhere as an optimization.  I 
>> assumed the original RFC patch was motivated by that.
>>
>> I knew that ARM would be broken because that's what the gcc ARM 
>> maintainers want.  I assumed that it was restricted to that, but now I'm 
>> worried that it's not.
> 
> To be clear here, I'm coming in to the review of the probing support in GCC
> late, and with very little context on the design of the feature. I certainly
> wouldn't want to cause you worry - I've got no intention of pushing for
> optimization to a larger guard page size if it would leaves things broken
> for AArch64.
No worries.  Richard E. can give you the background on the AArch64 side
of things.  I'll try to channel Richard's request. If I over-simplify or
mis-state something, Richard's view should be taken as "the truth".

AArch64 is at a bit of a disadvantage from an architectural standpoint
relative to x86 and at a disadvantage from an ABI standpoint relative to
PPC.

On x86 the calling sequence itself is an implicit probe as it pushes the
return address onto the stack.  Thus except for a couple oddball cases
we assume at function entry that *sp has been probed and thus an
attacker can't have pushed the stack pointer into the guard page.

On PPC we maintain a backchain pointer at *sp, so the mechanism is
different, but the net result again is that an attacker can't have
pushed the stack pointer into the guard page.

On s390 the caller typically allocates space for the callee to perform
register saves.  So while the attacker could push the stack pointer into
the guard, callee register saves limit how far into the guard the stack
pointer can possibly be -- and the prologue code obviously knows where
those save areas are located.

The closest thing we have on aarch64 is that the caller must have saved
LR into its stack.  But we have no idea the difference between where the
caller saved LR and the value of the stack pointer in the callee.

Thus to fully protect AArch64 we would have to emit many more probes
than on other architectures because we have to make worst case
assumptions at function entry.  This cost was deemed too high.

The key is that the outgoing argument space sits between the caller's LR
slot and the callee's frame.  So it was decided that certain
requirements would be placed on the caller and that the callee would be
able to make certain assumptions WRT whether or not the caller would
write into the outgoing argument area.

After analysis (of spec2k6 I believe) and review of the kernel's
guarantees WRT the guard size it was decided that the size of the guard
on aarch64 would be 64k.  That corresponds to 2 pages for a RHEL kernel.
 It corresponds to 16 pages on a Fedora kernel.

The caller would be responsible for ensuring that it always would
write/probe within 1k of the limit of its stack.  Thus the callee would
be able to allocate up to 63k without probing.  This essentially brings
the cost of probing down into the noise on AArch64.

Once probing, we probe at 4k intervals.  That fits nicely into the 12bit
shifted immediates available on aarch64.  In theory a larger probing
interval would reduce the cost of probing, but you'd have to twiddle the
sequences in the target files to get a scratch register in places were
they don't right now.

We all agreed that there's a bit of a hole where unprotected code
calling protected code could leave the stack pointer somewhere in the
guard page on aarch64 and be a vector for attack.  However, it's more
likely that if those scenarios that the attacker has enough control on
the caller side that they'd just jump the guard in the caller.

So that's why things are the way they are.  Again, if I've gotten
something wrong, I apologize to you and Richard :-)

> 
> Likewise, I have no real desire for us to emit a bunch of extra operations
> if we're not required to for glibc.
Agreed.  I think we all want probing to be low enough overhead that we
just enable it by default everywhere to get protected and nobody notices.

> If assuming that 64k probes are sufficient on AArch64 is not going to allow
> us a correct implementation, then we can't assume 64k probes on AArch64. My
> understanding was that we were safe in this as the kernel was giving us a
> generous 1MB to play with, and we could modify glibc to also give us 64k
> (I admit, I had not considered ILP32, where you've rightly pointed out we
> will eat lots of address space if we make this decision).
Richard E. explicitly took ILP32 off the table during out discussion.  I
believe the conclusion was that if/when the time came that ARM would
address ILP32 independently.  I don't offhand know of anything in the
current implementation that would break on ILP32 -- except for concerns
about eating up address space with large guards around thread stacks.


> 
>>> GCC needs to emit probe intervals for the smallest supported page size 
>>> on the the target architecture.  If it does not do that, we end up in 
>>> trouble on the glibc side.
> 
> This is where I may have a misunderstanding, why would it require probing
> at the smallest page size, rather than probing at a multiple of the guard
> size? It is very likely I'm missing something here as I don't know the glibc
> side of this at all.
I'm not sure where that statement comes from either.  I guess the
concern is someone could boot a kernel with a smaller page size and
perhaps the kernel/glibc create their guards based on # pages rather
than absolute size.  Thus booting a kernel with a smaller pagesize would
code with less protection.

jeff

ps.  I still need to address your questions/comments on the aarch64 GCC
bits.  I'm keen to wrap that up for gcc-8 as it's the only target where
we've got support for custom stack clash protected prologues that hasn't
been merged yet.