From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-456585-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 116797 invoked by alias); 22 Jun 2017 15:30:54 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 113785 invoked by uid 89); 22 Jun 2017 15:30:24 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=compliant, 985, pros, 98.5
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 22 Jun 2017 15:30:21 +0000
Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id DFAB97A16E;	Thu, 22 Jun 2017 15:30:08 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com DFAB97A16E
Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx04.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=law@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com DFAB97A16E
Received: from localhost.localdomain (ovpn-117-117.phx2.redhat.com [10.3.117.117])	by smtp.corp.redhat.com (Postfix) with ESMTP id 8B9DE19632;	Thu, 22 Jun 2017 15:30:08 +0000 (UTC)
Subject: Re: RFC: stack/heap collision vulnerability and mitigation with GCC
To: "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com>, gcc-patches <gcc-patches@gcc.gnu.org>
References: <bef46e40-8004-0f80-4928-ad0795eb76ba@redhat.com> <9dbdb66f-a9ec-c04a-8d83-e1597213e2da@arm.com> <6a46678a-01b7-50e8-c5e1-65ca25a5f662@redhat.com> <a20df538-c347-e2ab-038d-26064e2789ee@arm.com> <0afa3ebe-3a80-9324-c107-e1e39e747730@redhat.com> <ad0fb64c-6b0d-b3da-d01d-34d4d62385ee@arm.com>
From: Jeff Law <law@redhat.com>
Message-ID: <c05f961a-2023-bcf6-afa9-18870dd4f0e0@redhat.com>
Date: Thu, 22 Jun 2017 15:30:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0
MIME-Version: 1.0
In-Reply-To: <ad0fb64c-6b0d-b3da-d01d-34d4d62385ee@arm.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-IsSubscribed: yes
X-SW-Source: 2017-06/txt/msg01680.txt.bz2

On 06/22/2017 03:53 AM, Richard Earnshaw (lists) wrote:
> On 21/06/17 18:25, Jeff Law wrote:
>> On 06/21/2017 02:41 AM, Richard Earnshaw (lists) wrote:
>>
>>>> But the stack pointer might have already been advanced into the guard
>>>> page by the caller.   For the sake of argument assume the guard page is
>>>> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
>>>> the caller hasn't touched the 0xf1000 page.
>>>
>>> Then make sure the caller does touch the 0xf1000 page.  If it's
>>> allocated that much stack it should be forced to do the probe and not
>>> rely on all it's children having to do it because it can't be bothered.
>> That needs to be mandated at the ABI level if it's going to happen.  The
>> threat model assumes that the caller adheres to the ABI, but was not
>> necessarily compiled with -fstack-check.
> 
> The base ABI would never mandate stack probes.  Some systems may locate
> the stack in such a way that it can never collide with the heap, making
> guard pages and probes completely unnecessary (but perhaps at the
> expense of limiting the theoretical maximum stack size).  It might say
> "if you do stack probes, use this model", but even then it would need to
> parameterize the whole model as there are just too many OS configuration
> options to consider (page size, size of guard zone, for example).
I'm not suggesting mandating stack probes in the ABI, but that certain
changes to the ABI can be made which would in turn make stack probing
far more efficient.

For example the PPC ABI mandates that *sp hold the outer frame address.
That turns out to be amazingly useful from a probing standpoint -- we
always know that *sp was written.  That knowledge allows us to eliminate
98.5% of the prologue probing for glibc on PPC.

I'm not suggesting you do exactly the same thing, but I do think there
are things you could do at the ABI level which would drastically improve
the generated code when stack probing is enabled and which would have
minimal cost.


> 
>>
>> I'm all for making the common path fast and letting the uncommon cases
>> pay additional penalties.  That mindset has driven the work-to-date.
>>
>> But I don't think I have the liberty to change existing ABIs to
>> facilitate lower overhead approaches.  But I think ARM does given it
>> owns the ABI for aarch64 and I would happily exploit whatever guarantees
>> we can derive from an updated ABI.
>>
>> So if you want the caller to touch the page, you need to amend the ABI.
>> I'd think touching the lowest address of the alloca area and outgoing
>> args, if large would be sufficient.
>>
>>
> 
> I can't help but feel there's a bit of a goode olde mediaeval witch hunt
> going on here.  As Wilco points out, we can never defend against a
> function that is built without probe operations but skips the entire
> guard zone.  The only defence there is a larger guard zone, but how big
> do you make it?
No witchhunt at all.

It just happens to be the case that x86 hits *sp when it stores the
return pointer and that ppc always stores the backchain into *sp when it
allocates additional stack space.  As a result on those targets we know
the offset between the stack pointer and the most recent probe is zero
at the start of the callee's prologue.  That allows us to avoid the vast
majority of explicit probes.

aarch64's ABI and ISA don't provide us with any such guarantees and we
have to make very conservative assumptions which leads to much more
explicit probing.

s390 is in a similar situation to aarch64 in that it has to make worst
case assumptions at prologue entry in the callee.  I suspect many others
will be too (but I haven't investigated each architecture as I've been
focused strictly on RHEL targets).


> 
> So we can design a half-way house probing scheme which doesn't really
> solve the problem and is perhaps so expensive that most people will turn
> it off.  Or we can design something that addresses the problem
> scientifically if applied everywhere and has almost zero impact on the
> code size or performance.  The latter would probably be so cheap that
> most people would never notice that it was even on at all.  Yes, you'd
> need a system recompile to deploy it in full, but even a fairly limited
> rebuild of critical libraries (libc, libstdc++) would help.
> 
> Whichever route we take, this wouldn't be an ABI break.  New code will
> still interoperate with old code; you just don't get full heap
> protection if you mix old and neAs the port maintainers I think you have significant say in how this
plays out and you could (for example) say you simply aren't going to
worry about cases where there's more than 1k of outgoing argument space
or when the caller has a large alloca and wasn't compiled with
-fstack-check and set the initial offset to 1k.

That would provide a great deal of coverage, but still leaves folks
using the aarch64 port vulnerable in certain corner cases.

Red Hat would have to look at that carefully and decide to either leave
customers vulnerable to the corner cases or pay the penalty of getting
full protection.  I honestly do not know where we'd land on that
question, but we'd certainly have to evaluate the pros and cons.

Or you could go with something like PPC64 and mandate something in the
ABI.  For example, you could mandate that the last word of dynamically
allocated stack space must be written to and that the last word of the
outgoing args space must be written to if the outgoing args space is >
1k.  Note this would remain compatible with existing code since you're
just forcing the caller to write into those areas at allocation time and
the onus is on the uncommon code (allocas and really big outgoing args)
to pay a tiny penalty and allow the common code (small frames in callee)
to run fast.

I suspect doing something like that would probably eliminate more than
90% of the explicit probes in glibc with minimal cost.  Let's face it, a
single write into alloca space is cheap compared to just the setup for
alloca and a write into the outgoing args for >1k outgoing args isn't
likely to happen enough to ever matter.

And something like that is 100% compatible with existing code.  You can
mix and match freely.  Obviously non-compliant code has a vulnerability,
but as compliant compilers got into the wild the amount of non-compliant
code would quickly drop.


Jeff