RFC: stack/heap collision vulnerability and mitigation with GCC

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* RFC: stack/heap collision vulnerability and mitigation with GCC
@ 2017-06-19 17:07 Jeff Law
  2017-06-19 17:29 ` Jakub Jelinek
                   ` (4 more replies)
  0 siblings, 5 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-19 17:07 UTC (permalink / raw)
  To: gcc-patches

As some of you are likely aware, Qualys has just published fairly
detailed information on using stack/heap clashes as an attack vector.
Eric B, Michael M -- sorry I couldn't say more when I contact you about
-fstack-check and some PPC specific stuff.  This has been under embargo
for the last month.

--

http://www.openwall.com/lists/oss-security/2017/06/19/1

Obviously various vulnerabilities pointed out in that advisory are being
mitigated, particularly those found within glibc.  But those are really
just scratching the surface of this issue.

At its core, this chained attack relies first upon using various
techniques to bring the stack and heap close together.  Then the
exploits rely on large stack allocations to "jump the guard".  Once the
guard has been jumped, the stack and heap have collided and all hell
breaks loose.

The "jump the guard" step can be mitigated with help from the compiler.
We just have to ensure that as we allocate chunks of stack space that we
touch each allocated page.  That ensures that the guard page is hit.

This sounds a whole lot like -fstack-check and initially that's what
folks were hoping could be used to eliminate this class of problems.

--

Unfortunately, -fstack-check is actually not well suited for our purposes.

Some background.  -fstack-check was designed primarily for Ada's needs.
It assumes the whole program is compiled with -fstack-check and it is
designed to ensure there is enough stack space left so that if the
program hits the guard (say via infinite recursion) the program can
safely call into a signal handler and raise an exception.

To ensure there's always enough space to meet that design requirement,
-fstack-check probes stack space ahead of the actual need of the code.

The assumption that all code was compiled with -fstack-check allows for
elision of some stack probes as they are assumed to have been probed by
earlier callers in the call chain.  This elision is safe in an
environment where all callers use -fstack-check, but fatally flawed in a
mixed environment.

Most ports first probe by pages for whatever space is requested, then
after all probing is done, they actually allocate space.  This runs
afoul of valgrind in various unpleasant ways (including crashing
valgrind on two targets).

Only x86-linux currently uses a "moving sp" allocation and probing
strategy.  ie, it actually allocates space, then probes the space.

--

After much poking around I concluded that we really need to implement
allocation and probing via a "moving sp" strategy.   Probing into
unallocated areas runs afoul of valgrind, so that's a non-starter.

Allocating stack space, then probing the pages within the space is
vulnerable to async signal delivery between the allocation point and the
probe point.  If that occurs the signal handler could end up running on
a stack that has collided with the heap.

Ideally we would allocate and probe a page as an atomic unit (which is
feasible on PPC).  Alternatively, due to ISA restrictions, allocate a
page, then probe the page as distinct instructions.  The latter still
has a race, but we'd have to take the async signal in a single
instruction window.

A key point to remember is that you can never have an allocation
(potentially using more than one allocation site) which is larger than a
page without probing the page.

Furthermore, we can not assume that earlier functions in the call stack
were compiled with stack checking enabled.  Thus we can not make any
assumptions about what pages other functions in the callstack have
probed or not probed.

Finally, we need not ensure the ability to handle a signal at stack
overflow.  It is fine for the kernel to halt the process immediately if
it detects a reference to the guard page.

--

With all that in mind, we also want to be as efficient as possible and I
think we do pretty good on x86 and ppc.  On x86, the call instruction
itself stores into the stack and on ppc stack is only supposed to be
allocated via the store-with-base-register-modification instructions
which also store into *sp.

Those "implicit probes" allow us to greatly reduce the amount of probing
we do on those architectures.  If a function allocates less than a page
of space, no probing is needed -- this covers the vast majority of
functions.  Furthermore, if we allocate N pages + M bytes of residuals,
we need only explicitly probe the N pages, but not any of the residual
allocation.

On glibc, we end up creating probes in ~1.5% of the functions on those
two architectures.  We could probably do even better on PPC, but we
currently assume 4k pages which is overly-conservative on that target.

aarch64 is significantly worse.  There are no implicit probes we can
exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
So we have the track the distance to the most recent probe and when that
distance grows too large, we have to emit a probe.  Of course we have to
make worst case assumptions at function entry.

s390 is much like aarch64 in that it doesn't have implicit probes.
However, it has simpler prologue code.

Dynamic (alloca) space is handled fairly generically with simple code to
allocate a page and probe the just allocated page.

Michael Matz has suggested some generic support so that we don't have to
write target specific code for each and every target we support.  THe
idea is to have a helper function which allocates and probes stack
space.  THe port can then call that helper function from within its
prologue generator.  I  think this is wise -- I wouldn't want to go
through this exercise on every port.

--

So, time to open the discussion to questions & comments.

I've got patches I need to cleanup and post for comments that implement
this for x86, ppc, aarch64 and s390.  x86 and ppc are IMHO in good
shape.  THere's an unhandled case for s390.  I've got evaluation still
to do on aarch64.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:07 RFC: stack/heap collision vulnerability and mitigation with GCC Jeff Law
@ 2017-06-19 17:29 ` Jakub Jelinek
  2017-06-19 17:45   ` Jeff Law
                     ` (2 more replies)
  2017-06-19 17:51 ` Joseph Myers
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 66+ messages in thread
From: Jakub Jelinek @ 2017-06-19 17:29 UTC (permalink / raw)
  To: Jeff Law, Eric Botcazou; +Cc: gcc-patches

On Mon, Jun 19, 2017 at 11:07:06AM -0600, Jeff Law wrote:
> After much poking around I concluded that we really need to implement
> allocation and probing via a "moving sp" strategy.   Probing into
> unallocated areas runs afoul of valgrind, so that's a non-starter.
> 
> Allocating stack space, then probing the pages within the space is
> vulnerable to async signal delivery between the allocation point and the
> probe point.  If that occurs the signal handler could end up running on
> a stack that has collided with the heap.
> 
> Ideally we would allocate and probe a page as an atomic unit (which is
> feasible on PPC).  Alternatively, due to ISA restrictions, allocate a
> page, then probe the page as distinct instructions.  The latter still
> has a race, but we'd have to take the async signal in a single
> instruction window.

And if the allocation is only a page at a time, the single insn race window
can be mitigated in the kernel (probe (read-only is fine) the word at the
stack when setting up a signal frame for async signal).

> So, time to open the discussion to questions & comments.
> 
> I've got patches I need to cleanup and post for comments that implement
> this for x86, ppc, aarch64 and s390.  x86 and ppc are IMHO in good
> shape.  THere's an unhandled case for s390.  I've got evaluation still
> to do on aarch64.

In the patches Jeff is going to post, we have (at least for
-fasynchronous-unwind-tables which is on by default on e.g. x86)
precise unwind info even with the new stack check mode.
ira.c currently has:
       /* We need the frame pointer to catch stack overflow exceptions if
          the stack pointer is moving (as for the alloca case just above).  */
       || (STACK_CHECK_MOVING_SP
           && flag_stack_check
           && flag_exceptions
           && cfun->can_throw_non_call_exceptions)
For alloca we have a frame pointer for other reasons, the question is
if we really need this hunk even if we provided proper unwind info
even for the Ada -fstack-check mode.  Or, if we provide proper unwind info
for -fasynchronous-unwind-tables, if the above could not be also
&& !flag_asynchronous_unwind_tables.  Eric, what exactly is the reason
for the above, is it just lack of proper CFI notes, or something different?

Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
or movl $0, (%esp) ?

	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:29 ` Jakub Jelinek
@ 2017-06-19 17:45   ` Jeff Law
  2017-06-19 17:51     ` Jakub Jelinek
  2017-06-19 18:00   ` Richard Biener
  2017-06-20  7:50   ` Eric Botcazou
  2 siblings, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-19 17:45 UTC (permalink / raw)
  To: Jakub Jelinek, Eric Botcazou; +Cc: gcc-patches

On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
> 
> Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
> while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
> or movl $0, (%esp) ?
Florian raised this privately to me as well.  THere's a couple issues.

1. Is there a performance penalty/gain for sub-word operations?  If not,
   we can improve things slighly there.  Even if it's performance
   neutral we can probably do better on code size.

2. I would *prefer* if the probe actually changed the value, and the
   more destructive, the better :-0  It allows catching something gone
   wild easier.

These are pretty minor implementation details IMHO, but now is a good
time to revisit the probe style.

I'm mostly concerned about holes in the basic probing strategy, how
we're going to deal with the additional architectures (I can't imagine
we'll want to go through the pain of a custom implementation for each
target) and the UI.

On the last topic.  When we first started this work, it appeared like we
could make most targets with -fstack-check=specific support work
(possibly with some inefficiency) by just dropping the
probe-ahead-of-need aspects of the existing implementation.

ie, we'd drop the requirement for being able to run the signal handler
and stop probing 2 pages beyond the current stack requirements and
instead just probe up to what the current function needed.

This felt like a "probing policy" (ahead-of-need vs as-needed).

But when we ran into the issues with valgrind it became clear that we
really couldn't safely use the current port support for
-fstack-check=specific.

Thus I find myself rethinking is this a probing policy option or should
it just be another variant of -fstack-check=<something>.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:45   ` Jeff Law
@ 2017-06-19 17:51     ` Jakub Jelinek
  2017-06-19 21:51       ` Jeff Law
  2017-06-20  8:03       ` Uros Bizjak
  0 siblings, 2 replies; 66+ messages in thread
From: Jakub Jelinek @ 2017-06-19 17:51 UTC (permalink / raw)
  To: Jeff Law, Uros Bizjak, Jan Hubicka; +Cc: Eric Botcazou, gcc-patches

On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote:
> On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
> > 
> > Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
> > while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
> > or movl $0, (%esp) ?
> Florian raised this privately to me as well.  THere's a couple issues.
> 
> 1. Is there a performance penalty/gain for sub-word operations?  If not,
>    we can improve things slighly there.  Even if it's performance
>    neutral we can probably do better on code size.

CCing Uros and Honza here, I believe there are at least on x86 penalties
for 2-byte, maybe for 1-byte and then sometimes some stalls when you
write or read in a different size from a recent write or read.

> Thus I find myself rethinking is this a probing policy option or should
> it just be another variant of -fstack-check=<something>.

Yeah, IMHO it is just another way of stack probing next to generic and
specific, and for users it would be easier to write -fstack-check=whatever
than -fstack-check -fstack-check-probe=whatever

	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:07 RFC: stack/heap collision vulnerability and mitigation with GCC Jeff Law
  2017-06-19 17:29 ` Jakub Jelinek
@ 2017-06-19 17:51 ` Joseph Myers
  2017-06-19 17:55   ` Jakub Jelinek
                     ` (2 more replies)
  2017-06-19 18:12 ` Richard Kenner
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 66+ messages in thread
From: Joseph Myers @ 2017-06-19 17:51 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On Mon, 19 Jun 2017, Jeff Law wrote:

> A key point to remember is that you can never have an allocation
> (potentially using more than one allocation site) which is larger than a
> page without probing the page.

There's a platform ABI issue here.  At least some kernel fixes for these 
stack issues, as I understand it, increase the size of the stack guard to 
more than a single page.  It would be possible to define the ABI to 
require such a larger guard for protection and so reduce the number of 
(non-alloca/VLA-using) functions that need probes generated, depending on 
whether a goal is to achieve security on kernels without such a fix.  
(Thinking in terms of how to get to enabling such probes by default.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:51 ` Joseph Myers
@ 2017-06-19 17:55   ` Jakub Jelinek
  2017-06-19 18:21   ` Florian Weimer
  2017-06-19 19:05   ` Jeff Law
  2 siblings, 0 replies; 66+ messages in thread
From: Jakub Jelinek @ 2017-06-19 17:55 UTC (permalink / raw)
  To: Joseph Myers; +Cc: Jeff Law, gcc-patches

On Mon, Jun 19, 2017 at 05:50:56PM +0000, Joseph Myers wrote:
> On Mon, 19 Jun 2017, Jeff Law wrote:
> 
> > A key point to remember is that you can never have an allocation
> > (potentially using more than one allocation site) which is larger than a
> > page without probing the page.
> 
> There's a platform ABI issue here.  At least some kernel fixes for these 
> stack issues, as I understand it, increase the size of the stack guard to 
> more than a single page.  It would be possible to define the ABI to 
> require such a larger guard for protection and so reduce the number of 
> (non-alloca/VLA-using) functions that need probes generated, depending on 
> whether a goal is to achieve security on kernels without such a fix.  
> (Thinking in terms of how to get to enabling such probes by default.)

Note that the kernel imposed stack guard page is just one thing (eventhough
probably the most common), POSIX threads allow to specify the guard size
for stack sizes too and increasing the guard size too much in that case is a
bigger problem than just doing it for a single initial thread.
Also, people can override it, if they use 0 guard size, we can say it is
their problem to allow this kind of exploits, but asking them to use much
larger guard sizes might be a problem for apps that create many threads.

	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:29 ` Jakub Jelinek
  2017-06-19 17:45   ` Jeff Law
@ 2017-06-19 18:00   ` Richard Biener
  2017-06-19 18:02     ` Richard Biener
  2017-06-20  7:50   ` Eric Botcazou
  2 siblings, 1 reply; 66+ messages in thread
From: Richard Biener @ 2017-06-19 18:00 UTC (permalink / raw)
  To: Jakub Jelinek, Jeff Law, Eric Botcazou; +Cc: gcc-patches

On June 19, 2017 7:29:32 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com> wrote:
>On Mon, Jun 19, 2017 at 11:07:06AM -0600, Jeff Law wrote:
>> After much poking around I concluded that we really need to implement
>> allocation and probing via a "moving sp" strategy.   Probing into
>> unallocated areas runs afoul of valgrind, so that's a non-starter.
>> 
>> Allocating stack space, then probing the pages within the space is
>> vulnerable to async signal delivery between the allocation point and
>the
>> probe point.  If that occurs the signal handler could end up running
>on
>> a stack that has collided with the heap.
>> 
>> Ideally we would allocate and probe a page as an atomic unit (which
>is
>> feasible on PPC).  Alternatively, due to ISA restrictions, allocate a
>> page, then probe the page as distinct instructions.  The latter still
>> has a race, but we'd have to take the async signal in a single
>> instruction window.
>
>And if the allocation is only a page at a time, the single insn race
>window
>can be mitigated in the kernel (probe (read-only is fine) the word at
>the
>stack when setting up a signal frame for async signal).
>
>> So, time to open the discussion to questions & comments.
>> 
>> I've got patches I need to cleanup and post for comments that
>implement
>> this for x86, ppc, aarch64 and s390.  x86 and ppc are IMHO in good
>> shape.  THere's an unhandled case for s390.  I've got evaluation
>still
>> to do on aarch64.
>
>In the patches Jeff is going to post, we have (at least for
>-fasynchronous-unwind-tables which is on by default on e.g. x86)
>precise unwind info even with the new stack check mode.
>ira.c currently has:
>     /* We need the frame pointer to catch stack overflow exceptions if
>   the stack pointer is moving (as for the alloca case just above).  */
>       || (STACK_CHECK_MOVING_SP
>           && flag_stack_check
>           && flag_exceptions
>           && cfun->can_throw_non_call_exceptions)
>For alloca we have a frame pointer for other reasons, the question is
>if we really need this hunk even if we provided proper unwind info
>even for the Ada -fstack-check mode.  Or, if we provide proper unwind
>info
>for -fasynchronous-unwind-tables, if the above could not be also
>&& !flag_asynchronous_unwind_tables.  Eric, what exactly is the reason
>for the above, is it just lack of proper CFI notes, or something
>different?
>
>Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>while it is shorter, is it actually faster or as slow as movq $0,
>(%rsp)
>or movl $0, (%esp) ?

It at least has the chance of bypassing all of the store queue in CPUs and thus cause no cacheline allocation or trigger prefetching.

Not sure if any of that is done though.

Performance counters might tell.

Otherwise incrementing SP by 4095 and then pushing al would work as well (and be similarly short as the or).

Richard.

Richard.

>	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:00   ` Richard Biener
@ 2017-06-19 18:02     ` Richard Biener
  2017-06-19 18:15       ` Florian Weimer
  2017-06-19 22:08       ` Jeff Law
  0 siblings, 2 replies; 66+ messages in thread
From: Richard Biener @ 2017-06-19 18:02 UTC (permalink / raw)
  To: Jakub Jelinek, Jeff Law, Eric Botcazou; +Cc: gcc-patches

On June 19, 2017 8:00:19 PM GMT+02:00, Richard Biener <richard.guenther@gmail.com> wrote:
>On June 19, 2017 7:29:32 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com>
>wrote:
>>On Mon, Jun 19, 2017 at 11:07:06AM -0600, Jeff Law wrote:
>>> After much poking around I concluded that we really need to
>implement
>>> allocation and probing via a "moving sp" strategy.   Probing into
>>> unallocated areas runs afoul of valgrind, so that's a non-starter.
>>> 
>>> Allocating stack space, then probing the pages within the space is
>>> vulnerable to async signal delivery between the allocation point and
>>the
>>> probe point.  If that occurs the signal handler could end up running
>>on
>>> a stack that has collided with the heap.
>>> 
>>> Ideally we would allocate and probe a page as an atomic unit (which
>>is
>>> feasible on PPC).  Alternatively, due to ISA restrictions, allocate
>a
>>> page, then probe the page as distinct instructions.  The latter
>still
>>> has a race, but we'd have to take the async signal in a single
>>> instruction window.
>>
>>And if the allocation is only a page at a time, the single insn race
>>window
>>can be mitigated in the kernel (probe (read-only is fine) the word at
>>the
>>stack when setting up a signal frame for async signal).
>>
>>> So, time to open the discussion to questions & comments.
>>> 
>>> I've got patches I need to cleanup and post for comments that
>>implement
>>> this for x86, ppc, aarch64 and s390.  x86 and ppc are IMHO in good
>>> shape.  THere's an unhandled case for s390.  I've got evaluation
>>still
>>> to do on aarch64.
>>
>>In the patches Jeff is going to post, we have (at least for
>>-fasynchronous-unwind-tables which is on by default on e.g. x86)
>>precise unwind info even with the new stack check mode.
>>ira.c currently has:
>>     /* We need the frame pointer to catch stack overflow exceptions
>if
>>   the stack pointer is moving (as for the alloca case just above). 
>*/
>>       || (STACK_CHECK_MOVING_SP
>>           && flag_stack_check
>>           && flag_exceptions
>>           && cfun->can_throw_non_call_exceptions)
>>For alloca we have a frame pointer for other reasons, the question is
>>if we really need this hunk even if we provided proper unwind info
>>even for the Ada -fstack-check mode.  Or, if we provide proper unwind
>>info
>>for -fasynchronous-unwind-tables, if the above could not be also
>>&& !flag_asynchronous_unwind_tables.  Eric, what exactly is the reason
>>for the above, is it just lack of proper CFI notes, or something
>>different?
>>
>>Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>>while it is shorter, is it actually faster or as slow as movq $0,
>>(%rsp)
>>or movl $0, (%esp) ?
>
>It at least has the chance of bypassing all of the store queue in CPUs
>and thus cause no cacheline allocation or trigger prefetching.
>
>Not sure if any of that is done though.
>
>Performance counters might tell.
>
>Otherwise incrementing SP by 4095 and then pushing al would work as
>well (and be similarly short as the or).

Oh, and using push intelligently with first bumping to SP & 4096-1 + 4095 would solve the signal atomicity as well. Might be larger and somewhat interfere with CPUs stack engine.  Who knows...

Richard.

>Richard.
>
>Richard.
>
>>	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:07 RFC: stack/heap collision vulnerability and mitigation with GCC Jeff Law
  2017-06-19 17:29 ` Jakub Jelinek
  2017-06-19 17:51 ` Joseph Myers
@ 2017-06-19 18:12 ` Richard Kenner
  2017-06-19 22:05   ` Jeff Law
  2017-06-20  8:21   ` Eric Botcazou
  2017-06-20  8:17 ` Eric Botcazou
  2017-06-20  9:27 ` Richard Earnshaw (lists)
  4 siblings, 2 replies; 66+ messages in thread
From: Richard Kenner @ 2017-06-19 18:12 UTC (permalink / raw)
  To: law; +Cc: gcc-patches

Out of curiousity, does the old Alpha/VMS stack-checking API meet the
requirements?  From what I recall, I think it does.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:02     ` Richard Biener
@ 2017-06-19 18:15       ` Florian Weimer
  2017-06-19 21:57         ` Jeff Law
  2017-06-19 22:08       ` Jeff Law
  1 sibling, 1 reply; 66+ messages in thread
From: Florian Weimer @ 2017-06-19 18:15 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek, Jeff Law, Eric Botcazou; +Cc: gcc-patches

On 06/19/2017 08:02 PM, Richard Biener wrote:
> Oh, and using push intelligently with first bumping to SP & 4096-1 + 4095 would solve the signal atomicity as well. Might be larger and somewhat interfere with CPUs stack engine.  Who knows...

On x86-64, PUSH REG is just a single byte, so for sequences that have to
move SP and probe, it's the shortest possible sequence AFAIK.  NEG/NOT
can take an offsettable memory operand, but it's three bytes.

(I believe the use of ORQ in the current -fstack-check probes might be
an oversight.  For a start, the REX prefix seems completely unnecessary.)

Thanks,
Florian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:51 ` Joseph Myers
  2017-06-19 17:55   ` Jakub Jelinek
@ 2017-06-19 18:21   ` Florian Weimer
  2017-06-19 21:56     ` Joseph Myers
  2017-06-19 19:05   ` Jeff Law
  2 siblings, 1 reply; 66+ messages in thread
From: Florian Weimer @ 2017-06-19 18:21 UTC (permalink / raw)
  To: Joseph Myers, Jeff Law; +Cc: gcc-patches

On 06/19/2017 07:50 PM, Joseph Myers wrote:
> There's a platform ABI issue here.  At least some kernel fixes for these 
> stack issues, as I understand it, increase the size of the stack guard to 
> more than a single page.  It would be possible to define the ABI to 
> require such a larger guard for protection and so reduce the number of 
> (non-alloca/VLA-using) functions that need probes generated, depending on 
> whether a goal is to achieve security on kernels without such a fix.  
> (Thinking in terms of how to get to enabling such probes by default.)

I think architectures such as aarch64 without implied stack probing as
part of the function call sequence would benefit most from an ABI
agreement (splitting the probing responsibility in some way between
caller and callee).  For architectures with some form of implied
probing, the complications from negotiating a guard region size between
GCC, kernel, glibc, and perhaps even applications (see Jakub's comment
about thread stacks) outweigh the performance gains.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:51 ` Joseph Myers
  2017-06-19 17:55   ` Jakub Jelinek
  2017-06-19 18:21   ` Florian Weimer
@ 2017-06-19 19:05   ` Jeff Law
  2017-06-19 19:45     ` Jakub Jelinek
  2017-06-20  8:27     ` Richard Earnshaw (lists)
  2 siblings, 2 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-19 19:05 UTC (permalink / raw)
  To: Joseph Myers; +Cc: gcc-patches

On 06/19/2017 11:50 AM, Joseph Myers wrote:
> On Mon, 19 Jun 2017, Jeff Law wrote:
> 
>> A key point to remember is that you can never have an allocation
>> (potentially using more than one allocation site) which is larger than a
>> page without probing the page.
> 
> There's a platform ABI issue here.  At least some kernel fixes for these 
> stack issues, as I understand it, increase the size of the stack guard to 
> more than a single page.  It would be possible to define the ABI to 
> require such a larger guard for protection and so reduce the number of 
> (non-alloca/VLA-using) functions that need probes generated, depending on 
> whether a goal is to achieve security on kernels without such a fix.  
> (Thinking in terms of how to get to enabling such probes by default.)
On 32 bit platforms we don't have a lot of address space left, so we
have to be careful about creating too large of a guard.

On 64 bit platforms we have a lot more freedom and I suspect larger
guards, mandated by the ABI would be useful, if for no other reason than
allowing us to allocate more stack without probing.   A simple array of
PATH_MAX characters triggers probing right now.   I suspect (but didn't
bother to confirm) that PATH_MAX array are what causes git to have so
many large stacks.

Also if we look at something like ppc and aarch64, we've currently got
the PROBE_INTERVAL set to 4k.  But in reality they're using much larger
page sizes.  So we could improve things there as well.


jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 19:05   ` Jeff Law
@ 2017-06-19 19:45     ` Jakub Jelinek
  2017-06-19 21:41       ` Jeff Law
  2017-06-20  8:27     ` Richard Earnshaw (lists)
  1 sibling, 1 reply; 66+ messages in thread
From: Jakub Jelinek @ 2017-06-19 19:45 UTC (permalink / raw)
  To: Jeff Law; +Cc: Joseph Myers, gcc-patches

On Mon, Jun 19, 2017 at 01:04:57PM -0600, Jeff Law wrote:
> On 06/19/2017 11:50 AM, Joseph Myers wrote:
> > On Mon, 19 Jun 2017, Jeff Law wrote:
> > 
> >> A key point to remember is that you can never have an allocation
> >> (potentially using more than one allocation site) which is larger than a
> >> page without probing the page.
> > 
> > There's a platform ABI issue here.  At least some kernel fixes for these 
> > stack issues, as I understand it, increase the size of the stack guard to 
> > more than a single page.  It would be possible to define the ABI to 
> > require such a larger guard for protection and so reduce the number of 
> > (non-alloca/VLA-using) functions that need probes generated, depending on 
> > whether a goal is to achieve security on kernels without such a fix.  
> > (Thinking in terms of how to get to enabling such probes by default.)
> On 32 bit platforms we don't have a lot of address space left, so we
> have to be careful about creating too large of a guard.
> 
> On 64 bit platforms we have a lot more freedom and I suspect larger
> guards, mandated by the ABI would be useful, if for no other reason than
> allowing us to allocate more stack without probing.   A simple array of
> PATH_MAX characters triggers probing right now.   I suspect (but didn't
> bother to confirm) that PATH_MAX array are what causes git to have so
> many large stacks.
> 
> Also if we look at something like ppc and aarch64, we've currently got
> the PROBE_INTERVAL set to 4k.  But in reality they're using much larger
> page sizes.  So we could improve things there as well.

ppc can use 4K, 16K, 64K or 256K pages, aarch64 4K, 16K or 64K.
So, unless the ABI (or some ABI extension for Linux) says that the guard
page is at least 16K or 64K on these arches (and unless glibc changes the
default pthread_attr_getguardsize - currently defaults everywhere to 1
page), you can't rely on more than 4K there.

	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 19:45     ` Jakub Jelinek
@ 2017-06-19 21:41       ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-19 21:41 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Joseph Myers, gcc-patches

On 06/19/2017 01:45 PM, Jakub Jelinek wrote:
> On Mon, Jun 19, 2017 at 01:04:57PM -0600, Jeff Law wrote:
>> On 06/19/2017 11:50 AM, Joseph Myers wrote:
>>> On Mon, 19 Jun 2017, Jeff Law wrote:
>>>
>>>> A key point to remember is that you can never have an allocation
>>>> (potentially using more than one allocation site) which is larger than a
>>>> page without probing the page.
>>>
>>> There's a platform ABI issue here.  At least some kernel fixes for these 
>>> stack issues, as I understand it, increase the size of the stack guard to 
>>> more than a single page.  It would be possible to define the ABI to 
>>> require such a larger guard for protection and so reduce the number of 
>>> (non-alloca/VLA-using) functions that need probes generated, depending on 
>>> whether a goal is to achieve security on kernels without such a fix.  
>>> (Thinking in terms of how to get to enabling such probes by default.)
>> On 32 bit platforms we don't have a lot of address space left, so we
>> have to be careful about creating too large of a guard.
>>
>> On 64 bit platforms we have a lot more freedom and I suspect larger
>> guards, mandated by the ABI would be useful, if for no other reason than
>> allowing us to allocate more stack without probing.   A simple array of
>> PATH_MAX characters triggers probing right now.   I suspect (but didn't
>> bother to confirm) that PATH_MAX array are what causes git to have so
>> many large stacks.
>>
>> Also if we look at something like ppc and aarch64, we've currently got
>> the PROBE_INTERVAL set to 4k.  But in reality they're using much larger
>> page sizes.  So we could improve things there as well.
> 
> ppc can use 4K, 16K, 64K or 256K pages, aarch64 4K, 16K or 64K.
> So, unless the ABI (or some ABI extension for Linux) says that the guard
> page is at least 16K or 64K on these arches (and unless glibc changes the
> default pthread_attr_getguardsize - currently defaults everywhere to 1
> page), you can't rely on more than 4K there.While the hardware can use the smaller pages ISTM that we can (and
probably should) be clearer in the ABI.  The current pagesize exported
by the kernel on those targets is 16/32k IIRC.

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:51     ` Jakub Jelinek
@ 2017-06-19 21:51       ` Jeff Law
  2017-06-20  8:03       ` Uros Bizjak
  1 sibling, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-19 21:51 UTC (permalink / raw)
  To: Jakub Jelinek, Uros Bizjak, Jan Hubicka; +Cc: Eric Botcazou, gcc-patches

On 06/19/2017 11:51 AM, Jakub Jelinek wrote:
> On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote:
>> On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
>>>
>>> Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>>> while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
>>> or movl $0, (%esp) ?
>> Florian raised this privately to me as well.  THere's a couple issues.
>>
>> 1. Is there a performance penalty/gain for sub-word operations?  If not,
>>    we can improve things slighly there.  Even if it's performance
>>    neutral we can probably do better on code size.
> 
> CCing Uros and Honza here, I believe there are at least on x86 penalties
> for 2-byte, maybe for 1-byte and then sometimes some stalls when you
> write or read in a different size from a recent write or read.
Obviously, I'll go with whatever Honza & Uros say is the most efficient.
 This stuff would be highly localized and is easily tweaked into
whatever final form we want.

> 
>> Thus I find myself rethinking is this a probing policy option or should
>> it just be another variant of -fstack-check=<something>.
> 
> Yeah, IMHO it is just another way of stack probing next to generic and
> specific, and for users it would be easier to write -fstack-check=whatever
> than -fstack-check -fstack-check-probe=whatever
That's essentially where I'm leaning now.  The difficulty is in
selecting a name.  ISTM that -fstack-check=specific becomes horribly bad
though.  It should really be -fstack-check=ada or somesuch.

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:21   ` Florian Weimer
@ 2017-06-19 21:56     ` Joseph Myers
  2017-06-19 22:05       ` Jeff Law
  0 siblings, 1 reply; 66+ messages in thread
From: Joseph Myers @ 2017-06-19 21:56 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Jeff Law, gcc-patches

On Mon, 19 Jun 2017, Florian Weimer wrote:

> I think architectures such as aarch64 without implied stack probing as
> part of the function call sequence would benefit most from an ABI
> agreement (splitting the probing responsibility in some way between
> caller and callee).  For architectures with some form of implied

I'd expect that, regardless of architecture, if calls don't write to the 
stack, the caller has to save its own return address somewhere before 
making a call, which means writing the saved link register.  Is the 
problem case something like: the caller allocates stack space 
unconditionally, without writing to it, and then a particular case in the 
caller calls what it believes to be a noreturn function, or a function 
that it knows won't return in that particular case, so doesn't need to 
save the return address (although not saving return addresses when calling 
noreturn functions is problematic in practice when you want to backtrace 
from abort), so makes a call without ever having written anything to the 
stack (and then you chain many such calls to do large stack allocations, 
never writing to the stack, with each individual allocation being small)?  
Or is the concern simply that the caller might have been compiled without 
stack checking and you don't know *where* it wrote to the stack, even 
given that it must have saved its return address somewhere?

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:15       ` Florian Weimer
@ 2017-06-19 21:57         ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-19 21:57 UTC (permalink / raw)
  To: Florian Weimer, Richard Biener, Jakub Jelinek, Eric Botcazou; +Cc: gcc-patches

On 06/19/2017 12:15 PM, Florian Weimer wrote:
> On 06/19/2017 08:02 PM, Richard Biener wrote:
>> Oh, and using push intelligently with first bumping to SP & 4096-1 + 4095 would solve the signal atomicity as well. Might be larger and somewhat interfere with CPUs stack engine.  Who knows...
> 
> On x86-64, PUSH REG is just a single byte, so for sequences that have to
> move SP and probe, it's the shortest possible sequence AFAIK.  NEG/NOT
> can take an offsettable memory operand, but it's three bytes.
Right.  I think we want guidance from Honza & Uros on what the most
runtime efficient mechanisms are (or are likely to be, there's a certain
amount of guesswork that has to happen here), then we look at which are
the most code space efficient.  I'm personally willing to trade off some
unwinder table space if it gives us more compact code.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 21:56     ` Joseph Myers
@ 2017-06-19 22:05       ` Jeff Law
  2017-06-19 22:10         ` Florian Weimer
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-19 22:05 UTC (permalink / raw)
  To: Joseph Myers, Florian Weimer; +Cc: gcc-patches

On 06/19/2017 03:56 PM, Joseph Myers wrote:
> On Mon, 19 Jun 2017, Florian Weimer wrote:
> 
>> I think architectures such as aarch64 without implied stack probing as
>> part of the function call sequence would benefit most from an ABI
>> agreement (splitting the probing responsibility in some way between
>> caller and callee).  For architectures with some form of implied
> 
> I'd expect that, regardless of architecture, if calls don't write to the 
> stack, the caller has to save its own return address somewhere before 
> making a call, which means writing the saved link register. 
True, but the callee doesn't know the offset where the caller saved the
return address.  In fact, different callers could have stored it at
different offsets.  AFAICT for these targets we just have to make a
worst case assumption about the caller.



 Is the
> problem case something like: the caller allocates stack space 
> unconditionally, without writing to it, and then a particular case in the 
> caller calls what it believes to be a noreturn function, or a function 
> that it knows won't return in that particular case, so doesn't need to 
> save the return address (although not saving return addresses when calling 
> noreturn functions is problematic in practice when you want to backtrace 
> from abort), so makes a call without ever having written anything to the 
> stack (and then you chain many such calls to do large stack allocations, 
> never writing to the stack, with each individual allocation being small)?  
Noreturn functions are a bit special.  In the interest of safety my
patches do two things.

1. Callee always probes *(sp+small offset).  This avoids problems if the
caller allocated spaced, but turned the call into a jump because it knew
the callee was no-return and thus it didn't need to tear down the
caller's frame.  GCC doesn't do this optimization anyway, but better
safe than sorry.

2. GCC will explicitly refuse to optimize a call to a noreturn function
into a jump.





> Or is the concern simply that the caller might have been compiled without 
> stack checking and you don't know *where* it wrote to the stack, even 
> given that it must have saved its return address somewhere?
Right.  In a mixed environment, you don't know if the caller was
compiled with -fstack-check or not.  So unless the architecture does
something useful (stores the return pointer on the stack) or ABI
mandates something useful (*sp always contains outer frame), then you
have to make worst case assumptions.


Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:12 ` Richard Kenner
@ 2017-06-19 22:05   ` Jeff Law
  2017-06-19 22:07     ` Richard Kenner
  2017-06-20  8:21   ` Eric Botcazou
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-19 22:05 UTC (permalink / raw)
  To: Richard Kenner; +Cc: gcc-patches

On 06/19/2017 12:12 PM, Richard Kenner wrote:
> Out of curiousity, does the old Alpha/VMS stack-checking API meet the
> requirements?  From what I recall, I think it does.
Unsure.  Is this documented somewhere?

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 22:05   ` Jeff Law
@ 2017-06-19 22:07     ` Richard Kenner
  0 siblings, 0 replies; 66+ messages in thread
From: Richard Kenner @ 2017-06-19 22:07 UTC (permalink / raw)
  To: law; +Cc: gcc-patches

> > Out of curiousity, does the old Alpha/VMS stack-checking API meet the
> > requirements?  From what I recall, I think it does.
> Unsure.  Is this documented somewhere?

It seems to be in

   http://h20565.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04621389

starting at page 3-54.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:02     ` Richard Biener
  2017-06-19 18:15       ` Florian Weimer
@ 2017-06-19 22:08       ` Jeff Law
  1 sibling, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-19 22:08 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek, Eric Botcazou; +Cc: gcc-patches

On 06/19/2017 12:02 PM, Richard Biener wrote:
> On June 19, 2017 8:00:19 PM GMT+02:00, Richard Biener <richard.guenther@gmail.com> wrote:
>> On June 19, 2017 7:29:32 PM GMT+02:00, Jakub Jelinek <jakub@redhat.com>
>> wrote:
>>> On Mon, Jun 19, 2017 at 11:07:06AM -0600, Jeff Law wrote:
>>>> After much poking around I concluded that we really need to
>> implement
>>>> allocation and probing via a "moving sp" strategy.   Probing into
>>>> unallocated areas runs afoul of valgrind, so that's a non-starter.
>>>>
>>>> Allocating stack space, then probing the pages within the space is
>>>> vulnerable to async signal delivery between the allocation point and
>>> the
>>>> probe point.  If that occurs the signal handler could end up running
>>> on
>>>> a stack that has collided with the heap.
>>>>
>>>> Ideally we would allocate and probe a page as an atomic unit (which
>>> is
>>>> feasible on PPC).  Alternatively, due to ISA restrictions, allocate
>> a
>>>> page, then probe the page as distinct instructions.  The latter
>> still
>>>> has a race, but we'd have to take the async signal in a single
>>>> instruction window.
>>>
>>> And if the allocation is only a page at a time, the single insn race
>>> window
>>> can be mitigated in the kernel (probe (read-only is fine) the word at
>>> the
>>> stack when setting up a signal frame for async signal).
>>>
>>>> So, time to open the discussion to questions & comments.
>>>>
>>>> I've got patches I need to cleanup and post for comments that
>>> implement
>>>> this for x86, ppc, aarch64 and s390.  x86 and ppc are IMHO in good
>>>> shape.  THere's an unhandled case for s390.  I've got evaluation
>>> still
>>>> to do on aarch64.
>>>
>>> In the patches Jeff is going to post, we have (at least for
>>> -fasynchronous-unwind-tables which is on by default on e.g. x86)
>>> precise unwind info even with the new stack check mode.
>>> ira.c currently has:
>>>     /* We need the frame pointer to catch stack overflow exceptions
>> if
>>>   the stack pointer is moving (as for the alloca case just above). 
>> */
>>>       || (STACK_CHECK_MOVING_SP
>>>           && flag_stack_check
>>>           && flag_exceptions
>>>           && cfun->can_throw_non_call_exceptions)
>>> For alloca we have a frame pointer for other reasons, the question is
>>> if we really need this hunk even if we provided proper unwind info
>>> even for the Ada -fstack-check mode.  Or, if we provide proper unwind
>>> info
>>> for -fasynchronous-unwind-tables, if the above could not be also
>>> && !flag_asynchronous_unwind_tables.  Eric, what exactly is the reason
>>> for the above, is it just lack of proper CFI notes, or something
>>> different?
>>>
>>> Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>>> while it is shorter, is it actually faster or as slow as movq $0,
>>> (%rsp)
>>> or movl $0, (%esp) ?
>>
>> It at least has the chance of bypassing all of the store queue in CPUs
>> and thus cause no cacheline allocation or trigger prefetching.
>>
>> Not sure if any of that is done though.
>>
>> Performance counters might tell.
>>
>> Otherwise incrementing SP by 4095 and then pushing al would work as
>> well (and be similarly short as the or).
> 
> Oh, and using push intelligently with first bumping to SP & 4096-1 + 4095 would solve the signal atomicity as well. Might be larger and somewhat interfere with CPUs stack engine.  Who knows...
Happy to rely on Honza or Uros for guidance on that.  Though we do have
to maintain proper stack alignment, right?

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 22:05       ` Jeff Law
@ 2017-06-19 22:10         ` Florian Weimer
  0 siblings, 0 replies; 66+ messages in thread
From: Florian Weimer @ 2017-06-19 22:10 UTC (permalink / raw)
  To: Jeff Law, Joseph Myers; +Cc: gcc-patches

On 06/20/2017 12:05 AM, Jeff Law wrote:
> On 06/19/2017 03:56 PM, Joseph Myers wrote:
>> On Mon, 19 Jun 2017, Florian Weimer wrote:
>>
>>> I think architectures such as aarch64 without implied stack probing as
>>> part of the function call sequence would benefit most from an ABI
>>> agreement (splitting the probing responsibility in some way between
>>> caller and callee).  For architectures with some form of implied
>>
>> I'd expect that, regardless of architecture, if calls don't write to the 
>> stack, the caller has to save its own return address somewhere before 
>> making a call, which means writing the saved link register.

> True, but the callee doesn't know the offset where the caller saved the
> return address.  In fact, different callers could have stored it at
> different offsets.  AFAICT for these targets we just have to make a
> worst case assumption about the caller.

There are also some weird corner cases like this one:

H. Baker, â€œCONS Should Not CONS Its Arguments, Part II: Cheney on the
M.T.A.â€ <http://home.pipeline.com/~hbaker1/CheneyMTA.html>.

So I think some sort of convention is needed here.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:29 ` Jakub Jelinek
  2017-06-19 17:45   ` Jeff Law
  2017-06-19 18:00   ` Richard Biener
@ 2017-06-20  7:50   ` Eric Botcazou
  2 siblings, 0 replies; 66+ messages in thread
From: Eric Botcazou @ 2017-06-20  7:50 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Jeff Law

> In the patches Jeff is going to post, we have (at least for
> -fasynchronous-unwind-tables which is on by default on e.g. x86)
> precise unwind info even with the new stack check mode.
> ira.c currently has:
>        /* We need the frame pointer to catch stack overflow exceptions if
>           the stack pointer is moving (as for the alloca case just above). 
> */
>        || (STACK_CHECK_MOVING_SP
> 
>            && flag_stack_check
>            && flag_exceptions
>            && cfun->can_throw_non_call_exceptions)

There is the same test in ix86_finalize_stack_realign_flags.

> For alloca we have a frame pointer for other reasons, the question is
> if we really need this hunk even if we provided proper unwind info
> even for the Ada -fstack-check mode.  Or, if we provide proper unwind info
> for -fasynchronous-unwind-tables, if the above could not be also
> && !flag_asynchronous_unwind_tables.  Eric, what exactly is the reason
> for the above, is it just lack of proper CFI notes, or something different?

The reason was the assertion present in ix86_adjust_stack_and_probe that the 
CFA was not the stack pointer and which was removed in r230245:
  https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01499.html
I agree that it's probably obsolete by now.

> Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
> while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
> or movl $0, (%esp) ?

It's the idiom used on Windows since day #1, see libgcc/config/i386/cygwin.S.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:51     ` Jakub Jelinek
  2017-06-19 21:51       ` Jeff Law
@ 2017-06-20  8:03       ` Uros Bizjak
  2017-06-20 10:18         ` Richard Biener
  1 sibling, 1 reply; 66+ messages in thread
From: Uros Bizjak @ 2017-06-20  8:03 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jeff Law, Jan Hubicka, Eric Botcazou, gcc-patches

On Mon, Jun 19, 2017 at 7:51 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote:
>> On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
>> >
>> > Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>> > while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
>> > or movl $0, (%esp) ?
>> Florian raised this privately to me as well.  THere's a couple issues.
>>
>> 1. Is there a performance penalty/gain for sub-word operations?  If not,
>>    we can improve things slighly there.  Even if it's performance
>>    neutral we can probably do better on code size.
>
> CCing Uros and Honza here, I believe there are at least on x86 penalties
> for 2-byte, maybe for 1-byte and then sometimes some stalls when you
> write or read in a different size from a recent write or read.

Don't use orq $0, (%rsp), as this is a high latency RMW insn.

movq $0x0, (%rsp) is fast, but also quite long insn.

push $0x0 increases the stack pointer for 4 or 8 bytes, depending on
target word size. Push insn also updates delta stack pointer, so
update of SP is required (effectively, another ALU operation) if SP is
later referenced from insn other than push/pop/call/ret. There are no
non-word-sized register pushes.

I think that for the purpose of stack probe, we can write a byte to
the end of the stack, so

movb $0x0, (%rsp).

This is relatively short insn, and operates in the same way for 32bit
and 64bit targets. There are no issues with partial memory stalls
since nothing immediately reads a different sized value from the
written location.

Uros.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:07 RFC: stack/heap collision vulnerability and mitigation with GCC Jeff Law
                   ` (2 preceding siblings ...)
  2017-06-19 18:12 ` Richard Kenner
@ 2017-06-20  8:17 ` Eric Botcazou
  2017-06-20 21:52   ` Jeff Law
  2017-06-21  7:56   ` Andreas Schwab
  2017-06-20  9:27 ` Richard Earnshaw (lists)
  4 siblings, 2 replies; 66+ messages in thread
From: Eric Botcazou @ 2017-06-20  8:17 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

> As some of you are likely aware, Qualys has just published fairly
> detailed information on using stack/heap clashes as an attack vector.
> Eric B, Michael M -- sorry I couldn't say more when I contact you about
> -fstack-check and some PPC specific stuff.  This has been under embargo
> for the last month.

No problem and thanks for putting together this message.

> Unfortunately, -fstack-check is actually not well suited for our purposes.
> 
> Some background.  -fstack-check was designed primarily for Ada's needs.
> It assumes the whole program is compiled with -fstack-check and it is
> designed to ensure there is enough stack space left so that if the
> program hits the guard (say via infinite recursion) the program can
> safely call into a signal handler and raise an exception.
> 
> To ensure there's always enough space to meet that design requirement,
> -fstack-check probes stack space ahead of the actual need of the code.
> 
> The assumption that all code was compiled with -fstack-check allows for
> elision of some stack probes as they are assumed to have been probed by
> earlier callers in the call chain.  This elision is safe in an
> environment where all callers use -fstack-check, but fatally flawed in a
> mixed environment.
> 
> Most ports first probe by pages for whatever space is requested, then
> after all probing is done, they actually allocate space.  This runs
> afoul of valgrind in various unpleasant ways (including crashing
> valgrind on two targets).
> 
> Only x86-linux currently uses a "moving sp" allocation and probing
> strategy.  ie, it actually allocates space, then probes the space.

Right, because the Linux kernel for x86/x86-64 is the only OS flavor that 
doesn't let you probe the stack ahead of the stack pointer.  All other 
combinations of OS and architecture we tried (and it's quite a lot) do.

> After much poking around I concluded that we really need to implement
> allocation and probing via a "moving sp" strategy.   Probing into
> unallocated areas runs afoul of valgrind, so that's a non-starter.

The reason why you cannot use this strategy on a global basis for stack 
checking is that some ABIs specify that you cannot update the stack pointer 
more than once to establish a frame; others don't explicitly care but...

> Allocating stack space, then probing the pages within the space is
> vulnerable to async signal delivery between the allocation point and the
> probe point.  If that occurs the signal handler could end up running on
> a stack that has collided with the heap.

...yes, there are difficulties with the "moving sp" strategy.

> Finally, we need not ensure the ability to handle a signal at stack
> overflow.  It is fine for the kernel to halt the process immediately if
> it detects a reference to the guard page.

In Ada it's the opposite and we use an alternate signal stack in this case.

> Dynamic (alloca) space is handled fairly generically with simple code to
> allocate a page and probe the just allocated page.

Right, it's not the most difficult part.

> Michael Matz has suggested some generic support so that we don't have to
> write target specific code for each and every target we support.  THe
> idea is to have a helper function which allocates and probes stack
> space.  THe port can then call that helper function from within its
> prologue generator.  I  think this is wise -- I wouldn't want to go
> through this exercise on every port.

Interesting.  We never convinced ourselves that this was worthwhile.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 18:12 ` Richard Kenner
  2017-06-19 22:05   ` Jeff Law
@ 2017-06-20  8:21   ` Eric Botcazou
  2017-06-20 15:50     ` Jeff Law
  2017-06-20 19:48     ` Jakub Jelinek
  1 sibling, 2 replies; 66+ messages in thread
From: Eric Botcazou @ 2017-06-20  8:21 UTC (permalink / raw)
  To: Richard Kenner; +Cc: gcc-patches, law

> Out of curiousity, does the old Alpha/VMS stack-checking API meet the
> requirements?  From what I recall, I think it does.

No, it's the usual probe-first-and-then-allocate strategy and Jeff rejects it 
because of valgrind.  I'd personally rather change valgrind but...

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 19:05   ` Jeff Law
  2017-06-19 19:45     ` Jakub Jelinek
@ 2017-06-20  8:27     ` Richard Earnshaw (lists)
  2017-06-20 15:50       ` Jeff Law
  1 sibling, 1 reply; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-20  8:27 UTC (permalink / raw)
  To: Jeff Law, Joseph Myers; +Cc: gcc-patches

On 19/06/17 20:04, Jeff Law wrote:
> On 06/19/2017 11:50 AM, Joseph Myers wrote:
>> On Mon, 19 Jun 2017, Jeff Law wrote:
>>
>>> A key point to remember is that you can never have an allocation
>>> (potentially using more than one allocation site) which is larger than a
>>> page without probing the page.
>>
>> There's a platform ABI issue here.  At least some kernel fixes for these 
>> stack issues, as I understand it, increase the size of the stack guard to 
>> more than a single page.  It would be possible to define the ABI to 
>> require such a larger guard for protection and so reduce the number of 
>> (non-alloca/VLA-using) functions that need probes generated, depending on 
>> whether a goal is to achieve security on kernels without such a fix.  
>> (Thinking in terms of how to get to enabling such probes by default.)
> On 32 bit platforms we don't have a lot of address space left, so we
> have to be careful about creating too large of a guard.
> 
> On 64 bit platforms we have a lot more freedom and I suspect larger
> guards, mandated by the ABI would be useful, if for no other reason than
> allowing us to allocate more stack without probing.   A simple array of
> PATH_MAX characters triggers probing right now.   I suspect (but didn't
> bother to confirm) that PATH_MAX array are what causes git to have so
> many large stacks.
> 
> Also if we look at something like ppc and aarch64, we've currently got
> the PROBE_INTERVAL set to 4k.  But in reality they're using much larger
> page sizes.  So we could improve things there as well.
> 

There are aarch64 linux systems using 4k pages for compatibility with
existing aarch32 binaries.

R.

> 
> jeff
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-19 17:07 RFC: stack/heap collision vulnerability and mitigation with GCC Jeff Law
                   ` (3 preceding siblings ...)
  2017-06-20  8:17 ` Eric Botcazou
@ 2017-06-20  9:27 ` Richard Earnshaw (lists)
  2017-06-20 21:39   ` Jeff Law
  4 siblings, 1 reply; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-20  9:27 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 19/06/17 18:07, Jeff Law wrote:
> As some of you are likely aware, Qualys has just published fairly
> detailed information on using stack/heap clashes as an attack vector.
> Eric B, Michael M -- sorry I couldn't say more when I contact you about
> -fstack-check and some PPC specific stuff.  This has been under embargo
> for the last month.
> 
> 
> --
> 
> 
> http://www.openwall.com/lists/oss-security/2017/06/19/1
> 
[...]
> aarch64 is significantly worse.  There are no implicit probes we can
> exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
> So we have the track the distance to the most recent probe and when that
> distance grows too large, we have to emit a probe.  Of course we have to
> make worst case assumptions at function entry.
> 

I'm not sure I understand what you're saying here.  According to the
comment above aarch64_expand_prologue, the stack frame looks like:

+-------------------------------+
|                               |
|  incoming stack arguments     |
|                               |
+-------------------------------+
|                               | <-- incoming stack pointer (aligned)
|  callee-allocated save area   |
|  for register varargs         |
|                               |
+-------------------------------+
|  local variables              | <-- frame_pointer_rtx
|                               |
+-------------------------------+
|  padding0                     | \
+-------------------------------+  |
|  callee-saved registers       |  | frame.saved_regs_size
+-------------------------------+  |
|  LR'                          |  |
+-------------------------------+  |
|  FP'                          | / <- hard_frame_pointer_rtx (aligned)
+-------------------------------+
|  dynamic allocation           |
+-------------------------------+
|  padding                      |
+-------------------------------+
|  outgoing stack arguments     | <-- arg_pointer
|                               |
+-------------------------------+
|                               | <-- stack_pointer_rtx (aligned)

Now for the majority of frames the amount of local variables is small
and there is neither dynamic allocation nor the need for outgoing local
variables.  In this case the first instruction in the function is

	stp	fp, lr, [sp, #-FrameSize]!

So this instruction allocates all the stack needed and acts stores the
required registers.  That acts as an implicit probe as far as I can tell.


If the locals area gets slightly larger (>= 512 bytes) then the sequence
becomes
	sub	sp, sp, #FrameSize
	stp	fp, lr, [sp]

But again this acts as a sufficient implicit probe provided that
FrameSize does not exceed the probe interval.

Yes, we need more implicit probes if the local variable space becomes
large and we need additional probes for checking the outgoing area and
the dynamic area, but again, if those are small (< 512) we could replace
the existing
	sub	sp, sp, #n
with
	str	xzr, [sp, #-n]!

and thus the explicit probe now becomes the stack allocation operation.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  8:03       ` Uros Bizjak
@ 2017-06-20 10:18         ` Richard Biener
  2017-06-20 11:10           ` Uros Bizjak
  0 siblings, 1 reply; 66+ messages in thread
From: Richard Biener @ 2017-06-20 10:18 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Jakub Jelinek, Jeff Law, Jan Hubicka, Eric Botcazou, gcc-patches

On Tue, Jun 20, 2017 at 10:03 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Mon, Jun 19, 2017 at 7:51 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>> On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote:
>>> On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
>>> >
>>> > Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>>> > while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
>>> > or movl $0, (%esp) ?
>>> Florian raised this privately to me as well.  THere's a couple issues.
>>>
>>> 1. Is there a performance penalty/gain for sub-word operations?  If not,
>>>    we can improve things slighly there.  Even if it's performance
>>>    neutral we can probably do better on code size.
>>
>> CCing Uros and Honza here, I believe there are at least on x86 penalties
>> for 2-byte, maybe for 1-byte and then sometimes some stalls when you
>> write or read in a different size from a recent write or read.
>
> Don't use orq $0, (%rsp), as this is a high latency RMW insn.

Well, but _maybe_ it's optimized because oring 0 never changes anything?
At least it would be nice if it would only trigger the page-fault side-effect
and then not consume other CPU resources.

I guess micro-benchmark plus performance counters might tell.

> movq $0x0, (%rsp) is fast, but also quite long insn.
>
> push $0x0 increases the stack pointer for 4 or 8 bytes, depending on
> target word size. Push insn also updates delta stack pointer, so
> update of SP is required (effectively, another ALU operation) if SP is
> later referenced from insn other than push/pop/call/ret. There are no
> non-word-sized register pushes.

I only suggested push $0x0 because that doesn't leave the window
open for the async signal where %rsp points somewhere we didn't probe yet.

> I think that for the purpose of stack probe, we can write a byte to
> the end of the stack, so
>
> movb $0x0, (%rsp).
>
> This is relatively short insn, and operates in the same way for 32bit
> and 64bit targets. There are no issues with partial memory stalls
> since nothing immediately reads a different sized value from the
> written location.
>
> Uros.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 10:18         ` Richard Biener
@ 2017-06-20 11:10           ` Uros Bizjak
  2017-06-20 12:13             ` Florian Weimer
  0 siblings, 1 reply; 66+ messages in thread
From: Uros Bizjak @ 2017-06-20 11:10 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, Jeff Law, Jan Hubicka, Eric Botcazou, gcc-patches

On Tue, Jun 20, 2017 at 12:18 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 10:03 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Mon, Jun 19, 2017 at 7:51 PM, Jakub Jelinek <jakub@redhat.com> wrote:
>>> On Mon, Jun 19, 2017 at 11:45:13AM -0600, Jeff Law wrote:
>>>> On 06/19/2017 11:29 AM, Jakub Jelinek wrote:
>>>> >
>>>> > Also, on i?86 orq $0, (%rsp) or orl $0, (%esp) is used to probe stack,
>>>> > while it is shorter, is it actually faster or as slow as movq $0, (%rsp)
>>>> > or movl $0, (%esp) ?
>>>> Florian raised this privately to me as well.  THere's a couple issues.
>>>>
>>>> 1. Is there a performance penalty/gain for sub-word operations?  If not,
>>>>    we can improve things slighly there.  Even if it's performance
>>>>    neutral we can probably do better on code size.
>>>
>>> CCing Uros and Honza here, I believe there are at least on x86 penalties
>>> for 2-byte, maybe for 1-byte and then sometimes some stalls when you
>>> write or read in a different size from a recent write or read.
>>
>> Don't use orq $0, (%rsp), as this is a high latency RMW insn.
>
> Well, but _maybe_ it's optimized because oring 0 never changes anything?
> At least it would be nice if it would only trigger the page-fault side-effect
> and then not consume other CPU resources.

It doesn't look so:

--cut here--
void
__attribute__ ((noinline))
test_or (void)
{
  volatile int a;
  unsigned int n;

  for (n = 0; n < (unsigned) -1; n++)
    asm ("orl $0, %0" : "+m" (a));
}

void
__attribute__ ((noinline))
test_movb (void)
{
  volatile int a;
  unsigned int n;

  for (n = 0; n < (unsigned) -1; n++)
    asm ("movb $0, %0" : "+m" (a));
}

void
__attribute__ ((noinline))
test_movl (void)
{
  volatile int a;
  unsigned int n;

  for (n = 0; n < (unsigned) -1; n++)
    asm ("movl $0, %0" : "+m" (a));
}

int main()
{
 test_or ();
 test_movb ();
 test_movl ();
 return 0;
}
--cut here--

  74,99%  a.out    a.out          [.] test_or
  12,50%  a.out    a.out          [.] test_movb
  12,50%  a.out    a.out          [.] test_movl

Uros.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 11:10           ` Uros Bizjak
@ 2017-06-20 12:13             ` Florian Weimer
  2017-06-20 12:17               ` Uros Bizjak
  0 siblings, 1 reply; 66+ messages in thread
From: Florian Weimer @ 2017-06-20 12:13 UTC (permalink / raw)
  To: Uros Bizjak, Richard Biener
  Cc: Jakub Jelinek, Jeff Law, Jan Hubicka, Eric Botcazou, gcc-patches

On 06/20/2017 01:10 PM, Uros Bizjak wrote:

>   74,99%  a.out    a.out          [.] test_or
>   12,50%  a.out    a.out          [.] test_movb
>   12,50%  a.out    a.out          [.] test_movl

Could you try notl/notb/negl/negb as well, please?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 12:13             ` Florian Weimer
@ 2017-06-20 12:17               ` Uros Bizjak
  2017-06-20 12:20                 ` Uros Bizjak
  2017-06-20 15:59                 ` Jeff Law
  0 siblings, 2 replies; 66+ messages in thread
From: Uros Bizjak @ 2017-06-20 12:17 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Richard Biener, Jakub Jelinek, Jeff Law, Jan Hubicka,
	Eric Botcazou, gcc-patches

On Tue, Jun 20, 2017 at 2:13 PM, Florian Weimer <fweimer@redhat.com> wrote:
> On 06/20/2017 01:10 PM, Uros Bizjak wrote:
>
>>   74,99%  a.out    a.out          [.] test_or
>>   12,50%  a.out    a.out          [.] test_movb
>>   12,50%  a.out    a.out          [.] test_movl
>
> Could you try notl/notb/negl/negb as well, please?

These all have the same (long) runtime as test_or.

Uros.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 12:17               ` Uros Bizjak
@ 2017-06-20 12:20                 ` Uros Bizjak
  2017-06-20 12:27                   ` Richard Biener
  2017-06-20 15:59                 ` Jeff Law
  1 sibling, 1 reply; 66+ messages in thread
From: Uros Bizjak @ 2017-06-20 12:20 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Richard Biener, Jakub Jelinek, Jeff Law, Jan Hubicka,
	Eric Botcazou, gcc-patches

On Tue, Jun 20, 2017 at 2:17 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 2:13 PM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/20/2017 01:10 PM, Uros Bizjak wrote:
>>
>>>   74,99%  a.out    a.out          [.] test_or
>>>   12,50%  a.out    a.out          [.] test_movb
>>>   12,50%  a.out    a.out          [.] test_movl
>>
>> Could you try notl/notb/negl/negb as well, please?
>
> These all have the same (long) runtime as test_or.

Perhaps we can use "testb $0, %0"? It doesn't write to the memory, but
otherwise has the same runtime as movb/movl.

Uros.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 12:20                 ` Uros Bizjak
@ 2017-06-20 12:27                   ` Richard Biener
  2017-06-20 21:57                     ` Jeff Law
  0 siblings, 1 reply; 66+ messages in thread
From: Richard Biener @ 2017-06-20 12:27 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Florian Weimer, Jakub Jelinek, Jeff Law, Jan Hubicka,
	Eric Botcazou, gcc-patches

On Tue, Jun 20, 2017 at 2:20 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Jun 20, 2017 at 2:17 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Tue, Jun 20, 2017 at 2:13 PM, Florian Weimer <fweimer@redhat.com> wrote:
>>> On 06/20/2017 01:10 PM, Uros Bizjak wrote:
>>>
>>>>   74,99%  a.out    a.out          [.] test_or
>>>>   12,50%  a.out    a.out          [.] test_movb
>>>>   12,50%  a.out    a.out          [.] test_movl
>>>
>>> Could you try notl/notb/negl/negb as well, please?
>>
>> These all have the same (long) runtime as test_or.
>
> Perhaps we can use "testb $0, %0"? It doesn't write to the memory, but
> otherwise has the same runtime as movb/movl.

That sounds good, OTOH it's a matter of putting strain on the
memory fetch or store side...  We'll get cacheline allocations in
any case (but the memory will be used eventually).  Instead
of test a mere movb into a scratch register (aka, load instead of
store) would work as well apart from the need of a scratch register.

We can also vectorize with scatters ;)  (just kidding)

Richard.

> Uros.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  8:27     ` Richard Earnshaw (lists)
@ 2017-06-20 15:50       ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-20 15:50 UTC (permalink / raw)
  To: Richard Earnshaw (lists), Joseph Myers; +Cc: gcc-patches

On 06/20/2017 02:27 AM, Richard Earnshaw (lists) wrote:
> On 19/06/17 20:04, Jeff Law wrote:
>> On 06/19/2017 11:50 AM, Joseph Myers wrote:
>>> On Mon, 19 Jun 2017, Jeff Law wrote:
>>>
>>>> A key point to remember is that you can never have an allocation
>>>> (potentially using more than one allocation site) which is larger than a
>>>> page without probing the page.
>>>
>>> There's a platform ABI issue here.  At least some kernel fixes for these 
>>> stack issues, as I understand it, increase the size of the stack guard to 
>>> more than a single page.  It would be possible to define the ABI to 
>>> require such a larger guard for protection and so reduce the number of 
>>> (non-alloca/VLA-using) functions that need probes generated, depending on 
>>> whether a goal is to achieve security on kernels without such a fix.  
>>> (Thinking in terms of how to get to enabling such probes by default.)
>> On 32 bit platforms we don't have a lot of address space left, so we
>> have to be careful about creating too large of a guard.
>>
>> On 64 bit platforms we have a lot more freedom and I suspect larger
>> guards, mandated by the ABI would be useful, if for no other reason than
>> allowing us to allocate more stack without probing.   A simple array of
>> PATH_MAX characters triggers probing right now.   I suspect (but didn't
>> bother to confirm) that PATH_MAX array are what causes git to have so
>> many large stacks.
>>
>> Also if we look at something like ppc and aarch64, we've currently got
>> the PROBE_INTERVAL set to 4k.  But in reality they're using much larger
>> page sizes.  So we could improve things there as well.
>>
> 
> There are aarch64 linux systems using 4k pages for compatibility with
> existing aarch32 binaries.
Ah.  That's good to know.  Thanks.

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  8:21   ` Eric Botcazou
@ 2017-06-20 15:50     ` Jeff Law
  2017-06-20 19:48     ` Jakub Jelinek
  1 sibling, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-20 15:50 UTC (permalink / raw)
  To: Eric Botcazou, Richard Kenner; +Cc: gcc-patches

On 06/20/2017 02:21 AM, Eric Botcazou wrote:
>> Out of curiousity, does the old Alpha/VMS stack-checking API meet the
>> requirements?  From what I recall, I think it does.
> 
> No, it's the usual probe-first-and-then-allocate strategy and Jeff rejects it 
> because of valgrind.  I'd personally rather change valgrind but...
I'm torn here.  It'd certainly be a hell of a lot easier to punt this to
valgrind, but with the issues I just couldn't get comfortable with that.

We're probing pages which are potentially unmapped and far away from the
stack pointer.  The kernel and/or valgrind has to look at the reference
and make a guess whether or not it's really a request for more stack or
a wild pointer -- and there's no real good way to tell the difference.
Thus we're dependent upon the heuristics used by the kernel and valgrind
and we're dependent on those heuristics essentially being the same (and
I'm certain they are not at this time :-)

One could also argue that these probes are undefined behavior precisely
because they potentially hit pages unmapped pages far away from the
stack pointer.

Any probing beyond the stack pointer is also going to trigger a valgrind
warning.  Valgrind has some support avoiding the warning if a reference
hits the red zone -- but these probes can be well beyond the red zone.

In a world where distros may be turning on -fstack-check=<whatever> by
default, valgrind has to continue to work and not generate an
unreasonable number of false positive warnings else the tool becomes
useless.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 12:17               ` Uros Bizjak
  2017-06-20 12:20                 ` Uros Bizjak
@ 2017-06-20 15:59                 ` Jeff Law
  1 sibling, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-20 15:59 UTC (permalink / raw)
  To: Uros Bizjak, Florian Weimer
  Cc: Richard Biener, Jakub Jelinek, Jan Hubicka, Eric Botcazou, gcc-patches

On 06/20/2017 06:17 AM, Uros Bizjak wrote:
> On Tue, Jun 20, 2017 at 2:13 PM, Florian Weimer <fweimer@redhat.com> wrote:
>> On 06/20/2017 01:10 PM, Uros Bizjak wrote:
>>
>>>   74,99%  a.out    a.out          [.] test_or
>>>   12,50%  a.out    a.out          [.] test_movb
>>>   12,50%  a.out    a.out          [.] test_movl
>>
>> Could you try notl/notb/negl/negb as well, please?
> 
> These all have the same (long) runtime as test_or.
That would be my expectation -- they (not/neg) are going to be RMW.

So we can we agree that moving away RMW to a simple W style instruction
for the probe is where we want to go?  Then we can kick around the exact
form of that store.

FWIW, we don't have to store zero -- ultimately we care about the side
effect of triggering the page fault, not the value written.  So we could
just as easily store a register into the probed address to avoid the
codesize cost of encoding an immediate

I did that in my local s390 patches.  It may not be necessary there, but
it allowed me to avoid thinking too hard about the ISA and get s390
proof of concept code running :-)

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  8:21   ` Eric Botcazou
  2017-06-20 15:50     ` Jeff Law
@ 2017-06-20 19:48     ` Jakub Jelinek
  2017-06-20 20:37       ` Eric Botcazou
  1 sibling, 1 reply; 66+ messages in thread
From: Jakub Jelinek @ 2017-06-20 19:48 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Richard Kenner, gcc-patches, law

On Tue, Jun 20, 2017 at 10:21:14AM +0200, Eric Botcazou wrote:
> > Out of curiousity, does the old Alpha/VMS stack-checking API meet the
> > requirements?  From what I recall, I think it does.
> 
> No, it's the usual probe-first-and-then-allocate strategy and Jeff rejects it 
> because of valgrind.  I'd personally rather change valgrind but...

But then valgrind won't be able to find bugs in the code (storing and later
reading stuff into the volatile parts of the stack that could be overwritten
by any asynchronous signal).  GCC had various bugs in this area and valgrind
has been able to report those.  Unless the probe instruction is sufficiently
magic that it won't usually appear in other code.

Only checking loads below the stack is not sufficient, some buggy code could
e.g. store some data below stack pointer (below red zone if any), then
subtract stack and then try to read it, etc.

Not to mention that it isn't just false positive messages with current
valgrind on -fstack-check code, e.g. on ppc64 it just crashes.

	Jakub

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 19:48     ` Jakub Jelinek
@ 2017-06-20 20:37       ` Eric Botcazou
  2017-06-20 20:46         ` Jeff Law
  0 siblings, 1 reply; 66+ messages in thread
From: Eric Botcazou @ 2017-06-20 20:37 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Richard Kenner, law

> But then valgrind won't be able to find bugs in the code (storing and later
> reading stuff into the volatile parts of the stack that could be overwritten
> by any asynchronous signal).  GCC had various bugs in this area and
> valgrind has been able to report those.  Unless the probe instruction is
> sufficiently magic that it won't usually appear in other code.

Right, maybe this magic aspect was the reason why it was initially implemented 
like that for Cygwin, at least you know that orl $0 is meant to be special.

> Only checking loads below the stack is not sufficient, some buggy code could
> e.g. store some data below stack pointer (below red zone if any), then
> subtract stack and then try to read it, etc.
> 
> Not to mention that it isn't just false positive messages with current
> valgrind on -fstack-check code, e.g. on ppc64 it just crashes.

The reasoning seems weird though since, apart from x86/x86-64, you're going to 
gratuitously inflict this painful "moving sp" thing to every program compiled 
on Linux because of just one tool that you can adapt.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 20:37       ` Eric Botcazou
@ 2017-06-20 20:46         ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-20 20:46 UTC (permalink / raw)
  To: Eric Botcazou, Jakub Jelinek; +Cc: gcc-patches, Richard Kenner

On 06/20/2017 02:37 PM, Eric Botcazou wrote:
>> But then valgrind won't be able to find bugs in the code (storing and later
>> reading stuff into the volatile parts of the stack that could be overwritten
>> by any asynchronous signal).  GCC had various bugs in this area and
>> valgrind has been able to report those.  Unless the probe instruction is
>> sufficiently magic that it won't usually appear in other code.
> 
> Right, maybe this magic aspect was the reason why it was initially implemented 
> like that for Cygwin, at least you know that orl $0 is meant to be special.
> 
>> Only checking loads below the stack is not sufficient, some buggy code could
>> e.g. store some data below stack pointer (below red zone if any), then
>> subtract stack and then try to read it, etc.
>>
>> Not to mention that it isn't just false positive messages with current
>> valgrind on -fstack-check code, e.g. on ppc64 it just crashes.
> 
> The reasoning seems weird though since, apart from x86/x86-64, you're going to 
> gratuitously inflict this painful "moving sp" thing to every program compiled 
> on Linux because of just one tool that you can adapt.
I don't see MOVING_SP as painful, except perhaps on aarch64.  On
something like PPC MOVING_SP turns out to be exceedingly clean.

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  9:27 ` Richard Earnshaw (lists)
@ 2017-06-20 21:39   ` Jeff Law
  2017-06-21  8:41     ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-20 21:39 UTC (permalink / raw)
  To: Richard Earnshaw (lists), gcc-patches

On 06/20/2017 03:27 AM, Richard Earnshaw (lists) wrote:
> On 19/06/17 18:07, Jeff Law wrote:
>> As some of you are likely aware, Qualys has just published fairly
>> detailed information on using stack/heap clashes as an attack vector.
>> Eric B, Michael M -- sorry I couldn't say more when I contact you about
>> -fstack-check and some PPC specific stuff.  This has been under embargo
>> for the last month.
>>
>>
>> --
>>
>>
>> http://www.openwall.com/lists/oss-security/2017/06/19/1
>>
> [...]
>> aarch64 is significantly worse.  There are no implicit probes we can
>> exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
>> So we have the track the distance to the most recent probe and when that
>> distance grows too large, we have to emit a probe.  Of course we have to
>> make worst case assumptions at function entry.
>>
> 
> I'm not sure I understand what you're saying here.  According to the
> comment above aarch64_expand_prologue, the stack frame looks like:
> 
> +-------------------------------+
> |                               |
> |  incoming stack arguments     |
> |                               |
> +-------------------------------+
> |                               | <-- incoming stack pointer (aligned)
> |  callee-allocated save area   |
> |  for register varargs         |
> |                               |
> +-------------------------------+
> |  local variables              | <-- frame_pointer_rtx
> |                               |
> +-------------------------------+
> |  padding0                     | \
> +-------------------------------+  |
> |  callee-saved registers       |  | frame.saved_regs_size
> +-------------------------------+  |
> |  LR'                          |  |
> +-------------------------------+  |
> |  FP'                          | / <- hard_frame_pointer_rtx (aligned)
> +-------------------------------+
> |  dynamic allocation           |
> +-------------------------------+
> |  padding                      |
> +-------------------------------+
> |  outgoing stack arguments     | <-- arg_pointer
> |                               |
> +-------------------------------+
> |                               | <-- stack_pointer_rtx (aligned)
> 
> Now for the majority of frames the amount of local variables is small
> and there is neither dynamic allocation nor the need for outgoing local
> variables.  In this case the first instruction in the function is
> 
> 	stp	fp, lr, [sp, #-FrameSize
But the stack pointer might have already been advanced into the guard
page by the caller.   For the sake of argument assume the guard page is
0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
the caller hasn't touched the 0xf1000 page.

If FrameSize >= 32, then the stores are going to hit the 0xf0000 page
rather than the 0xf1000 page.   That's jumping the guard.  Thus we have
to emit a probe prior to this stack allocation.

Now because this instruction stores at *new_sp, it does allow us to
eliminate future probes and I do take advantage of that in my code.

The implementation is actually rather simple.  We keep a conservative
estimate of the offset of the last known probe relative to the stack
pointer.  At entry we have to assume the offset is:

PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT)


A stack allocation increases the offset.  A store into the stack
decreases the offset.
 i
A probe is required before an allocation that increases the offset to >=
PROBE_INTERVAL.

An allocation + store instruction such as shown does both, but can (and
is) easily modeled.  THe only tricky case here is that you can't naively
break it up into an allocation and store as that can force an
unnecessary probe (say if the allocated space is just enough to hold the
stored objects).




> 
> 
> If the locals area gets slightly larger (>= 512 bytes) then the sequence
> becomes
> 	sub	sp, sp, #FrameSize
> 	stp	fp, lr, [sp]
> 
> But again this acts as a sufficient implicit probe provided that
> FrameSize does not exceed the probe interval.
And again, the store acts as a probe which can eliminate potential
probes that might occur later in the instruction stream.  But if the
allocation by the "sub" instruction causes our running offset to cross
PROBE_BOUNDARY, then we must emit a probe prior to the "sub" instruction.

Hopefully it'll be clearer when I post the code :-)  aarch64 is one that
will need updating as all work to-date has been with Red Hat's 4.8
compiler with the aarch64 code generator bolted onto the side.

So perhaps "no implicit probes" was too strong.  It would probably be
better stated "no implicit probes in the caller".  We certainly use
stores in the prologue to try and eliminate probes.  In fact, we try
harder on aarch64 than any other target.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  8:17 ` Eric Botcazou
@ 2017-06-20 21:52   ` Jeff Law
  2017-06-20 22:20     ` Eric Botcazou
  2017-06-21 19:07     ` Florian Weimer
  2017-06-21  7:56   ` Andreas Schwab
  1 sibling, 2 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-20 21:52 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: gcc-patches

On 06/20/2017 02:16 AM, Eric Botcazou wrote:
> 
> Right, because the Linux kernel for x86/x86-64 is the only OS flavor that 
> doesn't let you probe the stack ahead of the stack pointer.  All other 
> combinations of OS and architecture we tried (and it's quite a lot) do.
But what you end up depending on is undocumented behavior of a
particular kernel implementation.  That seems rather unwise.


> 
>> After much poking around I concluded that we really need to implement
>> allocation and probing via a "moving sp" strategy.   Probing into
>> unallocated areas runs afoul of valgrind, so that's a non-starter.
> 
> The reason why you cannot use this strategy on a global basis for stack 
> checking is that some ABIs specify that you cannot update the stack pointer 
> more than once to establish a frame; others don't explicitly care but...
Which ABIs have that property?  I'll be the first to admit that I've
purged much of my weird ABI memories.

Supporting ABIs which force us into a probe, then allocate strategy is
actually easy.  We can use the existing -fstack-check code, but use the
value 0 for STACK_CHECK_PROTECT.

Just replace all uses of STACK_CHECK_PROTECT with calls to a wrapper.

The wrapper looks like

if (magic_flag)
  return STACK_CHECK_PROTECT;
else
  return 0;

That's precisely what we were planning to do prior to bumping against
the valgrind issues.  That indirection makes it easy to ensure we didn't
change the behavior of the existing stack-check for Ada, but also allows
us to change the behavior for the new stack checking option.


> 
>> Allocating stack space, then probing the pages within the space is
>> vulnerable to async signal delivery between the allocation point and the
>> probe point.  If that occurs the signal handler could end up running on
>> a stack that has collided with the heap.
> 
> ...yes, there are difficulties with the "moving sp" strategy.
> 
>> Finally, we need not ensure the ability to handle a signal at stack
>> overflow.  It is fine for the kernel to halt the process immediately if
>> it detects a reference to the guard page.
> 
> In Ada it's the opposite and we use an alternate signal stack in this case.
Ah, so if you're running on an alternate stack, then why probe ahead of
need?  I thought the whole point of probing a couple pages ahead as to
ensure you could take the signal the Ada.

I've also wondered if a 2 page guard would solve some of these problems.
 In the event of stack overflow, the kernel maps in one of the two pages
for use by the signal handler.   But changing things at this point may
not be worth the effort.

> 
>> Michael Matz has suggested some generic support so that we don't have to
>> write target specific code for each and every target we support.  THe
>> idea is to have a helper function which allocates and probes stack
>> space.  THe port can then call that helper function from within its
>> prologue generator.  I  think this is wise -- I wouldn't want to go
>> through this exercise on every port.
> 
> Interesting.  We never convinced ourselves that this was worthwhile.
The idea is not to have to write probing code for all those embedded
targets.  I doubt anyone really wants to write probes for the mn103,
rl78, mep, etc etc.

With Matz's little helper routine, they just have to pick the right
point in their prologue code to call the helper.  At least that's the
theory.

Jeff


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 12:27                   ` Richard Biener
@ 2017-06-20 21:57                     ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-20 21:57 UTC (permalink / raw)
  To: Richard Biener, Uros Bizjak
  Cc: Florian Weimer, Jakub Jelinek, Jan Hubicka, Eric Botcazou, gcc-patches

On 06/20/2017 06:27 AM, Richard Biener wrote:
> On Tue, Jun 20, 2017 at 2:20 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Tue, Jun 20, 2017 at 2:17 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
>>> On Tue, Jun 20, 2017 at 2:13 PM, Florian Weimer <fweimer@redhat.com> wrote:
>>>> On 06/20/2017 01:10 PM, Uros Bizjak wrote:
>>>>
>>>>>   74,99%  a.out    a.out          [.] test_or
>>>>>   12,50%  a.out    a.out          [.] test_movb
>>>>>   12,50%  a.out    a.out          [.] test_movl
>>>>
>>>> Could you try notl/notb/negl/negb as well, please?
>>>
>>> These all have the same (long) runtime as test_or.
>>
>> Perhaps we can use "testb $0, %0"? It doesn't write to the memory, but
>> otherwise has the same runtime as movb/movl.
> 
> That sounds good, OTOH it's a matter of putting strain on the
> memory fetch or store side...  We'll get cacheline allocations in
> any case (but the memory will be used eventually).  Instead
> of test a mere movb into a scratch register (aka, load instead of
> store) would work as well apart from the need of a scratch register.
It was never clear to me why we always implement probes via stores --
though from development standpoint a destructive store is useful.

I'd expect a tst to generate the desired SEGV.

How does that like compare to the partial-allocation + push approach?


> 
> We can also vectorize with scatters ;)  (just kidding)
:-)

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 21:52   ` Jeff Law
@ 2017-06-20 22:20     ` Eric Botcazou
  2017-06-21 17:31       ` Jeff Law
  2017-06-21 19:07     ` Florian Weimer
  1 sibling, 1 reply; 66+ messages in thread
From: Eric Botcazou @ 2017-06-20 22:20 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

> But what you end up depending on is undocumented behavior of a
> particular kernel implementation.  That seems rather unwise.

And it's the single example of such a thing in the entire codebase?
I don't know the code of the sanitizer much, but from the outside it looks 
full of similar tricks...

> Which ABIs have that property?  I'll be the first to admit that I've
> purged much of my weird ABI memories.

The original Alpha ABI mentioned by Richard IIRC for example.

> Supporting ABIs which force us into a probe, then allocate strategy is
> actually easy.  We can use the existing -fstack-check code, but use the
> value 0 for STACK_CHECK_PROTECT.
> 
> Just replace all uses of STACK_CHECK_PROTECT with calls to a wrapper.
> 
> The wrapper looks like
> 
> if (magic_flag)
>   return STACK_CHECK_PROTECT;
> else
>   return 0;
> 
> That's precisely what we were planning to do prior to bumping against
> the valgrind issues.  That indirection makes it easy to ensure we didn't
> change the behavior of the existing stack-check for Ada, but also allows
> us to change the behavior for the new stack checking option.

Yes, that would seem the most straightforward thing to do modulo Valgrind.

> Ah, so if you're running on an alternate stack, then why probe ahead of
> need?  I thought the whole point of probing a couple pages ahead as to
> ensure you could take the signal the Ada.

We run on the alternate stack only when we do _not_ probe ahead, i.e. on 
x86/x86-64 Linux.

> I've also wondered if a 2 page guard would solve some of these problems.
>  In the event of stack overflow, the kernel maps in one of the two pages
> for use by the signal handler.   But changing things at this point may
> not be worth the effort.

That was exactly the strategy used by Tru64 (so you needed to manually unmap 
the page after you had recovered from the overflow).

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20  8:17 ` Eric Botcazou
  2017-06-20 21:52   ` Jeff Law
@ 2017-06-21  7:56   ` Andreas Schwab
  1 sibling, 0 replies; 66+ messages in thread
From: Andreas Schwab @ 2017-06-21  7:56 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Jeff Law, gcc-patches

On Jun 20 2017, Eric Botcazou <ebotcazou@adacore.com> wrote:

> Right, because the Linux kernel for x86/x86-64 is the only OS flavor that 
> doesn't let you probe the stack ahead of the stack pointer.  All other 
> combinations of OS and architecture we tried (and it's quite a lot) do.

Take a look at do_page_fault in arch/*/mm/fault.c, there are a lot of
architectures that place a limit on how far you can probe below USP.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 21:39   ` Jeff Law
@ 2017-06-21  8:41     ` Richard Earnshaw (lists)
  2017-06-21 17:25       ` Jeff Law
  0 siblings, 1 reply; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-21  8:41 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 20/06/17 22:39, Jeff Law wrote:
> On 06/20/2017 03:27 AM, Richard Earnshaw (lists) wrote:
>> On 19/06/17 18:07, Jeff Law wrote:
>>> As some of you are likely aware, Qualys has just published fairly
>>> detailed information on using stack/heap clashes as an attack vector.
>>> Eric B, Michael M -- sorry I couldn't say more when I contact you about
>>> -fstack-check and some PPC specific stuff.  This has been under embargo
>>> for the last month.
>>>
>>>
>>> --
>>>
>>>
>>> http://www.openwall.com/lists/oss-security/2017/06/19/1
>>>
>> [...]
>>> aarch64 is significantly worse.  There are no implicit probes we can
>>> exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
>>> So we have the track the distance to the most recent probe and when that
>>> distance grows too large, we have to emit a probe.  Of course we have to
>>> make worst case assumptions at function entry.
>>>
>>
>> I'm not sure I understand what you're saying here.  According to the
>> comment above aarch64_expand_prologue, the stack frame looks like:
>>
>> +-------------------------------+
>> |                               |
>> |  incoming stack arguments     |
>> |                               |
>> +-------------------------------+
>> |                               | <-- incoming stack pointer (aligned)
>> |  callee-allocated save area   |
>> |  for register varargs         |
>> |                               |
>> +-------------------------------+
>> |  local variables              | <-- frame_pointer_rtx
>> |                               |
>> +-------------------------------+
>> |  padding0                     | \
>> +-------------------------------+  |
>> |  callee-saved registers       |  | frame.saved_regs_size
>> +-------------------------------+  |
>> |  LR'                          |  |
>> +-------------------------------+  |
>> |  FP'                          | / <- hard_frame_pointer_rtx (aligned)
>> +-------------------------------+
>> |  dynamic allocation           |
>> +-------------------------------+
>> |  padding                      |
>> +-------------------------------+
>> |  outgoing stack arguments     | <-- arg_pointer
>> |                               |
>> +-------------------------------+
>> |                               | <-- stack_pointer_rtx (aligned)
>>
>> Now for the majority of frames the amount of local variables is small
>> and there is neither dynamic allocation nor the need for outgoing local
>> variables.  In this case the first instruction in the function is
>>
>> 	stp	fp, lr, [sp, #-FrameSize
> But the stack pointer might have already been advanced into the guard
> page by the caller.   For the sake of argument assume the guard page is
> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
> the caller hasn't touched the 0xf1000 page.

Then make sure the caller does touch the 0xf1000 page.  If it's
allocated that much stack it should be forced to do the probe and not
rely on all it's children having to do it because it can't be bothered.


> 
> If FrameSize >= 32, then the stores are going to hit the 0xf0000 page
> rather than the 0xf1000 page.   That's jumping the guard.  Thus we have
> to emit a probe prior to this stack allocation.
> 
> Now because this instruction stores at *new_sp, it does allow us to
> eliminate future probes and I do take advantage of that in my code.
> 
> The implementation is actually rather simple.  We keep a conservative
> estimate of the offset of the last known probe relative to the stack
> pointer.  At entry we have to assume the offset is:
> 
> PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT)
> 
> 
> A stack allocation increases the offset.  A store into the stack
> decreases the offset.
>  i
> A probe is required before an allocation that increases the offset to >=
> PROBE_INTERVAL.
> 
> An allocation + store instruction such as shown does both, but can (and
> is) easily modeled.  THe only tricky case here is that you can't naively
> break it up into an allocation and store as that can force an
> unnecessary probe (say if the allocated space is just enough to hold the
> stored objects).
> 

It's better to pay the (relatively small) price when doing large
allocations than to repeatedly pay the price on every small allocation.

> 
> 
> 
>>
>>
>> If the locals area gets slightly larger (>= 512 bytes) then the sequence
>> becomes
>> 	sub	sp, sp, #FrameSize
>> 	stp	fp, lr, [sp]
>>
>> But again this acts as a sufficient implicit probe provided that
>> FrameSize does not exceed the probe interval.
> And again, the store acts as a probe which can eliminate potential
> probes that might occur later in the instruction stream.  But if the
> allocation by the "sub" instruction causes our running offset to cross
> PROBE_BOUNDARY, then we must emit a probe prior to the "sub" instruction.
> 
> Hopefully it'll be clearer when I post the code :-)  aarch64 is one that
> will need updating as all work to-date has been with Red Hat's 4.8
> compiler with the aarch64 code generator bolted onto the side.
> 
> So perhaps "no implicit probes" was too strong.  It would probably be
> better stated "no implicit probes in the caller".  We certainly use
> stores in the prologue to try and eliminate probes.  In fact, we try
> harder on aarch64 than any other target.
> 
> Jeff
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21  8:41     ` Richard Earnshaw (lists)
@ 2017-06-21 17:25       ` Jeff Law
  2017-06-22  9:53         ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-21 17:25 UTC (permalink / raw)
  To: Richard Earnshaw (lists), gcc-patches

On 06/21/2017 02:41 AM, Richard Earnshaw (lists) wrote:

>> But the stack pointer might have already been advanced into the guard
>> page by the caller.   For the sake of argument assume the guard page is
>> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
>> the caller hasn't touched the 0xf1000 page.
> 
> Then make sure the caller does touch the 0xf1000 page.  If it's
> allocated that much stack it should be forced to do the probe and not
> rely on all it's children having to do it because it can't be bothered.
That needs to be mandated at the ABI level if it's going to happen.  The
threat model assumes that the caller adheres to the ABI, but was not
necessarily compiled with -fstack-check.

I'm all for making the common path fast and letting the uncommon cases
pay additional penalties.  That mindset has driven the work-to-date.

But I don't think I have the liberty to change existing ABIs to
facilitate lower overhead approaches.  But I think ARM does given it
owns the ABI for aarch64 and I would happily exploit whatever guarantees
we can derive from an updated ABI.

So if you want the caller to touch the page, you need to amend the ABI.
I'd think touching the lowest address of the alloca area and outgoing
args, if large would be sufficient.

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 22:20     ` Eric Botcazou
@ 2017-06-21 17:31       ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-21 17:31 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: gcc-patches

On 06/20/2017 04:20 PM, Eric Botcazou wrote:
>> But what you end up depending on is undocumented behavior of a
>> particular kernel implementation.  That seems rather unwise.
> 
> And it's the single example of such a thing in the entire codebase?
> I don't know the code of the sanitizer much, but from the outside it looks 
> full of similar tricks...
I think the sanitizer runtime code is a pile of *(&@#$.  But I'm not
involved with that :-)


> 
>> Which ABIs have that property?  I'll be the first to admit that I've
>> purged much of my weird ABI memories.
> 
> The original Alpha ABI mentioned by Richard IIRC for example.
> 
>> Supporting ABIs which force us into a probe, then allocate strategy is
>> actually easy.  We can use the existing -fstack-check code, but use the
>> value 0 for STACK_CHECK_PROTECT.
>>
>> Just replace all uses of STACK_CHECK_PROTECT with calls to a wrapper.
>>
>> The wrapper looks like
>>
>> if (magic_flag)
>>   return STACK_CHECK_PROTECT;
>> else
>>   return 0;
>>
>> That's precisely what we were planning to do prior to bumping against
>> the valgrind issues.  That indirection makes it easy to ensure we didn't
>> change the behavior of the existing stack-check for Ada, but also allows
>> us to change the behavior for the new stack checking option.
> 
> Yes, that would seem the most straightforward thing to do modulo Valgrind.
And we could still do that for ports like the Alpha which mandate that
model or for ports that don't care about valgrind.


> 
>> Ah, so if you're running on an alternate stack, then why probe ahead of
>> need?  I thought the whole point of probing a couple pages ahead as to
>> ensure you could take the signal the Ada.
> 
> We run on the alternate stack only when we do _not_ probe ahead, i.e. on 
> x86/x86-64 Linux.
Ah.  Another piece of the puzzle make sense.  Thanks.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 21:52   ` Jeff Law
  2017-06-20 22:20     ` Eric Botcazou
@ 2017-06-21 19:07     ` Florian Weimer
  1 sibling, 0 replies; 66+ messages in thread
From: Florian Weimer @ 2017-06-21 19:07 UTC (permalink / raw)
  To: Jeff Law, Eric Botcazou; +Cc: gcc-patches

On 06/20/2017 11:52 PM, Jeff Law wrote:
> I've also wondered if a 2 page guard would solve some of these problems.
>  In the event of stack overflow, the kernel maps in one of the two pages
> for use by the signal handler.   But changing things at this point may
> not be worth the effort.

I think Hotspot does that as well.  At the low level, Java programs can
recover from stack overflow (but it's still a VM error which taints the
entire process because these errors can strike at too many places, and
critical invariants could be violated).

Thanks,
Florian

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21 17:25       ` Jeff Law
@ 2017-06-22  9:53         ` Richard Earnshaw (lists)
  2017-06-22 15:30           ` Jeff Law
  2017-06-28  6:45           ` Florian Weimer
  0 siblings, 2 replies; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-22  9:53 UTC (permalink / raw)
  To: Jeff Law, gcc-patches

On 21/06/17 18:25, Jeff Law wrote:
> On 06/21/2017 02:41 AM, Richard Earnshaw (lists) wrote:
> 
>>> But the stack pointer might have already been advanced into the guard
>>> page by the caller.   For the sake of argument assume the guard page is
>>> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
>>> the caller hasn't touched the 0xf1000 page.
>>
>> Then make sure the caller does touch the 0xf1000 page.  If it's
>> allocated that much stack it should be forced to do the probe and not
>> rely on all it's children having to do it because it can't be bothered.
> That needs to be mandated at the ABI level if it's going to happen.  The
> threat model assumes that the caller adheres to the ABI, but was not
> necessarily compiled with -fstack-check.

The base ABI would never mandate stack probes.  Some systems may locate
the stack in such a way that it can never collide with the heap, making
guard pages and probes completely unnecessary (but perhaps at the
expense of limiting the theoretical maximum stack size).  It might say
"if you do stack probes, use this model", but even then it would need to
parameterize the whole model as there are just too many OS configuration
options to consider (page size, size of guard zone, for example).

> 
> I'm all for making the common path fast and letting the uncommon cases
> pay additional penalties.  That mindset has driven the work-to-date.
> 
> But I don't think I have the liberty to change existing ABIs to
> facilitate lower overhead approaches.  But I think ARM does given it
> owns the ABI for aarch64 and I would happily exploit whatever guarantees
> we can derive from an updated ABI.
> 
> So if you want the caller to touch the page, you need to amend the ABI.
> I'd think touching the lowest address of the alloca area and outgoing
> args, if large would be sufficient.
> 
> 

I can't help but feel there's a bit of a goode olde mediaeval witch hunt
going on here.  As Wilco points out, we can never defend against a
function that is built without probe operations but skips the entire
guard zone.  The only defence there is a larger guard zone, but how big
do you make it?

So we can design a half-way house probing scheme which doesn't really
solve the problem and is perhaps so expensive that most people will turn
it off.  Or we can design something that addresses the problem
scientifically if applied everywhere and has almost zero impact on the
code size or performance.  The latter would probably be so cheap that
most people would never notice that it was even on at all.  Yes, you'd
need a system recompile to deploy it in full, but even a fairly limited
rebuild of critical libraries (libc, libstdc++) would help.

Whichever route we take, this wouldn't be an ABI break.  New code will
still interoperate with old code; you just don't get full heap
protection if you mix old and new.

I know which I'd prefer.

R.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-22  9:53         ` Richard Earnshaw (lists)
@ 2017-06-22 15:30           ` Jeff Law
  2017-06-22 16:07             ` Szabolcs Nagy
  2017-06-28  6:45           ` Florian Weimer
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-22 15:30 UTC (permalink / raw)
  To: Richard Earnshaw (lists), gcc-patches

On 06/22/2017 03:53 AM, Richard Earnshaw (lists) wrote:
> On 21/06/17 18:25, Jeff Law wrote:
>> On 06/21/2017 02:41 AM, Richard Earnshaw (lists) wrote:
>>
>>>> But the stack pointer might have already been advanced into the guard
>>>> page by the caller.   For the sake of argument assume the guard page is
>>>> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
>>>> the caller hasn't touched the 0xf1000 page.
>>>
>>> Then make sure the caller does touch the 0xf1000 page.  If it's
>>> allocated that much stack it should be forced to do the probe and not
>>> rely on all it's children having to do it because it can't be bothered.
>> That needs to be mandated at the ABI level if it's going to happen.  The
>> threat model assumes that the caller adheres to the ABI, but was not
>> necessarily compiled with -fstack-check.
> 
> The base ABI would never mandate stack probes.  Some systems may locate
> the stack in such a way that it can never collide with the heap, making
> guard pages and probes completely unnecessary (but perhaps at the
> expense of limiting the theoretical maximum stack size).  It might say
> "if you do stack probes, use this model", but even then it would need to
> parameterize the whole model as there are just too many OS configuration
> options to consider (page size, size of guard zone, for example).
I'm not suggesting mandating stack probes in the ABI, but that certain
changes to the ABI can be made which would in turn make stack probing
far more efficient.

For example the PPC ABI mandates that *sp hold the outer frame address.
That turns out to be amazingly useful from a probing standpoint -- we
always know that *sp was written.  That knowledge allows us to eliminate
98.5% of the prologue probing for glibc on PPC.

I'm not suggesting you do exactly the same thing, but I do think there
are things you could do at the ABI level which would drastically improve
the generated code when stack probing is enabled and which would have
minimal cost.

> 
>>
>> I'm all for making the common path fast and letting the uncommon cases
>> pay additional penalties.  That mindset has driven the work-to-date.
>>
>> But I don't think I have the liberty to change existing ABIs to
>> facilitate lower overhead approaches.  But I think ARM does given it
>> owns the ABI for aarch64 and I would happily exploit whatever guarantees
>> we can derive from an updated ABI.
>>
>> So if you want the caller to touch the page, you need to amend the ABI.
>> I'd think touching the lowest address of the alloca area and outgoing
>> args, if large would be sufficient.
>>
>>
> 
> I can't help but feel there's a bit of a goode olde mediaeval witch hunt
> going on here.  As Wilco points out, we can never defend against a
> function that is built without probe operations but skips the entire
> guard zone.  The only defence there is a larger guard zone, but how big
> do you make it?
No witchhunt at all.

It just happens to be the case that x86 hits *sp when it stores the
return pointer and that ppc always stores the backchain into *sp when it
allocates additional stack space.  As a result on those targets we know
the offset between the stack pointer and the most recent probe is zero
at the start of the callee's prologue.  That allows us to avoid the vast
majority of explicit probes.

aarch64's ABI and ISA don't provide us with any such guarantees and we
have to make very conservative assumptions which leads to much more
explicit probing.

s390 is in a similar situation to aarch64 in that it has to make worst
case assumptions at prologue entry in the callee.  I suspect many others
will be too (but I haven't investigated each architecture as I've been
focused strictly on RHEL targets).

> 
> So we can design a half-way house probing scheme which doesn't really
> solve the problem and is perhaps so expensive that most people will turn
> it off.  Or we can design something that addresses the problem
> scientifically if applied everywhere and has almost zero impact on the
> code size or performance.  The latter would probably be so cheap that
> most people would never notice that it was even on at all.  Yes, you'd
> need a system recompile to deploy it in full, but even a fairly limited
> rebuild of critical libraries (libc, libstdc++) would help.
> 
> Whichever route we take, this wouldn't be an ABI break.  New code will
> still interoperate with old code; you just don't get full heap
> protection if you mix old and neAs the port maintainers I think you have significant say in how this
plays out and you could (for example) say you simply aren't going to
worry about cases where there's more than 1k of outgoing argument space
or when the caller has a large alloca and wasn't compiled with
-fstack-check and set the initial offset to 1k.

That would provide a great deal of coverage, but still leaves folks
using the aarch64 port vulnerable in certain corner cases.

Red Hat would have to look at that carefully and decide to either leave
customers vulnerable to the corner cases or pay the penalty of getting
full protection.  I honestly do not know where we'd land on that
question, but we'd certainly have to evaluate the pros and cons.

Or you could go with something like PPC64 and mandate something in the
ABI.  For example, you could mandate that the last word of dynamically
allocated stack space must be written to and that the last word of the
outgoing args space must be written to if the outgoing args space is >
1k.  Note this would remain compatible with existing code since you're
just forcing the caller to write into those areas at allocation time and
the onus is on the uncommon code (allocas and really big outgoing args)
to pay a tiny penalty and allow the common code (small frames in callee)
to run fast.

I suspect doing something like that would probably eliminate more than
90% of the explicit probes in glibc with minimal cost.  Let's face it, a
single write into alloca space is cheap compared to just the setup for
alloca and a write into the outgoing args for >1k outgoing args isn't
likely to happen enough to ever matter.

And something like that is 100% compatible with existing code.  You can
mix and match freely.  Obviously non-compliant code has a vulnerability,
but as compliant compilers got into the wild the amount of non-compliant
code would quickly drop.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-22 15:30           ` Jeff Law
@ 2017-06-22 16:07             ` Szabolcs Nagy
  2017-06-22 16:15               ` Jeff Law
  0 siblings, 1 reply; 66+ messages in thread
From: Szabolcs Nagy @ 2017-06-22 16:07 UTC (permalink / raw)
  To: Jeff Law, Richard Earnshaw (lists), gcc-patches; +Cc: nd

On 22/06/17 16:30, Jeff Law wrote:
> It just happens to be the case that x86 hits *sp when it stores the
> return pointer and that ppc always stores the backchain into *sp when it
> allocates additional stack space.  As a result on those targets we know
> the offset between the stack pointer and the most recent probe is zero
> at the start of the callee's prologue.  That allows us to avoid the vast
> majority of explicit probes.
> 
> aarch64's ABI and ISA don't provide us with any such guarantees and we
> have to make very conservative assumptions which leads to much more
> explicit probing.

i think the aarch64 abi does not disallow access by the caller
to *sp so i don't see the problem.

the caller can give various guarantees about how far back the
last stack access was relative to the sp at entry to a function,
but this does not have to be abi! the abi is not a hard guarantee
anyway, asm code can always do tricks that is not abi clean but
happens to work in practice, so in that sense relying on *sp
'was written' is incorrect assumption even on ppc/x86, in fact it
would not work with the llvm 'safe stack' runtime as a probe,
the stack clash would happen on the un-safe stack in that case
not where the return address is stored.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-22 16:07             ` Szabolcs Nagy
@ 2017-06-22 16:15               ` Jeff Law
  0 siblings, 0 replies; 66+ messages in thread
From: Jeff Law @ 2017-06-22 16:15 UTC (permalink / raw)
  To: Szabolcs Nagy, Richard Earnshaw (lists), gcc-patches; +Cc: nd

On 06/22/2017 10:07 AM, Szabolcs Nagy wrote:
> On 22/06/17 16:30, Jeff Law wrote:
>> It just happens to be the case that x86 hits *sp when it stores the
>> return pointer and that ppc always stores the backchain into *sp when it
>> allocates additional stack space.  As a result on those targets we know
>> the offset between the stack pointer and the most recent probe is zero
>> at the start of the callee's prologue.  That allows us to avoid the vast
>> majority of explicit probes.
>>
>> aarch64's ABI and ISA don't provide us with any such guarantees and we
>> have to make very conservative assumptions which leads to much more
>> explicit probing.
> 
> i think the aarch64 abi does not disallow access by the caller
> to *sp so i don't see the problem.
> 
> the caller can give various guarantees about how far back the
> last stack access was relative to the sp at entry to a function,
> but this does not have to be abi! the abi is not a hard guarantee
> anyway, asm code can always do tricks that is not abi clean but
> happens to work in practice, so in that sense relying on *sp
> 'was written' is incorrect assumption even on ppc/x86, in fact it
> would not work with the llvm 'safe stack' runtime as a probe,
> the stack clash would happen on the un-safe stack in that case
> not where the return address is stored.
Even assembly code needs to follow the ABI.  The ABI is the contract
that must be maintained at function call boundaries.

I'll have to look at llvm's safe stack again to see if/how that might
impact the assumptions we're making.

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-22  9:53         ` Richard Earnshaw (lists)
  2017-06-22 15:30           ` Jeff Law
@ 2017-06-28  6:45           ` Florian Weimer
  2017-07-13 23:21             ` Jeff Law
  1 sibling, 1 reply; 66+ messages in thread
From: Florian Weimer @ 2017-06-28  6:45 UTC (permalink / raw)
  To: Richard Earnshaw (lists); +Cc: Jeff Law, gcc-patches

* Richard Earnshaw:

> I can't help but feel there's a bit of a goode olde mediaeval witch hunt
> going on here.  As Wilco points out, we can never defend against a
> function that is built without probe operations but skips the entire
> guard zone.  The only defence there is a larger guard zone, but how big
> do you make it?

Right.  And in the exploitable cases we have seen, there is a
dynamically sized allocation which the attacker can influence, so it
seems fairly likely that in a partially hardended binary, there could
be another call stack which is exploitable, with a non-hardened
function at the top.

I think a probing scheme which assumes that if the caller moves the
stack pointer into more than half of the guard area, that's the
callers fault would be totally appropriate in practice.  If possible,
callee-only probing for its own stack usage is preferable, but not if
it means instrumenting all functions which use the stack.

> Yes, you'd need a system recompile to deploy it in full, but even a
> fairly limited rebuild of critical libraries (libc, libstdc++) would
> help.

And we'd be mostly interested in protecting dynamically sized stack
allocations anyway, and perhaps the occasional BUFSIZ and PATH_MAX
rrays.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-28  6:45           ` Florian Weimer
@ 2017-07-13 23:21             ` Jeff Law
  2017-07-18 19:54               ` Florian Weimer
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-07-13 23:21 UTC (permalink / raw)
  To: Florian Weimer, Richard Earnshaw (lists); +Cc: gcc-patches

On 06/28/2017 12:45 AM, Florian Weimer wrote:
> * Richard Earnshaw:
> 
>> I can't help but feel there's a bit of a goode olde mediaeval witch hunt
>> going on here.  As Wilco points out, we can never defend against a
>> function that is built without probe operations but skips the entire
>> guard zone.  The only defence there is a larger guard zone, but how big
>> do you make it?
> 
> Right.  And in the exploitable cases we have seen, there is a
> dynamically sized allocation which the attacker can influence, so it
> seems fairly likely that in a partially hardended binary, there could
> be another call stack which is exploitable, with a non-hardened
> function at the top.
> 
> I think a probing scheme which assumes that if the caller moves the
> stack pointer into more than half of the guard area, that's the
> callers fault would be totally appropriate in practice.  If possible,
> callee-only probing for its own stack usage is preferable, but not if
> it means instrumenting all functions which use the stack.
That position is a surprise Florian :-)  I would have expected a full
protection position, particularly after the discussions we've had about
noreturn functions.

I guess the difference in your position is driven by the relatively high
frequency of probing worst case assumptions are going to have on aarch64
with a relatively small vulnerability surface?   Which is a fairly stark
contrast to the noreturn situation where it rarely, if ever comes up in
practice and never on a hot path?

Jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-07-13 23:21             ` Jeff Law
@ 2017-07-18 19:54               ` Florian Weimer
  0 siblings, 0 replies; 66+ messages in thread
From: Florian Weimer @ 2017-07-18 19:54 UTC (permalink / raw)
  To: Jeff Law; +Cc: Richard Earnshaw (lists), gcc-patches

* Jeff Law:

> On 06/28/2017 12:45 AM, Florian Weimer wrote:
>> * Richard Earnshaw:
>> 
>>> I can't help but feel there's a bit of a goode olde mediaeval witch hunt
>>> going on here.  As Wilco points out, we can never defend against a
>>> function that is built without probe operations but skips the entire
>>> guard zone.  The only defence there is a larger guard zone, but how big
>>> do you make it?
>> 
>> Right.  And in the exploitable cases we have seen, there is a
>> dynamically sized allocation which the attacker can influence, so it
>> seems fairly likely that in a partially hardended binary, there could
>> be another call stack which is exploitable, with a non-hardened
>> function at the top.
>> 
>> I think a probing scheme which assumes that if the caller moves the
>> stack pointer into more than half of the guard area, that's the
>> callers fault would be totally appropriate in practice.  If possible,
>> callee-only probing for its own stack usage is preferable, but not if
>> it means instrumenting all functions which use the stack.

> That position is a surprise Florian :-)  I would have expected a full
> protection position, particularly after the discussions we've had about
> noreturn functions.

I might have gotten carried away on that one.

I really want stack probing to be enabled by default across the board,
so this becomes a non-issue because the caller has been compiled with
probing as well.  However, in order to get there, we need extremely
cheap instrumentation, and if we cover any hypthetical corner case,
this might force us to instrument all functions, and that again
defeats the goal of enabling it by default.

Does that make sense?

> I guess the difference in your position is driven by the relatively high
> frequency of probing worst case assumptions are going to have on aarch64
> with a relatively small vulnerability surface?

Right, and I expect that the limited form of probing can be enabled by
default, so that eventually, the caller will take care of its share of
probing (i.e., it has never moved the stack pointer more than half
into the guard page, or whatever caller/callee split of
responsibilities we come up with).

> Which is a fairly stark contrast to the noreturn situation where it
> rarely, if ever comes up in practice and never on a hot path?

I've since researched the noreturn situation a bit more.  We never
turn noreturn functions into tail calls because the intent is to
preserve the call stack, in the expectation that either the noreturn
function itself performs a backtrace, or that someone later looks at
the coredump.  So the noreturn risk just doesn't seem to be there.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-22 16:10     ` Jeff Law
@ 2017-06-22 22:57       ` Wilco Dijkstra
  0 siblings, 0 replies; 66+ messages in thread
From: Wilco Dijkstra @ 2017-06-22 22:57 UTC (permalink / raw)
  To: Jeff Law, Richard Earnshaw, GCC Patches; +Cc: nd

Jeff Law wrote:
> You can be in one of 3 states when you start the callee's prologue.
>
> 1. You're somewhere in the normal stack.
>
> 2. You've past the guard and are already in the heap or elsewhere
>
> 3. You're somewhere in the guard
>
> State #3 is what we're trying to address.  The attacker has advanced the
> stack pointer into the guard, but has not fully breached the guard.  We
> need to ensure that we hit the guard to prevent the breach.

State 3 is only interesting if all code has been built with probing enabled. In that
case we'd define how far the caller may have gone into the guard band and the
callee will probe exactly when needed, so it's guaranteed to prevent a breach.

If the caller hasn't enabled probing and does alloca, you're simply dead already.
All an attacker needs to do is to increase the alloca size to complete the breach.

If a full breach isn't feasible in a caller (ie. no alloca) then it can only go a tiny
amount into the guard band since the outgoing argument size is tiny. If the callee
has probing enabled and uses a reasonable maximum outgoing argument size
(say more than the maximum across a whole distro), the chances you could breach
the stack guard are infinitesimal.

So I still don't understand what exactly is so much worse on AArch64.

Wilco

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21 17:47   ` Wilco Dijkstra
@ 2017-06-22 16:10     ` Jeff Law
  2017-06-22 22:57       ` Wilco Dijkstra
  0 siblings, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-22 16:10 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Earnshaw, GCC Patches; +Cc: nd

On 06/21/2017 11:47 AM, Wilco Dijkstra wrote:
> Jeff Law wrote:
> 
>> I'm a little confused.  I'm not defining or changing the ABI.  I'm
>> working within my understanding of the existing aarch64 ABI used on
>> linux systems.  My understanding after reading that ABI and the prologue
>> code for aarch64 is there's nothing that can currently be relied upon in
>> terms of the offset from the incoming stack pointer to the most recent
>> "probe" in the caller.
> 
> Well what we need is a properly defined ABI on when to emit probes.  That
> doesn't exist yet indeed, so there is nothing you can rely on today.  But that
> is true for any architecture. In particular if the caller hasn't been compiled with
> probes, even if the call instruction does an implicit probe, you still have to 
> assume that the stack guard has been breached already (ie. doing probes in
> the callee is useless if they aren't done in the whole call chain).
You can be in one of 3 states when you start the callee's prologue.

1. You're somewhere in the normal stack.

2. You've past the guard and are already in the heap or elsewhere

3. You're somewhere in the guard

State #1 is trivially handled.  You emit probes every PROBE_INTERVAL
bytes in newly allocated space and you're OK.

State #2 we can't do anything about.

State #3 is what we're trying to address.  The attacker has advanced the
stack pointer into the guard, but has not fully breached the guard.  We
need to ensure that we hit the guard to prevent the breach.

On x86 and ppc state #3 does not occur due to the ISA and ABI.  We rely
on not being in state #3 to avoid a great number of the explicit probes.

But on aarch64, s390 and likely many other architectures, state #3 is a
real possibility and we need to account for it.

As it stands right now we can be in state #3 and the stack pointer could
be STACK_BOUNDARY / BITS_PER_UNIT bytes away from the end of the guard.
If we were to emit a

stp lr, fp, [sp, -32]!

That would complete jumping the guard since it allocates 32 bytes of
space, then writes 16 and thus we do not hit the guard.

If (for example) the ABI were to mandate touching last outgoing arg slot
if it were > 1k and touch the last slot of the dynamic space in the
caller, then we would know that the stack pointer must be at least
3kbytes away from the end of the guard.  And thus

stp lr, fp, [sp, -32]!

Would hit the guard.  In fact, we could allocate up to 3kbytes of space
without a probe.

> Remember Richard's reply was to this:
>>> aarch64 is significantly worse.  There are no implicit probes we can
>>> exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
>>> So we have the track the distance to the most recent probe and when that
>>> distance grows too large, we have to emit a probe.  Of course we have to
>>> make worst case assumptions at function entry.
> 
> As pointed out, AArch64 is not significantly worse since the majority of frames
> do not need any probes (let alone multiple probes as suggested), neither do we
> need to make worst-case assumptions on entry (stack guard has already been
> breached).
See above.  It is significantly worse on aarch64/s390 because we need to
account for case #3 above.  Whereas x86 and ppc do not because of how
their ISA and ABIs work.

> 
>> Just limiting the size of the outgoing arguments is not sufficient
>> though.  You still have the dynamic allocation area in the caller.  The
>> threat model assumes the caller follows the ABI, but does not have to
>> have been compiled with -fstack-check.
> 
> The only mitigation for that is to increase the stack guard. A callee cannot 
> somehow undo crossing the stack guard by an unchecked caller (that includes
> the case where calls do an implicit probe).
I think you're looking at case #2 above and indeed we can't do anything
about that.  If they've fully breached the guard, then there's nothing
we can do.  But if they have only partially breached the guard then we
can and should stop them (case #3).

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21 17:05 ` Jeff Law
@ 2017-06-21 17:47   ` Wilco Dijkstra
  2017-06-22 16:10     ` Jeff Law
  0 siblings, 1 reply; 66+ messages in thread
From: Wilco Dijkstra @ 2017-06-21 17:47 UTC (permalink / raw)
  To: Jeff Law, Richard Earnshaw, GCC Patches; +Cc: nd

Jeff Law wrote:

> I'm a little confused.  I'm not defining or changing the ABI.  I'm
> working within my understanding of the existing aarch64 ABI used on
> linux systems.  My understanding after reading that ABI and the prologue
> code for aarch64 is there's nothing that can currently be relied upon in
> terms of the offset from the incoming stack pointer to the most recent
> "probe" in the caller.

Well what we need is a properly defined ABI on when to emit probes.  That
doesn't exist yet indeed, so there is nothing you can rely on today.  But that
is true for any architecture. In particular if the caller hasn't been compiled with
probes, even if the call instruction does an implicit probe, you still have to 
assume that the stack guard has been breached already (ie. doing probes in
the callee is useless if they aren't done in the whole call chain).

Remember Richard's reply was to this:
>> aarch64 is significantly worse.  There are no implicit probes we can
>> exploit.  Furthermore, the prologue may allocate stack space 3-4 times.
>> So we have the track the distance to the most recent probe and when that
>> distance grows too large, we have to emit a probe.  Of course we have to
>> make worst case assumptions at function entry.

As pointed out, AArch64 is not significantly worse since the majority of frames
do not need any probes (let alone multiple probes as suggested), neither do we
need to make worst-case assumptions on entry (stack guard has already been
breached).

> Just limiting the size of the outgoing arguments is not sufficient
> though.  You still have the dynamic allocation area in the caller.  The
> threat model assumes the caller follows the ABI, but does not have to
> have been compiled with -fstack-check.

The only mitigation for that is to increase the stack guard. A callee cannot 
somehow undo crossing the stack guard by an unchecked caller (that includes
the case where calls do an implicit probe).

> Thus you'd have to have an ABI mandated probe at the lowest address of
> the dynamic allocation area and limit the size of the outgoing stack
> arguments for your suggestion to be useful.

Yes, every alloca will need at least one probe, larger ones need a loop or call
to do the probes.

> If ARM wants to update the ABI, that's fine with me.  But until that
> happens and compilers which implement that ABI are ubiquitous ISTM we
> can't actually depend on those guarantees.

Indeed, given current binaries don't do probes for alloca or large frames, the only
possible mitigation for those is to increase the stack guard.

Wilco

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 23:22 Wilco Dijkstra
  2017-06-21  8:34 ` Richard Earnshaw (lists)
@ 2017-06-21 17:05 ` Jeff Law
  2017-06-21 17:47   ` Wilco Dijkstra
  1 sibling, 1 reply; 66+ messages in thread
From: Jeff Law @ 2017-06-21 17:05 UTC (permalink / raw)
  To: Wilco Dijkstra, Richard Earnshaw, GCC Patches; +Cc: nd

On 06/20/2017 05:22 PM, Wilco Dijkstra wrote:
> Jeff Law wrote:
>> But the stack pointer might have already been advanced into the guard
>> page by the caller.   For the sake of argument assume the guard page is
>> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
>> the caller hasn't touched the 0xf1000 page.
>>
>> If FrameSize >= 32, then the stores are going to hit the 0xf0000 page
>> rather than the 0xf1000 page.   That's jumping the guard.  Thus we have
>> to emit a probe prior to this stack allocation.
> 
> That's an incorrect ABI that allows adjusting the frame by 4080+32! A correct
> one might allow say 1024 bytes for outgoing arguments. That means when
> you call a function, there is still guard-page-size - 1024 bytes left that you can
> use to allocate locals. With a 4K guard page that allows leaf functions up to 3KB, 
> and depending on the frame locals of 2-3KB plus up to 1024 bytes of outgoing
> arguments without inserting any probes beyond the normal frame stores. 
> 
> This design means almost no functions need additional probes. Assuming we're
> also increasing the guard page size to 64KB, it's cheap even for large functions.
I'm a little confused.  I'm not defining or changing the ABI.  I'm
working within my understanding of the existing aarch64 ABI used on
linux systems.  My understanding after reading that ABI and the prologue
code for aarch64 is there's nothing that can currently be relied upon in
terms of the offset from the incoming stack pointer to the most recent
"probe" in the caller.

Just limiting the size of the outgoing arguments is not sufficient
though.  You still have the dynamic allocation area in the caller.  The
threat model assumes the caller follows the ABI, but does not have to
have been compiled with -fstack-check.

Thus you'd have to have an ABI mandated probe at the lowest address of
the dynamic allocation area and limit the size of the outgoing stack
arguments for your suggestion to be useful.

If ARM wants to update the ABI, that's fine with me.  But until that
happens and compilers which implement that ABI are ubiquitous ISTM we
can't actually depend on those guarantees.

If I'm wrong and there is some guarantee we can rely on, let me know.
It's trivial to change the initial state to utilize such guarantees.

jeff

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21  8:34 ` Richard Earnshaw (lists)
  2017-06-21  8:44   ` Andreas Schwab
@ 2017-06-21  9:03   ` Wilco Dijkstra
  1 sibling, 0 replies; 66+ messages in thread
From: Wilco Dijkstra @ 2017-06-21  9:03 UTC (permalink / raw)
  To: Richard Earnshaw, Jeff Law, GCC Patches; +Cc: nd

Richard Earnshaw wrote:
> A mere 256 bytes for the caller would permit 32 x 8byte arguments on the
> stack which, with at least 8 parameters passed in registers, would allow
> for calls with 40 parameters.  There can't be many in that space.  Any
> function making calls with more than that might need additional probes,
> but that's going to be exceedingly rare.
> 
> Put the cost on the least common sequences, even if they pay
> disproportionately - it will be a win over all.

Functions with large outgoing arguments are extremely rare indeed, it tails off
really fast after 64 bytes. The only large cases I've seen are from Fortran code -
and those cases seem buggy (40KBytes of outgoing args means 5000 double
args which is unlikely).

Wilco

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21  8:44   ` Andreas Schwab
@ 2017-06-21  8:46     ` Richard Earnshaw (lists)
  2017-06-21  8:46       ` Richard Earnshaw (lists)
  0 siblings, 1 reply; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-21  8:46 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Wilco Dijkstra, Jeff Law, GCC Patches, nd

On 21/06/17 09:44, Andreas Schwab wrote:
> On Jun 21 2017, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote:
> 
>> A mere 256 bytes for the caller would permit 32 x 8byte arguments on the
>> stack which, with at least 8 parameters passed in registers, would allow
>> for calls with 40 parameters.  There can't be many in that space.  Any
>> function making calls with more than that might need additional probes,
>> but that's going to be exceedingly rare.
> 
> With passing structures by value you can have arbitrary large
> parameters.
> 
> Andreas.
> 


No.  Those are passed by copies which appear in the locals portion of
the frame (so are covered by normal allocation priniciples).  Only
structures of less than 16 bytes are passed by direct copy.

R.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21  8:46     ` Richard Earnshaw (lists)
@ 2017-06-21  8:46       ` Richard Earnshaw (lists)
  0 siblings, 0 replies; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-21  8:46 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Wilco Dijkstra, Jeff Law, GCC Patches, nd

On 21/06/17 09:46, Richard Earnshaw (lists) wrote:
> On 21/06/17 09:44, Andreas Schwab wrote:
>> On Jun 21 2017, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote:
>>
>>> A mere 256 bytes for the caller would permit 32 x 8byte arguments on the
>>> stack which, with at least 8 parameters passed in registers, would allow
>>> for calls with 40 parameters.  There can't be many in that space.  Any
>>> function making calls with more than that might need additional probes,
>>> but that's going to be exceedingly rare.
>>
>> With passing structures by value you can have arbitrary large
>> parameters.
>>
>> Andreas.
>>
> 
> 
> No.  Those are passed by copies which appear in the locals portion of
* pointers to copies *
R.

> the frame (so are covered by normal allocation priniciples).  Only
> structures of less than 16 bytes are passed by direct copy.
> 
> R.
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-21  8:34 ` Richard Earnshaw (lists)
@ 2017-06-21  8:44   ` Andreas Schwab
  2017-06-21  8:46     ` Richard Earnshaw (lists)
  2017-06-21  9:03   ` Wilco Dijkstra
  1 sibling, 1 reply; 66+ messages in thread
From: Andreas Schwab @ 2017-06-21  8:44 UTC (permalink / raw)
  To: Richard Earnshaw (lists); +Cc: Wilco Dijkstra, Jeff Law, GCC Patches, nd

On Jun 21 2017, "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com> wrote:

> A mere 256 bytes for the caller would permit 32 x 8byte arguments on the
> stack which, with at least 8 parameters passed in registers, would allow
> for calls with 40 parameters.  There can't be many in that space.  Any
> function making calls with more than that might need additional probes,
> but that's going to be exceedingly rare.

With passing structures by value you can have arbitrary large
parameters.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
  2017-06-20 23:22 Wilco Dijkstra
@ 2017-06-21  8:34 ` Richard Earnshaw (lists)
  2017-06-21  8:44   ` Andreas Schwab
  2017-06-21  9:03   ` Wilco Dijkstra
  2017-06-21 17:05 ` Jeff Law
  1 sibling, 2 replies; 66+ messages in thread
From: Richard Earnshaw (lists) @ 2017-06-21  8:34 UTC (permalink / raw)
  To: Wilco Dijkstra, Jeff Law, GCC Patches; +Cc: nd

On 21/06/17 00:22, Wilco Dijkstra wrote:
> Jeff Law wrote:
>> But the stack pointer might have already been advanced into the guard
>> page by the caller.   For the sake of argument assume the guard page is
>> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
>> the caller hasn't touched the 0xf1000 page.
>>
>> If FrameSize >= 32, then the stores are going to hit the 0xf0000 page
>> rather than the 0xf1000 page.   That's jumping the guard.  Thus we have
>> to emit a probe prior to this stack allocation.
> 
> That's an incorrect ABI that allows adjusting the frame by 4080+32! A
> correct
> one might allow say 1024 bytes for outgoing arguments. That means when
> you call a function, there is still guard-page-size - 1024 bytes left
> that you can
> use to allocate locals. With a 4K guard page that allows leaf functions
> up to 3KB,
> and depending on the frame locals of 2-3KB plus up to 1024 bytes of outgoing
> arguments without inserting any probes beyond the normal frame stores.
> 
> This design means almost no functions need additional probes. Assuming we're
> also increasing the guard page size to 64KB, it's cheap even for large
> functions.
> 
> Wilco

A mere 256 bytes for the caller would permit 32 x 8byte arguments on the
stack which, with at least 8 parameters passed in registers, would allow
for calls with 40 parameters.  There can't be many in that space.  Any
function making calls with more than that might need additional probes,
but that's going to be exceedingly rare.

Put the cost on the least common sequences, even if they pay
disproportionately - it will be a win over all.

R.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: stack/heap collision vulnerability and mitigation with GCC
@ 2017-06-20 23:22 Wilco Dijkstra
  2017-06-21  8:34 ` Richard Earnshaw (lists)
  2017-06-21 17:05 ` Jeff Law
  0 siblings, 2 replies; 66+ messages in thread
From: Wilco Dijkstra @ 2017-06-20 23:22 UTC (permalink / raw)
  To: Jeff Law, Richard Earnshaw, GCC Patches; +Cc: nd

Jeff Law wrote:
> But the stack pointer might have already been advanced into the guard
> page by the caller.   For the sake of argument assume the guard page is
> 0xf1000 and assume that our stack pointer at entry is 0xf1010 and that
> the caller hasn't touched the 0xf1000 page.
>
> If FrameSize >= 32, then the stores are going to hit the 0xf0000 page
> rather than the 0xf1000 page.   That's jumping the guard.  Thus we have
> to emit a probe prior to this stack allocation.

That's an incorrect ABI that allows adjusting the frame by 4080+32! A correct
one might allow say 1024 bytes for outgoing arguments. That means when
you call a function, there is still guard-page-size - 1024 bytes left that you can
use to allocate locals. With a 4K guard page that allows leaf functions up to 3KB, 
and depending on the frame locals of 2-3KB plus up to 1024 bytes of outgoing
arguments without inserting any probes beyond the normal frame stores. 

This design means almost no functions need additional probes. Assuming we're
also increasing the guard page size to 64KB, it's cheap even for large functions.

Wilco

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2017-07-18 19:54 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-19 17:07 RFC: stack/heap collision vulnerability and mitigation with GCC Jeff Law
2017-06-19 17:29 ` Jakub Jelinek
2017-06-19 17:45   ` Jeff Law
2017-06-19 17:51     ` Jakub Jelinek
2017-06-19 21:51       ` Jeff Law
2017-06-20  8:03       ` Uros Bizjak
2017-06-20 10:18         ` Richard Biener
2017-06-20 11:10           ` Uros Bizjak
2017-06-20 12:13             ` Florian Weimer
2017-06-20 12:17               ` Uros Bizjak
2017-06-20 12:20                 ` Uros Bizjak
2017-06-20 12:27                   ` Richard Biener
2017-06-20 21:57                     ` Jeff Law
2017-06-20 15:59                 ` Jeff Law
2017-06-19 18:00   ` Richard Biener
2017-06-19 18:02     ` Richard Biener
2017-06-19 18:15       ` Florian Weimer
2017-06-19 21:57         ` Jeff Law
2017-06-19 22:08       ` Jeff Law
2017-06-20  7:50   ` Eric Botcazou
2017-06-19 17:51 ` Joseph Myers
2017-06-19 17:55   ` Jakub Jelinek
2017-06-19 18:21   ` Florian Weimer
2017-06-19 21:56     ` Joseph Myers
2017-06-19 22:05       ` Jeff Law
2017-06-19 22:10         ` Florian Weimer
2017-06-19 19:05   ` Jeff Law
2017-06-19 19:45     ` Jakub Jelinek
2017-06-19 21:41       ` Jeff Law
2017-06-20  8:27     ` Richard Earnshaw (lists)
2017-06-20 15:50       ` Jeff Law
2017-06-19 18:12 ` Richard Kenner
2017-06-19 22:05   ` Jeff Law
2017-06-19 22:07     ` Richard Kenner
2017-06-20  8:21   ` Eric Botcazou
2017-06-20 15:50     ` Jeff Law
2017-06-20 19:48     ` Jakub Jelinek
2017-06-20 20:37       ` Eric Botcazou
2017-06-20 20:46         ` Jeff Law
2017-06-20  8:17 ` Eric Botcazou
2017-06-20 21:52   ` Jeff Law
2017-06-20 22:20     ` Eric Botcazou
2017-06-21 17:31       ` Jeff Law
2017-06-21 19:07     ` Florian Weimer
2017-06-21  7:56   ` Andreas Schwab
2017-06-20  9:27 ` Richard Earnshaw (lists)
2017-06-20 21:39   ` Jeff Law
2017-06-21  8:41     ` Richard Earnshaw (lists)
2017-06-21 17:25       ` Jeff Law
2017-06-22  9:53         ` Richard Earnshaw (lists)
2017-06-22 15:30           ` Jeff Law
2017-06-22 16:07             ` Szabolcs Nagy
2017-06-22 16:15               ` Jeff Law
2017-06-28  6:45           ` Florian Weimer
2017-07-13 23:21             ` Jeff Law
2017-07-18 19:54               ` Florian Weimer
2017-06-20 23:22 Wilco Dijkstra
2017-06-21  8:34 ` Richard Earnshaw (lists)
2017-06-21  8:44   ` Andreas Schwab
2017-06-21  8:46     ` Richard Earnshaw (lists)
2017-06-21  8:46       ` Richard Earnshaw (lists)
2017-06-21  9:03   ` Wilco Dijkstra
2017-06-21 17:05 ` Jeff Law
2017-06-21 17:47   ` Wilco Dijkstra
2017-06-22 16:10     ` Jeff Law
2017-06-22 22:57       ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).