public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
@ 2017-07-11 21:21 Jeff Law
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Law @ 2017-07-11 21:21 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2504 bytes --]


This patch introduces aarch64 -fstack-check=clash prologue support.

aarch64 is the first target that does not have any implicit probes in
the caller.  Thus at prologue entry it must make conservative
assumptions about the offset of the most recent probed address relative
to the stack pointer.

However, it can and does use stores into the stack generated by the
prologue to avoid some explicit probes.  It's not as efficient as x86 or
ppc, but it does try hard to be efficient.


We track the offset of the most recent probe relative to the stack
pointer.  The initial state is PROBE_INTERVAL - (STACK_BOUNDARY /
BITS_PER_UNIT) bytes away.  Allocations increase that value.  Stores
decrease the value.

When the value crosses PROBE_INTERVAL we emit an explicit probe.

aarch64 prologues can allocate space in 3 different places.  There's
INITIAL_ADJUST and FINAL_ADJUST which are simple adjustments and they're
trivial to handle.  We just add the adjustment to the current offset and
if we cross PROBE_INTERVAL, we emit a probe before the stack adjustment.

Handling of CALLEE_ADJUST is more subtle because its two actions in one
instruction.  The instruction both allocates space and stores into two
words at the lowest address in the allocated space.

Its easiest to mentally model the instruction as two distinct operations.

First the instruction adjusts the stack by CALLEE_ADJUST - 2 * WORD_SIZE
and thus increases the offset to the most recent probe.  If the offset
to the most recent probe crosses PROBE_INTERVAL, then we must emit a
probe prior to the CALLEE_ADJUST instruction.  Note that CALLEE_ADJUST
might be 0 or any multiple of 16.  For 0/16 this step is a nop.

Second the CALLEE_ADJUST instruction adjusts the stack by 2 * WORD_SIZE
and stores into both just allocated words.  We never need a probe for
this action.  Furthermore, this action always zeros the offset to the
most recent probe which is very helpful in avoiding subsequent explicit
probes.


Between CALLEE_ADJUST and FINAL_ADJUST are other register saves.  We
track the offsets for those and decrease the offset to the most recent
probe appropriately.

I'm pretty sure this implementation is about as good as it can get with
the ISA/ABI in place on aarch64.

I'm pretty sure that the loop to allocate large stacks does not update
CFIs correctly.

Again, you'll note that like ppc, this does not add any new tests.  It
uses the tests that were added with the x86 support unmodifed.

Thoughts/Comments?  OK for the trunk?

[-- Attachment #2: P7 --]
[-- Type: text/plain, Size: 10422 bytes --]


	* config/aarch/aarch64.c (output_probe_stack_range): Handle
	-fstack-check=clash probing too.
	(aarch64_save_callee_saves): Return smallest offset from SP
	that was written.
	(aarch64_allocate_and_probe_stack_space): New function.
	(aarch64_expand_prologue): Track distance between SP and most
	recent probe.  Use aarch64_allocate_and_probe_stack_space
	when -fstack-check=clash rather than just adjusting sp.
	Dump actions via dump_stack_clash_frame_info.
	Handle probing and probe offset update for CALLEE_ADJUST space.
	Use return value from aarch64_save_callee_saves to reduce
	last_probe_offset.
commit 86871749daad396cf825b3982dee8e80eda96d8a
Author: root <root@gigabyte-r120-05.khw.lab.eng.bos.redhat.com>
Date:   Fri Jul 7 11:17:06 2017 -0400

    aarch64 support with dumping, adjust tests

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0a8b40a..40bc183 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2830,6 +2830,9 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   char loop_lab[32];
   rtx xops[2];
 
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK)
+    reg1 = stack_pointer_rtx;
+
   ASM_GENERATE_INTERNAL_LABEL (loop_lab, "LPSRL", labelno++);
 
   /* Loop.  */
@@ -2841,7 +2844,14 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   output_asm_insn ("sub\t%0, %0, %1", xops);
 
   /* Probe at TEST_ADDR.  */
-  output_asm_insn ("str\txzr, [%0]", xops);
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK)
+    {
+      gcc_assert (xops[0] == stack_pointer_rtx);
+      xops[1] = GEN_INT (PROBE_INTERVAL - 8);
+      output_asm_insn ("str\txzr, [%0, %1]", xops);
+    }
+  else
+    output_asm_insn ("str\txzr, [%0]", xops);
 
   /* Test if TEST_ADDR == LAST_ADDR.  */
   xops[1] = reg2;
@@ -3245,9 +3255,11 @@ aarch64_return_address_signing_enabled (void)
 
 /* Emit code to save the callee-saved registers from register number START
    to LIMIT to the stack at the location starting at offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  */
+   skipping any write-back candidates if SKIP_WB is true.
 
-static void
+   Return the smallest offset from SP that is written.  */
+
+static HOST_WIDE_INT
 aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 			   unsigned start, unsigned limit, bool skip_wb)
 {
@@ -3257,6 +3269,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
   unsigned regno;
   unsigned regno2;
 
+  HOST_WIDE_INT lowest_offset = PROBE_INTERVAL;
+
   for (regno = aarch64_next_callee_save (start, limit);
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
@@ -3274,6 +3288,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
+      if (offset < lowest_offset)
+	lowest_offset = offset;
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
 					      offset));
 
@@ -3289,6 +3305,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	  rtx mem2;
 
 	  offset = start_offset + cfun->machine->frame.reg_offset[regno2];
+	  if (offset < lowest_offset)
+	    lowest_offset = offset;
 	  mem2 = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
 						   offset));
 	  insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
@@ -3306,6 +3324,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 
       RTX_FRAME_RELATED_P (insn) = 1;
     }
+  return lowest_offset;
 }
 
 /* Emit code to restore the callee registers of mode MODE from register
@@ -3605,6 +3624,71 @@ aarch64_set_handled_components (sbitmap components)
       cfun->machine->reg_is_wrapped_separately[regno] = true;
 }
 
+/* Allocate SIZE bytes of stack space using SCRATCH_REG as a scratch
+   register.
+
+   LAST_PROBE_OFFSET contains the offset between the stack pointer and
+   the last known probe.  As LAST_PROBE_OFFSET crosses PROBE_INTERVAL
+   emit a probe and adjust LAST_PROBE_OFFSET.  */
+static void
+aarch64_allocate_and_probe_stack_space (int scratchreg, HOST_WIDE_INT size,
+					HOST_WIDE_INT *last_probe_offset)
+{
+  rtx temp = gen_rtx_REG (word_mode, scratchreg);
+
+  HOST_WIDE_INT rounded_size = size & -PROBE_INTERVAL;
+
+  /* We can handle a small number of allocations/probes inline.  Otherwise
+     punt to a loop.  */
+  if (rounded_size && rounded_size <= 4 * PROBE_INTERVAL)
+    {
+      for (HOST_WIDE_INT i = 0; i < rounded_size; i += PROBE_INTERVAL)
+	{
+	  /* We should never need a scratch register for this adjustment.  */
+	  aarch64_sub_sp (-1, PROBE_INTERVAL, true);
+
+	  /* We just allocated PROBE_INTERVAL bytes.  Thus, a probe is
+	     mandatory.  Note that LAST_PROBE_OFFSET does not change here.  */
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   (PROBE_INTERVAL
+					    - GET_MODE_SIZE (word_mode))));
+	}
+      dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    }
+  else if (rounded_size)
+    {
+      /* Compute the ending address.  */
+      emit_move_insn (temp, GEN_INT (-rounded_size));
+      emit_insn (gen_add3_insn (temp, stack_pointer_rtx, temp));
+
+      /* This allocates and probes the stack.  Like the inline version above
+	 it does not need to change LAST_PROBE_OFFSET.
+
+	 It almost certainly does not update CFIs correctly.  */
+      emit_insn (gen_probe_stack_range (temp, temp, temp));
+      dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+    }
+  else
+    dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    
+
+  /* Handle any residuals.  */
+  HOST_WIDE_INT residual = size - rounded_size;
+  if (residual)
+    {
+      aarch64_sub_sp (-1, residual, true);
+      *last_probe_offset += residual;
+      if (*last_probe_offset >= PROBE_INTERVAL)
+	{
+	  *last_probe_offset -= PROBE_INTERVAL;
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   (residual
+					    - GET_MODE_SIZE (word_mode))));
+	}
+    }
+  return;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3661,6 +3745,11 @@ aarch64_expand_prologue (void)
   unsigned reg2 = cfun->machine->frame.wb_candidate2;
   rtx_insn *insn;
 
+  /* We know nothing about the last write into the stack, so we have to make
+     worst case assumptions here.  */
+  HOST_WIDE_INT last_probe_offset
+    = PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT);
+
   /* Sign return address for functions.  */
   if (aarch64_return_address_signing_enabled ())
     {
@@ -3686,16 +3775,64 @@ aarch64_expand_prologue (void)
 	aarch64_emit_probe_stack_range (get_stack_check_protect (), frame_size);
     }
 
-  aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+      && initial_adjust != 0)
+    aarch64_allocate_and_probe_stack_space (IP0_REGNUM, initial_adjust,
+					    &last_probe_offset);
+  else
+    aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+
+  /* There are no explicit probes in this case.  */
+  if (initial_adjust == 0
+      && callee_adjust < 32
+      && final_adjust == 0)
+    dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, callee_adjust != 0);
 
   if (callee_adjust != 0)
-    aarch64_push_regs (reg1, reg2, callee_adjust);
+    {
+      aarch64_push_regs (reg1, reg2, callee_adjust);
+
+      /* For a small CALLEE_ADJUST we don't need a probe here as the store of
+	 REG1/REG2 will hit the page we want.  */
+      if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+	  && callee_adjust >= 32)
+	{
+	  /* callee_adjust has to fit in a 7 bit signed scaled offset,
+	     so it should never be near PROBE_INTERVAL.  */
+	  gcc_assert (callee_adjust < PROBE_INTERVAL);
+	  last_probe_offset += callee_adjust;
 
+	  /* Probe if necessary.  */
+	  if (last_probe_offset >= PROBE_INTERVAL)
+	    {
+	      emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					       (callee_adjust
+					        - GET_MODE_SIZE (word_mode))));
+	      dump_stack_clash_frame_info (PROBE_INLINE, true);
+	    }
+
+	  /* Regardless of whether or not we explicitly probed, the register
+	     push constitutes an implicit probe at *sp.  */
+	  last_probe_offset = 0;
+	}
+
+    }
+
+  /* The calls to aarch64_save_callee_saves store into the stack and
+     can act as implicit probes.  So it tracks the offset closest to
+     SP that is written and returns it.  If that return value is less
+     than LAST_PROBE_OFFSET then we can adjust LAST_PROBE_OFFSET, possibly
+     saving an explicit probe for FINAL_ADJUST.  */
+  HOST_WIDE_INT tmp;
   if (frame_pointer_needed)
     {
       if (callee_adjust == 0)
-	aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM,
-				   R30_REGNUM, false);
+	{
+	  tmp = aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM,
+					   R30_REGNUM, false);
+	  if (tmp < last_probe_offset)
+	    last_probe_offset = tmp;
+	}
       insn = emit_insn (gen_add3_insn (hard_frame_pointer_rtx,
 				       stack_pointer_rtx,
 				       GEN_INT (callee_offset)));
@@ -3703,11 +3840,22 @@ aarch64_expand_prologue (void)
       emit_insn (gen_stack_tie (stack_pointer_rtx, hard_frame_pointer_rtx));
     }
 
-  aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
-			     callee_adjust != 0 || frame_pointer_needed);
-  aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
-			     callee_adjust != 0 || frame_pointer_needed);
-  aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
+  tmp = aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
+				   callee_adjust != 0 || frame_pointer_needed);
+  if (tmp < last_probe_offset)
+    last_probe_offset = tmp;
+
+  tmp = aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
+				   callee_adjust != 0 || frame_pointer_needed);
+  if (tmp < last_probe_offset)
+    last_probe_offset = tmp;
+
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+      && final_adjust != 0)
+    aarch64_allocate_and_probe_stack_space (IP1_REGNUM, final_adjust,
+					    &last_probe_offset);
+  else
+    aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
 }
 
 /* Return TRUE if we can use a simple_return insn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-17 22:42       ` Wilco Dijkstra
@ 2017-07-17 23:16         ` Jeff Law
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Law @ 2017-07-17 23:16 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: nd

On 07/17/2017 04:42 PM, Wilco Dijkstra wrote:
> Jeff Law wrote:    
>> On 07/17/2017 05:27 AM, Wilco Dijkstra wrote:
> 
>>> A minimum guard size of 64KB seems reasonable even on systems with
>>> 4KB pages. However whatever the chosen guard size, you cannot defend
>>> against hostile code. An OS can of course increase the guard size well 
>>> beyond the minimum required, but that's simply reducing the probability -
>>> it is never going to block a big unchecked alloca.
>> That's a kernel issue and I'm not in a position to change that.  On
>> systems with a 64bit address space, I'm keen to see the kernel team
>> increase the guard size, but it's not something I can unilaterally make
>> happen.
> 
> Well if we want the stack guard catch cases with high probability even if some
> or most of the code is unchecked, it must be made much larger. And the fact
> you can set the stack guard to zero in GLIBC is worrying as that would allow an
> attacker to trivially bypass the stack guard...
Well, the attacker would have to arrange to emit a call to change the
guard size -- if the program doesn't already have that code sequence,
then that's going to be reasonably hard (they'd have to control the
argument to that pthread call to set the guard or conjure up a ROP
sequence -- and if they've got a ROP chain started there's probably more
direct attacks).




>>> In 99% of the frames only one stack allocation is made. There are a few
>>> cases where the stack can be adjusted twice.
>> BUt we still have to deal with the cases where there are multiple
>> adjustments.  Punting in the case of multiple adjustments isn't the
>> right thing to do.  Some level of tracking is needed.
> 
> I didn't say we should punt, just that no tracking is required. The AArch64 prolog
> code is extremely simple. The only adjustments that need to be checked are 
> initial_adjust and final_adjust. Callee_adjust doesn't need any checks since it is 
> limited by the range of STP (ie. < 512) and if the locals are large, it is zero.
Hmm, so we can't ever get into a case where INITIAL_ADJUST and
CALLEE_ADJUST are both nonzero?  If so, then yes, that does simplify
things -- dealing with that case is an serious annoyance and I'd be
happy to add an assert to ensure it never happens.   That's where
knowing the architecture details (which I don't) really turns out to be
useful :-)

> 
> Anyway the only complex case is shrinkwrapping. We know that at least LR must be
> saved before a call, but with -fomit-frame-pointer it doesn't always end up at the
> bottom of the callee-saves. We could take its offset into account or force it at offset 0.
Right.  I'd roughly figured out that the LR save is not separately
wrapped, which is good to know in terms of cases that have to be supported.


> 
>>> To be safe I think we first need to probe and then allocate. Or are there going
>>> to be extra checks in asynchronous interrupt handlers that check whether SP is
>>> above the stack guard?
>> That hits the red zone, which is unfortunate.  But it's acceptable IMHO.
> 
> How do you mean? What will happen if SP points into the heap? Will the trap
> handler just write to it without any check?
In general writing below SP tends of be discouraged.  But there's cases
where, IMHO, it's called for.  I think this is one of them.

I'm not aware of any checks in the signal handler frame setup done by
the kernel, which is a significant issue.  I'd focused on making sure we
don't have a window where the stack pointer has been pushed past the
guard before probing.  However, as Richard E has pointed out, if the
kernel doesn't have checks, then the signal frame setup itself in the
kernel could be used to push through the guard depending on the size of
the signal frame and the state of the stack pointer relative to the end
of the guard.

Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-17 15:24     ` Jeff Law
  2017-07-17 15:31       ` Jakub Jelinek
@ 2017-07-17 22:42       ` Wilco Dijkstra
  2017-07-17 23:16         ` Jeff Law
  1 sibling, 1 reply; 16+ messages in thread
From: Wilco Dijkstra @ 2017-07-17 22:42 UTC (permalink / raw)
  To: Jeff Law, GCC Patches; +Cc: nd

Jeff Law wrote:    
> On 07/17/2017 05:27 AM, Wilco Dijkstra wrote:

> > A minimum guard size of 64KB seems reasonable even on systems with
> > 4KB pages. However whatever the chosen guard size, you cannot defend
> > against hostile code. An OS can of course increase the guard size well 
> > beyond the minimum required, but that's simply reducing the probability -
> > it is never going to block a big unchecked alloca.
> That's a kernel issue and I'm not in a position to change that.  On
> systems with a 64bit address space, I'm keen to see the kernel team
> increase the guard size, but it's not something I can unilaterally make
> happen.

Well if we want the stack guard catch cases with high probability even if some
or most of the code is unchecked, it must be made much larger. And the fact
you can set the stack guard to zero in GLIBC is worrying as that would allow an
attacker to trivially bypass the stack guard...

> > Outgoing args are not an area for concern even with a 4KB stack guard.
> They are a concern and as the size of the guard comes down, they become
> an increasing concern, IMHO.

At 4KB it's 1000 times more likely to have a frame that could skip the stack
guard than outgoing args. That's before we consider alloca which is even
more common.

> > In 99% of the frames only one stack allocation is made. There are a few
> > cases where the stack can be adjusted twice.
> BUt we still have to deal with the cases where there are multiple
> adjustments.  Punting in the case of multiple adjustments isn't the
> right thing to do.  Some level of tracking is needed.

I didn't say we should punt, just that no tracking is required. The AArch64 prolog
code is extremely simple. The only adjustments that need to be checked are 
initial_adjust and final_adjust. Callee_adjust doesn't need any checks since it is 
limited by the range of STP (ie. < 512) and if the locals are large, it is zero.

Anyway the only complex case is shrinkwrapping. We know that at least LR must be
saved before a call, but with -fomit-frame-pointer it doesn't always end up at the
bottom of the callee-saves. We could take its offset into account or force it at offset 0.

> > To be safe I think we first need to probe and then allocate. Or are there going
> > to be extra checks in asynchronous interrupt handlers that check whether SP is
> > above the stack guard?
> That hits the red zone, which is unfortunate.  But it's acceptable IMHO.

How do you mean? What will happen if SP points into the heap? Will the trap
handler just write to it without any check?

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-17 15:51         ` Jeff Law
@ 2017-07-17 20:30           ` Trevor Saunders
  0 siblings, 0 replies; 16+ messages in thread
From: Trevor Saunders @ 2017-07-17 20:30 UTC (permalink / raw)
  To: Jeff Law; +Cc: Wilco Dijkstra, GCC Patches, nd

On Mon, Jul 17, 2017 at 09:51:40AM -0600, Jeff Law wrote:
> On 07/16/2017 12:36 PM, Trevor Saunders wrote:
> >>> On the other hand if probing is fast enough that it can be on by default
> >>> in gcc there should be much less of it.  Even if you did change the ABI
> >>> to require probing it seems unlikely that code violating that
> >>> requirement would hit problems other than this security concern, so I'd
> >>> expect there will be some non compliant asm out there.
> >> Certainly my goal is to enable it by default one day.  Even if that's
> >> ultimately done at the distro level or by a configure time switch.
> >>
> >> Certainly if/when we reach that point the amount of unprotected code
> >> will drop dramatically, but based on my experience I fully expect some
> >> folks to turn it off.  They'll say something like "hey, we've audited
> >> our code and we don't have any large stack or allocas, so I'm turning
> >> this thing off to get some un-measurable performance gain."  It'd be an
> >> uber-dumb thing to do, but it's going to happen.
> > 
> > I agree with that, though we could consider not having an option for it.
> > Of course people can compile there own compiler, but I think its pretty
> > clear if you do that the security problem is your fault too.
> I suspect not having an option is a non-starter.  But maybe I'm wrong on
> that one.

I can see conditions under which I'd support it depending on
architecture and impact, but I can't speak for anyone else, so your
probably wise to expect it won't happen.  As an example it seems

> >> Tweaking the ABI to mandate touching *sp in the outgoing args area &
> >> alloca space is better because we likely wouldn't have an option to
> >> avoid that access.   So a well meaning, but clueless, developer couldn't
> >> turn the option off like they could stack checking.
> > 
> > However see the comment about assembly code, I can easily see someone
> > forgeting to touch *sp in hand written assembly.  Obviously much less of
> > that, but it kind of sounds like you want perfect here.
> Perfect would be good :-)
> 
> The odds of hand written assembly code having a large enough outgoing
> args size or dynamic frame to require touching is exceedingly small.

Probably true, but that's not perfect so we're back in the land of
balancing probabilities.

> >>> It seems to me pretty important to ask how many programs out there have
> >>> a caller that can push the stack into the guard page, but not past it.
> >> I've actually spent a lot of time thing about that precise problem. You
> >> don't need large frames to do that -- you just need controlled heap
> >> leaks and/or controlled recursion.  ie, even if a function has no
> >> allocas and a small frame, it can put the stack pointer into the guard.
> > 
> > There may not be room for it on 32 bit platforms, but I think another
> > thing we learned here is that its a mistake to allow the heap to grow
> > into space the stack might use.  That would require the use of recursion
> > here.
> The heap grows in response to explicit requests and appropriate checks
> can be made to prevent jumping the guard due to heap growth.  It's the
> implicit mechansisms for stack growth that cause headaches.

I'm aware the heap can't jump the guard on its own.  However my
understanding of some of the known exploits was that they used heap
leaks to grow the heap until the stack couldn't grow any larger to make
it easier to jump the guard.

Has anyone looked at how necessary it is to support stack growth on 64
bit arches? I'd hope we could just have a set size that gets faulted in,
but I'm probably wrong.

> >>   I think the largest
> >>> buffer Qualys found was less than 400k? So 1 256k guard page should
> >>> protect 95% of functions, and 1m or 2m  seems like enough to protect
> >>> against all non malicious programs.  I'm not sure, but this is a 64 bit
> >>> arch, so it seems like we should have the adress space for large guard
> >>> pages like that.
> >> I'm all for larger guards, particularly on 64 bit architectures.
> >>
> >> We use 64k pages on aarch64 for RHEL which implicitly gives us a minimum
> >> guard of 64k.  That would be a huge factor in any analysis I would do if
> >> the aarch64 maintainers choose not to fully protect their architecture
> >> and I was forced to make a recommendation for Red Hat and its customers.
> >>  I hope I don't have to sit down and do the analysis on this and make
> >> any kind of recommendation.
> > 
> > Its certainly easier to say its not the compilers job to fully protect
> > when you don't have to make that recommendation ;)
> Lots of factors come into play.  What I don't want to do is put Red Hat
> in a position where some customer gets hacked because we left open a
> known hole that we could have reasonably closed.

Yeah, that's understandable if I thought all of this should be fixed in
the compiler I'd agree with not wanting to leave any holes.

> >> The fact that Qualys found nothing larger than X (for any X) in the code
> >> they scanned isn't relevant.  There could well be code out there they
> >> did not look at that uses > X or code that is yet to be written that
> >> uses > X.
> > 
> > If you want to protect all code someone could write I agree.  I've been
> > thinking more about protecting most reasonable code and saying if you
> > allocate a 50mb buffer on the stack that's your bug, and we don't need
> > to make that safe.
> Depends on your point of view.  I can't reasonably take that position
> with my Red Hat hat on.  In that role, I would say that any architecture
> for Red Hat Enterprise Linux needs to have a minimum level of protection
> against stack clash style attacks and that would include being safe
> against someone allocating very large structures on the stack. Such code
> may be dumb, such code may be inefficient and not terribly portable.
> But in the environments where RHEL is deployed, we can't just dismiss it
> out-of-hand.

That's understandable.

> That level of protection does not necessarily extend to an unbound
> alloca/vla because if it's unbound, then it's just a matter of
> controlling the size and when the unbound alloca occurs to jump the
> stack -- no amount of compiler hackery can fix that situation.

true, though most probing schemes should at least make that harder to
exploit, so they are a reason to want to enable checking as much as
possible.

Trev

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-16 18:36       ` Trevor Saunders
@ 2017-07-17 15:51         ` Jeff Law
  2017-07-17 20:30           ` Trevor Saunders
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Law @ 2017-07-17 15:51 UTC (permalink / raw)
  To: Trevor Saunders; +Cc: Wilco Dijkstra, GCC Patches, nd

On 07/16/2017 12:36 PM, Trevor Saunders wrote:
>>> On the other hand if probing is fast enough that it can be on by default
>>> in gcc there should be much less of it.  Even if you did change the ABI
>>> to require probing it seems unlikely that code violating that
>>> requirement would hit problems other than this security concern, so I'd
>>> expect there will be some non compliant asm out there.
>> Certainly my goal is to enable it by default one day.  Even if that's
>> ultimately done at the distro level or by a configure time switch.
>>
>> Certainly if/when we reach that point the amount of unprotected code
>> will drop dramatically, but based on my experience I fully expect some
>> folks to turn it off.  They'll say something like "hey, we've audited
>> our code and we don't have any large stack or allocas, so I'm turning
>> this thing off to get some un-measurable performance gain."  It'd be an
>> uber-dumb thing to do, but it's going to happen.
> 
> I agree with that, though we could consider not having an option for it.
> Of course people can compile there own compiler, but I think its pretty
> clear if you do that the security problem is your fault too.
I suspect not having an option is a non-starter.  But maybe I'm wrong on
that one.

>> Tweaking the ABI to mandate touching *sp in the outgoing args area &
>> alloca space is better because we likely wouldn't have an option to
>> avoid that access.   So a well meaning, but clueless, developer couldn't
>> turn the option off like they could stack checking.
> 
> However see the comment about assembly code, I can easily see someone
> forgeting to touch *sp in hand written assembly.  Obviously much less of
> that, but it kind of sounds like you want perfect here.
Perfect would be good :-)

The odds of hand written assembly code having a large enough outgoing
args size or dynamic frame to require touching is exceedingly small.

>>> It seems to me pretty important to ask how many programs out there have
>>> a caller that can push the stack into the guard page, but not past it.
>> I've actually spent a lot of time thing about that precise problem. You
>> don't need large frames to do that -- you just need controlled heap
>> leaks and/or controlled recursion.  ie, even if a function has no
>> allocas and a small frame, it can put the stack pointer into the guard.
> 
> There may not be room for it on 32 bit platforms, but I think another
> thing we learned here is that its a mistake to allow the heap to grow
> into space the stack might use.  That would require the use of recursion
> here.
The heap grows in response to explicit requests and appropriate checks
can be made to prevent jumping the guard due to heap growth.  It's the
implicit mechansisms for stack growth that cause headaches.



>>   I think the largest
>>> buffer Qualys found was less than 400k? So 1 256k guard page should
>>> protect 95% of functions, and 1m or 2m  seems like enough to protect
>>> against all non malicious programs.  I'm not sure, but this is a 64 bit
>>> arch, so it seems like we should have the adress space for large guard
>>> pages like that.
>> I'm all for larger guards, particularly on 64 bit architectures.
>>
>> We use 64k pages on aarch64 for RHEL which implicitly gives us a minimum
>> guard of 64k.  That would be a huge factor in any analysis I would do if
>> the aarch64 maintainers choose not to fully protect their architecture
>> and I was forced to make a recommendation for Red Hat and its customers.
>>  I hope I don't have to sit down and do the analysis on this and make
>> any kind of recommendation.
> 
> Its certainly easier to say its not the compilers job to fully protect
> when you don't have to make that recommendation ;)
Lots of factors come into play.  What I don't want to do is put Red Hat
in a position where some customer gets hacked because we left open a
known hole that we could have reasonably closed.


> 
>> The fact that Qualys found nothing larger than X (for any X) in the code
>> they scanned isn't relevant.  There could well be code out there they
>> did not look at that uses > X or code that is yet to be written that
>> uses > X.
> 
> If you want to protect all code someone could write I agree.  I've been
> thinking more about protecting most reasonable code and saying if you
> allocate a 50mb buffer on the stack that's your bug, and we don't need
> to make that safe.
Depends on your point of view.  I can't reasonably take that position
with my Red Hat hat on.  In that role, I would say that any architecture
for Red Hat Enterprise Linux needs to have a minimum level of protection
against stack clash style attacks and that would include being safe
against someone allocating very large structures on the stack. Such code
may be dumb, such code may be inefficient and not terribly portable.
But in the environments where RHEL is deployed, we can't just dismiss it
out-of-hand.

That level of protection does not necessarily extend to an unbound
alloca/vla because if it's unbound, then it's just a matter of
controlling the size and when the unbound alloca occurs to jump the
stack -- no amount of compiler hackery can fix that situation.

> 
> It might well not be what you need, but it does kind of seem like what's
> generally useful is protecting almost all if not all "reasonable"
> programs with the least performance impact.  Looking at various real
> programs to decide what is reasonable or not seems like a good way to
> draw that line.
Its an interesting exercise, but more from determining what are the
common cases so that those can be made the most efficient.


> 
>>> On the other hand making -fstack-check=clash the default seems to me
>>> like a very significant security improvement.
>> Agreed and it's where I'd like this to go.
> 
> yeah, I'm just wondering if it isn't possible to get there already on 64
> bit machines with some kernel changes.
> 
> thanks / sorry about lag.
No worries.

Thanks,
jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-17 15:24     ` Jeff Law
@ 2017-07-17 15:31       ` Jakub Jelinek
  2017-07-17 22:42       ` Wilco Dijkstra
  1 sibling, 0 replies; 16+ messages in thread
From: Jakub Jelinek @ 2017-07-17 15:31 UTC (permalink / raw)
  To: Jeff Law; +Cc: Wilco Dijkstra, GCC Patches, nd

On Mon, Jul 17, 2017 at 09:24:27AM -0600, Jeff Law wrote:
> >> While I'm not really comfortable with the 2k/2k split in general, I
> >> think I can support it from a Red Hat point of view -- largely because
> >> we use 64k pages in RHEL.  So our guard is actually 64k.  Having a
> >> hostile call chain push 2k into the guard isn't so bad in that case.
> > 
> > A minimum guard size of 64KB seems reasonable even on systems with
> > 4KB pages. However whatever the chosen guard size, you cannot defend
> > against hostile code. An OS can of course increase the guard size well 
> > beyond the minimum required, but that's simply reducing the probability -
> > it is never going to block a big unchecked alloca.
> That's a kernel issue and I'm not in a position to change that.  On
> systems with a 64bit address space, I'm keen to see the kernel team
> increase the guard size, but it's not something I can unilaterally make
> happen.

That is actually not true.  Kernel issue it is only for the initial thread's
stack.  For pthread_create created threads it is a matter how big the
default guard size is (these days glibc uses a single page by default) and
how many apps override that guardsize to something different (smaller
including 0 or larger).  And, while it might be acceptable to have larger
guard size for the initial thread, if you have hundreds of thousands of
threads, every extra KB in the guard areas will count significantly.

	Jakub

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-17 11:27   ` Wilco Dijkstra
@ 2017-07-17 15:24     ` Jeff Law
  2017-07-17 15:31       ` Jakub Jelinek
  2017-07-17 22:42       ` Wilco Dijkstra
  0 siblings, 2 replies; 16+ messages in thread
From: Jeff Law @ 2017-07-17 15:24 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: nd

On 07/17/2017 05:27 AM, Wilco Dijkstra wrote:
> Jeff Law wrote:
> 
>> So would a half-half (2k caller/2k callee) split like Florian has
>> proposed work for you?  ie, we simply declare a caller that pushes the
>> stack pointer 2k or more into the guard as broken?
> 
> My results show using a 4KB guard size is not ideal. 2KB of outgoing
> args is too large (it means 264 arguments) so it's a waste of the limited
> range.
Ideal depends on your point of view, obviously.  Can you interface
directly with Richard and determine what split you want?  Having me as
an intermediary just slows things down :-)



> 
>> While I'm not really comfortable with the 2k/2k split in general, I
>> think I can support it from a Red Hat point of view -- largely because
>> we use 64k pages in RHEL.  So our guard is actually 64k.  Having a
>> hostile call chain push 2k into the guard isn't so bad in that case.
> 
> A minimum guard size of 64KB seems reasonable even on systems with
> 4KB pages. However whatever the chosen guard size, you cannot defend
> against hostile code. An OS can of course increase the guard size well 
> beyond the minimum required, but that's simply reducing the probability -
> it is never going to block a big unchecked alloca.
That's a kernel issue and I'm not in a position to change that.  On
systems with a 64bit address space, I'm keen to see the kernel team
increase the guard size, but it's not something I can unilaterally make
happen.




> 
>> In fact, with a 64k guard, the dynamic area dwarfs the outgoing args as
>> the area of most concern.  And with that size guard we're far more
>> likely to see an attempted jump with an unconstrained alloca rather than
>> with a fairly large alloca.
> 
> Outgoing args are not an area for concern even with a 4KB stack guard.
They are a concern and as the size of the guard comes down, they become
an increasing concern, IMHO.

> 
> Unconstrained alloca's are indeed the most risky - if a program has a single
> alloca that can be controlled then it is 100% unsafe no matter how many
> million probes you do elsewhere.
Agreed.

> 
>> I believe Richard Earnshaw indicated that 4k pages were in use on some
>> aarch64 systems, so I didn't try to use a larger probe interval.  Though
>> I certainly considered it.  I think that's a decision that belongs in
>> the hands of the target maintainers.  Though I think it's best if you're
>> going to use a larger probe interval to mandate a suitable page size in
>> the ABI.
> 
> The probe interval doesn't have to be the same as the (minimum) page size.
Correct.  It has to be smaller than the guard.   SO what guard does the
kernel guarantee?  I'm going to set things up assuming a single guard
page, y'all can modify as needed.

> 
>> Some (simpler) tracking is still needed because allocations happen in
>> potentially 3 places for aarch64.  There's almost certainly cases where
>> none of them individually are larger than PROBE_INTERVAL, but as a group
>> they could be.
> 
> In 99% of the frames only one stack allocation is made. There are a few
> cases where the stack can be adjusted twice.
BUt we still have to deal with the cases where there are multiple
adjustments.  Punting in the case of multiple adjustments isn't the
right thing to do.  Some level of tracking is needed.

> 
>> So how about this.
>>
>> First we have to disallow any individual allocation from allocating more
>> than PROBE_INTERVAL bytes.
>>
>> If the total frame size is less than X (where we're still discussing X),
>> we emit no probes.
> 
> ... and the outgoing args are smaller than Y.
> 
>> If the total frame size is greater than X, then after the first
>> allocation we probe the highest address within in the just allocated
>> area and we probe every PROBE_INTERVAL bytes thereafter as space is
>> allocated.
> 
> To be safe I think we first need to probe and then allocate. Or are there going
> to be extra checks in asynchronous interrupt handlers that check whether SP is
> above the stack guard?
That hits the red zone, which is unfortunate.  But it's acceptable IMHO.


Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-14  6:19 ` Jeff Law
@ 2017-07-17 11:27   ` Wilco Dijkstra
  2017-07-17 15:24     ` Jeff Law
  0 siblings, 1 reply; 16+ messages in thread
From: Wilco Dijkstra @ 2017-07-17 11:27 UTC (permalink / raw)
  To: Jeff Law, GCC Patches; +Cc: nd

Jeff Law wrote:

> So would a half-half (2k caller/2k callee) split like Florian has
> proposed work for you?  ie, we simply declare a caller that pushes the
> stack pointer 2k or more into the guard as broken?

My results show using a 4KB guard size is not ideal. 2KB of outgoing
args is too large (it means 264 arguments) so it's a waste of the limited
range.

> While I'm not really comfortable with the 2k/2k split in general, I
> think I can support it from a Red Hat point of view -- largely because
> we use 64k pages in RHEL.  So our guard is actually 64k.  Having a
> hostile call chain push 2k into the guard isn't so bad in that case.

A minimum guard size of 64KB seems reasonable even on systems with
4KB pages. However whatever the chosen guard size, you cannot defend
against hostile code. An OS can of course increase the guard size well 
beyond the minimum required, but that's simply reducing the probability -
it is never going to block a big unchecked alloca.

> In fact, with a 64k guard, the dynamic area dwarfs the outgoing args as
> the area of most concern.  And with that size guard we're far more
> likely to see an attempted jump with an unconstrained alloca rather than
> with a fairly large alloca.

Outgoing args are not an area for concern even with a 4KB stack guard.

Unconstrained alloca's are indeed the most risky - if a program has a single
alloca that can be controlled then it is 100% unsafe no matter how many
million probes you do elsewhere.

> And if there's an attacker controlled unconstrained alloca, then, well,
> we've lost because no matter the size of the guard, the attacker can
> jump it and there isn't a damn thing the callee can do about it.

Exactly. This means checking dynamic allocation by default is essential.

> I believe Richard Earnshaw indicated that 4k pages were in use on some
> aarch64 systems, so I didn't try to use a larger probe interval.  Though
> I certainly considered it.  I think that's a decision that belongs in
> the hands of the target maintainers.  Though I think it's best if you're
> going to use a larger probe interval to mandate a suitable page size in
> the ABI.

The probe interval doesn't have to be the same as the (minimum) page size.

> Some (simpler) tracking is still needed because allocations happen in
> potentially 3 places for aarch64.  There's almost certainly cases where
> none of them individually are larger than PROBE_INTERVAL, but as a group
> they could be.

In 99% of the frames only one stack allocation is made. There are a few
cases where the stack can be adjusted twice.

> So how about this.
>
> First we have to disallow any individual allocation from allocating more
> than PROBE_INTERVAL bytes.
>
> If the total frame size is less than X (where we're still discussing X),
> we emit no probes.

... and the outgoing args are smaller than Y.

> If the total frame size is greater than X, then after the first
> allocation we probe the highest address within in the just allocated
> area and we probe every PROBE_INTERVAL bytes thereafter as space is
> allocated.

To be safe I think we first need to probe and then allocate. Or are there going
to be extra checks in asynchronous interrupt handlers that check whether SP is
above the stack guard?

> PROBE_INTERVAL is currently 4k, but the aarch64 maintainers can choose
> to change that.  Note that significantly larger probe intervals may
> require tweaking the sequences -- I'm not familiar enough with the
> various immediate field limitations on aarch64 instructions to be sure
> either way on that issue.

On AArch64 a probe can reach up to 16MBytes using 2 instructions.

Wilco

    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-12 21:08     ` Jeff Law
@ 2017-07-16 18:36       ` Trevor Saunders
  2017-07-17 15:51         ` Jeff Law
  0 siblings, 1 reply; 16+ messages in thread
From: Trevor Saunders @ 2017-07-16 18:36 UTC (permalink / raw)
  To: Jeff Law; +Cc: Wilco Dijkstra, GCC Patches, nd

On Wed, Jul 12, 2017 at 03:08:28PM -0600, Jeff Law wrote:
> On 07/12/2017 06:47 AM, Trevor Saunders wrote:
> > On Tue, Jul 11, 2017 at 08:02:26PM -0600, Jeff Law wrote:
> >> On 07/11/2017 06:20 PM, Wilco Dijkstra wrote:
> >>> Jeff Law wrote:
> >>>> aarch64 is the first target that does not have any implicit probes in
> >>>> the caller.  Thus at prologue entry it must make conservative
> >>>> assumptions about the offset of the most recent probed address relative
> >>>> to the stack pointer.
> >>>
> >>> No - like I mentioned before that's not correct nor acceptable as it would imply
> >>> that ~70% of functions need a probe at entry. I did a quick run across SPEC and
> >>> found the outgoing argument size is > 1024 in just 9 functions out of 147000!
> >>> Those 9 were odd special cases due to auto generated code to interface between
> >>> C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
> >>> an unchecked caller, large outgoing arguments are simply not a realistic threat.
> >> Whether or not such frames exist in SPEC isn't the question.   THere's
> >> nothing in the ABI or ISA that allows us to avoid those probes without
> >> compromising security.
> >>
> >> Mixed code scenarios are going to be a fact of life, probably forever
> >> unless we can convince every ISV providing software that works on top of
> >> RHEL/*BSD/whatever  to turn on probing (based on my experiences, that
> >> has exactly a zero chance of occurring).
> > 
> > On the other hand if probing is fast enough that it can be on by default
> > in gcc there should be much less of it.  Even if you did change the ABI
> > to require probing it seems unlikely that code violating that
> > requirement would hit problems other than this security concern, so I'd
> > expect there will be some non compliant asm out there.
> Certainly my goal is to enable it by default one day.  Even if that's
> ultimately done at the distro level or by a configure time switch.
> 
> Certainly if/when we reach that point the amount of unprotected code
> will drop dramatically, but based on my experience I fully expect some
> folks to turn it off.  They'll say something like "hey, we've audited
> our code and we don't have any large stack or allocas, so I'm turning
> this thing off to get some un-measurable performance gain."  It'd be an
> uber-dumb thing to do, but it's going to happen.

I agree with that, though we could consider not having an option for it.
Of course people can compile there own compiler, but I think its pretty
clear if you do that the security problem is your fault too.

> And once that happens, they're open to attack again, even if they were
> correct in their audit and their code only calls stack-clash protected
> libraries.
> 
> Tweaking the ABI to mandate touching *sp in the outgoing args area &
> alloca space is better because we likely wouldn't have an option to
> avoid that access.   So a well meaning, but clueless, developer couldn't
> turn the option off like they could stack checking.

However see the comment about assembly code, I can easily see someone
forgeting to touch *sp in hand written assembly.  Obviously much less of
that, but it kind of sounds like you want perfect here.

> >> In a mixed code scenario a caller may have a large alloca or large
> >> outgoing argument area that pushes the stack pointer into the guard page
> >> without actually accessing the guard page.  That requires a callee which
> >> is compiled with stack checking to make worst case assumptions at
> >> function entry to protect itself as much as possible from these attacks.
> > 
> > It seems to me pretty important to ask how many programs out there have
> > a caller that can push the stack into the guard page, but not past it.
> I've actually spent a lot of time thing about that precise problem. You
> don't need large frames to do that -- you just need controlled heap
> leaks and/or controlled recursion.  ie, even if a function has no
> allocas and a small frame, it can put the stack pointer into the guard.

There may not be room for it on 32 bit platforms, but I think another
thing we learned here is that its a mistake to allow the heap to grow
into space the stack might use.  That would require the use of recursion
here.

> THus in the callee we have to make worst case assumptions to be safe on
> targets where there is no implicit probe in the caller.
> 
> 
> 
> > I'd expect that's not the case for most allocas, either they are safe,
> > or can increase the stack arbitrarily.  I'd expect its more likely with
> > outgoing arg or large buffer allocations though.
> Unfortunately not.  A small alloca is not inherently safe unless you
> also touch the allocated space.  That's precisely why the generic code
> touches residuals after the loop that handles PROBE_INTERVAL chunks
> (consider a small alloca in a loop).
> 
> That's also why I suggested a small backwards compatible tweak to the
> aarch64 ABI.  Specifically there should be a mandated touch of *sp in
> any function where the outgoing args size is greater than some well
> defined value or if the function allocates any dynamic stack space.
> 
> That mandated touch of *sp at the ABI level would be enough to allow for
> a much less conservative assumed state at function start and would allow
> the vast majority of functions to need no probing at all.
> 
> 
> 
>   I think the largest
> > buffer Qualys found was less than 400k? So 1 256k guard page should
> > protect 95% of functions, and 1m or 2m  seems like enough to protect
> > against all non malicious programs.  I'm not sure, but this is a 64 bit
> > arch, so it seems like we should have the adress space for large guard
> > pages like that.
> I'm all for larger guards, particularly on 64 bit architectures.
> 
> We use 64k pages on aarch64 for RHEL which implicitly gives us a minimum
> guard of 64k.  That would be a huge factor in any analysis I would do if
> the aarch64 maintainers choose not to fully protect their architecture
> and I was forced to make a recommendation for Red Hat and its customers.
>  I hope I don't have to sit down and do the analysis on this and make
> any kind of recommendation.

Its certainly easier to say its not the compilers job to fully protect
when you don't have to make that recommendation ;)

> The fact that Qualys found nothing larger than X (for any X) in the code
> they scanned isn't relevant.  There could well be code out there they
> did not look at that uses > X or code that is yet to be written that
> uses > X.

If you want to protect all code someone could write I agree.  I've been
thinking more about protecting most reasonable code and saying if you
allocate a 50mb buffer on the stack that's your bug, and we don't need
to make that safe.

> > 
> > There's also the trade off that increasing the amount of code that
> > probes reduces the set of code that can either move the stack into or
> > past the guard page.
> Correct.
> 
> > 
> >> THe aarch64 maintainers can certain nix what I've done or modify it in
> >> ways that suit them.  That is their choice as port maintainers.
> >> However, Red Hat will have to evaluate the cost of reducing security for
> >> our customer base against the performance improvement of such changes.
> >> As I've said before, I do not know where that decision would fall.
> >>
> >>
> >>>
> >>> Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
> >>> before needing an explicit probe - now only 0.6% of functions need a probe.
> >>> If we choose a proper minimum probe distance, say 64KB, explicit probes are
> >>> basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
> >>> Clearly inserting probes can be the default as the impact on code quality is negligible.
> >> Again, there's nothing that says 3k is safe.   You're picking an
> >> arbitrary point that is safe in a codebase you've looked at.  But I'm
> >> looking at this from a "what guarantees do I have from an ABI or ISA
> >> standpoint".  The former may be more performant, but it's inherently
> >> less secure than the latter.

It might well not be what you need, but it does kind of seem like what's
generally useful is protecting almost all if not all "reasonable"
programs with the least performance impact.  Looking at various real
programs to decide what is reasonable or not seems like a good way to
draw that line.

> > On the other hand making -fstack-check=clash the default seems to me
> > like a very significant security improvement.
> Agreed and it's where I'd like this to go.

yeah, I'm just wondering if it isn't possible to get there already on 64
bit machines with some kernel changes.

thanks / sorry about lag.

Trev

> 
> Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-14  8:17 ` Jakub Jelinek
@ 2017-07-14 11:49   ` Wilco Dijkstra
  0 siblings, 0 replies; 16+ messages in thread
From: Wilco Dijkstra @ 2017-07-14 11:49 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Jeff Law, GCC Patches, nd

Jakub Jelinek wrote:
> On Wed, Jul 12, 2017 at 12:20:32AM +0000, Wilco Dijkstra wrote:

> > Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
> > before needing an explicit probe - now only 0.6% of functions need a probe.
> > If we choose a proper minimum probe distance, say 64KB, explicit probes are
> > basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
> > Clearly inserting probes can be the default as the impact on code quality is negligible.
> 
> For non-leaf functions you need at least one probe no matter how small the
> frame size is (if it is bigger than 0), explicit or implicit, unless you
> perform IPA analysis on the callgraph and determine when that isn't needed,
> because you can have deep call stacks that would through functions that
> don't touch anything skip stack pages.  Of course, such probes can be stores
> of call used registers, it can be any store to the stack.

Well you need to save the return address somewhere, so a non-leaf function already
has an implicit probe before a call (even if shrinkwrapped). So it is not possible for a 
long sequence of function calls or a recursive function to jump the stack guard - the
only way to jump the guard is using a huge unchecked static or dynamic allocation.

One key thing to understand is that it doesn't matter where exactly the return address
is saved in a frame. You could save it at a random location and all it would mean is that
if the probe size is N, you only need to insert additional explicit probes if the frame is
larger than N/2 (sum of static and dynamic allocation). Obviously you could do better
than that with a well defined frame layout.

Before we consider IPA, how about optimizing trivial alloca's first? For example why
does GCC emit dynamic allocations for:

void f(void*);
void alloca (int x)
{
  if (x < 100)
    f (__builtin_alloca (x));
  f (__builtin_alloca (16));
}

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-12  0:20 Wilco Dijkstra
  2017-07-12  2:02 ` Jeff Law
  2017-07-14  6:19 ` Jeff Law
@ 2017-07-14  8:17 ` Jakub Jelinek
  2017-07-14 11:49   ` Wilco Dijkstra
  2 siblings, 1 reply; 16+ messages in thread
From: Jakub Jelinek @ 2017-07-14  8:17 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Jeff Law, GCC Patches, nd

On Wed, Jul 12, 2017 at 12:20:32AM +0000, Wilco Dijkstra wrote:
> No - like I mentioned before that's not correct nor acceptable as it would imply
> that ~70% of functions need a probe at entry. I did a quick run across SPEC and
> found the outgoing argument size is > 1024 in just 9 functions out of 147000!
> Those 9 were odd special cases due to auto generated code to interface between
> C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
> an unchecked caller, large outgoing arguments are simply not a realistic threat.
> 
> Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
> before needing an explicit probe - now only 0.6% of functions need a probe.
> If we choose a proper minimum probe distance, say 64KB, explicit probes are
> basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
> Clearly inserting probes can be the default as the impact on code quality is negligible.

For non-leaf functions you need at least one probe no matter how small the
frame size is (if it is bigger than 0), explicit or implicit, unless you
perform IPA analysis on the callgraph and determine when that isn't needed,
because you can have deep call stacks that would through functions that
don't touch anything skip stack pages.  Of course, such probes can be stores
of call used registers, it can be any store to the stack.

	Jakub

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-12  0:20 Wilco Dijkstra
  2017-07-12  2:02 ` Jeff Law
@ 2017-07-14  6:19 ` Jeff Law
  2017-07-17 11:27   ` Wilco Dijkstra
  2017-07-14  8:17 ` Jakub Jelinek
  2 siblings, 1 reply; 16+ messages in thread
From: Jeff Law @ 2017-07-14  6:19 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: nd

On 07/11/2017 06:20 PM, Wilco Dijkstra wrote:
> Jeff Law wrote:
>> aarch64 is the first target that does not have any implicit probes in
>> the caller.  Thus at prologue entry it must make conservative
>> assumptions about the offset of the most recent probed address relative
>> to the stack pointer.
> 
> No - like I mentioned before that's not correct nor acceptable as it would imply
> that ~70% of functions need a probe at entry. I did a quick run across SPEC and
> found the outgoing argument size is > 1024 in just 9 functions out of 147000!
> Those 9 were odd special cases due to auto generated code to interface between
> C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
> an unchecked caller, large outgoing arguments are simply not a realistic threat.
So would a half-half (2k caller/2k callee) split like Florian has
proposed work for you?  ie, we simply declare a caller that pushes the
stack pointer 2k or more into the guard as broken?

While I'm not really comfortable with the 2k/2k split in general, I
think I can support it from a Red Hat point of view -- largely because
we use 64k pages in RHEL.  So our guard is actually 64k.  Having a
hostile call chain push 2k into the guard isn't so bad in that case.

In fact, with a 64k guard, the dynamic area dwarfs the outgoing args as
the area of most concern.  And with that size guard we're far more
likely to see an attempted jump with an unconstrained alloca rather than
with a fairly large alloca.

And if there's an attacker controlled unconstrained alloca, then, well,
we've lost because no matter the size of the guard, the attacker can
jump it and there isn't a damn thing the callee can do about it.


> 
> Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
> before needing an explicit probe - now only 0.6% of functions need a probe.
> If we choose a proper minimum probe distance, say 64KB, explicit probes are
> basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
> Clearly inserting probes can be the default as the impact on code quality is negligible.
I believe Richard Earnshaw indicated that 4k pages were in use on some
aarch64 systems, so I didn't try to use a larger probe interval.  Though
I certainly considered it.  I think that's a decision that belongs in
the hands of the target maintainers.  Though I think it's best if you're
going to use a larger probe interval to mandate a suitable page size in
the ABI.




> 
> With respect to implementation it is relatively easy to decide in aarch64_layout_frame
> which frames need probes and where. I don't think keeping a running offset of the last
> probe/store is useful, it'll just lead to inefficiencies and bugs. 
It's certainly possible.  My goal was the minimize probing while
providing stack-clash safety, even if it meant a more complex tracking
scheme.

Some (simpler) tracking is still needed because allocations happen in
potentially 3 places for aarch64.  There's almost certainly cases where
none of them individually are larger than PROBE_INTERVAL, but as a group
they could be.

But that would certainly be simpler than the tracking I did at the cost
of sometimes probing when it's not strictly necessary.  Again, given the
lack of implicit probes prior to function entry I wanted to do as good a
job as posssible exploiting the implicit probes generated in the prologue.


So how about this.

First we have to disallow any individual allocation from allocating more
than PROBE_INTERVAL bytes.

If the total frame size is less than X (where we're still discussing X),
we emit no probes.

If the total frame size is greater than X, then after the first
allocation we probe the highest address within in the just allocated
area and we probe every PROBE_INTERVAL bytes thereafter as space is
allocated.

PROBE_INTERVAL is currently 4k, but the aarch64 maintainers can choose
to change that.  Note that significantly larger probe intervals may
require tweaking the sequences -- I'm not familiar enough with the
various immediate field limitations on aarch64 instructions to be sure
either way on that issue.



The patch doesn't deal
> with the delayed stores due to shrinkwrapping for example. Inserting probes before
> the prolog would be easier, eg.
Yea, I'm glad you brought up shrink-wrapping.  It's definitely a
concern.  Though probably less so when your suggestions from above.

Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-12 12:47   ` Trevor Saunders
@ 2017-07-12 21:08     ` Jeff Law
  2017-07-16 18:36       ` Trevor Saunders
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Law @ 2017-07-12 21:08 UTC (permalink / raw)
  To: Trevor Saunders; +Cc: Wilco Dijkstra, GCC Patches, nd

On 07/12/2017 06:47 AM, Trevor Saunders wrote:
> On Tue, Jul 11, 2017 at 08:02:26PM -0600, Jeff Law wrote:
>> On 07/11/2017 06:20 PM, Wilco Dijkstra wrote:
>>> Jeff Law wrote:
>>>> aarch64 is the first target that does not have any implicit probes in
>>>> the caller.  Thus at prologue entry it must make conservative
>>>> assumptions about the offset of the most recent probed address relative
>>>> to the stack pointer.
>>>
>>> No - like I mentioned before that's not correct nor acceptable as it would imply
>>> that ~70% of functions need a probe at entry. I did a quick run across SPEC and
>>> found the outgoing argument size is > 1024 in just 9 functions out of 147000!
>>> Those 9 were odd special cases due to auto generated code to interface between
>>> C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
>>> an unchecked caller, large outgoing arguments are simply not a realistic threat.
>> Whether or not such frames exist in SPEC isn't the question.   THere's
>> nothing in the ABI or ISA that allows us to avoid those probes without
>> compromising security.
>>
>> Mixed code scenarios are going to be a fact of life, probably forever
>> unless we can convince every ISV providing software that works on top of
>> RHEL/*BSD/whatever  to turn on probing (based on my experiences, that
>> has exactly a zero chance of occurring).
> 
> On the other hand if probing is fast enough that it can be on by default
> in gcc there should be much less of it.  Even if you did change the ABI
> to require probing it seems unlikely that code violating that
> requirement would hit problems other than this security concern, so I'd
> expect there will be some non compliant asm out there.
Certainly my goal is to enable it by default one day.  Even if that's
ultimately done at the distro level or by a configure time switch.

Certainly if/when we reach that point the amount of unprotected code
will drop dramatically, but based on my experience I fully expect some
folks to turn it off.  They'll say something like "hey, we've audited
our code and we don't have any large stack or allocas, so I'm turning
this thing off to get some un-measurable performance gain."  It'd be an
uber-dumb thing to do, but it's going to happen.

And once that happens, they're open to attack again, even if they were
correct in their audit and their code only calls stack-clash protected
libraries.

Tweaking the ABI to mandate touching *sp in the outgoing args area &
alloca space is better because we likely wouldn't have an option to
avoid that access.   So a well meaning, but clueless, developer couldn't
turn the option off like they could stack checking.

> 
>> In a mixed code scenario a caller may have a large alloca or large
>> outgoing argument area that pushes the stack pointer into the guard page
>> without actually accessing the guard page.  That requires a callee which
>> is compiled with stack checking to make worst case assumptions at
>> function entry to protect itself as much as possible from these attacks.
> 
> It seems to me pretty important to ask how many programs out there have
> a caller that can push the stack into the guard page, but not past it.
I've actually spent a lot of time thing about that precise problem. You
don't need large frames to do that -- you just need controlled heap
leaks and/or controlled recursion.  ie, even if a function has no
allocas and a small frame, it can put the stack pointer into the guard.

THus in the callee we have to make worst case assumptions to be safe on
targets where there is no implicit probe in the caller.



> I'd expect that's not the case for most allocas, either they are safe,
> or can increase the stack arbitrarily.  I'd expect its more likely with
> outgoing arg or large buffer allocations though.
Unfortunately not.  A small alloca is not inherently safe unless you
also touch the allocated space.  That's precisely why the generic code
touches residuals after the loop that handles PROBE_INTERVAL chunks
(consider a small alloca in a loop).

That's also why I suggested a small backwards compatible tweak to the
aarch64 ABI.  Specifically there should be a mandated touch of *sp in
any function where the outgoing args size is greater than some well
defined value or if the function allocates any dynamic stack space.

That mandated touch of *sp at the ABI level would be enough to allow for
a much less conservative assumed state at function start and would allow
the vast majority of functions to need no probing at all.



  I think the largest
> buffer Qualys found was less than 400k? So 1 256k guard page should
> protect 95% of functions, and 1m or 2m  seems like enough to protect
> against all non malicious programs.  I'm not sure, but this is a 64 bit
> arch, so it seems like we should have the adress space for large guard
> pages like that.
I'm all for larger guards, particularly on 64 bit architectures.

We use 64k pages on aarch64 for RHEL which implicitly gives us a minimum
guard of 64k.  That would be a huge factor in any analysis I would do if
the aarch64 maintainers choose not to fully protect their architecture
and I was forced to make a recommendation for Red Hat and its customers.
 I hope I don't have to sit down and do the analysis on this and make
any kind of recommendation.

The fact that Qualys found nothing larger than X (for any X) in the code
they scanned isn't relevant.  There could well be code out there they
did not look at that uses > X or code that is yet to be written that
uses > X.

> 
> There's also the trade off that increasing the amount of code that
> probes reduces the set of code that can either move the stack into or
> past the guard page.
Correct.

> 
>> THe aarch64 maintainers can certain nix what I've done or modify it in
>> ways that suit them.  That is their choice as port maintainers.
>> However, Red Hat will have to evaluate the cost of reducing security for
>> our customer base against the performance improvement of such changes.
>> As I've said before, I do not know where that decision would fall.
>>
>>
>>>
>>> Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
>>> before needing an explicit probe - now only 0.6% of functions need a probe.
>>> If we choose a proper minimum probe distance, say 64KB, explicit probes are
>>> basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
>>> Clearly inserting probes can be the default as the impact on code quality is negligible.
>> Again, there's nothing that says 3k is safe.   You're picking an
>> arbitrary point that is safe in a codebase you've looked at.  But I'm
>> looking at this from a "what guarantees do I have from an ABI or ISA
>> standpoint".  The former may be more performant, but it's inherently
>> less secure than the latter.
> 
> On the other hand making -fstack-check=clash the default seems to me
> like a very significant security improvement.
Agreed and it's where I'd like this to go.

Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-12  2:02 ` Jeff Law
@ 2017-07-12 12:47   ` Trevor Saunders
  2017-07-12 21:08     ` Jeff Law
  0 siblings, 1 reply; 16+ messages in thread
From: Trevor Saunders @ 2017-07-12 12:47 UTC (permalink / raw)
  To: Jeff Law; +Cc: Wilco Dijkstra, GCC Patches, nd

On Tue, Jul 11, 2017 at 08:02:26PM -0600, Jeff Law wrote:
> On 07/11/2017 06:20 PM, Wilco Dijkstra wrote:
> > Jeff Law wrote:
> >> aarch64 is the first target that does not have any implicit probes in
> >> the caller.  Thus at prologue entry it must make conservative
> >> assumptions about the offset of the most recent probed address relative
> >> to the stack pointer.
> > 
> > No - like I mentioned before that's not correct nor acceptable as it would imply
> > that ~70% of functions need a probe at entry. I did a quick run across SPEC and
> > found the outgoing argument size is > 1024 in just 9 functions out of 147000!
> > Those 9 were odd special cases due to auto generated code to interface between
> > C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
> > an unchecked caller, large outgoing arguments are simply not a realistic threat.
> Whether or not such frames exist in SPEC isn't the question.   THere's
> nothing in the ABI or ISA that allows us to avoid those probes without
> compromising security.
> 
> Mixed code scenarios are going to be a fact of life, probably forever
> unless we can convince every ISV providing software that works on top of
> RHEL/*BSD/whatever  to turn on probing (based on my experiences, that
> has exactly a zero chance of occurring).

On the other hand if probing is fast enough that it can be on by default
in gcc there should be much less of it.  Even if you did change the ABI
to require probing it seems unlikely that code violating that
requirement would hit problems other than this security concern, so I'd
expect there will be some non compliant asm out there.

> In a mixed code scenario a caller may have a large alloca or large
> outgoing argument area that pushes the stack pointer into the guard page
> without actually accessing the guard page.  That requires a callee which
> is compiled with stack checking to make worst case assumptions at
> function entry to protect itself as much as possible from these attacks.

It seems to me pretty important to ask how many programs out there have
a caller that can push the stack into the guard page, but not past it.
I'd expect that's not the case for most allocas, either they are safe,
or can increase the stack arbitrarily.  I'd expect its more likely with
outgoing arg or large buffer allocations though.  I think the largest
buffer Qualys found was less than 400k? So 1 256k guard page should
protect 95% of functions, and 1m or 2m  seems like enough to protect
against all non malicious programs.  I'm not sure, but this is a 64 bit
arch, so it seems like we should have the adress space for large guard
pages like that.

There's also the trade off that increasing the amount of code that
probes reduces the set of code that can either move the stack into or
past the guard page.

> THe aarch64 maintainers can certain nix what I've done or modify it in
> ways that suit them.  That is their choice as port maintainers.
> However, Red Hat will have to evaluate the cost of reducing security for
> our customer base against the performance improvement of such changes.
> As I've said before, I do not know where that decision would fall.
> 
> 
> > 
> > Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
> > before needing an explicit probe - now only 0.6% of functions need a probe.
> > If we choose a proper minimum probe distance, say 64KB, explicit probes are
> > basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
> > Clearly inserting probes can be the default as the impact on code quality is negligible.
> Again, there's nothing that says 3k is safe.   You're picking an
> arbitrary point that is safe in a codebase you've looked at.  But I'm
> looking at this from a "what guarantees do I have from an ABI or ISA
> standpoint".  The former may be more performant, but it's inherently
> less secure than the latter.

On the other hand making -fstack-check=clash the default seems to me
like a very significant security improvement.

Trev

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
  2017-07-12  0:20 Wilco Dijkstra
@ 2017-07-12  2:02 ` Jeff Law
  2017-07-12 12:47   ` Trevor Saunders
  2017-07-14  6:19 ` Jeff Law
  2017-07-14  8:17 ` Jakub Jelinek
  2 siblings, 1 reply; 16+ messages in thread
From: Jeff Law @ 2017-07-12  2:02 UTC (permalink / raw)
  To: Wilco Dijkstra, GCC Patches; +Cc: nd

On 07/11/2017 06:20 PM, Wilco Dijkstra wrote:
> Jeff Law wrote:
>> aarch64 is the first target that does not have any implicit probes in
>> the caller.  Thus at prologue entry it must make conservative
>> assumptions about the offset of the most recent probed address relative
>> to the stack pointer.
> 
> No - like I mentioned before that's not correct nor acceptable as it would imply
> that ~70% of functions need a probe at entry. I did a quick run across SPEC and
> found the outgoing argument size is > 1024 in just 9 functions out of 147000!
> Those 9 were odd special cases due to auto generated code to interface between
> C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
> an unchecked caller, large outgoing arguments are simply not a realistic threat.
Whether or not such frames exist in SPEC isn't the question.   THere's
nothing in the ABI or ISA that allows us to avoid those probes without
compromising security.

Mixed code scenarios are going to be a fact of life, probably forever
unless we can convince every ISV providing software that works on top of
RHEL/*BSD/whatever  to turn on probing (based on my experiences, that
has exactly a zero chance of occurring).

In a mixed code scenario a caller may have a large alloca or large
outgoing argument area that pushes the stack pointer into the guard page
without actually accessing the guard page.  That requires a callee which
is compiled with stack checking to make worst case assumptions at
function entry to protect itself as much as possible from these attacks.

THe aarch64 maintainers can certain nix what I've done or modify it in
ways that suit them.  That is their choice as port maintainers.
However, Red Hat will have to evaluate the cost of reducing security for
our customer base against the performance improvement of such changes.
As I've said before, I do not know where that decision would fall.


> 
> Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
> before needing an explicit probe - now only 0.6% of functions need a probe.
> If we choose a proper minimum probe distance, say 64KB, explicit probes are
> basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
> Clearly inserting probes can be the default as the impact on code quality is negligible.
Again, there's nothing that says 3k is safe.   You're picking an
arbitrary point that is safe in a codebase you've looked at.  But I'm
looking at this from a "what guarantees do I have from an ABI or ISA
standpoint".  The former may be more performant, but it's inherently
less secure than the latter.

> 
> With respect to implementation it is relatively easy to decide in aarch64_layout_frame
> which frames need probes and where. I don't think keeping a running offset of the last
> probe/store is useful, it'll just lead to inefficiencies and bugs. The patch doesn't deal
> with the delayed stores due to shrinkwrapping for example. Inserting probes before
> the prolog would be easier, eg.
> 
> sub tmp, sp, 65536
> str xzr, [tmp, 1024]  // allow up to 1KB of outgoing arguments in callee
> sub tmp, sp, 131072
> str xzr, [tmp, 1024]
> ... normal prolog for frame size 128-192KBIf you think you can do better without compromising security, be my
guest.  Realistically, I'm pretty close to the limit of how much more
time I can spend on the aarch64 target dependent issues.

So I think the question I need answered from the aarch64 maintainers is

Are they going to nak this code outright
Are they going to keep basic structure but suggest tweaks

I need to know the general direction so that I know whether or not to
continue to carry the aarch64 changes in the patchkit.  If y'all are
going to nak, then I'll drop them from upstream except for the changes
to indirect STACK_CHECK_PROTECT.


jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
@ 2017-07-12  0:20 Wilco Dijkstra
  2017-07-12  2:02 ` Jeff Law
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Wilco Dijkstra @ 2017-07-12  0:20 UTC (permalink / raw)
  To: Jeff Law, GCC Patches; +Cc: nd

Jeff Law wrote:
> aarch64 is the first target that does not have any implicit probes in
> the caller.  Thus at prologue entry it must make conservative
> assumptions about the offset of the most recent probed address relative
> to the stack pointer.

No - like I mentioned before that's not correct nor acceptable as it would imply
that ~70% of functions need a probe at entry. I did a quick run across SPEC and
found the outgoing argument size is > 1024 in just 9 functions out of 147000!
Those 9 were odd special cases due to auto generated code to interface between
C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
an unchecked caller, large outgoing arguments are simply not a realistic threat.

Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
before needing an explicit probe - now only 0.6% of functions need a probe.
If we choose a proper minimum probe distance, say 64KB, explicit probes are
basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
Clearly inserting probes can be the default as the impact on code quality is negligible.

With respect to implementation it is relatively easy to decide in aarch64_layout_frame
which frames need probes and where. I don't think keeping a running offset of the last
probe/store is useful, it'll just lead to inefficiencies and bugs. The patch doesn't deal
with the delayed stores due to shrinkwrapping for example. Inserting probes before
the prolog would be easier, eg.

sub tmp, sp, 65536
str xzr, [tmp, 1024]  // allow up to 1KB of outgoing arguments in callee
sub tmp, sp, 131072
str xzr, [tmp, 1024]
... normal prolog for frame size 128-192KB

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-07-17 23:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-11 21:21 [PATCH][RFA/RFC] Stack clash mitigation patch 07/08 Jeff Law
2017-07-12  0:20 Wilco Dijkstra
2017-07-12  2:02 ` Jeff Law
2017-07-12 12:47   ` Trevor Saunders
2017-07-12 21:08     ` Jeff Law
2017-07-16 18:36       ` Trevor Saunders
2017-07-17 15:51         ` Jeff Law
2017-07-17 20:30           ` Trevor Saunders
2017-07-14  6:19 ` Jeff Law
2017-07-17 11:27   ` Wilco Dijkstra
2017-07-17 15:24     ` Jeff Law
2017-07-17 15:31       ` Jakub Jelinek
2017-07-17 22:42       ` Wilco Dijkstra
2017-07-17 23:16         ` Jeff Law
2017-07-14  8:17 ` Jakub Jelinek
2017-07-14 11:49   ` Wilco Dijkstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).