public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
@ 2017-07-12  0:20 Wilco Dijkstra
  2017-07-12  2:02 ` Jeff Law
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Wilco Dijkstra @ 2017-07-12  0:20 UTC (permalink / raw)
  To: Jeff Law, GCC Patches; +Cc: nd

Jeff Law wrote:
> aarch64 is the first target that does not have any implicit probes in
> the caller.  Thus at prologue entry it must make conservative
> assumptions about the offset of the most recent probed address relative
> to the stack pointer.

No - like I mentioned before that's not correct nor acceptable as it would imply
that ~70% of functions need a probe at entry. I did a quick run across SPEC and
found the outgoing argument size is > 1024 in just 9 functions out of 147000!
Those 9 were odd special cases due to auto generated code to interface between
C and Fortran. This is extremely unlikely to occur anywhere else. So even assuming
an unchecked caller, large outgoing arguments are simply not a realistic threat.

Therefore even when using a tiny 4K probe size we can safely adjust SP by 3KB
before needing an explicit probe - now only 0.6% of functions need a probe.
If we choose a proper minimum probe distance, say 64KB, explicit probes are
basically non-existent (just 35 functions, or ~0.02% of all functions are > 64KB).
Clearly inserting probes can be the default as the impact on code quality is negligible.

With respect to implementation it is relatively easy to decide in aarch64_layout_frame
which frames need probes and where. I don't think keeping a running offset of the last
probe/store is useful, it'll just lead to inefficiencies and bugs. The patch doesn't deal
with the delayed stores due to shrinkwrapping for example. Inserting probes before
the prolog would be easier, eg.

sub tmp, sp, 65536
str xzr, [tmp, 1024]  // allow up to 1KB of outgoing arguments in callee
sub tmp, sp, 131072
str xzr, [tmp, 1024]
... normal prolog for frame size 128-192KB

Wilco

^ permalink raw reply	[flat|nested] 16+ messages in thread
* [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
@ 2017-07-11 21:21 Jeff Law
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Law @ 2017-07-11 21:21 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2504 bytes --]


This patch introduces aarch64 -fstack-check=clash prologue support.

aarch64 is the first target that does not have any implicit probes in
the caller.  Thus at prologue entry it must make conservative
assumptions about the offset of the most recent probed address relative
to the stack pointer.

However, it can and does use stores into the stack generated by the
prologue to avoid some explicit probes.  It's not as efficient as x86 or
ppc, but it does try hard to be efficient.


We track the offset of the most recent probe relative to the stack
pointer.  The initial state is PROBE_INTERVAL - (STACK_BOUNDARY /
BITS_PER_UNIT) bytes away.  Allocations increase that value.  Stores
decrease the value.

When the value crosses PROBE_INTERVAL we emit an explicit probe.

aarch64 prologues can allocate space in 3 different places.  There's
INITIAL_ADJUST and FINAL_ADJUST which are simple adjustments and they're
trivial to handle.  We just add the adjustment to the current offset and
if we cross PROBE_INTERVAL, we emit a probe before the stack adjustment.

Handling of CALLEE_ADJUST is more subtle because its two actions in one
instruction.  The instruction both allocates space and stores into two
words at the lowest address in the allocated space.

Its easiest to mentally model the instruction as two distinct operations.

First the instruction adjusts the stack by CALLEE_ADJUST - 2 * WORD_SIZE
and thus increases the offset to the most recent probe.  If the offset
to the most recent probe crosses PROBE_INTERVAL, then we must emit a
probe prior to the CALLEE_ADJUST instruction.  Note that CALLEE_ADJUST
might be 0 or any multiple of 16.  For 0/16 this step is a nop.

Second the CALLEE_ADJUST instruction adjusts the stack by 2 * WORD_SIZE
and stores into both just allocated words.  We never need a probe for
this action.  Furthermore, this action always zeros the offset to the
most recent probe which is very helpful in avoiding subsequent explicit
probes.


Between CALLEE_ADJUST and FINAL_ADJUST are other register saves.  We
track the offsets for those and decrease the offset to the most recent
probe appropriately.

I'm pretty sure this implementation is about as good as it can get with
the ISA/ABI in place on aarch64.

I'm pretty sure that the loop to allocate large stacks does not update
CFIs correctly.

Again, you'll note that like ppc, this does not add any new tests.  It
uses the tests that were added with the x86 support unmodifed.

Thoughts/Comments?  OK for the trunk?

[-- Attachment #2: P7 --]
[-- Type: text/plain, Size: 10422 bytes --]


	* config/aarch/aarch64.c (output_probe_stack_range): Handle
	-fstack-check=clash probing too.
	(aarch64_save_callee_saves): Return smallest offset from SP
	that was written.
	(aarch64_allocate_and_probe_stack_space): New function.
	(aarch64_expand_prologue): Track distance between SP and most
	recent probe.  Use aarch64_allocate_and_probe_stack_space
	when -fstack-check=clash rather than just adjusting sp.
	Dump actions via dump_stack_clash_frame_info.
	Handle probing and probe offset update for CALLEE_ADJUST space.
	Use return value from aarch64_save_callee_saves to reduce
	last_probe_offset.
commit 86871749daad396cf825b3982dee8e80eda96d8a
Author: root <root@gigabyte-r120-05.khw.lab.eng.bos.redhat.com>
Date:   Fri Jul 7 11:17:06 2017 -0400

    aarch64 support with dumping, adjust tests

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0a8b40a..40bc183 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2830,6 +2830,9 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   char loop_lab[32];
   rtx xops[2];
 
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK)
+    reg1 = stack_pointer_rtx;
+
   ASM_GENERATE_INTERNAL_LABEL (loop_lab, "LPSRL", labelno++);
 
   /* Loop.  */
@@ -2841,7 +2844,14 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   output_asm_insn ("sub\t%0, %0, %1", xops);
 
   /* Probe at TEST_ADDR.  */
-  output_asm_insn ("str\txzr, [%0]", xops);
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK)
+    {
+      gcc_assert (xops[0] == stack_pointer_rtx);
+      xops[1] = GEN_INT (PROBE_INTERVAL - 8);
+      output_asm_insn ("str\txzr, [%0, %1]", xops);
+    }
+  else
+    output_asm_insn ("str\txzr, [%0]", xops);
 
   /* Test if TEST_ADDR == LAST_ADDR.  */
   xops[1] = reg2;
@@ -3245,9 +3255,11 @@ aarch64_return_address_signing_enabled (void)
 
 /* Emit code to save the callee-saved registers from register number START
    to LIMIT to the stack at the location starting at offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  */
+   skipping any write-back candidates if SKIP_WB is true.
 
-static void
+   Return the smallest offset from SP that is written.  */
+
+static HOST_WIDE_INT
 aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 			   unsigned start, unsigned limit, bool skip_wb)
 {
@@ -3257,6 +3269,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
   unsigned regno;
   unsigned regno2;
 
+  HOST_WIDE_INT lowest_offset = PROBE_INTERVAL;
+
   for (regno = aarch64_next_callee_save (start, limit);
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
@@ -3274,6 +3288,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
+      if (offset < lowest_offset)
+	lowest_offset = offset;
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
 					      offset));
 
@@ -3289,6 +3305,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	  rtx mem2;
 
 	  offset = start_offset + cfun->machine->frame.reg_offset[regno2];
+	  if (offset < lowest_offset)
+	    lowest_offset = offset;
 	  mem2 = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
 						   offset));
 	  insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
@@ -3306,6 +3324,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 
       RTX_FRAME_RELATED_P (insn) = 1;
     }
+  return lowest_offset;
 }
 
 /* Emit code to restore the callee registers of mode MODE from register
@@ -3605,6 +3624,71 @@ aarch64_set_handled_components (sbitmap components)
       cfun->machine->reg_is_wrapped_separately[regno] = true;
 }
 
+/* Allocate SIZE bytes of stack space using SCRATCH_REG as a scratch
+   register.
+
+   LAST_PROBE_OFFSET contains the offset between the stack pointer and
+   the last known probe.  As LAST_PROBE_OFFSET crosses PROBE_INTERVAL
+   emit a probe and adjust LAST_PROBE_OFFSET.  */
+static void
+aarch64_allocate_and_probe_stack_space (int scratchreg, HOST_WIDE_INT size,
+					HOST_WIDE_INT *last_probe_offset)
+{
+  rtx temp = gen_rtx_REG (word_mode, scratchreg);
+
+  HOST_WIDE_INT rounded_size = size & -PROBE_INTERVAL;
+
+  /* We can handle a small number of allocations/probes inline.  Otherwise
+     punt to a loop.  */
+  if (rounded_size && rounded_size <= 4 * PROBE_INTERVAL)
+    {
+      for (HOST_WIDE_INT i = 0; i < rounded_size; i += PROBE_INTERVAL)
+	{
+	  /* We should never need a scratch register for this adjustment.  */
+	  aarch64_sub_sp (-1, PROBE_INTERVAL, true);
+
+	  /* We just allocated PROBE_INTERVAL bytes.  Thus, a probe is
+	     mandatory.  Note that LAST_PROBE_OFFSET does not change here.  */
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   (PROBE_INTERVAL
+					    - GET_MODE_SIZE (word_mode))));
+	}
+      dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    }
+  else if (rounded_size)
+    {
+      /* Compute the ending address.  */
+      emit_move_insn (temp, GEN_INT (-rounded_size));
+      emit_insn (gen_add3_insn (temp, stack_pointer_rtx, temp));
+
+      /* This allocates and probes the stack.  Like the inline version above
+	 it does not need to change LAST_PROBE_OFFSET.
+
+	 It almost certainly does not update CFIs correctly.  */
+      emit_insn (gen_probe_stack_range (temp, temp, temp));
+      dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+    }
+  else
+    dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    
+
+  /* Handle any residuals.  */
+  HOST_WIDE_INT residual = size - rounded_size;
+  if (residual)
+    {
+      aarch64_sub_sp (-1, residual, true);
+      *last_probe_offset += residual;
+      if (*last_probe_offset >= PROBE_INTERVAL)
+	{
+	  *last_probe_offset -= PROBE_INTERVAL;
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   (residual
+					    - GET_MODE_SIZE (word_mode))));
+	}
+    }
+  return;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3661,6 +3745,11 @@ aarch64_expand_prologue (void)
   unsigned reg2 = cfun->machine->frame.wb_candidate2;
   rtx_insn *insn;
 
+  /* We know nothing about the last write into the stack, so we have to make
+     worst case assumptions here.  */
+  HOST_WIDE_INT last_probe_offset
+    = PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT);
+
   /* Sign return address for functions.  */
   if (aarch64_return_address_signing_enabled ())
     {
@@ -3686,16 +3775,64 @@ aarch64_expand_prologue (void)
 	aarch64_emit_probe_stack_range (get_stack_check_protect (), frame_size);
     }
 
-  aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+      && initial_adjust != 0)
+    aarch64_allocate_and_probe_stack_space (IP0_REGNUM, initial_adjust,
+					    &last_probe_offset);
+  else
+    aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+
+  /* There are no explicit probes in this case.  */
+  if (initial_adjust == 0
+      && callee_adjust < 32
+      && final_adjust == 0)
+    dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, callee_adjust != 0);
 
   if (callee_adjust != 0)
-    aarch64_push_regs (reg1, reg2, callee_adjust);
+    {
+      aarch64_push_regs (reg1, reg2, callee_adjust);
+
+      /* For a small CALLEE_ADJUST we don't need a probe here as the store of
+	 REG1/REG2 will hit the page we want.  */
+      if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+	  && callee_adjust >= 32)
+	{
+	  /* callee_adjust has to fit in a 7 bit signed scaled offset,
+	     so it should never be near PROBE_INTERVAL.  */
+	  gcc_assert (callee_adjust < PROBE_INTERVAL);
+	  last_probe_offset += callee_adjust;
 
+	  /* Probe if necessary.  */
+	  if (last_probe_offset >= PROBE_INTERVAL)
+	    {
+	      emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					       (callee_adjust
+					        - GET_MODE_SIZE (word_mode))));
+	      dump_stack_clash_frame_info (PROBE_INLINE, true);
+	    }
+
+	  /* Regardless of whether or not we explicitly probed, the register
+	     push constitutes an implicit probe at *sp.  */
+	  last_probe_offset = 0;
+	}
+
+    }
+
+  /* The calls to aarch64_save_callee_saves store into the stack and
+     can act as implicit probes.  So it tracks the offset closest to
+     SP that is written and returns it.  If that return value is less
+     than LAST_PROBE_OFFSET then we can adjust LAST_PROBE_OFFSET, possibly
+     saving an explicit probe for FINAL_ADJUST.  */
+  HOST_WIDE_INT tmp;
   if (frame_pointer_needed)
     {
       if (callee_adjust == 0)
-	aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM,
-				   R30_REGNUM, false);
+	{
+	  tmp = aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM,
+					   R30_REGNUM, false);
+	  if (tmp < last_probe_offset)
+	    last_probe_offset = tmp;
+	}
       insn = emit_insn (gen_add3_insn (hard_frame_pointer_rtx,
 				       stack_pointer_rtx,
 				       GEN_INT (callee_offset)));
@@ -3703,11 +3840,22 @@ aarch64_expand_prologue (void)
       emit_insn (gen_stack_tie (stack_pointer_rtx, hard_frame_pointer_rtx));
     }
 
-  aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
-			     callee_adjust != 0 || frame_pointer_needed);
-  aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
-			     callee_adjust != 0 || frame_pointer_needed);
-  aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
+  tmp = aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
+				   callee_adjust != 0 || frame_pointer_needed);
+  if (tmp < last_probe_offset)
+    last_probe_offset = tmp;
+
+  tmp = aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
+				   callee_adjust != 0 || frame_pointer_needed);
+  if (tmp < last_probe_offset)
+    last_probe_offset = tmp;
+
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+      && final_adjust != 0)
+    aarch64_allocate_and_probe_stack_space (IP1_REGNUM, final_adjust,
+					    &last_probe_offset);
+  else
+    aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
 }
 
 /* Return TRUE if we can use a simple_return insn.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-07-17 23:16 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-12  0:20 [PATCH][RFA/RFC] Stack clash mitigation patch 07/08 Wilco Dijkstra
2017-07-12  2:02 ` Jeff Law
2017-07-12 12:47   ` Trevor Saunders
2017-07-12 21:08     ` Jeff Law
2017-07-16 18:36       ` Trevor Saunders
2017-07-17 15:51         ` Jeff Law
2017-07-17 20:30           ` Trevor Saunders
2017-07-14  6:19 ` Jeff Law
2017-07-17 11:27   ` Wilco Dijkstra
2017-07-17 15:24     ` Jeff Law
2017-07-17 15:31       ` Jakub Jelinek
2017-07-17 22:42       ` Wilco Dijkstra
2017-07-17 23:16         ` Jeff Law
2017-07-14  8:17 ` Jakub Jelinek
2017-07-14 11:49   ` Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2017-07-11 21:21 Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).