[PATCH][RFA/RFC] Stack clash mitigation patch 07/08

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Jeff Law <law@redhat.com>
To: gcc-patches <gcc-patches@gcc.gnu.org>
Subject: [PATCH][RFA/RFC] Stack clash mitigation patch 07/08
Date: Tue, 11 Jul 2017 21:21:00 -0000	[thread overview]
Message-ID: <a49e4c85-cb71-2b02-9292-64a92c1a2c5a@redhat.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2504 bytes --]


This patch introduces aarch64 -fstack-check=clash prologue support.

aarch64 is the first target that does not have any implicit probes in
the caller.  Thus at prologue entry it must make conservative
assumptions about the offset of the most recent probed address relative
to the stack pointer.

However, it can and does use stores into the stack generated by the
prologue to avoid some explicit probes.  It's not as efficient as x86 or
ppc, but it does try hard to be efficient.


We track the offset of the most recent probe relative to the stack
pointer.  The initial state is PROBE_INTERVAL - (STACK_BOUNDARY /
BITS_PER_UNIT) bytes away.  Allocations increase that value.  Stores
decrease the value.

When the value crosses PROBE_INTERVAL we emit an explicit probe.

aarch64 prologues can allocate space in 3 different places.  There's
INITIAL_ADJUST and FINAL_ADJUST which are simple adjustments and they're
trivial to handle.  We just add the adjustment to the current offset and
if we cross PROBE_INTERVAL, we emit a probe before the stack adjustment.

Handling of CALLEE_ADJUST is more subtle because its two actions in one
instruction.  The instruction both allocates space and stores into two
words at the lowest address in the allocated space.

Its easiest to mentally model the instruction as two distinct operations.

First the instruction adjusts the stack by CALLEE_ADJUST - 2 * WORD_SIZE
and thus increases the offset to the most recent probe.  If the offset
to the most recent probe crosses PROBE_INTERVAL, then we must emit a
probe prior to the CALLEE_ADJUST instruction.  Note that CALLEE_ADJUST
might be 0 or any multiple of 16.  For 0/16 this step is a nop.

Second the CALLEE_ADJUST instruction adjusts the stack by 2 * WORD_SIZE
and stores into both just allocated words.  We never need a probe for
this action.  Furthermore, this action always zeros the offset to the
most recent probe which is very helpful in avoiding subsequent explicit
probes.


Between CALLEE_ADJUST and FINAL_ADJUST are other register saves.  We
track the offsets for those and decrease the offset to the most recent
probe appropriately.

I'm pretty sure this implementation is about as good as it can get with
the ISA/ABI in place on aarch64.

I'm pretty sure that the loop to allocate large stacks does not update
CFIs correctly.

Again, you'll note that like ppc, this does not add any new tests.  It
uses the tests that were added with the x86 support unmodifed.

Thoughts/Comments?  OK for the trunk?

[-- Attachment #2: P7 --]
[-- Type: text/plain, Size: 10422 bytes --]


	* config/aarch/aarch64.c (output_probe_stack_range): Handle
	-fstack-check=clash probing too.
	(aarch64_save_callee_saves): Return smallest offset from SP
	that was written.
	(aarch64_allocate_and_probe_stack_space): New function.
	(aarch64_expand_prologue): Track distance between SP and most
	recent probe.  Use aarch64_allocate_and_probe_stack_space
	when -fstack-check=clash rather than just adjusting sp.
	Dump actions via dump_stack_clash_frame_info.
	Handle probing and probe offset update for CALLEE_ADJUST space.
	Use return value from aarch64_save_callee_saves to reduce
	last_probe_offset.
commit 86871749daad396cf825b3982dee8e80eda96d8a
Author: root <root@gigabyte-r120-05.khw.lab.eng.bos.redhat.com>
Date:   Fri Jul 7 11:17:06 2017 -0400

    aarch64 support with dumping, adjust tests

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0a8b40a..40bc183 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2830,6 +2830,9 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   char loop_lab[32];
   rtx xops[2];
 
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK)
+    reg1 = stack_pointer_rtx;
+
   ASM_GENERATE_INTERNAL_LABEL (loop_lab, "LPSRL", labelno++);
 
   /* Loop.  */
@@ -2841,7 +2844,14 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   output_asm_insn ("sub\t%0, %0, %1", xops);
 
   /* Probe at TEST_ADDR.  */
-  output_asm_insn ("str\txzr, [%0]", xops);
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK)
+    {
+      gcc_assert (xops[0] == stack_pointer_rtx);
+      xops[1] = GEN_INT (PROBE_INTERVAL - 8);
+      output_asm_insn ("str\txzr, [%0, %1]", xops);
+    }
+  else
+    output_asm_insn ("str\txzr, [%0]", xops);
 
   /* Test if TEST_ADDR == LAST_ADDR.  */
   xops[1] = reg2;
@@ -3245,9 +3255,11 @@ aarch64_return_address_signing_enabled (void)
 
 /* Emit code to save the callee-saved registers from register number START
    to LIMIT to the stack at the location starting at offset START_OFFSET,
-   skipping any write-back candidates if SKIP_WB is true.  */
+   skipping any write-back candidates if SKIP_WB is true.
 
-static void
+   Return the smallest offset from SP that is written.  */
+
+static HOST_WIDE_INT
 aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 			   unsigned start, unsigned limit, bool skip_wb)
 {
@@ -3257,6 +3269,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
   unsigned regno;
   unsigned regno2;
 
+  HOST_WIDE_INT lowest_offset = PROBE_INTERVAL;
+
   for (regno = aarch64_next_callee_save (start, limit);
        regno <= limit;
        regno = aarch64_next_callee_save (regno + 1, limit))
@@ -3274,6 +3288,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 
       reg = gen_rtx_REG (mode, regno);
       offset = start_offset + cfun->machine->frame.reg_offset[regno];
+      if (offset < lowest_offset)
+	lowest_offset = offset;
       mem = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
 					      offset));
 
@@ -3289,6 +3305,8 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 	  rtx mem2;
 
 	  offset = start_offset + cfun->machine->frame.reg_offset[regno2];
+	  if (offset < lowest_offset)
+	    lowest_offset = offset;
 	  mem2 = gen_mem_ref (mode, plus_constant (Pmode, stack_pointer_rtx,
 						   offset));
 	  insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2,
@@ -3306,6 +3324,7 @@ aarch64_save_callee_saves (machine_mode mode, HOST_WIDE_INT start_offset,
 
       RTX_FRAME_RELATED_P (insn) = 1;
     }
+  return lowest_offset;
 }
 
 /* Emit code to restore the callee registers of mode MODE from register
@@ -3605,6 +3624,71 @@ aarch64_set_handled_components (sbitmap components)
       cfun->machine->reg_is_wrapped_separately[regno] = true;
 }
 
+/* Allocate SIZE bytes of stack space using SCRATCH_REG as a scratch
+   register.
+
+   LAST_PROBE_OFFSET contains the offset between the stack pointer and
+   the last known probe.  As LAST_PROBE_OFFSET crosses PROBE_INTERVAL
+   emit a probe and adjust LAST_PROBE_OFFSET.  */
+static void
+aarch64_allocate_and_probe_stack_space (int scratchreg, HOST_WIDE_INT size,
+					HOST_WIDE_INT *last_probe_offset)
+{
+  rtx temp = gen_rtx_REG (word_mode, scratchreg);
+
+  HOST_WIDE_INT rounded_size = size & -PROBE_INTERVAL;
+
+  /* We can handle a small number of allocations/probes inline.  Otherwise
+     punt to a loop.  */
+  if (rounded_size && rounded_size <= 4 * PROBE_INTERVAL)
+    {
+      for (HOST_WIDE_INT i = 0; i < rounded_size; i += PROBE_INTERVAL)
+	{
+	  /* We should never need a scratch register for this adjustment.  */
+	  aarch64_sub_sp (-1, PROBE_INTERVAL, true);
+
+	  /* We just allocated PROBE_INTERVAL bytes.  Thus, a probe is
+	     mandatory.  Note that LAST_PROBE_OFFSET does not change here.  */
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   (PROBE_INTERVAL
+					    - GET_MODE_SIZE (word_mode))));
+	}
+      dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    }
+  else if (rounded_size)
+    {
+      /* Compute the ending address.  */
+      emit_move_insn (temp, GEN_INT (-rounded_size));
+      emit_insn (gen_add3_insn (temp, stack_pointer_rtx, temp));
+
+      /* This allocates and probes the stack.  Like the inline version above
+	 it does not need to change LAST_PROBE_OFFSET.
+
+	 It almost certainly does not update CFIs correctly.  */
+      emit_insn (gen_probe_stack_range (temp, temp, temp));
+      dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+    }
+  else
+    dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+    
+
+  /* Handle any residuals.  */
+  HOST_WIDE_INT residual = size - rounded_size;
+  if (residual)
+    {
+      aarch64_sub_sp (-1, residual, true);
+      *last_probe_offset += residual;
+      if (*last_probe_offset >= PROBE_INTERVAL)
+	{
+	  *last_probe_offset -= PROBE_INTERVAL;
+	  emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					   (residual
+					    - GET_MODE_SIZE (word_mode))));
+	}
+    }
+  return;
+}
+
 /* AArch64 stack frames generated by this compiler look like:
 
 	+-------------------------------+
@@ -3661,6 +3745,11 @@ aarch64_expand_prologue (void)
   unsigned reg2 = cfun->machine->frame.wb_candidate2;
   rtx_insn *insn;
 
+  /* We know nothing about the last write into the stack, so we have to make
+     worst case assumptions here.  */
+  HOST_WIDE_INT last_probe_offset
+    = PROBE_INTERVAL - (STACK_BOUNDARY / BITS_PER_UNIT);
+
   /* Sign return address for functions.  */
   if (aarch64_return_address_signing_enabled ())
     {
@@ -3686,16 +3775,64 @@ aarch64_expand_prologue (void)
 	aarch64_emit_probe_stack_range (get_stack_check_protect (), frame_size);
     }
 
-  aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+      && initial_adjust != 0)
+    aarch64_allocate_and_probe_stack_space (IP0_REGNUM, initial_adjust,
+					    &last_probe_offset);
+  else
+    aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+
+  /* There are no explicit probes in this case.  */
+  if (initial_adjust == 0
+      && callee_adjust < 32
+      && final_adjust == 0)
+    dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, callee_adjust != 0);
 
   if (callee_adjust != 0)
-    aarch64_push_regs (reg1, reg2, callee_adjust);
+    {
+      aarch64_push_regs (reg1, reg2, callee_adjust);
+
+      /* For a small CALLEE_ADJUST we don't need a probe here as the store of
+	 REG1/REG2 will hit the page we want.  */
+      if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+	  && callee_adjust >= 32)
+	{
+	  /* callee_adjust has to fit in a 7 bit signed scaled offset,
+	     so it should never be near PROBE_INTERVAL.  */
+	  gcc_assert (callee_adjust < PROBE_INTERVAL);
+	  last_probe_offset += callee_adjust;
 
+	  /* Probe if necessary.  */
+	  if (last_probe_offset >= PROBE_INTERVAL)
+	    {
+	      emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+					       (callee_adjust
+					        - GET_MODE_SIZE (word_mode))));
+	      dump_stack_clash_frame_info (PROBE_INLINE, true);
+	    }
+
+	  /* Regardless of whether or not we explicitly probed, the register
+	     push constitutes an implicit probe at *sp.  */
+	  last_probe_offset = 0;
+	}
+
+    }
+
+  /* The calls to aarch64_save_callee_saves store into the stack and
+     can act as implicit probes.  So it tracks the offset closest to
+     SP that is written and returns it.  If that return value is less
+     than LAST_PROBE_OFFSET then we can adjust LAST_PROBE_OFFSET, possibly
+     saving an explicit probe for FINAL_ADJUST.  */
+  HOST_WIDE_INT tmp;
   if (frame_pointer_needed)
     {
       if (callee_adjust == 0)
-	aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM,
-				   R30_REGNUM, false);
+	{
+	  tmp = aarch64_save_callee_saves (DImode, callee_offset, R29_REGNUM,
+					   R30_REGNUM, false);
+	  if (tmp < last_probe_offset)
+	    last_probe_offset = tmp;
+	}
       insn = emit_insn (gen_add3_insn (hard_frame_pointer_rtx,
 				       stack_pointer_rtx,
 				       GEN_INT (callee_offset)));
@@ -3703,11 +3840,22 @@ aarch64_expand_prologue (void)
       emit_insn (gen_stack_tie (stack_pointer_rtx, hard_frame_pointer_rtx));
     }
 
-  aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
-			     callee_adjust != 0 || frame_pointer_needed);
-  aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
-			     callee_adjust != 0 || frame_pointer_needed);
-  aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
+  tmp = aarch64_save_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
+				   callee_adjust != 0 || frame_pointer_needed);
+  if (tmp < last_probe_offset)
+    last_probe_offset = tmp;
+
+  tmp = aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
+				   callee_adjust != 0 || frame_pointer_needed);
+  if (tmp < last_probe_offset)
+    last_probe_offset = tmp;
+
+  if (flag_stack_check == STACK_CLASH_BUILTIN_STACK_CHECK
+      && final_adjust != 0)
+    aarch64_allocate_and_probe_stack_space (IP1_REGNUM, final_adjust,
+					    &last_probe_offset);
+  else
+    aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
 }
 
 /* Return TRUE if we can use a simple_return insn.

next             reply	other threads:[~2017-07-11 21:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 21:21 Jeff Law [this message]
2017-07-12  0:20 Wilco Dijkstra
2017-07-12  2:02 ` Jeff Law
2017-07-12 12:47   ` Trevor Saunders
2017-07-12 21:08     ` Jeff Law
2017-07-16 18:36       ` Trevor Saunders
2017-07-17 15:51         ` Jeff Law
2017-07-17 20:30           ` Trevor Saunders
2017-07-14  6:19 ` Jeff Law
2017-07-17 11:27   ` Wilco Dijkstra
2017-07-17 15:24     ` Jeff Law
2017-07-17 15:31       ` Jakub Jelinek
2017-07-17 22:42       ` Wilco Dijkstra
2017-07-17 23:16         ` Jeff Law
2017-07-14  8:17 ` Jakub Jelinek
2017-07-14 11:49   ` Wilco Dijkstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a49e4c85-cb71-2b02-9292-64a92c1a2c5a@redhat.com \
    --to=law@redhat.com \
    --cc=gcc-patches@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).