* [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
@ 2017-09-29 15:15 Jeff Law
[not found] ` <DB6PR0801MB205340E4DACC1CF3696A150983480@DB6PR0801MB2053.eurprd08.prod.outlook.com>
2017-10-13 20:47 ` Wilco Dijkstra
0 siblings, 2 replies; 13+ messages in thread
From: Jeff Law @ 2017-09-29 15:15 UTC (permalink / raw)
To: gcc-patches
Cc: Richard Earnshaw, James Greenhalgh, Wilco Dijkstra, Marcus Shawcroft
[-- Attachment #1: Type: text/plain, Size: 2531 bytes --]
Wilco has done most of the design/implementation review work to-date and
should have state on most of this code.
--
Here's the current aarch64 patch for stack clash protection. It's the
only bits for stack clash protection that haven't been committed to the
trunk.
Looking through my archives I can't find the V3 patch. So I'm going to
describe what's changed since the V2 patch.
aarch64_output_probe_stack_range changes were slightly simplified.
There was no need to set reg1 to stack_pointer_rtx for stack-clash
protection.
We have PARAMS for the size of the guard and the probe interval. For
aarch64 we override the default guard size and set it to 64k. The guard
size determines if we need probing and once we need probing the interval
is determined by the probe interval. The reserved 1k for outgoing args
doesn't have a PARAM since it's target specific.
The code tries to handle probe intervals > ARITH_FACTOR, but I haven't
thoroughly tested that.
The code should add the appropriate notes to get the dwarf2 CFI records
we need, including the case when we have a probing loop.
The epilogue would sometimes elide setting IP0_REGNUM or IP1_REGNUM
assuming the prologue had done so and left them in a usable state.
However, those registers won't have the expected values when stack-clash
protection is enabled. The epilogue has been updated to handle that
case and avoid the incorrect optimization. This was tickled during testing.
We emit REG_STACK_CHECK notes and blockage insns to prevent
combine-stack-adjustments and/or the scheduler from mucking up the
probing sequences. A testcase for this issue is included. This was
discovered during binary scanning of glibc built with stack clash
protection.
The trailing probe to ensure that a caller always has a probe within 1k
of the end of the outgoing args should be correct now. Testcases for
a couple instances of this problem (from Wilco) are included.
Similarly we turn on the hook to do the same for the alloca space.
We no longer track the most recent probe. A better understanding of the
aarch64 prologue code shows this wasn't really needed.
This has been bootstrapped and regression tested on aarch64 linux. I've
also bootstrapped and regression tested with stack clash protections
enabled by default to exercise the code further (regressions in that
case are related to mixing stack-clash and stack-check as well as the
usual guality problems). Earlier versions have been used to build and
test glibc as well.
OK for the trunk?
Jeff
[-- Attachment #2: P --]
[-- Type: text/plain, Size: 20259 bytes --]
commit 7fcdc6933501f9a6b124174e8b2c4113c3adcf1d
Author: root <root@gigabyte-r120-05.khw.lab.eng.bos.redhat.com>
Date: Fri Jul 7 11:17:06 2017 -0400
* config/aarch/aarch64.c (aarch64_output_probe_stack_range): Handle
-fstack-clash-protection probing too.
(aarch64_allocate_and_probe_stack_space): New function.
(aarch64_expand_prologue): Assert we never have both an initial
adjustment and callee save adjustment. Probe and dump as needed when
-fstack-clash-protection is enabled.
(aarch64_expand_epilogue): Do not assume the prologue has
set IP1_REGNUM or IP0_REGNUM to useful values when
-fstack-clash-protection is active.
(aarch64_override_options_internal): Assume 64kb of guard space.
(aarch64_stack_clash_protection_final_dynamic_probe): New function.
(TARGET_STACK_CLASH_PROTECTION_FINAL_DYNAMIC_PROBE): define.
* aarch64.md (probe_stack_range): Fix constraint on destination
operand.
* lib/target-supports.exp
(check_effective_target_supports_stack_clash_protection): Enable for
aarch64 targets.
* gcc.target/aarch64/stack-check-12.c: New test.
* gcc.target/aarch64/stack-check-13.c: New test.
* gcc.target/aarch64/stack-check-14.c: New test.
* gcc.target/aarch64/stack-check-15.c: New test.
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 71b1408496a..c7224237572 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,20 @@
+2017-09-27 Jeff Law <law@redhat.com>
+
+ * config/aarch/aarch64.c (aarch64_output_probe_stack_range): Handle
+ -fstack-clash-protection probing too.
+ (aarch64_allocate_and_probe_stack_space): New function.
+ (aarch64_expand_prologue): Assert we never have both an initial
+ adjustment and callee save adjustment. Probe and dump as needed when
+ -fstack-clash-protection is enabled.
+ (aarch64_expand_epilogue): Do not assume the prologue has
+ set IP1_REGNUM or IP0_REGNUM to useful values when
+ -fstack-clash-protection is active.
+ (aarch64_override_options_internal): Assume 64kb of guard space.
+ (aarch64_stack_clash_protection_final_dynamic_probe): New function.
+ (TARGET_STACK_CLASH_PROTECTION_FINAL_DYNAMIC_PROBE): define.
+ * aarch64.md (probe_stack_range): Fix constraint on destination
+ operand.
+
2017-09-27 Christophe Lyon <christophe.lyon@linaro.org>
PR target/71727
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 4c411df12ce..4f74bcbc234 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -2846,7 +2846,14 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
output_asm_insn ("sub\t%0, %0, %1", xops);
/* Probe at TEST_ADDR. */
- output_asm_insn ("str\txzr, [%0]", xops);
+ if (flag_stack_clash_protection)
+ {
+ gcc_assert (xops[0] == stack_pointer_rtx);
+ xops[1] = GEN_INT (PROBE_INTERVAL - 8);
+ output_asm_insn ("str\txzr, [%0, %1]", xops);
+ }
+ else
+ output_asm_insn ("str\txzr, [%0]", xops);
/* Test if TEST_ADDR == LAST_ADDR. */
xops[1] = reg2;
@@ -3606,6 +3613,125 @@ aarch64_set_handled_components (sbitmap components)
cfun->machine->reg_is_wrapped_separately[regno] = true;
}
+/* Allocate SIZE bytes of stack space using SCRATCH_REG as a scratch
+ register. */
+
+static void
+aarch64_allocate_and_probe_stack_space (int scratchreg, HOST_WIDE_INT size)
+{
+ HOST_WIDE_INT probe_interval
+ = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL);
+ HOST_WIDE_INT guard_size
+ = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+ HOST_WIDE_INT guard_used_by_caller = 1024;
+
+ /* SIZE should be large enough to require probing here. ie, it
+ must be larger than GUARD_SIZE - GUARD_USED_BY_CALLER.
+
+ We can allocate GUARD_SIZE - GUARD_USED_BY_CALLER as a single chunk
+ without any probing. */
+ gcc_assert (size >= guard_size - guard_used_by_caller);
+ aarch64_sub_sp (scratchreg, guard_size - guard_used_by_caller, true);
+ HOST_WIDE_INT orig_size = size;
+ size -= (guard_size - guard_used_by_caller);
+
+ HOST_WIDE_INT rounded_size = size & -probe_interval;
+ HOST_WIDE_INT residual = size - rounded_size;
+
+ /* We can handle a small number of allocations/probes inline. Otherwise
+ punt to a loop. */
+ if (rounded_size && rounded_size <= 4 * probe_interval)
+ {
+ /* We don't use aarch64_sub_sp here because we don't want to
+ repeatedly load SCRATCHREG. */
+ rtx scratch_rtx = gen_rtx_REG (Pmode, scratchreg);
+ if (probe_interval > ARITH_FACTOR)
+ emit_move_insn (scratch_rtx, GEN_INT (-probe_interval));
+ else
+ scratch_rtx = GEN_INT (-probe_interval);
+
+ for (HOST_WIDE_INT i = 0; i < rounded_size; i += probe_interval)
+ {
+ rtx_insn *insn = emit_insn (gen_add2_insn (stack_pointer_rtx,
+ scratch_rtx));
+ add_reg_note (insn, REG_STACK_CHECK, const0_rtx);
+
+ if (probe_interval > ARITH_FACTOR)
+ {
+ RTX_FRAME_RELATED_P (insn) = 1;
+ rtx adj = plus_constant (Pmode, stack_pointer_rtx, rounded_size);
+ add_reg_note (insn, REG_CFA_ADJUST_CFA,
+ gen_rtx_SET (stack_pointer_rtx, adj));
+ }
+
+ emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+ (probe_interval
+ - GET_MODE_SIZE (word_mode))));
+ emit_insn (gen_blockage ());
+ }
+ dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+ }
+ else if (rounded_size)
+ {
+ /* Compute the ending address. */
+ rtx temp = gen_rtx_REG (word_mode, scratchreg);
+ emit_move_insn (temp, GEN_INT (-rounded_size));
+ rtx_insn *insn
+ = emit_insn (gen_add3_insn (temp, stack_pointer_rtx, temp));
+
+ /* For the initial allocation, we don't have a frame pointer
+ set up, so we always need CFI notes. If we're doing the
+ final allocation, then we may have a frame pointer, in which
+ case it is the CFA, otherwise we need CFI notes.
+
+ We can determine which allocation we are doing by looking at
+ the temporary register. IP0 is the initial allocation, IP1
+ is the final allocation. */
+ if (scratchreg == IP0_REGNUM || !frame_pointer_needed)
+ {
+ /* We want the CFA independent of the stack pointer for the
+ duration of the loop. */
+ add_reg_note (insn, REG_CFA_DEF_CFA,
+ plus_constant (Pmode, temp,
+ (rounded_size + (orig_size - size))));
+ RTX_FRAME_RELATED_P (insn) = 1;
+ }
+
+ /* This allocates and probes the stack.
+
+ It also probes at a 4k interval regardless of the value of
+ PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL. */
+ insn = emit_insn (gen_probe_stack_range (stack_pointer_rtx,
+ stack_pointer_rtx, temp));
+
+ /* Now reset the CFA register if needed. */
+ if (scratchreg == IP0_REGNUM || !frame_pointer_needed)
+ {
+ add_reg_note (insn, REG_CFA_DEF_CFA,
+ plus_constant (Pmode, stack_pointer_rtx,
+ (rounded_size + (orig_size - size))));
+ RTX_FRAME_RELATED_P (insn) = 1;
+ }
+
+ emit_insn (gen_blockage ());
+ dump_stack_clash_frame_info (PROBE_LOOP, size != rounded_size);
+ }
+ else
+ dump_stack_clash_frame_info (PROBE_INLINE, size != rounded_size);
+
+ /* Handle any residuals.
+ Note that any residual must be probed. */
+ if (residual)
+ {
+ aarch64_sub_sp (scratchreg, residual, true);
+ add_reg_note (get_last_insn (), REG_STACK_CHECK, const0_rtx);
+ emit_stack_probe (plus_constant (Pmode, stack_pointer_rtx,
+ (residual - GET_MODE_SIZE (word_mode))));
+ emit_insn (gen_blockage ());
+ }
+ return;
+}
+
/* AArch64 stack frames generated by this compiler look like:
+-------------------------------+
@@ -3687,7 +3813,54 @@ aarch64_expand_prologue (void)
aarch64_emit_probe_stack_range (get_stack_check_protect (), frame_size);
}
- aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
+ /* We do not fully protect aarch64 against stack clash style attacks
+ as doing so would be prohibitively expensive with less utility over
+ time as newer compilers are deployed.
+
+ We assume the guard is at least 64k. Furthermore, we assume that
+ the caller has not pushed the stack pointer more than 1k into
+ the guard. A caller that pushes the stack pointer than 1k into
+ the guard is considered invalid.
+
+ Note that the caller's ability to push the stack pointer into the
+ guard is a function of the number and size of outgoing arguments and/or
+ dynamic stack allocations due to the mandatory save of the link register
+ in the caller's frame.
+
+ With those assumptions the callee can allocate up to 63k of stack
+ space without probing.
+
+ When probing is needed, we emit a probe at the start of the prologue
+ and every PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL bytes thereafter.
+
+ We have to track how much space has been allocated, but we do not
+ track stores into the stack as implicit probes except for the
+ fp/lr store. */
+ HOST_WIDE_INT guard_size
+ = 1 << PARAM_VALUE (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE);
+ HOST_WIDE_INT guard_used_by_caller = 1024;
+ if (flag_stack_clash_protection)
+ {
+ if (frame_size == 0)
+ dump_stack_clash_frame_info (NO_PROBE_NO_FRAME, false);
+ else if (initial_adjust < guard_size - guard_used_by_caller
+ && final_adjust < guard_size - guard_used_by_caller)
+ dump_stack_clash_frame_info (NO_PROBE_SMALL_FRAME, true);
+ }
+
+ /* In theory we should never have both an initial adjustment
+ and a callee save adjustment. Verify that is the case since the
+ code below does not handle it for -fstack-clash-protection. */
+ gcc_assert (initial_adjust == 0 || callee_adjust == 0);
+
+ /* Only probe if the initial adjustment is larger than the guard
+ less the amount of the guard reserved for use by the caller's
+ outgoing args. */
+ if (flag_stack_clash_protection
+ && initial_adjust >= guard_size - guard_used_by_caller)
+ aarch64_allocate_and_probe_stack_space (IP0_REGNUM, initial_adjust);
+ else
+ aarch64_sub_sp (IP0_REGNUM, initial_adjust, true);
if (callee_adjust != 0)
aarch64_push_regs (reg1, reg2, callee_adjust);
@@ -3708,7 +3881,30 @@ aarch64_expand_prologue (void)
callee_adjust != 0 || frame_pointer_needed);
aarch64_save_callee_saves (DFmode, callee_offset, V0_REGNUM, V31_REGNUM,
callee_adjust != 0 || frame_pointer_needed);
- aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
+
+ /* We may need to probe the final adjustment as well. */
+ if (flag_stack_clash_protection && final_adjust != 0)
+ {
+ /* First probe if the final adjustment is larger than the guard size
+ less the amount of the guard reserved for use by the caller's
+ outgoing args. */
+ if (final_adjust >= guard_size - guard_used_by_caller)
+ aarch64_allocate_and_probe_stack_space (IP1_REGNUM, final_adjust);
+ else
+ aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
+
+ /* We must also probe if the final adjustment is larger than the guard
+ that is assumed used by the caller. This may be sub-optimal. */
+ if (final_adjust >= guard_used_by_caller)
+ {
+ if (dump_file)
+ fprintf (dump_file,
+ "Stack clash aarch64 large outgoing arg, probing\n");
+ emit_stack_probe (stack_pointer_rtx);
+ }
+ }
+ else
+ aarch64_sub_sp (IP1_REGNUM, final_adjust, !frame_pointer_needed);
}
/* Return TRUE if we can use a simple_return insn.
@@ -3774,7 +3970,11 @@ aarch64_expand_epilogue (bool for_sibcall)
RTX_FRAME_RELATED_P (insn) = callee_adjust == 0;
}
else
- aarch64_add_sp (IP1_REGNUM, final_adjust, df_regs_ever_live_p (IP1_REGNUM));
+ aarch64_add_sp (IP1_REGNUM, final_adjust,
+ /* A stack clash protection prologue may not have
+ left IP1_REGNUM in a usable state. */
+ (flag_stack_clash_protection
+ || df_regs_ever_live_p (IP1_REGNUM)));
aarch64_restore_callee_saves (DImode, callee_offset, R0_REGNUM, R30_REGNUM,
callee_adjust != 0, &cfi_ops);
@@ -3797,7 +3997,11 @@ aarch64_expand_epilogue (bool for_sibcall)
cfi_ops = NULL;
}
- aarch64_add_sp (IP0_REGNUM, initial_adjust, df_regs_ever_live_p (IP0_REGNUM));
+ /* A stack clash protection prologue may not have left IP0_REGNUM
+ in a usable state. */
+ aarch64_add_sp (IP0_REGNUM, initial_adjust,
+ (flag_stack_clash_protection
+ || df_regs_ever_live_p (IP0_REGNUM)));
if (cfi_ops)
{
@@ -9094,6 +9298,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
&& opts->x_optimize >= aarch64_tune_params.prefetch->default_opt_level)
opts->x_flag_prefetch_loop_arrays = 1;
+ /* We assume the guard page is 64k. */
+ maybe_set_param_value (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE,
+ 16,
+ opts->x_param_values,
+ global_options_set.x_param_values);
+
aarch64_override_options_after_change_1 (opts);
}
@@ -15245,6 +15455,28 @@ aarch64_sched_can_speculate_insn (rtx_insn *insn)
}
}
+/* It has been decided that to allow up to 1kb of outgoing argument
+ space to be allocated w/o probing. If more than 1kb of outgoing
+ argment space is allocated, then it must be probed and the last
+ probe must occur no more than 1kbyte away from the end of the
+ allocated space.
+
+ This implies that the residual part of an alloca allocation may
+ need probing in cases where the generic code might not otherwise
+ think a probe is needed.
+
+ This target hook returns TRUE when allocating RESIDUAL bytes of
+ alloca space requires an additional probe, otherwise FALSE is
+ returned. */
+
+static bool
+aarch64_stack_clash_protection_final_dynamic_probe (rtx residual)
+{
+ return (residual == CONST0_RTX (Pmode)
+ || GET_CODE (residual) != CONST_INT
+ || INTVAL (residual) >= 1024);
+}
+
/* Target-specific selftests. */
#if CHECKING_P
@@ -15691,6 +15923,10 @@ aarch64_libgcc_floating_mode_supported_p
#undef TARGET_CONSTANT_ALIGNMENT
#define TARGET_CONSTANT_ALIGNMENT aarch64_constant_alignment
+#undef TARGET_STACK_CLASH_PROTECTION_FINAL_DYNAMIC_PROBE
+#define TARGET_STACK_CLASH_PROTECTION_FINAL_DYNAMIC_PROBE \
+ aarch64_stack_clash_protection_final_dynamic_probe
+
#if CHECKING_P
#undef TARGET_RUN_TARGET_SELFTESTS
#define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index f8cdb063546..f483e4dddc3 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -5589,7 +5589,7 @@
)
(define_insn "probe_stack_range"
- [(set (match_operand:DI 0 "register_operand" "=r")
+ [(set (match_operand:DI 0 "register_operand" "=rk")
(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
(match_operand:DI 2 "register_operand" "r")]
UNSPECV_PROBE_STACK_RANGE))]
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index f2a44809c79..df4f3351769 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,13 @@
+2017-09-27 Jeff Law <law@redhat.com>
+
+ * lib/target-supports.exp
+ (check_effective_target_supports_stack_clash_protection): Enable for
+ aarch64 targets.
+ * gcc.target/aarch64/stack-check-12.c: New test.
+ * gcc.target/aarch64/stack-check-13.c: New test.
+ * gcc.target/aarch64/stack-check-14.c: New test.
+ * gcc.target/aarch64/stack-check-15.c: New test.
+
2017-09-27 David Malcolm <dmalcolm@redhat.com>
* jit.dg/all-non-failing-tests.h: Add
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-12.c b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
new file mode 100644
index 00000000000..2ce38483b6b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-12.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=12" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+extern void arf (unsigned long int *, unsigned long int *);
+void
+frob ()
+{
+ unsigned long int num[1000];
+ unsigned long int den[1000];
+ arf (den, num);
+}
+
+/* This verifies that the scheduler did not break the dependencies
+ by adjusting the offsets within the probe and that the scheduler
+ did not reorder around the stack probes. */
+/* { dg-final { scan-assembler-times "sub\\tsp, sp, #4096\\n\\tstr\\txzr, .sp, 4088." 3 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-13.c b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
new file mode 100644
index 00000000000..d8886835989
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-13.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=12" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#define ARG32(X) X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X,X
+#define ARG192(X) ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X),ARG32(X)
+void out1(ARG192(__int128));
+int t1(int);
+
+int t3(int x)
+{
+ if (x < 1000)
+ return t1 (x) + 1;
+
+ out1 (ARG192(1));
+ return 0;
+}
+
+
+
+/* This test creates a large (> 1k) outgoing argument area that needs
+ to be probed. We don't test the exact size of the space or the
+ exact offset to make the test a little less sensitive to trivial
+ output changes. */
+/* { dg-final { scan-assembler-times "sub\\tsp, sp, #....\\n\\tstr\\txzr, \\\[sp" 1 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-14.c b/gcc/testsuite/gcc.target/aarch64/stack-check-14.c
new file mode 100644
index 00000000000..59ffe01376d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-14.c
@@ -0,0 +1,25 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=12" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+int t1(int);
+
+int t2(int x)
+{
+ char *p = __builtin_alloca (4050);
+ x = t1 (x);
+ return p[x];
+}
+
+
+/* This test has a constant sized alloca that is smaller than the
+ probe interval. But it actually requires two probes instead
+ of one because of the optimistic assumptions we made in the
+ aarch64 prologue code WRT probing state.
+
+ The form can change quite a bit so we just check for two
+ probes without looking at the actual address. */
+/* { dg-final { scan-assembler-times "str\\txzr," 2 } } */
+
+
+
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-15.c b/gcc/testsuite/gcc.target/aarch64/stack-check-15.c
new file mode 100644
index 00000000000..e06db6dc2f0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-15.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fstack-clash-protection --param stack-clash-protection-guard-size=12" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+int t1(int);
+
+int t2(int x)
+{
+ char *p = __builtin_alloca (x);
+ x = t1 (x);
+ return p[x];
+}
+
+
+/* This test has a variable sized alloca. It requires 3 probes.
+ One in the loop, one for the residual and at the end of the
+ alloca area.
+
+ The form can change quite a bit so we just check for two
+ probes without looking at the actual address. */
+/* { dg-final { scan-assembler-times "str\\txzr," 3 } } */
+
+
+
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 57f646ce2df..6fe81827207 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -8725,14 +8725,9 @@ proc check_effective_target_autoincdec { } {
#
proc check_effective_target_supports_stack_clash_protection { } {
- # Temporary until the target bits are fully ACK'd.
-# if { [istarget aarch*-*-*] } {
-# return 1
-# }
-
if { [istarget x86_64-*-*] || [istarget i?86-*-*]
|| [istarget powerpc*-*-*] || [istarget rs6000*-*-*]
- || [istarget s390*-*-*] } {
+ || [istarget aarch64*-**] || [istarget s390*-*-*] } {
return 1
}
return 0
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
[not found] ` <DB6PR0801MB205340E4DACC1CF3696A150983480@DB6PR0801MB2053.eurprd08.prod.outlook.com>
@ 2017-10-13 16:14 ` Wilco Dijkstra
0 siblings, 0 replies; 13+ messages in thread
From: Wilco Dijkstra @ 2017-10-13 16:14 UTC (permalink / raw)
To: gcc-patches, Jeff Law; +Cc: nd
Hi,
Sorry for the delay - I finally had a chance to look at this again.
I'll start with alloca:
@@ -15245,6 +15455,28 @@ aarch64_sched_can_speculate_insn (rtx_insn *insn)
}
}
+/* It has been decided that to allow up to 1kb of outgoing argument
+ space to be allocated w/o probing. If more than 1kb of outgoing
+ argment space is allocated, then it must be probed and the last
+ probe must occur no more than 1kbyte away from the end of the
+ allocated space.
+
+ This implies that the residual part of an alloca allocation may
+ need probing in cases where the generic code might not otherwise
+ think a probe is needed.
+
+ This target hook returns TRUE when allocating RESIDUAL bytes of
+ alloca space requires an additional probe, otherwise FALSE is
+ returned. */
+
+static bool
+aarch64_stack_clash_protection_final_dynamic_probe (rtx residual)
+{
+ return (residual == CONST0_RTX (Pmode)
+ || GET_CODE (residual) != CONST_INT
+ || INTVAL (residual) >= 1024);
+}
+
The const0 check is wrong - for alloca(0) we do not want to insert a probe!
I don't get how this works - probes are done at strange offsets, there is always
a final probe (even if residual is < 1024), and the alignment info seems to be lost
somehow, so generated code end up quite bad:
void *p;
void small_alloca (int x)
{
if (x > 100)
p = __builtin_alloca (4096);
}
sub sp, sp, #4096
str xzr, [sp, 4088]
mov x0, sp
str xzr, [x0], 15 *** +15 and extra probe
and x0, x0, -16 *** already aligned
The same unnecessary probe happens for a variable alloca after the loop:
void alloca (int x)
{
if (x > 100)
p = __builtin_alloca (x);
}
add x0, x0, 15
and x0, x0, -16
and x1, x0, -4096
sub x1, sp, x1
.L22:
mov x2, sp
cmp x2, x1
beq .L23
sub sp, sp, #4096
str xzr, [sp, 4088]
b .L22
.L23:
and x0, x0, 4095
sub x1, x0, #8
sub sp, sp, x0
str xzr, [sp, x1]
str xzr, [sp, -8]
mov x1, sp
That does lots of unnecessary operations, including always an extra probe
(which isn't needed since the final adjustment is always less than the max
probe distance). I think it should be:
add x0, x0, 15
subs x1, x0, 4096
bhi .L22
.L23:
and x0, x0, 4080
sub sp, sp, x0
str xzr, [sp, probe_offset - 8]
mov x1, sp
.L22: (loop marked as unlikely so out of the way)
subs x1, x1, 4096
sub sp, sp, #4096
str xzr, [sp, probe_offset]
bhi .L22
b .L23
On AArch64 probe_offset would be min (1024, outgoing_args_size). On other
targets it could be 0 or (probe distance - 8) depending on the probe design. This
means you always do exactly the number of probes that are needed and avoid
the corner case of alloca (0). Typically only extra executed instructions are a
compare+non-taken branch plus a store. Codesize overhead is reduced by a
third (from 12 to 8 instructions).
Wilco
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-09-29 15:15 [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits) Jeff Law
[not found] ` <DB6PR0801MB205340E4DACC1CF3696A150983480@DB6PR0801MB2053.eurprd08.prod.outlook.com>
@ 2017-10-13 20:47 ` Wilco Dijkstra
2017-10-28 11:23 ` Jeff Law
1 sibling, 1 reply; 13+ messages in thread
From: Wilco Dijkstra @ 2017-10-13 20:47 UTC (permalink / raw)
To: Jeff Law, gcc-patches
Cc: Richard Earnshaw, James Greenhalgh, Marcus Shawcroft, nd
Hi,
To continue the review of the AArch64 frame code I tried a few examples
to figure out what it does now. For initial_adjust <= 63*1024 and final_adjust <
1024 there are no probes inserted as expected, ie. the vast majority of
functions are unaffected. So that works perfectly.
For larger frames the first oddity is that there are now 2 separate params
controlling how probes are generated:
stack-clash-protection-guard-size (default 12, but set to 16 on AArch64)
stack-clash-protection-probe-interval (default 12)
I don't see how this makes sense. These values are closely related, so if
one is different from the other, probing becomes ineffective/incorrect.
For example we generate code that trivially bypasses the guard despite
all the probing:
--param=stack-clash-protection-probe-interval=13
--param=stack-clash-protection-guard-size=12
So if there is a good reason to continue with 2 separate values, we must
force probe interval <= guard size!
Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
crashes due to the offsets used in the probes - we don't need large offsets
as we want to probe close to the bottom of the stack.
Functions with a large stack emit like alloca a lot of code, here I used
--param=stack-clash-protection-probe-interval=15:
int f1(int x)
{
char arr[128*1024];
return arr[x];
}
f1:
mov x16, 64512
sub sp, sp, x16
.cfi_def_cfa_offset 64512
mov x16, -32768
add sp, sp, x16
.cfi_def_cfa_offset -1024
str xzr, [sp, 32760]
add sp, sp, x16
.cfi_def_cfa_offset -66560
str xzr, [sp, 32760]
sub sp, sp, #1024
.cfi_def_cfa_offset -65536
str xzr, [sp, 1016]
ldrb w0, [sp, w0, sxtw]
.cfi_def_cfa_offset 131072
add sp, sp, 131072
.cfi_def_cfa_offset 0
ret
Note the cfa offsets are wrong.
There is an odd mix of a big initial adjustment, then some probes+adjustments and
then a final adjustment and probe for the remainder. I can't see the point of having
both an initial and remainder adjustment. I would expect this:
sub sp, sp, 65536
str xzr, [sp, 1024]
sub sp, sp, 65536
str xzr, [sp, 1024]
ldrb w0, [sp, w0, sxtw]
add sp, sp, 131072
ret
int f2(int x)
{
char arr[128*1024];
return arr[x];
}
f2:
mov x16, 64512
sub sp, sp, x16
mov x16, -65536
movk x16, 0xfffd, lsl 16
add x16, sp, x16
.LPSRL0:
sub sp, sp, 4096
str xzr, [sp, 4088]
cmp sp, x16
b.ne .LPSRL0
sub sp, sp, #1024
str xzr, [sp, 1016]
ldrb w0, [sp, w0, sxtw]
add sp, sp, 262144
ret
The cfa entries are OK for this case. There is a mix of positive/negative offsets which
makes things confusing. Again there are 3 kinds of adjustments when for this size we
only need the loop.
Reusing the existing gen_probe_stack_range code appears a bad idea since
it ignores the probe interval and just defaults to 4KB. I don't see why it should be
any more complex than this:
sub x16, sp, 262144 // only need temporary if > 1MB
.LPSRL0:
sub sp, sp, 65536
str xzr, [sp, 1024]
cmp sp, x16
b.ne .LPSRL0
ldrb w0, [sp, w0, sxtw]
add sp, sp, 262144
ret
Probe insertion if final adjustment >= 1024 also generates a lot of redundant
code - although this is more a theoretical issue given this is so rare.
Wilco
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-10-13 20:47 ` Wilco Dijkstra
@ 2017-10-28 11:23 ` Jeff Law
2017-11-21 11:59 ` James Greenhalgh
2017-11-27 16:32 ` Szabolcs Nagy
0 siblings, 2 replies; 13+ messages in thread
From: Jeff Law @ 2017-10-28 11:23 UTC (permalink / raw)
To: Wilco Dijkstra, gcc-patches
Cc: Richard Earnshaw, James Greenhalgh, Marcus Shawcroft, nd
On 10/13/2017 02:26 PM, Wilco Dijkstra wrote:
> Hi,
>
> To continue the review of the AArch64 frame code I tried a few examples
> to figure out what it does now. For initial_adjust <= 63*1024 and final_adjust <
> 1024 there are no probes inserted as expected, ie. the vast majority of
> functions are unaffected. So that works perfectly.
Right.
>
> For larger frames the first oddity is that there are now 2 separate params
> controlling how probes are generated:
>
> stack-clash-protection-guard-size (default 12, but set to 16 on AArch64)
> stack-clash-protection-probe-interval (default 12)
>
> I don't see how this makes sense. These values are closely related, so if
> one is different from the other, probing becomes ineffective/incorrect.
> For example we generate code that trivially bypasses the guard despite
> all the probing:
My hope would be that we simply don't ever use the params. They were
done as much for *you* to experiment with as anything. I'd happy just
delete them as there's essentially no guard rails to ensure their values
are sane.
>
> --param=stack-clash-protection-probe-interval=13
> --param=stack-clash-protection-guard-size=12
>
> So if there is a good reason to continue with 2 separate values, we must
> force probe interval <= guard size!
The param code really isn't designed to enforce values that are
inter-dependent. It has a min, max & default values. No more, no less.
If you set up something inconsistent with the params, it's simply not
going to work.
>
> Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
> crashes due to the offsets used in the probes - we don't need large offsets
> as we want to probe close to the bottom of the stack.
Not a surprise. While I tried to handle larger intervals, I certainly
didn't test them. Given the ISA I wouldn't expect an interval > 12 to
be useful or necessarily even work correctly.
>
> Functions with a large stack emit like alloca a lot of code, here I used
> --param=stack-clash-protection-probe-interval=15:
>
> int f1(int x)
> {
> char arr[128*1024];
> return arr[x];
> }
>
> f1:
> mov x16, 64512
> sub sp, sp, x16
> .cfi_def_cfa_offset 64512
> mov x16, -32768
> add sp, sp, x16
> .cfi_def_cfa_offset -1024
> str xzr, [sp, 32760]
> add sp, sp, x16
> .cfi_def_cfa_offset -66560
> str xzr, [sp, 32760]
> sub sp, sp, #1024
> .cfi_def_cfa_offset -65536
> str xzr, [sp, 1016]
> ldrb w0, [sp, w0, sxtw]
> .cfi_def_cfa_offset 131072
> add sp, sp, 131072
> .cfi_def_cfa_offset 0
> ret
>
> Note the cfa offsets are wrong.
Yes. They definitely look wrong. There's a clear logic error in
setting up the ADJUST_CFA note when the probing interval is larger than
2**12. That should be easily fixed. Let me poke at it.
>
> There is an odd mix of a big initial adjustment, then some probes+adjustments and
> then a final adjustment and probe for the remainder. I can't see the point of having
> both an initial and remainder adjustment. I would expect this:
>
> sub sp, sp, 65536
> str xzr, [sp, 1024]
> sub sp, sp, 65536
> str xzr, [sp, 1024]
> ldrb w0, [sp, w0, sxtw]
> add sp, sp, 131072
> ret
I'm really not able to justify spending further time optimizing the
aarch64 implementation. I've done the best I can. You can take the
work as-is or improve it, but I really can't justify further time
investment on that architecture.
>
>
> int f2(int x)
> {
> char arr[128*1024];
> return arr[x];
> }
>
> f2:
> mov x16, 64512
> sub sp, sp, x16
> mov x16, -65536
> movk x16, 0xfffd, lsl 16
> add x16, sp, x16
> .LPSRL0:
> sub sp, sp, 4096
> str xzr, [sp, 4088]
> cmp sp, x16
> b.ne .LPSRL0
> sub sp, sp, #1024
> str xzr, [sp, 1016]
> ldrb w0, [sp, w0, sxtw]
> add sp, sp, 262144
> ret
>
> The cfa entries are OK for this case. There is a mix of positive/negative offsets which
> makes things confusing. Again there are 3 kinds of adjustments when for this size we
> only need the loop.
>
> Reusing the existing gen_probe_stack_range code appears a bad idea since
> it ignores the probe interval and just defaults to 4KB. I don't see why it should be
> any more complex than this:
>
> sub x16, sp, 262144 // only need temporary if > 1MB
> .LPSRL0:
> sub sp, sp, 65536
> str xzr, [sp, 1024]
> cmp sp, x16
> b.ne .LPSRL0
> ldrb w0, [sp, w0, sxtw]
> add sp, sp, 262144
> ret
>
> Probe insertion if final adjustment >= 1024 also generates a lot of redundant
> code - although this is more a theoretical issue given this is so rare.
Again, if ARM wants this optimized, then ARM's engineers are going to
have to take the lead here. I've invested all I can reasonably invest
in terms of trying optimize the probing for this target.
jeff
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-10-28 11:23 ` Jeff Law
@ 2017-11-21 11:59 ` James Greenhalgh
2017-11-22 18:52 ` Jeff Law
2017-11-27 16:32 ` Szabolcs Nagy
1 sibling, 1 reply; 13+ messages in thread
From: James Greenhalgh @ 2017-11-21 11:59 UTC (permalink / raw)
To: Jeff Law
Cc: Wilco Dijkstra, gcc-patches, Richard Earnshaw, Marcus Shawcroft, nd
I've finally built up enough courage to start getting my head around this...
I see one outstanding issue sitting on this patch version:
On Sat, Oct 28, 2017 at 05:08:54AM +0100, Jeff Law wrote:
> On 10/13/2017 02:26 PM, Wilco Dijkstra wrote:
> > --param=stack-clash-protection-probe-interval=13
> > --param=stack-clash-protection-guard-size=12
> >
> > So if there is a good reason to continue with 2 separate values, we must
> > force probe interval <= guard size!
> The param code really isn't designed to enforce values that are
> inter-dependent. It has a min, max & default values. No more, no less.
> If you set up something inconsistent with the params, it's simply not
> going to work.
>
>
> >
> > Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
> > crashes due to the offsets used in the probes - we don't need large offsets
> > as we want to probe close to the bottom of the stack.
> Not a surprise. While I tried to handle larger intervals, I certainly
> didn't test them. Given the ISA I wouldn't expect an interval > 12 to
> be useful or necessarily even work correctly.
Understood - weird behaviour with weird params don't concern me.
> > Functions with a large stack emit like alloca a lot of code, here I used
> > --param=stack-clash-protection-probe-interval=15:
> >
> > int f1(int x)
> > {
> > char arr[128*1024];
> > return arr[x];
> > }
> >
> > f1:
> > mov x16, 64512
> > sub sp, sp, x16
> > .cfi_def_cfa_offset 64512
> > mov x16, -32768
> > add sp, sp, x16
> > .cfi_def_cfa_offset -1024
> > str xzr, [sp, 32760]
> > add sp, sp, x16
> > .cfi_def_cfa_offset -66560
> > str xzr, [sp, 32760]
> > sub sp, sp, #1024
> > .cfi_def_cfa_offset -65536
> > str xzr, [sp, 1016]
> > ldrb w0, [sp, w0, sxtw]
> > .cfi_def_cfa_offset 131072
> > add sp, sp, 131072
> > .cfi_def_cfa_offset 0
> > ret
> >
> > Note the cfa offsets are wrong.
> Yes. They definitely look wrong. There's a clear logic error in
> setting up the ADJUST_CFA note when the probing interval is larger than
> 2**12. That should be easily fixed. Let me poke at it.
This one does concern me, how did you get on? Did it respond well to
prodding?
> > There is an odd mix of a big initial adjustment, then some probes+adjustments and
> > then a final adjustment and probe for the remainder. I can't see the point of having
> > both an initial and remainder adjustment. I would expect this:
> >
> > sub sp, sp, 65536
> > str xzr, [sp, 1024]
> > sub sp, sp, 65536
> > str xzr, [sp, 1024]
> > ldrb w0, [sp, w0, sxtw]
> > add sp, sp, 131072
> > ret
> I'm really not able to justify spending further time optimizing the
> aarch64 implementation. I've done the best I can. You can take the
> work as-is or improve it, but I really can't justify further time
> investment on that architecture.
Makes sense. Understood. And certainly not required to land this patch.
> > int f2(int x)
> > {
> > char arr[128*1024];
> > return arr[x];
> > }
> >
> > f2:
> > mov x16, 64512
> > sub sp, sp, x16
> > mov x16, -65536
> > movk x16, 0xfffd, lsl 16
> > add x16, sp, x16
> > .LPSRL0:
> > sub sp, sp, 4096
> > str xzr, [sp, 4088]
> > cmp sp, x16
> > b.ne .LPSRL0
> > sub sp, sp, #1024
> > str xzr, [sp, 1016]
> > ldrb w0, [sp, w0, sxtw]
> > add sp, sp, 262144
> > ret
> >
> > The cfa entries are OK for this case. There is a mix of positive/negative offsets which
> > makes things confusing. Again there are 3 kinds of adjustments when for this size we
> > only need the loop.
> >
> > Reusing the existing gen_probe_stack_range code appears a bad idea since
> > it ignores the probe interval and just defaults to 4KB. I don't see why it should be
> > any more complex than this:
> >
> > sub x16, sp, 262144 // only need temporary if > 1MB
> > .LPSRL0:
> > sub sp, sp, 65536
> > str xzr, [sp, 1024]
> > cmp sp, x16
> > b.ne .LPSRL0
> > ldrb w0, [sp, w0, sxtw]
> > add sp, sp, 262144
> > ret
> >
> > Probe insertion if final adjustment >= 1024 also generates a lot of redundant
> > code - although this is more a theoretical issue given this is so rare.
> Again, if ARM wants this optimized, then ARM's engineers are going to
> have to take the lead here. I've invested all I can reasonably invest
> in terms of trying optimize the probing for this target.
Likewise here - thanks for your work so far, I have no expectation of this
being fully optimized before I OK it to land.
Sorry for the big delay getting round to this patch, I hope to get serious
time to put in to it later this week, and it would be helpful to close out
the few remaining issues before I do.
Thanks,
James
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-21 11:59 ` James Greenhalgh
@ 2017-11-22 18:52 ` Jeff Law
2017-11-27 19:25 ` James Greenhalgh
0 siblings, 1 reply; 13+ messages in thread
From: Jeff Law @ 2017-11-22 18:52 UTC (permalink / raw)
To: James Greenhalgh
Cc: Wilco Dijkstra, gcc-patches, Richard Earnshaw, Marcus Shawcroft, nd
On 11/21/2017 04:57 AM, James Greenhalgh wrote:
> I've finally built up enough courage to start getting my head around this...
Can't blame you for avoiding :-) This stuff isn't my idea of fun either.
>
> I see one outstanding issue sitting on this patch version:
>
> On Sat, Oct 28, 2017 at 05:08:54AM +0100, Jeff Law wrote:
>> On 10/13/2017 02:26 PM, Wilco Dijkstra wrote:
>>> --param=stack-clash-protection-probe-interval=13
>>> --param=stack-clash-protection-guard-size=12
>>>
>>> So if there is a good reason to continue with 2 separate values, we must
>>> force probe interval <= guard size!
>> The param code really isn't designed to enforce values that are
>> inter-dependent. It has a min, max & default values. No more, no less.
>> If you set up something inconsistent with the params, it's simply not
>> going to work.
>>
>>
>>>
>>> Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
>>> crashes due to the offsets used in the probes - we don't need large offsets
>>> as we want to probe close to the bottom of the stack.
>> Not a surprise. While I tried to handle larger intervals, I certainly
>> didn't test them. Given the ISA I wouldn't expect an interval > 12 to
>> be useful or necessarily even work correctly.
>
> Understood - weird behaviour with weird params don't concern me.
Likewise. My thinking was to expose them, folks that care use them to
test the impacts of changing them to find the right balance given the
various architectural and OS issues -- then we remove the params :-)
I don't feel that it's worth any significant time sanity checking the
params.
>
>>> Functions with a large stack emit like alloca a lot of code, here I used
>>> --param=stack-clash-protection-probe-interval=15:
>>>
>>> int f1(int x)
>>> {
>>> char arr[128*1024];
>>> return arr[x];
>>> }
>>>
>>> f1:
>>> mov x16, 64512
>>> sub sp, sp, x16
>>> .cfi_def_cfa_offset 64512
>>> mov x16, -32768
>>> add sp, sp, x16
>>> .cfi_def_cfa_offset -1024
>>> str xzr, [sp, 32760]
>>> add sp, sp, x16
>>> .cfi_def_cfa_offset -66560
>>> str xzr, [sp, 32760]
>>> sub sp, sp, #1024
>>> .cfi_def_cfa_offset -65536
>>> str xzr, [sp, 1016]
>>> ldrb w0, [sp, w0, sxtw]
>>> .cfi_def_cfa_offset 131072
>>> add sp, sp, 131072
>>> .cfi_def_cfa_offset 0
>>> ret
>>>
>>> Note the cfa offsets are wrong.
>> Yes. They definitely look wrong. There's a clear logic error in
>> setting up the ADJUST_CFA note when the probing interval is larger than
>> 2**12. That should be easily fixed. Let me poke at it.
>
> This one does concern me, how did you get on? Did it respond well to
> prodding?
I've got a fix for the CFA note. It's a one-liner due to my
misunderstanding of a bit of how some of the CFA stuff works. It only
affects cases where a --param has adjusted the probing interval to a
point where its too large for a simple arithmetic insn to adjust the stack.
>>>
>>> The cfa entries are OK for this case. There is a mix of positive/negative offsets which
>>> makes things confusing. Again there are 3 kinds of adjustments when for this size we
>>> only need the loop.
>>>
>>> Reusing the existing gen_probe_stack_range code appears a bad idea since
>>> it ignores the probe interval and just defaults to 4KB. I don't see why it should be
>>> any more complex than this:
>>>
>>> sub x16, sp, 262144 // only need temporary if > 1MB
>>> .LPSRL0:
>>> sub sp, sp, 65536
>>> str xzr, [sp, 1024]
>>> cmp sp, x16
>>> b.ne .LPSRL0
>>> ldrb w0, [sp, w0, sxtw]
>>> add sp, sp, 262144
>>> ret
>>>
>>> Probe insertion if final adjustment >= 1024 also generates a lot of redundant
>>> code - although this is more a theoretical issue given this is so rare.
>> Again, if ARM wants this optimized, then ARM's engineers are going to
>> have to take the lead here. I've invested all I can reasonably invest
>> in terms of trying optimize the probing for this target.
>
> Likewise here - thanks for your work so far, I have no expectation of this
> being fully optimized before I OK it to land.
>
> Sorry for the big delay getting round to this patch, I hope to get serious
> time to put in to it later this week, and it would be helpful to close out
> the few remaining issues before I do.
That'd be awesome. It's probably too late for further improvements to
land into Red Hat's RHEL 7.5 compiler.
However, I'm in the process now of backporting the kit to F27/F28 so the
glibc teams can bullet proof glibc there and we can certainly exploit
further refinements in F27/F28.
jeff
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-10-28 11:23 ` Jeff Law
2017-11-21 11:59 ` James Greenhalgh
@ 2017-11-27 16:32 ` Szabolcs Nagy
2017-11-27 17:50 ` Wilco Dijkstra
` (2 more replies)
1 sibling, 3 replies; 13+ messages in thread
From: Szabolcs Nagy @ 2017-11-27 16:32 UTC (permalink / raw)
To: Jeff Law, Wilco Dijkstra, gcc-patches
Cc: nd, Richard Earnshaw, James Greenhalgh, Marcus Shawcroft
On 28/10/17 05:08, Jeff Law wrote:
> On 10/13/2017 02:26 PM, Wilco Dijkstra wrote:
>> For larger frames the first oddity is that there are now 2 separate params
>> controlling how probes are generated:
>>
>> stack-clash-protection-guard-size (default 12, but set to 16 on AArch64)
>> stack-clash-protection-probe-interval (default 12)
>>
>> I don't see how this makes sense. These values are closely related, so if
>> one is different from the other, probing becomes ineffective/incorrect.
>> For example we generate code that trivially bypasses the guard despite
>> all the probing:
> My hope would be that we simply don't ever use the params. They were
> done as much for *you* to experiment with as anything. I'd happy just
> delete them as there's essentially no guard rails to ensure their values
> are sane.
so is there a consensus now that 64k guard size is what
gcc stack probing will assume?
if so i can propose a patch to glibc to actually have
that much guard by default in threads.. (i think it
makes sense on all 64bit targets to have bigger guard
and a consensus here would help making that change)
>> Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
>> crashes due to the offsets used in the probes - we don't need large offsets
>> as we want to probe close to the bottom of the stack.
> Not a surprise. While I tried to handle larger intervals, I certainly
> didn't test them. Given the ISA I wouldn't expect an interval > 12 to
> be useful or necessarily even work correctly.
it's not clear what makes probing at every 64k hard,
i think this should be clarified before we stick to
this design. (..or before backporting such patches)
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-27 16:32 ` Szabolcs Nagy
@ 2017-11-27 17:50 ` Wilco Dijkstra
2017-11-27 18:44 ` Jeff Law
2017-11-27 18:30 ` Jeff Law
2017-11-28 16:31 ` Rich Felker
2 siblings, 1 reply; 13+ messages in thread
From: Wilco Dijkstra @ 2017-11-27 17:50 UTC (permalink / raw)
To: Szabolcs Nagy, Jeff Law, gcc-patches
Cc: nd, Richard Earnshaw, James Greenhalgh, Marcus Shawcroft
Szabolcs Nagy wrote:
>On 28/10/17 05:08, Jeff Law wrote:
>
>> My hope would be that we simply don't ever use the params. They were
>> done as much for *you* to experiment with as anything. I'd happy just
>> delete them as there's essentially no guard rails to ensure their values
>> are sane.
>
> so is there a consensus now that 64k guard size is what
> gcc stack probing will assume?
I think right now only AArch64 will use a 64KB probe size which is always
enabled. It is best to hardcode this so it can't be changed or turned off,
inadvertently or not...
> if so i can propose a patch to glibc to actually have
> that much guard by default in threads.. (i think it
> makes sense on all 64bit targets to have bigger guard
> and a consensus here would help making that change)
Assuming a minimum 64KB thread guard size on 64-bit systems is unlikely to
be controversial - the guard size in OS/GLIBC may be larger than the probe
size used in GCC, so I suggest to propose a patch.
>>> Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
>>> crashes due to the offsets used in the probes - we don't need large offsets
>>> as we want to probe close to the bottom of the stack.
>> Not a surprise. While I tried to handle larger intervals, I certainly
>> didn't test them. Given the ISA I wouldn't expect an interval > 12 to
>> be useful or necessarily even work correctly.
>
> it's not clear what makes probing at every 64k hard,
> i think this should be clarified before we stick to
> this design. (..or before backporting such patches)
There is nothing inherently hard about large probe intervals. Any size can be
supported efficiently on AArch64 (like I showed in my examples of expected
code sequences for probes).
Wilco
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-27 16:32 ` Szabolcs Nagy
2017-11-27 17:50 ` Wilco Dijkstra
@ 2017-11-27 18:30 ` Jeff Law
2017-11-28 16:31 ` Rich Felker
2 siblings, 0 replies; 13+ messages in thread
From: Jeff Law @ 2017-11-27 18:30 UTC (permalink / raw)
To: Szabolcs Nagy, Wilco Dijkstra, gcc-patches
Cc: nd, Richard Earnshaw, James Greenhalgh, Marcus Shawcroft
On 11/27/2017 08:48 AM, Szabolcs Nagy wrote:
> On 28/10/17 05:08, Jeff Law wrote:
>> On 10/13/2017 02:26 PM, Wilco Dijkstra wrote:
>>> For larger frames the first oddity is that there are now 2 separate params
>>> controlling how probes are generated:
>>>
>>> stack-clash-protection-guard-size (default 12, but set to 16 on AArch64)
>>> stack-clash-protection-probe-interval (default 12)
>>>
>>> I don't see how this makes sense. These values are closely related, so if
>>> one is different from the other, probing becomes ineffective/incorrect.
>>> For example we generate code that trivially bypasses the guard despite
>>> all the probing:
>> My hope would be that we simply don't ever use the params. They were
>> done as much for *you* to experiment with as anything. I'd happy just
>> delete them as there's essentially no guard rails to ensure their values
>> are sane.
>
> so is there a consensus now that 64k guard size is what
> gcc stack probing will assume?
Only aarch64 currently assumes a guard that large. Other targets are
still assuming a single 4k page guard.
I'd certainly like to see a larger guard, at least on the 64bit targets.
Address space is at a much higher premium on the 32bit targets so it's
less clear if we can really bump the guard size on them.
>
>>> Also on AArch64 --param=stack-clash-protection-probe-interval=16 causes
>>> crashes due to the offsets used in the probes - we don't need large offsets
>>> as we want to probe close to the bottom of the stack.
>> Not a surprise. While I tried to handle larger intervals, I certainly
>> didn't test them. Given the ISA I wouldn't expect an interval > 12 to
>> be useful or necessarily even work correctly.
>
> it's not clear what makes probing at every 64k hard,
> i think this should be clarified before we stick to
> this design. (..or before backporting such patches)
You just need to be able to find additional scratch registers then the
probe interval is > 12 on aarch64. It also impacts the dwarf2 unwinding
notes.
jeff
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-27 17:50 ` Wilco Dijkstra
@ 2017-11-27 18:44 ` Jeff Law
0 siblings, 0 replies; 13+ messages in thread
From: Jeff Law @ 2017-11-27 18:44 UTC (permalink / raw)
To: Wilco Dijkstra, Szabolcs Nagy, gcc-patches
Cc: nd, Richard Earnshaw, James Greenhalgh, Marcus Shawcroft
On 11/27/2017 10:33 AM, Wilco Dijkstra wrote:
> Szabolcs Nagy wrote:
>> On 28/10/17 05:08, Jeff Law wrote:
>>
>>> My hope would be that we simply don't ever use the params. They were
>>> done as much for *you* to experiment with as anything. I'd happy just
>>> delete them as there's essentially no guard rails to ensure their values
>>> are sane.
>>
>> so is there a consensus now that 64k guard size is what
>> gcc stack probing will assume?
>
> I think right now only AArch64 will use a 64KB probe size which is always
> enabled. It is best to hardcode this so it can't be changed or turned off,
> inadvertently or not...
>
>> if so i can propose a patch to glibc to actually have
>> that much guard by default in threads.. (i think it
>> makes sense on all 64bit targets to have bigger guard
>> and a consensus here would help making that change)
>
> Assuming a minimum 64KB thread guard size on 64-bit systems is unlikely to
> be controversial - the guard size in OS/GLIBC may be larger than the probe
> size used in GCC, so I suggest to propose a patch.
Agreed.
Jeff
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-22 18:52 ` Jeff Law
@ 2017-11-27 19:25 ` James Greenhalgh
2017-12-19 0:49 ` Jeff Law
0 siblings, 1 reply; 13+ messages in thread
From: James Greenhalgh @ 2017-11-27 19:25 UTC (permalink / raw)
To: Jeff Law
Cc: Wilco Dijkstra, gcc-patches, Richard Earnshaw, Marcus Shawcroft, nd
On Wed, Nov 22, 2017 at 06:28:24PM +0000, Jeff Law wrote:
> On 11/21/2017 04:57 AM, James Greenhalgh wrote:
> > I've finally built up enough courage to start getting my head around this...
> Can't blame you for avoiding :-) This stuff isn't my idea of fun either.
Right, here's where I'm up to... I think that I have a reasonable grasp of
the overall intentions for this patch and what we're trying to do.
We want to put out regular probes (writes of zero) to the stack, at a regular
interval (given by PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL ), starting
at a known offset from the stack
base ( PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE ). There's some subtetly
around when and where we need probes, what the efficient sequences to
achieve that would be, etc. and handling that complexity, and the trouble
of having two possible stack adjustments to protect (initial adjustment and
final adjustment), is where most of the body of the patch comes from.
So, as far as I can read there are a couple of design points that I need
clarified. If we get through these it might be possible for Wilco to pick
up the patch and optimise the cases he's marked as a concern earlier.
Let me try to expand on these...
One point of contention in the review so far has been whether the default
values for PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL and
PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE are correct for AArch64. As this
has some implications for the algorithm we would like to use for probing,
where exactly we'd like to emit the probes, and might need a glibc patch
it will be helpful to narrow this down to an answer.
I see two positions.
Wilco would like both parameters set to 16 - that means (ignoring the
caller-protected stuff for now) a guard page size of 64k, and a probe
every 64k.
Jeff's current patch has a 64k guard page size, and a probe every 4k.
What concerns me is this from Jeff from couple of emails back:
> Given the ISA I wouldn't expect an interval > 12 to be useful or necessarily
> even work correctly.
There's a clear difference of opinion here!
I'd guess the difference of opinion is whether you need to probe every page
(4k) or whether you need to probe on each multiple of the minimum guard page
size (64k), but maybe it is just a difficulty in juggling registers? For
Wilco's position to work, we'd need to be happy emitting a probe once for
every 64k - from the sounds of the other comments on this thread, there's no
problem with that - we could have a minimum size of 64k for
PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL (though the current code would
probably need to change to accomodate that). Clearly we can reduce the
number of probes required to 1/16th if this were feasible.
So, first design question - other than potential troubles grabbing scratch
registers (we'll revisit this), is there a reason that would stop us deciding
to default to 64k for both parameters on AArch64?
Where this gets messy is handling the stack write. The current design
will drop the stack pointer, then emit a probe back near the top where we
were previosuly sitting. i.e.
sp = sp - 4096
*(sp + 4088) = 0
For larger probe intervals, we end up juggling bigger numbers, which makes
the calculation slightly more unwiedly.
One suggestion from Wilco would be to change this sequence to:
sp = sp - 4096
*(sp) = 0
In other words, drop the stack and then emit a probe where we are now
sitting. That would save us juggling the large number for the load offset.
Is there an obvious flaw I've missed in probing at the bottom of the range
rather than the top? It seems to me that as long as we were to ensure we
managed to emit the first probe (the one at the bottom of GUARD_SIZE)
correctly, we could probably get away with this - but I've missed some of the
early discussions on the topic.
Maybe the reasoning for this comes from the optimisation for sharing
responsibility between caller and callee - which I'll admit to not fully
understanding yet?
I think my understanding of the ideal code sequence from Wilco would be
(modulo corner cases):
/* Initial probe, assuming that our caller was well behaved, we need
to probe 64k below where they were last obliged to probe. First drop
the stack pointer to the right place, then probe as required. */
sp = sp - (GUARD_PAGE_SIZE - caller_saved_1k);
*(sp) = 0;
/* Subsequent probes, one every probe interval. */
sp = sp - PROBE_INTERVAL;
*(sp) = 0;
sp = sp - PROBE_INTERVAL;
*(sp) = 0;
sp = sp - PROBE_INTERVAL;
*(sp) = 0;
sp = sp - residual
*(sp) = 0;
Is this good enough to provide the level of protection we need?
I think if I can understand these two design questions I can make a better
job of the review here. I still don't want to push for any optimisation
for AArch64 (we can hack that in ourselves :-) ) just to understand where
we can be flexible, and where we'd be shooting ourselves in the foot if
we screwed up. Thanks for the earlier replies, they've been helpful.
Thanks,
James
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-27 16:32 ` Szabolcs Nagy
2017-11-27 17:50 ` Wilco Dijkstra
2017-11-27 18:30 ` Jeff Law
@ 2017-11-28 16:31 ` Rich Felker
2 siblings, 0 replies; 13+ messages in thread
From: Rich Felker @ 2017-11-28 16:31 UTC (permalink / raw)
To: Szabolcs Nagy
Cc: Jeff Law, Wilco Dijkstra, gcc-patches, nd, Richard Earnshaw,
James Greenhalgh, Marcus Shawcroft
On Mon, Nov 27, 2017 at 03:48:41PM +0000, Szabolcs Nagy wrote:
> On 28/10/17 05:08, Jeff Law wrote:
> > On 10/13/2017 02:26 PM, Wilco Dijkstra wrote:
> >> For larger frames the first oddity is that there are now 2 separate params
> >> controlling how probes are generated:
> >>
> >> stack-clash-protection-guard-size (default 12, but set to 16 on AArch64)
> >> stack-clash-protection-probe-interval (default 12)
> >>
> >> I don't see how this makes sense. These values are closely related, so if
> >> one is different from the other, probing becomes ineffective/incorrect.
> >> For example we generate code that trivially bypasses the guard despite
> >> all the probing:
> > My hope would be that we simply don't ever use the params. They were
> > done as much for *you* to experiment with as anything. I'd happy just
> > delete them as there's essentially no guard rails to ensure their values
> > are sane.
>
> so is there a consensus now that 64k guard size is what
> gcc stack probing will assume?
I would very much prefer the compiler never assume more than 1 guard
page is present. I understand the performance argument, but I think
it's mostly in artificial scenarios where it makes a difference:
If a function has 4k and actually uses it, you're almost certainly
looking at >4k cycles, and a few cycles to probe the guard page are
irrelevant (dominated by the actual work).
If a function has >4k of local data and code paths that can easily be
determined don't use it (e.g. the big data is in a conditional block),
the compiler should shrink-wrap them anyway in order to avoid
pathological blowing away of the stack like what LLVM does all over
the place (hoisting stack allocations up as far as it can). The probe
can then likewise be moved with the shrinkwrapping.
The only remaining case is functions which have >4k of local data
that's mostly unused, but no easy way to determine that it's not used,
or where which small part is actually used is data-dependant. This is
the case where the probe is mildly costly, but it seems very unlikely
to be a common case in real-world usage.
> if so i can propose a patch to glibc to actually have
> that much guard by default in threads.. (i think it
> makes sense on all 64bit targets to have bigger guard
> and a consensus here would help making that change)
I don't object to making musl libc default to >64k (I'd prefer 68k to
avoid preserving alignment mod a large power of 2) guard size on
64-bit archs (on 32-bit, including ILP32, though, it's prohibitive
because it exhausts virtual address space; this may affect aarch64's
ILP32 model). It doesn't have any significant cost and it's useful
hardening. But I think it would be unfortunate if smaller guard sizes,
which applications can request/set, were left unsafe due to the
compiler making a tradeoff for performance that doesn't actually get
you any measureable real-world performance benefits.
Rich
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits)
2017-11-27 19:25 ` James Greenhalgh
@ 2017-12-19 0:49 ` Jeff Law
0 siblings, 0 replies; 13+ messages in thread
From: Jeff Law @ 2017-12-19 0:49 UTC (permalink / raw)
To: James Greenhalgh
Cc: Wilco Dijkstra, gcc-patches, Richard Earnshaw, Marcus Shawcroft, nd
On 11/27/2017 11:44 AM, James Greenhalgh wrote:
> On Wed, Nov 22, 2017 at 06:28:24PM +0000, Jeff Law wrote:
>> On 11/21/2017 04:57 AM, James Greenhalgh wrote:
>>> I've finally built up enough courage to start getting my head around this...
>> Can't blame you for avoiding :-) This stuff isn't my idea of fun either.
>
> Right, here's where I'm up to... I think that I have a reasonable grasp of
> the overall intentions for this patch and what we're trying to do.
Good. I realize it's somewhat painful. Wilco's feedback WRT what is
and what is not possible in aarch64 prologues has been invaluable in
simplifying things. But in the end it's just painful.
>
> We want to put out regular probes (writes of zero) to the stack, at a regular
> interval (given by PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL ), starting
> at a known offset from the stack
> base ( PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE ). There's some subtetly
> around when and where we need probes, what the efficient sequences to
> achieve that would be, etc. and handling that complexity, and the trouble
> of having two possible stack adjustments to protect (initial adjustment and
> final adjustment), is where most of the body of the patch comes from.
Right.
>
> So, as far as I can read there are a couple of design points that I need
> clarified. If we get through these it might be possible for Wilco to pick
> up the patch and optimise the cases he's marked as a concern earlier.
>
> Let me try to expand on these...
>
> One point of contention in the review so far has been whether the default
> values for PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL and
> PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE are correct for AArch64. As this
> has some implications for the algorithm we would like to use for probing,
> where exactly we'd like to emit the probes, and might need a glibc patch
> it will be helpful to narrow this down to an answer.
So the guard size came out of a discussion with Richard Earnshaw. I
think there's a general agreement on that. It was explicitly decided to
punt ILP32 on aarch64 and not worry about it. So that's really the one
thing that could throw a wrench into the guard size.
>
> I see two positions.
>
> Wilco would like both parameters set to 16 - that means (ignoring the
> caller-protected stuff for now) a guard page size of 64k, and a probe
> every 64k.
>
> Jeff's current patch has a 64k guard page size, and a probe every 4k.
>
> What concerns me is this from Jeff from couple of emails back:
>
>> Given the ISA I wouldn't expect an interval > 12 to be useful or necessarily
>> even work correctly.
>
> There's a clear difference of opinion here!
So the issue is that once you get an interval > 12 you're unable to
adjust the stack pointer in a single instruction without using a scratch
register.
*Most* of the code tries to handle this. The notable exception is
aarch64_output_probe_stack_range which is used by both stack clash
protection and Ada style -fstack-check.
For stack clash prologues we get there via
aarch64_expand_prologue -> aarch64_allocate_and_probe_stack_space
Which implies that we have to have another scratch register in the
prologue code. We already use IP0_REGNUM and IP1_REGNUM depending on
context. So neither of those is available. I'm not familiar enough
with aarch64 to know if there's another register we can safely use.
Without another register we can safely use, probing at intervals > 4k is
a non-starter AFAICT.
> I'd guess the difference of opinion is whether you need to probe every page
> (4k) or whether you need to probe on each multiple of the minimum guard page
> size (64k), but maybe it is just a difficulty in juggling registers? For
> Wilco's position to work, we'd need to be happy emitting a probe once for
> every 64k - from the sounds of the other comments on this thread, there's no
> problem with that - we could have a minimum size of 64k for
> PARAM_STACK_CLASH_PROTECTION_PROBE_INTERVAL (though the current code would
> probably need to change to accomodate that). Clearly we can reduce the
> number of probes required to 1/16th if this were feasible.
It's primarily a matter of getting a regsiter.
>
> So, first design question - other than potential troubles grabbing scratch
> registers (we'll revisit this), is there a reason that would stop us deciding
> to default to 64k for both parameters on AArch64?
Florian might have comments here.
>
> Where this gets messy is handling the stack write. The current design
> will drop the stack pointer, then emit a probe back near the top where we
> were previosuly sitting. i.e.
>
> sp = sp - 4096
> *(sp + 4088) = 0
>
> For larger probe intervals, we end up juggling bigger numbers, which makes
> the calculation slightly more unwiedly.
Right.
>
> One suggestion from Wilco would be to change this sequence to:
>
> sp = sp - 4096
> *(sp) = 0
>
> In other words, drop the stack and then emit a probe where we are now
> sitting. That would save us juggling the large number for the load offset.
I think this runs afoul of valgrind. Being valgrind-safe is IMHO a
design requirement. I've already been chastised for hitting the red
zone too many times :-) I'm told there is no red zone in the aarch64 ABI.
Jeff
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-12-19 0:49 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29 15:15 [RFA][PATCH] Stack clash protection 07/08 -- V4 (aarch64 bits) Jeff Law
[not found] ` <DB6PR0801MB205340E4DACC1CF3696A150983480@DB6PR0801MB2053.eurprd08.prod.outlook.com>
2017-10-13 16:14 ` Wilco Dijkstra
2017-10-13 20:47 ` Wilco Dijkstra
2017-10-28 11:23 ` Jeff Law
2017-11-21 11:59 ` James Greenhalgh
2017-11-22 18:52 ` Jeff Law
2017-11-27 19:25 ` James Greenhalgh
2017-12-19 0:49 ` Jeff Law
2017-11-27 16:32 ` Szabolcs Nagy
2017-11-27 17:50 ` Wilco Dijkstra
2017-11-27 18:44 ` Jeff Law
2017-11-27 18:30 ` Jeff Law
2017-11-28 16:31 ` Rich Felker
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).