From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
nd <nd@arm.com>, James Greenhalgh <James.Greenhalgh@arm.com>,
Richard Earnshaw <Richard.Earnshaw@arm.com>,
Marcus Shawcroft <Marcus.Shawcroft@arm.com>
Subject: Re: [PATCH][GCC][AArch64] Add support for SVE stack clash probing [patch (2/7)]
Date: Thu, 27 Sep 2018 10:14:00 -0000 [thread overview]
Message-ID: <20180927100225.GA14440@arm.com> (raw)
In-Reply-To: <20180926082027.GA27633@arm.com>
[-- Attachment #1: Type: text/plain, Size: 25558 bytes --]
Hi All,
It turns out the testsuite didn't have a case in it which would cause a
significant enough spill to enter the loop. After creating one I noticed a bug
in the loop and fixed it.
The loops are now
.cfi_startproc
mov x15, sp
cntb x16, all, mul #11
add x16, x16, 304
.cfi_def_cfa_register 15
.SVLPSPL0:
cmp x16, 61440
b.lt .SVLPEND0
sub sp, sp, 61440
str xzr, [sp, 0]
subs x16, x16, 61440
b .SVLPSPL0
.SVLPEND0:
sub sp, sp, x16
.cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22
for a 64KB guard size.
I'm also adding a new testcase that causes a large enough spill to enter the loop.
Ok for trunk?
Thanks,
Tamar
gcc/
2018-09-27 Tamar Christina <tamar.christina@arm.com>
PR target/86486
* config/aarch64/aarch64-protos.h (aarch64_output_probe_sve_stack_clash): New.
* config/aarch64/aarch64.c (aarch64_output_probe_sve_stack_clash,
aarch64_uimm12_nearest_value): New.
(aarch64_allocate_and_probe_stack_space): Add SVE specific section.
* config/aarch64/aarch64.md (probe_sve_stack_clash): New.
gcc/testsuite/
2018-09-27 Tamar Christina <tamar.christina@arm.com>
PR target/86486
* gcc.target/aarch64/stack-check-prologue-16.c: New test
* gcc.target/aarch64/stack-check-cfa-3.c: New test.
* gcc.target/aarch64/sve/struct_vect_24.c: New test.
* gcc.target/aarch64/sve/struct_vect_24_run.c: New test.
The 09/26/2018 09:20, Tamar Christina wrote:
> Hi Richard,
>
> I've added a new loop that should also exit early as described in my previous email.
>
> An example sequence is:
>
> .cfi_startproc
> mov x15, sp
> cntb x16, all, mul #11
> add x16, x16, 304
> .cfi_def_cfa_register 15
> cmp x16, 61440
> b.lt .SVLPEND0
> .SVLPSPL0:
> sub sp, sp, 61440
> str xzr, [sp, 0]
> subs x16, x16, 61440
> b.hs .SVLPSPL0
> add x16, x16, 61440
> .SVLPEND0:
> sub sp, sp, x16
> .cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22
>
> for a 64KB guard size, and for a 4KB guard size
>
> .cfi_startproc
> mov x15, sp
> cntb x16, all, mul #11
> add x16, x16, 304
> .cfi_def_cfa_register 15
> cmp x16, 3072
> b.lt .SVLPEND0
> .SVLPSPL0:
> sub sp, sp, 3072
> str xzr, [sp, 0]
> subs x16, x16, 3072
> b.hs .SVLPSPL0
> add x16, x16, 3072
> .SVLPEND0:
> sub sp, sp, x16
> .cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22
>
>
> This has about the same semantics as alloca, except we prioritize the common case
> where no probe is required. We also change the amount we adjust the stack and
> the probing interval to be the nearest value to `guard size - abi buffer` that
> fits in the 12-bit shifted immediate used by cmp.
>
> While this would mean we probe a bit more often than we require, in practice the
> amount of SVE vectors you'd need to spill is significant. Even more so to enter the
> loop more than once.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues in sve testsuite.
> Target was tested with stack clash on and off by default.
>
> Ok for trunk?
>
> Thanks,
> Tamar
>
> gcc/
> 2018-09-26 Tamar Christina <tamar.christina@arm.com>
>
> PR target/86486
> * config/aarch64/aarch64-protos.h (aarch64_output_probe_sve_stack_clash): New.
> * config/aarch64/aarch64.c (aarch64_output_probe_sve_stack_clash,
> aarch64_uimm12_nearest_value): New.
> (aarch64_allocate_and_probe_stack_space): Add SVE specific section.
> * config/aarch64/aarch64.md (probe_sve_stack_clash): New.
>
> gcc/testsuite/
> 2018-09-26 Tamar Christina <tamar.christina@arm.com>
>
> PR target/86486
> * gcc.target/aarch64/stack-check-prologue-16.c: New test
> * gcc.target/aarch64/stack-check-cfa-3.c: New test.
>
>
> The 09/20/2018 10:23, Tamar Christina wrote:
> > Hi Richard,
> >
> > The 09/11/2018 16:20, Richard Sandiford wrote:
> > > Tamar Christina <Tamar.Christina@arm.com> writes:
> > > >> > +
> > > >> > + /* No probe leave. */
> > > >> > + ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
> > > >> > + return "";
> > > >>
> > > >> With the CFA stuff and constant load, I think this works out as:
> > > >>
> > > >> ---------------------------------------------
> > > >> # 12 insns
> > > >> mov r15, base
> > > >> mov adjustment, N
> > > >> 1:
> > > >> cmp adjustment, guard_size
> > > >> b.lt 2f
> > > >> sub base, base, guard_size
> > > >> str xzr, [base, limit]
> > > >> sub adjustment, adjustment, guard_size
> > > >> b 1b
> > > >> 2:
> > > >> sub base, base, adjustment
> > > >> cmp adjustment, limit
> > > >> b.le 3f
> > > >> str xzr, [base, limit]
> > > >> 3:
> > > >> ---------------------------------------------
> > > >>
> > > >> What do you think about something like:
> > > >>
> > > >> ---------------------------------------------
> > > >> # 10 insns
> > > >> mov adjustment, N
> > > >> sub r15, base, adjustment
> > > >> subs adjustment, adjustment, min_probe_threshold
> > > >> b.lo 2f
> > > >> 1:
> > > >> add base, x15, adjustment
> > > >> str xzr, [base, 0]
> > > >> subs adjustment, adjustment, 16
> > > >> and adjustment, adjustment, ~(guard_size-1)
> > > >> b.hs 1b
> > > >> 2:
> > > >> mov base, r15
> > > >> ---------------------------------------------
> > > >>
> > > >> or (with different trade-offs):
> > > >>
> > > >> ---------------------------------------------
> > > >> # 11 insns
> > > >> mov adjustment, N
> > > >> sub r15, base, adjustment
> > > >> subs adjustment, adjustment, min_probe_threshold
> > > >> b.lo 2f
> > > >> # Might be 0, leading to a double probe
> > > >> and r14, adjustment, guard_size-1
> > > >> 1:
> > > >> add base, x15, adjustment
> > > >> str xzr, [base, 0]
> > > >> subs adjustment, adjustment, r14
> > > >> mov r14, guard_size
> > > >> b.hs 1b
> > > >> 2:
> > > >> mov base, r15
> > > >> ---------------------------------------------
> > > >>
> > > >> or (longer, but with a simpler loop):
> > > >>
> > > >> ---------------------------------------------
> > > >> # 12 insns
> > > >> mov adjustment, N
> > > >> sub r15, base, adjustment
> > > >> subs adjustment, adjustment, min_probe_threshold
> > > >> b.lo 2f
> > > >> str xzr, [base, -16]!
> > > >> sub adjustment, adjustment, 32
> > > >> and adjustment, adjustment, -(guard_size-1)
> > > >> 1:
> > > >> add base, x15, adjustment
> > > >> str xzr, [base, 0]
> > > >> subs adjustment, adjustment, guard_size
> > > >> b.hs 1b
> > > >> 2:
> > > >> mov base, r15
> > > >> ---------------------------------------------
> > > >>
> > > >> with the CFA based on r15+offset?
> > > >>
> > > >> These loops probe more often than necessary in some cases,
> > > >> but they only need a single branch in the common case that
> > > >> ADJUSTMENT <= MIN_PROBE_THRESHOLD.
> > > >
> > > > I haven't changed the loop yet because I'm a bit on the edge about
> > > > whether the implementation difficulties would outweigh the benefits.
> > > > We are planning on doing something smarter for SVE so optimizing these
> > > > loops only to replace them later may not be time well spent now.
> > > >
> > > > The problem is that to support both 4KB and 64KB pages, instructions such
> > > > as subs would require different immediates and shifts. Granted we technically
> > > > only support these two so I could hardcode the values, but that would mean
> > > > these functions are less general than the rest.
> > >
> > > Because of the min_probe_threshold? You could conservatively clamp it
> > > to the next lowest value that's in range, which we could do without
> > > having to hard-code specific values. I think it would be better
> > > to do that even with the current code, since hard-coding 2048 with:
> > >
> > > /* Test if ADJUSTMENT < RESIDUAL_PROBE_GUARD, in principle any power of two
> > > larger than 1024B would work, but we need one that works for all supported
> > > guard-sizes. What we actually want to check is guard-size - 1KB, but this
> > > immediate won't fit inside a cmp without requiring a tempory, so instead we
> > > just accept a smaller immediate that doesn't, we may probe a bit more often
> > > but that doesn't matter much on the long run. */
> > >
> > > seems a bit of a hack.
> > >
> > > > If you think it would be worthwhile, I'd be happy to use one of these
> > > > loops instead.
> > >
> > > Yeah, I still think we should do this unless we can commit to doing
> > > the optimised version by a specific date, and that date is soon enough
> > > that the optimisation could reasonably be backported to GCC 8.
> > >
> >
> > While implementing these loops I found them a bit hard to follow, or rather a bit
> > difficult to prove correct, to someone looking at the code it may not be trivially clear
> > what it does. I believe the main concern here is that the common case
> > isn't shortcutted? e.g. spills small enough not to require a probe. So how about
> >
> > mov r15, base
> > mov adjustment, N
> > cmp adjustment, nearest(min_probe_threshold)
> > b.lt end
> > begin:
> > sub base, base, nearest(min_probe_threshold)
> > str xzr, [base, 0]
> > subs size, size, nearest(min_probe_threshold)
> > b.hs begin
> > end:
> > sub base, base, size
> >
> > as an alternative? Which is 9 insn but also much simpler and follows the same semantics
> > as the other probing codes. This has the downside that we probe a bit more often than we need to
> > but on the average case you'd likely not enter the loop more than once, so I'd expect in real world usage
> > the amount of probes to be the same as the previous code, since you'd have to spill a significant amount of SVE
> > vectors in order to enter the loop, let alone iterate.
> >
> > This is still safe as the only invariant we have to hold is not to drop the SP by more than a page at a time,
> > doing less than a page it fine.
> >
> > nearest just rounds down to the nearest value that fits in a 12-bit shifted immediate.
> >
> >
> > Thanks,
> > Tamar
> >
> > > > @@ -4830,11 +4929,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > > > }
> > > > }
> > > >
> > > > - HOST_WIDE_INT size;
> > > > + HOST_WIDE_INT size = 0;
> > > > /* If SIZE is not large enough to require probing, just adjust the stack and
> > > > exit. */
> > > > - if (!poly_size.is_constant (&size)
> > > > - || known_lt (poly_size, min_probe_threshold)
> > > > + if ((poly_size.is_constant (&size)
> > > > + && known_lt (poly_size, min_probe_threshold))
> > > > || !flag_stack_clash_protection)
> > > > {
> > > > aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
> > >
> > > I still think we should remove this poly_size.is_constant, and instead:
> > >
> > > > @@ -4842,9 +4941,64 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > > > }
> > >
> > > > if (dump_file)
> > > > - fprintf (dump_file,
> > > > - "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
> > > > - ", probing will be required.\n", size);
> > > > + {
> > > > + if (poly_size.is_constant ())
> > > > + fprintf (dump_file,
> > > > + "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
> > > > + " bytes, probing will be required.\n", size);
> > > > + else
> > > > + {
> > > > + fprintf (dump_file, "Stack clash SVE prologue: ");
> > > > + print_dec (poly_size, dump_file);
> > > > + fprintf (dump_file, " bytes, dynamic probing will be required.\n");
> > > > + }
> > > > + }
> > > > +
> > > > + /* Handle the SVE non-constant case first. */
> > > > + if (!poly_size.is_constant ())
> > >
> > > ...use is_constant (&size) here, and put the dump messages for the
> > > constant and non-constant cases in their respective constant and
> > > non-constant blocks. That way each use of "size" is directly protected
> > > by an is_constant call, and there's no need to initialise size to 0.
> > >
> > > The non-constant case doesn't have the new special handling of
> > > final_adjustment_p, so I think the !is_constant block should assert
> > > !final_adjustment_p.
> > >
> > > Thanks,
> > > Richard
> >
> > --
>
> --
> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index ef95fc829b83886e2ff00e4664e31af916e99b0c..e2d8734a8d5e513588e3b0318e9c67fdaebdf0d4 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -453,6 +453,7 @@ void aarch64_asm_output_labelref (FILE *, const char *);
> void aarch64_cpu_cpp_builtins (cpp_reader *);
> const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
> const char * aarch64_output_probe_stack_range (rtx, rtx);
> +const char * aarch64_output_probe_sve_stack_clash (rtx, rtx, rtx, rtx);
> void aarch64_err_no_fpadvsimd (machine_mode);
> void aarch64_expand_epilogue (bool);
> void aarch64_expand_mov_immediate (rtx, rtx, rtx (*) (rtx, rtx) = 0);
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index d4b13d48d852a70848fc7c51fd867e776efb5e55..d189198a377e698964d34ef03a4c1a92fe1be4f0 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -208,6 +208,7 @@ static bool aarch64_builtin_support_vector_misalignment (machine_mode mode,
> static machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64);
> static bool aarch64_print_address_internal (FILE*, machine_mode, rtx,
> aarch64_addr_query_type);
> +static HOST_WIDE_INT aarch64_uimm12_nearest_value (HOST_WIDE_INT val);
>
> /* Major revision number of the ARM Architecture implemented by the target. */
> unsigned aarch64_architecture_version;
> @@ -3973,6 +3974,89 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
> return "";
> }
>
> +/* Emit the probe loop for doing stack clash probes and stack adjustments for
> + SVE. This emits probes from BASE to BASE - ADJUSTMENT based on a guard size
> + of GUARD_SIZE. When a probe is emitted it is done at MIN_PROBE_OFFSET bytes
> + from the current BASE at an interval of MIN_PROBE_OFFSET. By the end of this
> + function BASE = BASE - ADJUSTMENT. */
> +
> +const char *
> +aarch64_output_probe_sve_stack_clash (rtx base, rtx adjustment,
> + rtx min_probe_threshold, rtx guard_size)
> +{
> + /* This function is not allowed to use any instruction generation function
> + like gen_ and friends. If you do you'll likely ICE during CFG validation,
> + so instead emit the code you want using output_asm_insn. */
> + gcc_assert (flag_stack_clash_protection);
> + gcc_assert (CONST_INT_P (min_probe_threshold) && CONST_INT_P (guard_size));
> + gcc_assert (INTVAL (guard_size) > INTVAL (min_probe_threshold));
> +
> + /* The minimum required allocation before the residual requires probing. */
> + HOST_WIDE_INT residual_probe_guard = INTVAL (min_probe_threshold);
> +
> + /* Clamp the value down to the nearest value that can be used with a cmp. */
> + residual_probe_guard = aarch64_uimm12_nearest_value (residual_probe_guard);
> + rtx probe_offset_value_rtx = gen_int_mode (residual_probe_guard, Pmode);
> +
> + gcc_assert (INTVAL (min_probe_threshold) >= residual_probe_guard);
> + gcc_assert (aarch64_uimm12_shift (residual_probe_guard));
> +
> + static int labelno = 0;
> + char loop_start_lab[32];
> + char loop_end_lab[32];
> + rtx xops[2];
> +
> + ASM_GENERATE_INTERNAL_LABEL (loop_start_lab, "SVLPSPL", labelno);
> + ASM_GENERATE_INTERNAL_LABEL (loop_end_lab, "SVLPEND", labelno++);
> +
> + /* ADJUSTMENT == RESIDUAL_PROBE_GUARD. */
> + xops[0] = adjustment;
> + xops[1] = probe_offset_value_rtx;
> + output_asm_insn ("cmp\t%0, %1", xops);
> +
> + /* Branch to end if not enough adjustment to probe. */
> + fputs ("\tb.lt\t", asm_out_file);
> + assemble_name_raw (asm_out_file, loop_end_lab);
> + fputc ('\n', asm_out_file);
> +
> + /* Emit loop start label. */
> + ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_start_lab);
> +
> + /* BASE = BASE - RESIDUAL_PROBE_GUARD. */
> + xops[0] = base;
> + xops[1] = gen_int_mode (residual_probe_guard, Pmode);
> + output_asm_insn ("sub\t%0, %0, %1", xops);
> +
> + /* Probe at BASE. */
> + xops[1] = const0_rtx;
> + output_asm_insn ("str\txzr, [%0, %1]", xops);
> +
> + /* ADJUSTMENT = ADJUSTMENT - RESIDUAL_PROBE_GUARD. */
> + xops[0] = adjustment;
> + xops[1] = probe_offset_value_rtx;
> + output_asm_insn ("subs\t%0, %0, %1", xops);
> +
> + /* Branch to start if still more bytes to allocate. */
> + fputs ("\tb.hs\t", asm_out_file);
> + assemble_name_raw (asm_out_file, loop_start_lab);
> + fputc ('\n', asm_out_file);
> +
> + /* ADJUSTMENT = ADJUSTMENT + RESIDUAL_PROBE_GUARD, we need to undo the last
> + subtract in order to know how much to drop the stack by. */
> + xops[0] = adjustment;
> + xops[1] = probe_offset_value_rtx;
> + output_asm_insn ("add\t%0, %0, %1", xops);
> +
> + /* No probe leave. */
> + ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
> +
> + /* BASE = BASE - ADJUSTMENT. */
> + xops[0] = base;
> + xops[1] = adjustment;
> + output_asm_insn ("sub\t%0, %0, %1", xops);
> + return "";
> +}
> +
> /* Determine whether a frame chain needs to be generated. */
> static bool
> aarch64_needs_frame_chain (void)
> @@ -4835,21 +4919,76 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> }
> }
>
> - HOST_WIDE_INT size;
> /* If SIZE is not large enough to require probing, just adjust the stack and
> exit. */
> - if (!poly_size.is_constant (&size)
> - || known_lt (poly_size, min_probe_threshold)
> + if (known_lt (poly_size, min_probe_threshold)
> || !flag_stack_clash_protection)
> {
> aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
> return;
> }
>
> + HOST_WIDE_INT size;
> + /* Handle the SVE non-constant case first. */
> + if (!poly_size.is_constant (&size))
> + {
> +
> + if (dump_file)
> + {
> + fprintf (dump_file, "Stack clash SVE prologue: ");
> + print_dec (poly_size, dump_file);
> + fprintf (dump_file, " bytes, dynamic probing will be required.\n");
> + }
> +
> + /* First calculate the amount of bytes we're actually spilling. */
> + aarch64_add_offset (Pmode, temp1, CONST0_RTX (GET_MODE (temp1)),
> + poly_size, temp1, temp2, false, true);
> +
> + rtx_insn *insn = get_last_insn ();
> +
> + if (frame_related_p)
> + {
> + /* This is done to provide unwinding information for the stack
> + adjustments we're about to do, however to prevent the optimizers
> + from removing the R15 move and leaving the CFA note (which would be
> + very wrong) we tie the old and new stack pointer together.
> + The tie will expand to nothing but the optimizers will not touch
> + the instruction. */
> + rtx stack_ptr_copy = gen_rtx_REG (Pmode, R15_REGNUM);
> + emit_move_insn (stack_ptr_copy, stack_pointer_rtx);
> + emit_insn (gen_stack_tie (stack_ptr_copy, stack_pointer_rtx));
> +
> + /* We want the CFA independent of the stack pointer for the
> + duration of the loop. */
> + add_reg_note (insn, REG_CFA_DEF_CFA, stack_ptr_copy);
> + RTX_FRAME_RELATED_P (insn) = 1;
> + }
> +
> + rtx probe_const = gen_int_mode (min_probe_threshold, DImode);
> + rtx guard_const = gen_int_mode (guard_size, DImode);
> +
> + insn = emit_insn (gen_probe_sve_stack_clash (stack_pointer_rtx,
> + stack_pointer_rtx, temp1,
> + probe_const, guard_const));
> +
> + /* Now reset the CFA register if needed. */
> + if (frame_related_p)
> + {
> + add_reg_note (insn, REG_CFA_DEF_CFA,
> + gen_rtx_PLUS (Pmode, stack_pointer_rtx,
> + gen_int_mode (poly_size, Pmode)));
> + RTX_FRAME_RELATED_P (insn) = 1;
> + }
> +
> + return;
> + }
> +
> if (dump_file)
> - fprintf (dump_file,
> - "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
> - ", probing will be required.\n", size);
> + {
> + fprintf (dump_file,
> + "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
> + " bytes, probing will be required.\n", size);
> + }
>
> /* Round size to the nearest multiple of guard_size, and calculate the
> residual as the difference between the original size and the rounded
> @@ -5458,6 +5597,16 @@ aarch64_uimm12_shift (HOST_WIDE_INT val)
> );
> }
>
> +/* Returns the nearest value to VAL that will fit as a 12-bit unsigned immediate
> + that can be created with a left shift of 0 or 12. */
> +static HOST_WIDE_INT
> +aarch64_uimm12_nearest_value (HOST_WIDE_INT val)
> +{
> + if ((val & (((HOST_WIDE_INT) 0xfff) << 0)) == val)
> + return val;
> +
> + return val & (((HOST_WIDE_INT) 0xfff) << 12);
> +}
>
> /* Return true if val is an immediate that can be loaded into a
> register by a MOVZ instruction. */
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index b8da13f14fa9990e8fdc3c71ed407c8afc65a324..b422713019e5063babec1fb81d0dfc7b50c76038 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6464,6 +6464,25 @@
> [(set_attr "length" "32")]
> )
>
> +;; This instruction is used to generate the stack clash stack adjustment and
> +;; probing loop. We can't change the control flow during prologue and epilogue
> +;; code generation. So we must emit a volatile unspec and expand it later on.
> +
> +(define_insn "probe_sve_stack_clash"
> + [(set (match_operand:DI 0 "register_operand" "=rk")
> + (unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
> + (match_operand:DI 2 "register_operand" "r")
> + (match_operand:DI 3 "const_int_operand" "n")
> + (match_operand:DI 4 "aarch64_plus_immediate" "L")]
> + UNSPECV_PROBE_STACK_RANGE))]
> + "TARGET_SVE"
> +{
> + return aarch64_output_probe_sve_stack_clash (operands[0], operands[2],
> + operands[3], operands[4]);
> +}
> + [(set_attr "length" "32")]
> +)
> +
> ;; Named pattern for expanding thread pointer reference.
> (define_expand "get_thread_pointerdi"
> [(match_operand:DI 0 "register_operand" "=r")]
> diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..6ea87392843e4b9561cf6d43ffee57887db62e4e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables -ftree-vectorize" } */
> +/* { dg-require-effective-target supports_stack_clash_protection } */
> +
> +#include <stdint.h>
> +
> +#define N 20040
> +
> +void __attribute__ ((noinline, noclone))
> +test (int8_t *restrict dest, int8_t *restrict src)
> +{
> + for (int i = 0; i < N; i+=8)
> + {
> + dest[i] += src[i * 4];
> + dest[i+1] += src[i * 4 + 1];
> + dest[i+2] += src[i * 4 + 2];
> + dest[i+3] += src[i * 4 + 3];
> + dest[i+4] += src[i * 4 + 4];
> + dest[i+5] += src[i * 4 + 5];
> + dest[i+6] += src[i * 4 + 6];
> + dest[i+7] += src[i * 4 + 7];
> + }
> +}
> +/* { dg-final { scan-assembler-times {mov\tx15, sp} 1 } } */
> +/* { dg-final { scan-assembler-times {\.cfi_def_cfa_register 15} 1 } } */
> +/* { dg-final { scan-assembler-times {\.cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22} 1 } } */
> +
> +/* Checks that the CFA notes are correct for every sp adjustment, but we also
> + need to make sure we can unwind correctly before the frame is set up. So
> + check that we're emitting r15 with a copy of sp an setting the CFA there. */
> diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..fd139bb09274509bd4faeef324520927a1fe3d3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target supports_stack_clash_protection } */
> +/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -ftree-vectorize" } */
> +
> +
> +#include <stdint.h>
> +
> +#define N 20040
> +
> +void __attribute__ ((noinline, noclone))
> +test (int8_t *restrict dest, int8_t *restrict src)
> +{
> + for (int i = 0; i < N; i+=8)
> + {
> + dest[i] += src[i * 4];
> + dest[i+1] += src[i * 4 + 1];
> + dest[i+2] += src[i * 4 + 2];
> + dest[i+3] += src[i * 4 + 3];
> + dest[i+4] += src[i * 4 + 4];
> + dest[i+5] += src[i * 4 + 5];
> + dest[i+6] += src[i * 4 + 6];
> + dest[i+7] += src[i * 4 + 7];
> + }
> +}
> +
> +
> +/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 0\]} 1 } } */
> +/* { dg-final { scan-assembler-times {cmp\s+x[0-9]+, 61440} 1 } } */
> +/* { dg-final { scan-assembler-times {subs\s+x[0-9]+, x[0-9]+, 61440} 1 } } */
> +
> +/* SVE spill, requires probing as vector size is unknown at compile time. */
> +
>
--
[-- Attachment #2: rb9914.patch --]
[-- Type: text/x-diff, Size: 14985 bytes --]
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ef95fc829b83886e2ff00e4664e31af916e99b0c..e2d8734a8d5e513588e3b0318e9c67fdaebdf0d4 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -453,6 +453,7 @@ void aarch64_asm_output_labelref (FILE *, const char *);
void aarch64_cpu_cpp_builtins (cpp_reader *);
const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
const char * aarch64_output_probe_stack_range (rtx, rtx);
+const char * aarch64_output_probe_sve_stack_clash (rtx, rtx, rtx, rtx);
void aarch64_err_no_fpadvsimd (machine_mode);
void aarch64_expand_epilogue (bool);
void aarch64_expand_mov_immediate (rtx, rtx, rtx (*) (rtx, rtx) = 0);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index d4b13d48d852a70848fc7c51fd867e776efb5e55..245fd6832ec0afe27c42a242c901a2e13024f935 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -208,6 +208,7 @@ static bool aarch64_builtin_support_vector_misalignment (machine_mode mode,
static machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64);
static bool aarch64_print_address_internal (FILE*, machine_mode, rtx,
aarch64_addr_query_type);
+static HOST_WIDE_INT aarch64_uimm12_nearest_value (HOST_WIDE_INT val);
/* Major revision number of the ARM Architecture implemented by the target. */
unsigned aarch64_architecture_version;
@@ -3973,6 +3974,83 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
return "";
}
+/* Emit the probe loop for doing stack clash probes and stack adjustments for
+ SVE. This emits probes from BASE to BASE - ADJUSTMENT based on a guard size
+ of GUARD_SIZE. When a probe is emitted it is done at MIN_PROBE_OFFSET bytes
+ from the current BASE at an interval of MIN_PROBE_OFFSET. By the end of this
+ function BASE = BASE - ADJUSTMENT. */
+
+const char *
+aarch64_output_probe_sve_stack_clash (rtx base, rtx adjustment,
+ rtx min_probe_threshold, rtx guard_size)
+{
+ /* This function is not allowed to use any instruction generation function
+ like gen_ and friends. If you do you'll likely ICE during CFG validation,
+ so instead emit the code you want using output_asm_insn. */
+ gcc_assert (flag_stack_clash_protection);
+ gcc_assert (CONST_INT_P (min_probe_threshold) && CONST_INT_P (guard_size));
+ gcc_assert (INTVAL (guard_size) > INTVAL (min_probe_threshold));
+
+ /* The minimum required allocation before the residual requires probing. */
+ HOST_WIDE_INT residual_probe_guard = INTVAL (min_probe_threshold);
+
+ /* Clamp the value down to the nearest value that can be used with a cmp. */
+ residual_probe_guard = aarch64_uimm12_nearest_value (residual_probe_guard);
+ rtx probe_offset_value_rtx = gen_int_mode (residual_probe_guard, Pmode);
+
+ gcc_assert (INTVAL (min_probe_threshold) >= residual_probe_guard);
+ gcc_assert (aarch64_uimm12_shift (residual_probe_guard));
+
+ static int labelno = 0;
+ char loop_start_lab[32];
+ char loop_end_lab[32];
+ rtx xops[2];
+
+ ASM_GENERATE_INTERNAL_LABEL (loop_start_lab, "SVLPSPL", labelno);
+ ASM_GENERATE_INTERNAL_LABEL (loop_end_lab, "SVLPEND", labelno++);
+
+ /* Emit loop start label. */
+ ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_start_lab);
+
+ /* ADJUSTMENT == RESIDUAL_PROBE_GUARD. */
+ xops[0] = adjustment;
+ xops[1] = probe_offset_value_rtx;
+ output_asm_insn ("cmp\t%0, %1", xops);
+
+ /* Branch to end if not enough adjustment to probe. */
+ fputs ("\tb.lt\t", asm_out_file);
+ assemble_name_raw (asm_out_file, loop_end_lab);
+ fputc ('\n', asm_out_file);
+
+ /* BASE = BASE - RESIDUAL_PROBE_GUARD. */
+ xops[0] = base;
+ xops[1] = gen_int_mode (residual_probe_guard, Pmode);
+ output_asm_insn ("sub\t%0, %0, %1", xops);
+
+ /* Probe at BASE. */
+ xops[1] = const0_rtx;
+ output_asm_insn ("str\txzr, [%0, %1]", xops);
+
+ /* ADJUSTMENT = ADJUSTMENT - RESIDUAL_PROBE_GUARD. */
+ xops[0] = adjustment;
+ xops[1] = probe_offset_value_rtx;
+ output_asm_insn ("sub\t%0, %0, %1", xops);
+
+ /* Branch to start if still more bytes to allocate. */
+ fputs ("\tb\t", asm_out_file);
+ assemble_name_raw (asm_out_file, loop_start_lab);
+ fputc ('\n', asm_out_file);
+
+ /* No probe leave. */
+ ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
+
+ /* BASE = BASE - ADJUSTMENT. */
+ xops[0] = base;
+ xops[1] = adjustment;
+ output_asm_insn ("sub\t%0, %0, %1", xops);
+ return "";
+}
+
/* Determine whether a frame chain needs to be generated. */
static bool
aarch64_needs_frame_chain (void)
@@ -4835,21 +4913,76 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
}
}
- HOST_WIDE_INT size;
/* If SIZE is not large enough to require probing, just adjust the stack and
exit. */
- if (!poly_size.is_constant (&size)
- || known_lt (poly_size, min_probe_threshold)
+ if (known_lt (poly_size, min_probe_threshold)
|| !flag_stack_clash_protection)
{
aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
return;
}
+ HOST_WIDE_INT size;
+ /* Handle the SVE non-constant case first. */
+ if (!poly_size.is_constant (&size))
+ {
+
+ if (dump_file)
+ {
+ fprintf (dump_file, "Stack clash SVE prologue: ");
+ print_dec (poly_size, dump_file);
+ fprintf (dump_file, " bytes, dynamic probing will be required.\n");
+ }
+
+ /* First calculate the amount of bytes we're actually spilling. */
+ aarch64_add_offset (Pmode, temp1, CONST0_RTX (GET_MODE (temp1)),
+ poly_size, temp1, temp2, false, true);
+
+ rtx_insn *insn = get_last_insn ();
+
+ if (frame_related_p)
+ {
+ /* This is done to provide unwinding information for the stack
+ adjustments we're about to do, however to prevent the optimizers
+ from removing the R15 move and leaving the CFA note (which would be
+ very wrong) we tie the old and new stack pointer together.
+ The tie will expand to nothing but the optimizers will not touch
+ the instruction. */
+ rtx stack_ptr_copy = gen_rtx_REG (Pmode, R15_REGNUM);
+ emit_move_insn (stack_ptr_copy, stack_pointer_rtx);
+ emit_insn (gen_stack_tie (stack_ptr_copy, stack_pointer_rtx));
+
+ /* We want the CFA independent of the stack pointer for the
+ duration of the loop. */
+ add_reg_note (insn, REG_CFA_DEF_CFA, stack_ptr_copy);
+ RTX_FRAME_RELATED_P (insn) = 1;
+ }
+
+ rtx probe_const = gen_int_mode (min_probe_threshold, DImode);
+ rtx guard_const = gen_int_mode (guard_size, DImode);
+
+ insn = emit_insn (gen_probe_sve_stack_clash (stack_pointer_rtx,
+ stack_pointer_rtx, temp1,
+ probe_const, guard_const));
+
+ /* Now reset the CFA register if needed. */
+ if (frame_related_p)
+ {
+ add_reg_note (insn, REG_CFA_DEF_CFA,
+ gen_rtx_PLUS (Pmode, stack_pointer_rtx,
+ gen_int_mode (poly_size, Pmode)));
+ RTX_FRAME_RELATED_P (insn) = 1;
+ }
+
+ return;
+ }
+
if (dump_file)
- fprintf (dump_file,
- "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
- ", probing will be required.\n", size);
+ {
+ fprintf (dump_file,
+ "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
+ " bytes, probing will be required.\n", size);
+ }
/* Round size to the nearest multiple of guard_size, and calculate the
residual as the difference between the original size and the rounded
@@ -5458,6 +5591,16 @@ aarch64_uimm12_shift (HOST_WIDE_INT val)
);
}
+/* Returns the nearest value to VAL that will fit as a 12-bit unsigned immediate
+ that can be created with a left shift of 0 or 12. */
+static HOST_WIDE_INT
+aarch64_uimm12_nearest_value (HOST_WIDE_INT val)
+{
+ if ((val & (((HOST_WIDE_INT) 0xfff) << 0)) == val)
+ return val;
+
+ return val & (((HOST_WIDE_INT) 0xfff) << 12);
+}
/* Return true if val is an immediate that can be loaded into a
register by a MOVZ instruction. */
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index b8da13f14fa9990e8fdc3c71ed407c8afc65a324..100bd2cc603656d2b8ba97f905c5eff16c59793b 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -6464,6 +6464,25 @@
[(set_attr "length" "32")]
)
+;; This instruction is used to generate the stack clash stack adjustment and
+;; probing loop. We can't change the control flow during prologue and epilogue
+;; code generation. So we must emit a volatile unspec and expand it later on.
+
+(define_insn "probe_sve_stack_clash"
+ [(set (match_operand:DI 0 "register_operand" "=rk")
+ (unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
+ (match_operand:DI 2 "register_operand" "r")
+ (match_operand:DI 3 "const_int_operand" "n")
+ (match_operand:DI 4 "aarch64_plus_immediate" "L")]
+ UNSPECV_PROBE_STACK_RANGE))]
+ "TARGET_SVE"
+{
+ return aarch64_output_probe_sve_stack_clash (operands[0], operands[2],
+ operands[3], operands[4]);
+}
+ [(set_attr "length" "28")]
+)
+
;; Named pattern for expanding thread pointer reference.
(define_expand "get_thread_pointerdi"
[(match_operand:DI 0 "register_operand" "=r")]
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..6ea87392843e4b9561cf6d43ffee57887db62e4e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables -ftree-vectorize" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#include <stdint.h>
+
+#define N 20040
+
+void __attribute__ ((noinline, noclone))
+test (int8_t *restrict dest, int8_t *restrict src)
+{
+ for (int i = 0; i < N; i+=8)
+ {
+ dest[i] += src[i * 4];
+ dest[i+1] += src[i * 4 + 1];
+ dest[i+2] += src[i * 4 + 2];
+ dest[i+3] += src[i * 4 + 3];
+ dest[i+4] += src[i * 4 + 4];
+ dest[i+5] += src[i * 4 + 5];
+ dest[i+6] += src[i * 4 + 6];
+ dest[i+7] += src[i * 4 + 7];
+ }
+}
+/* { dg-final { scan-assembler-times {mov\tx15, sp} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_register 15} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22} 1 } } */
+
+/* Checks that the CFA notes are correct for every sp adjustment, but we also
+ need to make sure we can unwind correctly before the frame is set up. So
+ check that we're emitting r15 with a copy of sp an setting the CFA there. */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
new file mode 100644
index 0000000000000000000000000000000000000000..fd0e987597eba406fa7351433fe7157743aeca42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -ftree-vectorize" } */
+
+
+#include <stdint.h>
+
+#define N 20040
+
+void __attribute__ ((noinline, noclone))
+test (int8_t *restrict dest, int8_t *restrict src)
+{
+ for (int i = 0; i < N; i+=8)
+ {
+ dest[i] += src[i * 4];
+ dest[i+1] += src[i * 4 + 1];
+ dest[i+2] += src[i * 4 + 2];
+ dest[i+3] += src[i * 4 + 3];
+ dest[i+4] += src[i * 4 + 4];
+ dest[i+5] += src[i * 4 + 5];
+ dest[i+6] += src[i * 4 + 6];
+ dest[i+7] += src[i * 4 + 7];
+ }
+}
+
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 0\]} 1 } } */
+/* { dg-final { scan-assembler-times {cmp\s+x[0-9]+, 61440} 1 } } */
+/* { dg-final { scan-assembler-times {sub\s+x[0-9]+, x[0-9]+, 61440} 1 } } */
+
+/* SVE spill, requires probing as vector size is unknown at compile time. */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24.c b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24.c
new file mode 100644
index 0000000000000000000000000000000000000000..4199e391881f1e260535c3fbdb6c41e428d54fbf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+/* { dg-options "-O2 -ftree-vectorize -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+
+#include <stdint.h>
+
+#define N 2000
+#define S 2 * 64 * 1024
+
+#define TEST_LOOP(NAME, TYPE) \
+ void __attribute__ ((noinline, noclone)) \
+ NAME (TYPE *restrict dest, TYPE *restrict src) \
+ { \
+ volatile char foo[S]; \
+ foo[S-1]=1; \
+ for (int i = 0; i < N; i=i+4) \
+ { \
+ dest[i] += src[i * 4]; \
+ dest[i+1] += src[(i+1) * 4]; \
+ dest[i+2] += src[(i+2) * 4]; \
+ dest[i+3] += src[(i+3) * 4]; \
+ } \
+ }
+
+#define TEST(NAME) \
+ TEST_LOOP (NAME##_i8, int8_t) \
+ TEST_LOOP (NAME##_i16, uint16_t) \
+ TEST_LOOP (NAME##_f32, float) \
+ TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop for stack clash probing. */
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 0\]} 2 } } */
+/* { dg-final { scan-assembler-times {cmp\s+x[0-9]+, 61440} 2 } } */
+/* { dg-final { scan-assembler-times {sub\s+x[0-9]+, x[0-9]+, 61440} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24_run.c b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24_run.c
new file mode 100644
index 0000000000000000000000000000000000000000..cbf1bdfe1e3f23d342042c5b1bb5994714a65cec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24_run.c
@@ -0,0 +1,37 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+/* { dg-options "-O2 -ftree-vectorize -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+
+#include "struct_vect_22.c"
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE) \
+ { \
+ TYPE out[N]; \
+ TYPE in[N * 4]; \
+ for (int i = 0; i < N; ++i) \
+ { \
+ out[i] = i * 7 / 2; \
+ asm volatile ("" ::: "memory"); \
+ } \
+ for (int i = 0; i < N * 4; ++i) \
+ { \
+ in[i] = i * 9 / 2; \
+ asm volatile ("" ::: "memory"); \
+ } \
+ NAME (out, in); \
+ for (int i = 0; i < N; ++i) \
+ { \
+ TYPE expected = i * 7 / 2 + in[i * 4]; \
+ if (out[i] != expected) \
+ __builtin_abort (); \
+ asm volatile ("" ::: "memory"); \
+ } \
+ }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+ TEST (test);
+ return 0;
+}
next prev parent reply other threads:[~2018-09-27 10:02 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-28 12:19 Tamar Christina
2018-08-28 20:40 ` Richard Sandiford
2018-09-07 16:05 ` Tamar Christina
2018-09-11 15:20 ` Richard Sandiford
2018-09-20 9:27 ` Tamar Christina
2018-09-26 8:31 ` Tamar Christina
2018-09-27 10:14 ` Tamar Christina [this message]
2018-09-27 11:13 ` Richard Sandiford
2018-09-28 16:42 ` Tamar Christina
2018-09-28 17:18 ` Richard Sandiford
2018-10-09 6:38 ` Tamar Christina
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180927100225.GA14440@arm.com \
--to=tamar.christina@arm.com \
--cc=James.Greenhalgh@arm.com \
--cc=Marcus.Shawcroft@arm.com \
--cc=Richard.Earnshaw@arm.com \
--cc=Richard.Sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=nd@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).