From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-486536-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 81559 invoked by alias); 27 Sep 2018 10:02:41 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 81536 invoked by uid 89); 27 Sep 2018 10:02:40 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,KAM_LOTSOFHASH,RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS autolearn=ham version=3.3.2 spammy=cfa, CFA
X-HELO: EUR02-HE1-obe.outbound.protection.outlook.com
Received: from mail-eopbgr10056.outbound.protection.outlook.com (HELO EUR02-HE1-obe.outbound.protection.outlook.com) (40.107.1.56) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 27 Sep 2018 10:02:33 +0000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=VL0Lp3X+1GNxRS8naveIRWe6QzjYgJ5Qcepb4b/eBPo=; b=hqbG4JhT11ayMoJunLu0cKI9nwK0uDQRr4Gsusdu3tgDe+B3mZSsHONurDW8GNYoK0lk9Dwd7LU4d2BXEa5HISjXwYx9vQFCzpDqkzCariTnM6Z2SNIiznjqSO5fWFHw38nzL5rNNByhFQRzoTQwIAxZ/96GwS159uob4HbZkNc=
Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Tamar.Christina@arm.com;
Received: from arm.com (217.140.106.52) by AM4PR0802MB2305.eurprd08.prod.outlook.com (2603:10a6:200:5f::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1185.20; Thu, 27 Sep 2018 10:02:29 +0000
Date: Thu, 27 Sep 2018 10:14:00 -0000
From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Sandiford <Richard.Sandiford@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, nd <nd@arm.com>,	James Greenhalgh <James.Greenhalgh@arm.com>,	Richard Earnshaw <Richard.Earnshaw@arm.com>,	Marcus Shawcroft <Marcus.Shawcroft@arm.com>
Subject: Re: [PATCH][GCC][AArch64] Add support for SVE stack clash probing [patch (2/7)]
Message-ID: <20180927100225.GA14440@arm.com>
References: <20180828121925.GA31072@arm.com> <878t4qp6lw.fsf@arm.com> <20180907160506.GB14299@arm.com> <87in3ckqmx.fsf@arm.com> <20180920092313.GA28769@arm.com> <20180926082027.GA27633@arm.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="xHFwDpU9dbj6ez1V"
Content-Disposition: inline
In-Reply-To: <20180926082027.GA27633@arm.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Return-Path: Tamar.Christina@arm.com
Received-SPF: None (protection.outlook.com: arm.com does not designate permitted sender hosts)
X-IsSubscribed: yes
X-SW-Source: 2018-09/txt/msg01639.txt.bz2


--xHFwDpU9dbj6ez1V
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-length: 25558

Hi All,

It turns out the testsuite didn't have a case in it which would cause a
significant enough spill to enter the loop.  After creating one I noticed a bug
in the loop and fixed it.

The loops are now

        .cfi_startproc
        mov     x15, sp
        cntb    x16, all, mul #11
        add     x16, x16, 304
        .cfi_def_cfa_register 15
.SVLPSPL0:
        cmp     x16, 61440
        b.lt    .SVLPEND0
        sub     sp, sp, 61440
        str     xzr, [sp, 0]
        subs    x16, x16, 61440
        b      .SVLPSPL0
.SVLPEND0:
        sub     sp, sp, x16
        .cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22

for a 64KB guard size.

I'm also adding a new testcase that causes a large enough spill to enter the loop.

Ok for trunk?

Thanks,
Tamar

gcc/
2018-09-27  Tamar Christina  <tamar.christina@arm.com>

	PR target/86486
	* config/aarch64/aarch64-protos.h (aarch64_output_probe_sve_stack_clash): New.
	* config/aarch64/aarch64.c (aarch64_output_probe_sve_stack_clash,
	aarch64_uimm12_nearest_value): New.
	(aarch64_allocate_and_probe_stack_space): Add SVE specific section.
	* config/aarch64/aarch64.md (probe_sve_stack_clash): New.

gcc/testsuite/
2018-09-27  Tamar Christina  <tamar.christina@arm.com>

	PR target/86486
	* gcc.target/aarch64/stack-check-prologue-16.c: New test
	* gcc.target/aarch64/stack-check-cfa-3.c: New test.
	* gcc.target/aarch64/sve/struct_vect_24.c: New test.
	* gcc.target/aarch64/sve/struct_vect_24_run.c: New test.

The 09/26/2018 09:20, Tamar Christina wrote:
> Hi Richard,
> 
> I've added a new loop that should also exit early as described in my previous email.
> 
> An example sequence is:
> 
>         .cfi_startproc
>         mov     x15, sp
>         cntb    x16, all, mul #11
>         add     x16, x16, 304
>         .cfi_def_cfa_register 15
>         cmp     x16, 61440
>         b.lt    .SVLPEND0
> .SVLPSPL0:
>         sub     sp, sp, 61440
>         str     xzr, [sp, 0]
>         subs    x16, x16, 61440
>         b.hs    .SVLPSPL0
>         add     x16, x16, 61440
> .SVLPEND0:
>         sub     sp, sp, x16
>         .cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22
> 
> for a 64KB guard size, and for a 4KB guard size
> 
>         .cfi_startproc
>         mov     x15, sp
>         cntb    x16, all, mul #11
>         add     x16, x16, 304
>         .cfi_def_cfa_register 15
>         cmp     x16, 3072
>         b.lt    .SVLPEND0
> .SVLPSPL0:
>         sub     sp, sp, 3072
>         str     xzr, [sp, 0]
>         subs    x16, x16, 3072
>         b.hs    .SVLPSPL0
>         add     x16, x16, 3072
> .SVLPEND0:
>         sub     sp, sp, x16
>         .cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22
> 
> 
> This has about the same semantics as alloca, except we prioritize the common case
> where no probe is required.  We also change the amount we adjust the stack and
> the probing interval to be the nearest value to `guard size - abi buffer` that
> fits in the 12-bit shifted immediate used by cmp.
> 
> While this would mean we probe a bit more often than we require, in practice the
> amount of SVE vectors you'd need to spill is significant. Even more so to enter the
> loop more than once.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues in sve testsuite.
> Target was tested with stack clash on and off by default.
> 
> Ok for trunk?
> 
> Thanks,
> Tamar
> 
> gcc/
> 2018-09-26  Tamar Christina  <tamar.christina@arm.com>
> 
> 	PR target/86486
> 	* config/aarch64/aarch64-protos.h (aarch64_output_probe_sve_stack_clash): New.
> 	* config/aarch64/aarch64.c (aarch64_output_probe_sve_stack_clash,
> 	aarch64_uimm12_nearest_value): New.
> 	(aarch64_allocate_and_probe_stack_space): Add SVE specific section.
> 	* config/aarch64/aarch64.md (probe_sve_stack_clash): New.
> 
> gcc/testsuite/
> 2018-09-26  Tamar Christina  <tamar.christina@arm.com>
> 
> 	PR target/86486
> 	* gcc.target/aarch64/stack-check-prologue-16.c: New test
> 	* gcc.target/aarch64/stack-check-cfa-3.c: New test.
> 
> 
> The 09/20/2018 10:23, Tamar Christina wrote:
> > Hi Richard,
> > 
> > The 09/11/2018 16:20, Richard Sandiford wrote:
> > > Tamar Christina <Tamar.Christina@arm.com> writes:
> > > >> > +
> > > >> > +  /* No probe leave.  */
> > > >> > +  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
> > > >> > +  return "";
> > > >> 
> > > >> With the CFA stuff and constant load, I think this works out as:
> > > >> 
> > > >> ---------------------------------------------
> > > >> # 12 insns
> > > >> 	mov	r15, base
> > > >> 	mov	adjustment, N
> > > >> 1:
> > > >> 	cmp	adjustment, guard_size
> > > >> 	b.lt    2f
> > > >> 	sub	base, base, guard_size
> > > >> 	str	xzr, [base, limit]
> > > >> 	sub     adjustment, adjustment, guard_size
> > > >> 	b	1b
> > > >> 2:
> > > >> 	sub     base, base, adjustment
> > > >> 	cmp	adjustment, limit
> > > >> 	b.le	3f
> > > >> 	str	xzr, [base, limit]
> > > >> 3:
> > > >> ---------------------------------------------
> > > >> 
> > > >> What do you think about something like:
> > > >> 
> > > >> ---------------------------------------------
> > > >> # 10 insns
> > > >> 	mov	adjustment, N
> > > >> 	sub	r15, base, adjustment
> > > >> 	subs	adjustment, adjustment, min_probe_threshold
> > > >> 	b.lo	2f
> > > >> 1:
> > > >> 	add	base, x15, adjustment
> > > >> 	str	xzr, [base, 0]
> > > >> 	subs	adjustment, adjustment, 16
> > > >> 	and	adjustment, adjustment, ~(guard_size-1)
> > > >> 	b.hs	1b
> > > >> 2:
> > > >> 	mov	base, r15
> > > >> ---------------------------------------------
> > > >> 
> > > >> or (with different trade-offs):
> > > >> 
> > > >> ---------------------------------------------
> > > >> # 11 insns
> > > >> 	mov	adjustment, N
> > > >> 	sub	r15, base, adjustment
> > > >> 	subs	adjustment, adjustment, min_probe_threshold
> > > >> 	b.lo	2f
> > > >> 	# Might be 0, leading to a double probe
> > > >> 	and	r14, adjustment, guard_size-1
> > > >> 1:
> > > >> 	add	base, x15, adjustment
> > > >> 	str	xzr, [base, 0]
> > > >> 	subs	adjustment, adjustment, r14
> > > >> 	mov	r14, guard_size
> > > >> 	b.hs	1b
> > > >> 2:
> > > >> 	mov	base, r15
> > > >> ---------------------------------------------
> > > >> 
> > > >> or (longer, but with a simpler loop):
> > > >> 
> > > >> ---------------------------------------------
> > > >> # 12 insns
> > > >> 	mov	adjustment, N
> > > >> 	sub	r15, base, adjustment
> > > >> 	subs	adjustment, adjustment, min_probe_threshold
> > > >> 	b.lo	2f
> > > >> 	str	xzr, [base, -16]!
> > > >> 	sub	adjustment, adjustment, 32
> > > >> 	and	adjustment, adjustment, -(guard_size-1)
> > > >> 1:
> > > >> 	add	base, x15, adjustment
> > > >> 	str	xzr, [base, 0]
> > > >> 	subs	adjustment, adjustment, guard_size
> > > >> 	b.hs	1b
> > > >> 2:
> > > >> 	mov	base, r15
> > > >> ---------------------------------------------
> > > >> 
> > > >> with the CFA based on r15+offset?
> > > >> 
> > > >> These loops probe more often than necessary in some cases,
> > > >> but they only need a single branch in the common case that
> > > >> ADJUSTMENT <= MIN_PROBE_THRESHOLD.
> > > >
> > > > I haven't changed the loop yet because I'm a bit on the edge about
> > > > whether the implementation difficulties would outweigh the benefits.
> > > > We are planning on doing something smarter for SVE so optimizing these
> > > > loops only to replace them later may not be time well spent now.
> > > >
> > > > The problem is that to support both 4KB and 64KB pages, instructions such
> > > > as subs would require different immediates and shifts. Granted we technically
> > > > only support these two so I could hardcode the values, but that would mean
> > > > these functions are less general than the rest.
> > > 
> > > Because of the min_probe_threshold?  You could conservatively clamp it
> > > to the next lowest value that's in range, which we could do without
> > > having to hard-code specific values.  I think it would be better
> > > to do that even with the current code, since hard-coding 2048 with:
> > > 
> > >   /* Test if ADJUSTMENT < RESIDUAL_PROBE_GUARD, in principle any power of two
> > >      larger than 1024B would work, but we need one that works for all supported
> > >      guard-sizes.  What we actually want to check is guard-size - 1KB, but this
> > >      immediate won't fit inside a cmp without requiring a tempory, so instead we
> > >      just accept a smaller immediate that doesn't, we may probe a bit more often
> > >      but that doesn't matter much on the long run.  */
> > > 
> > > seems a bit of a hack.
> > > 
> > > > If you think it would be worthwhile, I'd be happy to use one of these
> > > > loops instead.
> > > 
> > > Yeah, I still think we should do this unless we can commit to doing
> > > the optimised version by a specific date, and that date is soon enough
> > > that the optimisation could reasonably be backported to GCC 8.
> > > 
> > 
> > While implementing these loops I found them a bit hard to follow, or rather a bit
> > difficult to prove correct, to someone looking at the code it may not be trivially clear
> > what it does. I believe the main concern here is that the common case
> > isn't shortcutted? e.g. spills small enough not to require a probe. So how about
> > 
> > 	mov	r15, base
> > 	mov	adjustment, N
> > 	cmp	adjustment, nearest(min_probe_threshold)
> > 	b.lt	end
> > begin:
> > 	sub	base, base, nearest(min_probe_threshold)
> > 	str	xzr, [base, 0]
> > 	subs	size, size, nearest(min_probe_threshold)
> > 	b.hs	begin
> > end:
> > 	sub	base, base, size
> > 
> > as an alternative? Which is 9 insn but also much simpler and follows the same semantics
> > as the other probing codes.  This has the downside that we probe a bit more often than we need to
> > but on the average case you'd likely not enter the loop more than once, so I'd expect in real world usage
> > the amount of probes to be the same as the previous code, since you'd have to spill a significant amount of SVE
> > vectors in order to enter the loop, let alone iterate.
> > 
> > This is still safe as the only invariant we have to hold is not to drop the SP by more than a page at a time,
> > doing less than a page it fine.
> > 
> > nearest just rounds down to the nearest value that fits in a 12-bit shifted immediate.
> > 
> > 
> > Thanks,
> > Tamar
> > 
> > > > @@ -4830,11 +4929,11 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > > >  	}
> > > >      }
> > > > 
> > > > -  HOST_WIDE_INT size;
> > > > +  HOST_WIDE_INT size = 0;
> > > >    /* If SIZE is not large enough to require probing, just adjust the stack and
> > > >       exit.  */
> > > > -  if (!poly_size.is_constant (&size)
> > > > -      || known_lt (poly_size, min_probe_threshold)
> > > > +  if ((poly_size.is_constant (&size)
> > > > +       && known_lt (poly_size, min_probe_threshold))
> > > >        || !flag_stack_clash_protection)
> > > >      {
> > > >        aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
> > > 
> > > I still think we should remove this poly_size.is_constant, and instead:
> > > 
> > > > @@ -4842,9 +4941,64 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
> > > >      }
> > >  
> > > >    if (dump_file)
> > > > -    fprintf (dump_file,
> > > > -	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
> > > > -	     ", probing will be required.\n", size);
> > > > +    {
> > > > +      if (poly_size.is_constant ())
> > > > +	fprintf (dump_file,
> > > > +		 "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
> > > > +		 " bytes, probing will be required.\n", size);
> > > > +      else
> > > > +	{
> > > > +	  fprintf (dump_file, "Stack clash SVE prologue: ");
> > > > +	  print_dec (poly_size, dump_file);
> > > > +	  fprintf (dump_file, " bytes, dynamic probing will be required.\n");
> > > > +	}
> > > > +    }
> > > > +
> > > > +  /* Handle the SVE non-constant case first.  */
> > > > +  if (!poly_size.is_constant ())
> > > 
> > > ...use is_constant (&size) here, and put the dump messages for the
> > > constant and non-constant cases in their respective constant and
> > > non-constant blocks.  That way each use of "size" is directly protected
> > > by an is_constant call, and there's no need to initialise size to 0.
> > > 
> > > The non-constant case doesn't have the new special handling of
> > > final_adjustment_p, so I think the !is_constant block should assert
> > > !final_adjustment_p.
> > > 
> > > Thanks,
> > > Richard
> > 
> > -- 
> 
> -- 

> diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
> index ef95fc829b83886e2ff00e4664e31af916e99b0c..e2d8734a8d5e513588e3b0318e9c67fdaebdf0d4 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -453,6 +453,7 @@ void aarch64_asm_output_labelref (FILE *, const char *);
>  void aarch64_cpu_cpp_builtins (cpp_reader *);
>  const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
>  const char * aarch64_output_probe_stack_range (rtx, rtx);
> +const char * aarch64_output_probe_sve_stack_clash (rtx, rtx, rtx, rtx);
>  void aarch64_err_no_fpadvsimd (machine_mode);
>  void aarch64_expand_epilogue (bool);
>  void aarch64_expand_mov_immediate (rtx, rtx, rtx (*) (rtx, rtx) = 0);
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index d4b13d48d852a70848fc7c51fd867e776efb5e55..d189198a377e698964d34ef03a4c1a92fe1be4f0 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -208,6 +208,7 @@ static bool aarch64_builtin_support_vector_misalignment (machine_mode mode,
>  static machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64);
>  static bool aarch64_print_address_internal (FILE*, machine_mode, rtx,
>  					    aarch64_addr_query_type);
> +static HOST_WIDE_INT aarch64_uimm12_nearest_value (HOST_WIDE_INT val);
>  
>  /* Major revision number of the ARM Architecture implemented by the target.  */
>  unsigned aarch64_architecture_version;
> @@ -3973,6 +3974,89 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
>    return "";
>  }
>  
> +/* Emit the probe loop for doing stack clash probes and stack adjustments for
> +   SVE.  This emits probes from BASE to BASE - ADJUSTMENT based on a guard size
> +   of GUARD_SIZE.  When a probe is emitted it is done at MIN_PROBE_OFFSET bytes
> +   from the current BASE at an interval of MIN_PROBE_OFFSET.  By the end of this
> +   function BASE = BASE - ADJUSTMENT.  */
> +
> +const char *
> +aarch64_output_probe_sve_stack_clash (rtx base, rtx adjustment,
> +				      rtx min_probe_threshold, rtx guard_size)
> +{
> +  /* This function is not allowed to use any instruction generation function
> +     like gen_ and friends.  If you do you'll likely ICE during CFG validation,
> +     so instead emit the code you want using output_asm_insn.  */
> +  gcc_assert (flag_stack_clash_protection);
> +  gcc_assert (CONST_INT_P (min_probe_threshold) && CONST_INT_P (guard_size));
> +  gcc_assert (INTVAL (guard_size) > INTVAL (min_probe_threshold));
> +
> +  /* The minimum required allocation before the residual requires probing.  */
> +  HOST_WIDE_INT residual_probe_guard = INTVAL (min_probe_threshold);
> +
> +  /* Clamp the value down to the nearest value that can be used with a cmp.  */
> +  residual_probe_guard = aarch64_uimm12_nearest_value (residual_probe_guard);
> +  rtx probe_offset_value_rtx = gen_int_mode (residual_probe_guard, Pmode);
> +
> +  gcc_assert (INTVAL (min_probe_threshold) >= residual_probe_guard);
> +  gcc_assert (aarch64_uimm12_shift (residual_probe_guard));
> +
> +  static int labelno = 0;
> +  char loop_start_lab[32];
> +  char loop_end_lab[32];
> +  rtx xops[2];
> +
> +  ASM_GENERATE_INTERNAL_LABEL (loop_start_lab, "SVLPSPL", labelno);
> +  ASM_GENERATE_INTERNAL_LABEL (loop_end_lab, "SVLPEND", labelno++);
> +
> +  /* ADJUSTMENT == RESIDUAL_PROBE_GUARD.  */
> +  xops[0] = adjustment;
> +  xops[1] = probe_offset_value_rtx;
> +  output_asm_insn ("cmp\t%0, %1", xops);
> +
> +  /* Branch to end if not enough adjustment to probe.  */
> +  fputs ("\tb.lt\t", asm_out_file);
> +  assemble_name_raw (asm_out_file, loop_end_lab);
> +  fputc ('\n', asm_out_file);
> +
> +  /* Emit loop start label.  */
> +  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_start_lab);
> +
> +  /* BASE = BASE - RESIDUAL_PROBE_GUARD.  */
> +  xops[0] = base;
> +  xops[1] = gen_int_mode (residual_probe_guard, Pmode);
> +  output_asm_insn ("sub\t%0, %0, %1", xops);
> +
> +  /* Probe at BASE.  */
> +  xops[1] = const0_rtx;
> +  output_asm_insn ("str\txzr, [%0, %1]", xops);
> +
> +  /* ADJUSTMENT = ADJUSTMENT - RESIDUAL_PROBE_GUARD.  */
> +  xops[0] = adjustment;
> +  xops[1] = probe_offset_value_rtx;
> +  output_asm_insn ("subs\t%0, %0, %1", xops);
> +
> +  /* Branch to start if still more bytes to allocate.  */
> +  fputs ("\tb.hs\t", asm_out_file);
> +  assemble_name_raw (asm_out_file, loop_start_lab);
> +  fputc ('\n', asm_out_file);
> +
> +  /* ADJUSTMENT = ADJUSTMENT + RESIDUAL_PROBE_GUARD, we need to undo the last
> +     subtract in order to know how much to drop the stack by.  */
> +  xops[0] = adjustment;
> +  xops[1] = probe_offset_value_rtx;
> +  output_asm_insn ("add\t%0, %0, %1", xops);
> +
> +  /* No probe leave.  */
> +  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
> +
> +  /* BASE = BASE - ADJUSTMENT.  */
> +  xops[0] = base;
> +  xops[1] = adjustment;
> +  output_asm_insn ("sub\t%0, %0, %1", xops);
> +  return "";
> +}
> +
>  /* Determine whether a frame chain needs to be generated.  */
>  static bool
>  aarch64_needs_frame_chain (void)
> @@ -4835,21 +4919,76 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
>  	}
>      }
>  
> -  HOST_WIDE_INT size;
>    /* If SIZE is not large enough to require probing, just adjust the stack and
>       exit.  */
> -  if (!poly_size.is_constant (&size)
> -      || known_lt (poly_size, min_probe_threshold)
> +  if (known_lt (poly_size, min_probe_threshold)
>        || !flag_stack_clash_protection)
>      {
>        aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
>        return;
>      }
>  
> +  HOST_WIDE_INT size;
> +  /* Handle the SVE non-constant case first.  */
> +  if (!poly_size.is_constant (&size))
> +    {
> +
> +     if (dump_file)
> +      {
> +	  fprintf (dump_file, "Stack clash SVE prologue: ");
> +	  print_dec (poly_size, dump_file);
> +	  fprintf (dump_file, " bytes, dynamic probing will be required.\n");
> +      }
> +
> +      /* First calculate the amount of bytes we're actually spilling.  */
> +      aarch64_add_offset (Pmode, temp1, CONST0_RTX (GET_MODE (temp1)),
> +			  poly_size, temp1, temp2, false, true);
> +
> +      rtx_insn *insn = get_last_insn ();
> +
> +      if (frame_related_p)
> +	{
> +	  /* This is done to provide unwinding information for the stack
> +	     adjustments we're about to do, however to prevent the optimizers
> +	     from removing the R15 move and leaving the CFA note (which would be
> +	     very wrong) we tie the old and new stack pointer together.
> +	     The tie will expand to nothing but the optimizers will not touch
> +	     the instruction.  */
> +	  rtx stack_ptr_copy = gen_rtx_REG (Pmode, R15_REGNUM);
> +	  emit_move_insn (stack_ptr_copy, stack_pointer_rtx);
> +	  emit_insn (gen_stack_tie (stack_ptr_copy, stack_pointer_rtx));
> +
> +	  /* We want the CFA independent of the stack pointer for the
> +	     duration of the loop.  */
> +	  add_reg_note (insn, REG_CFA_DEF_CFA, stack_ptr_copy);
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	}
> +
> +      rtx probe_const = gen_int_mode (min_probe_threshold, DImode);
> +      rtx guard_const = gen_int_mode (guard_size, DImode);
> +
> +      insn = emit_insn (gen_probe_sve_stack_clash (stack_pointer_rtx,
> +						   stack_pointer_rtx, temp1,
> +						   probe_const, guard_const));
> +
> +      /* Now reset the CFA register if needed.  */
> +      if (frame_related_p)
> +	{
> +	  add_reg_note (insn, REG_CFA_DEF_CFA,
> +			gen_rtx_PLUS (Pmode, stack_pointer_rtx,
> +				      gen_int_mode (poly_size, Pmode)));
> +	  RTX_FRAME_RELATED_P (insn) = 1;
> +	}
> +
> +      return;
> +    }
> +
>    if (dump_file)
> -    fprintf (dump_file,
> -	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
> -	     ", probing will be required.\n", size);
> +    {
> +      fprintf (dump_file,
> +	       "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
> +	       " bytes, probing will be required.\n", size);
> +    }
>  
>    /* Round size to the nearest multiple of guard_size, and calculate the
>       residual as the difference between the original size and the rounded
> @@ -5458,6 +5597,16 @@ aarch64_uimm12_shift (HOST_WIDE_INT val)
>  	  );
>  }
>  
> +/* Returns the nearest value to VAL that will fit as a 12-bit unsigned immediate
> +   that can be created with a left shift of 0 or 12.  */
> +static HOST_WIDE_INT
> +aarch64_uimm12_nearest_value (HOST_WIDE_INT val)
> +{
> +  if ((val & (((HOST_WIDE_INT) 0xfff) << 0)) == val)
> +    return val;
> +
> +  return val & (((HOST_WIDE_INT) 0xfff) << 12);
> +}
>  
>  /* Return true if val is an immediate that can be loaded into a
>     register by a MOVZ instruction.  */
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index b8da13f14fa9990e8fdc3c71ed407c8afc65a324..b422713019e5063babec1fb81d0dfc7b50c76038 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -6464,6 +6464,25 @@
>    [(set_attr "length" "32")]
>  )
>  
> +;; This instruction is used to generate the stack clash stack adjustment and
> +;; probing loop.  We can't change the control flow during prologue and epilogue
> +;; code generation.  So we must emit a volatile unspec and expand it later on.
> +
> +(define_insn "probe_sve_stack_clash"
> +  [(set (match_operand:DI 0 "register_operand" "=rk")
> +	(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
> +			     (match_operand:DI 2 "register_operand" "r")
> +			     (match_operand:DI 3 "const_int_operand" "n")
> +			     (match_operand:DI 4 "aarch64_plus_immediate" "L")]
> +			      UNSPECV_PROBE_STACK_RANGE))]
> +  "TARGET_SVE"
> +{
> +  return aarch64_output_probe_sve_stack_clash (operands[0], operands[2],
> +					       operands[3], operands[4]);
> +}
> +  [(set_attr "length" "32")]
> +)
> +
>  ;; Named pattern for expanding thread pointer reference.
>  (define_expand "get_thread_pointerdi"
>    [(match_operand:DI 0 "register_operand" "=r")]
> diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..6ea87392843e4b9561cf6d43ffee57887db62e4e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
> @@ -0,0 +1,30 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables -ftree-vectorize" } */
> +/* { dg-require-effective-target supports_stack_clash_protection } */
> +
> +#include <stdint.h>
> +
> +#define N 20040
> +
> +void __attribute__ ((noinline, noclone))
> +test (int8_t *restrict dest, int8_t *restrict src)
> +{
> +  for (int i = 0; i < N; i+=8)
> +    {
> +      dest[i] += src[i * 4];
> +      dest[i+1] += src[i * 4 + 1];
> +      dest[i+2] += src[i * 4 + 2];
> +      dest[i+3] += src[i * 4 + 3];
> +      dest[i+4] += src[i * 4 + 4];
> +      dest[i+5] += src[i * 4 + 5];
> +      dest[i+6] += src[i * 4 + 6];
> +      dest[i+7] += src[i * 4 + 7];
> +    }
> +}
> +/* { dg-final { scan-assembler-times {mov\tx15, sp} 1 } } */
> +/* { dg-final { scan-assembler-times {\.cfi_def_cfa_register 15} 1 } } */
> +/* { dg-final { scan-assembler-times {\.cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22} 1 } } */
> +
> +/* Checks that the CFA notes are correct for every sp adjustment, but we also
> +   need to make sure we can unwind correctly before the frame is set up.  So
> +   check that we're emitting r15 with a copy of sp an setting the CFA there.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..fd139bb09274509bd4faeef324520927a1fe3d3c
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
> @@ -0,0 +1,32 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target supports_stack_clash_protection } */
> +/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -ftree-vectorize" } */
> +
> +
> +#include <stdint.h>
> +
> +#define N 20040
> +
> +void __attribute__ ((noinline, noclone))
> +test (int8_t *restrict dest, int8_t *restrict src)
> +{
> +  for (int i = 0; i < N; i+=8)
> +    {
> +      dest[i] += src[i * 4];
> +      dest[i+1] += src[i * 4 + 1];
> +      dest[i+2] += src[i * 4 + 2];
> +      dest[i+3] += src[i * 4 + 3];
> +      dest[i+4] += src[i * 4 + 4];
> +      dest[i+5] += src[i * 4 + 5];
> +      dest[i+6] += src[i * 4 + 6];
> +      dest[i+7] += src[i * 4 + 7];
> +    }
> +}
> +
> +
> +/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 0\]} 1 } } */
> +/* { dg-final { scan-assembler-times {cmp\s+x[0-9]+, 61440} 1 } } */
> +/* { dg-final { scan-assembler-times {subs\s+x[0-9]+, x[0-9]+, 61440} 1 } } */
> +
> +/* SVE spill, requires probing as vector size is unknown at compile time.  */
> +
> 


-- 

--xHFwDpU9dbj6ez1V
Content-Type: text/x-diff; charset=utf-8
Content-Disposition: attachment; filename="rb9914.patch"
Content-length: 14985

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ef95fc829b83886e2ff00e4664e31af916e99b0c..e2d8734a8d5e513588e3b0318e9c67fdaebdf0d4 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -453,6 +453,7 @@ void aarch64_asm_output_labelref (FILE *, const char *);
 void aarch64_cpu_cpp_builtins (cpp_reader *);
 const char * aarch64_gen_far_branch (rtx *, int, const char *, const char *);
 const char * aarch64_output_probe_stack_range (rtx, rtx);
+const char * aarch64_output_probe_sve_stack_clash (rtx, rtx, rtx, rtx);
 void aarch64_err_no_fpadvsimd (machine_mode);
 void aarch64_expand_epilogue (bool);
 void aarch64_expand_mov_immediate (rtx, rtx, rtx (*) (rtx, rtx) = 0);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index d4b13d48d852a70848fc7c51fd867e776efb5e55..245fd6832ec0afe27c42a242c901a2e13024f935 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -208,6 +208,7 @@ static bool aarch64_builtin_support_vector_misalignment (machine_mode mode,
 static machine_mode aarch64_simd_container_mode (scalar_mode, poly_int64);
 static bool aarch64_print_address_internal (FILE*, machine_mode, rtx,
 					    aarch64_addr_query_type);
+static HOST_WIDE_INT aarch64_uimm12_nearest_value (HOST_WIDE_INT val);
 
 /* Major revision number of the ARM Architecture implemented by the target.  */
 unsigned aarch64_architecture_version;
@@ -3973,6 +3974,83 @@ aarch64_output_probe_stack_range (rtx reg1, rtx reg2)
   return "";
 }
 
+/* Emit the probe loop for doing stack clash probes and stack adjustments for
+   SVE.  This emits probes from BASE to BASE - ADJUSTMENT based on a guard size
+   of GUARD_SIZE.  When a probe is emitted it is done at MIN_PROBE_OFFSET bytes
+   from the current BASE at an interval of MIN_PROBE_OFFSET.  By the end of this
+   function BASE = BASE - ADJUSTMENT.  */
+
+const char *
+aarch64_output_probe_sve_stack_clash (rtx base, rtx adjustment,
+				      rtx min_probe_threshold, rtx guard_size)
+{
+  /* This function is not allowed to use any instruction generation function
+     like gen_ and friends.  If you do you'll likely ICE during CFG validation,
+     so instead emit the code you want using output_asm_insn.  */
+  gcc_assert (flag_stack_clash_protection);
+  gcc_assert (CONST_INT_P (min_probe_threshold) && CONST_INT_P (guard_size));
+  gcc_assert (INTVAL (guard_size) > INTVAL (min_probe_threshold));
+
+  /* The minimum required allocation before the residual requires probing.  */
+  HOST_WIDE_INT residual_probe_guard = INTVAL (min_probe_threshold);
+
+  /* Clamp the value down to the nearest value that can be used with a cmp.  */
+  residual_probe_guard = aarch64_uimm12_nearest_value (residual_probe_guard);
+  rtx probe_offset_value_rtx = gen_int_mode (residual_probe_guard, Pmode);
+
+  gcc_assert (INTVAL (min_probe_threshold) >= residual_probe_guard);
+  gcc_assert (aarch64_uimm12_shift (residual_probe_guard));
+
+  static int labelno = 0;
+  char loop_start_lab[32];
+  char loop_end_lab[32];
+  rtx xops[2];
+
+  ASM_GENERATE_INTERNAL_LABEL (loop_start_lab, "SVLPSPL", labelno);
+  ASM_GENERATE_INTERNAL_LABEL (loop_end_lab, "SVLPEND", labelno++);
+
+  /* Emit loop start label.  */
+  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_start_lab);
+
+  /* ADJUSTMENT == RESIDUAL_PROBE_GUARD.  */
+  xops[0] = adjustment;
+  xops[1] = probe_offset_value_rtx;
+  output_asm_insn ("cmp\t%0, %1", xops);
+
+  /* Branch to end if not enough adjustment to probe.  */
+  fputs ("\tb.lt\t", asm_out_file);
+  assemble_name_raw (asm_out_file, loop_end_lab);
+  fputc ('\n', asm_out_file);
+
+  /* BASE = BASE - RESIDUAL_PROBE_GUARD.  */
+  xops[0] = base;
+  xops[1] = gen_int_mode (residual_probe_guard, Pmode);
+  output_asm_insn ("sub\t%0, %0, %1", xops);
+
+  /* Probe at BASE.  */
+  xops[1] = const0_rtx;
+  output_asm_insn ("str\txzr, [%0, %1]", xops);
+
+  /* ADJUSTMENT = ADJUSTMENT - RESIDUAL_PROBE_GUARD.  */
+  xops[0] = adjustment;
+  xops[1] = probe_offset_value_rtx;
+  output_asm_insn ("sub\t%0, %0, %1", xops);
+
+  /* Branch to start if still more bytes to allocate.  */
+  fputs ("\tb\t", asm_out_file);
+  assemble_name_raw (asm_out_file, loop_start_lab);
+  fputc ('\n', asm_out_file);
+
+  /* No probe leave.  */
+  ASM_OUTPUT_INTERNAL_LABEL (asm_out_file, loop_end_lab);
+
+  /* BASE = BASE - ADJUSTMENT.  */
+  xops[0] = base;
+  xops[1] = adjustment;
+  output_asm_insn ("sub\t%0, %0, %1", xops);
+  return "";
+}
+
 /* Determine whether a frame chain needs to be generated.  */
 static bool
 aarch64_needs_frame_chain (void)
@@ -4835,21 +4913,76 @@ aarch64_allocate_and_probe_stack_space (rtx temp1, rtx temp2,
 	}
     }
 
-  HOST_WIDE_INT size;
   /* If SIZE is not large enough to require probing, just adjust the stack and
      exit.  */
-  if (!poly_size.is_constant (&size)
-      || known_lt (poly_size, min_probe_threshold)
+  if (known_lt (poly_size, min_probe_threshold)
       || !flag_stack_clash_protection)
     {
       aarch64_sub_sp (temp1, temp2, poly_size, frame_related_p);
       return;
     }
 
+  HOST_WIDE_INT size;
+  /* Handle the SVE non-constant case first.  */
+  if (!poly_size.is_constant (&size))
+    {
+
+     if (dump_file)
+      {
+	  fprintf (dump_file, "Stack clash SVE prologue: ");
+	  print_dec (poly_size, dump_file);
+	  fprintf (dump_file, " bytes, dynamic probing will be required.\n");
+      }
+
+      /* First calculate the amount of bytes we're actually spilling.  */
+      aarch64_add_offset (Pmode, temp1, CONST0_RTX (GET_MODE (temp1)),
+			  poly_size, temp1, temp2, false, true);
+
+      rtx_insn *insn = get_last_insn ();
+
+      if (frame_related_p)
+	{
+	  /* This is done to provide unwinding information for the stack
+	     adjustments we're about to do, however to prevent the optimizers
+	     from removing the R15 move and leaving the CFA note (which would be
+	     very wrong) we tie the old and new stack pointer together.
+	     The tie will expand to nothing but the optimizers will not touch
+	     the instruction.  */
+	  rtx stack_ptr_copy = gen_rtx_REG (Pmode, R15_REGNUM);
+	  emit_move_insn (stack_ptr_copy, stack_pointer_rtx);
+	  emit_insn (gen_stack_tie (stack_ptr_copy, stack_pointer_rtx));
+
+	  /* We want the CFA independent of the stack pointer for the
+	     duration of the loop.  */
+	  add_reg_note (insn, REG_CFA_DEF_CFA, stack_ptr_copy);
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      rtx probe_const = gen_int_mode (min_probe_threshold, DImode);
+      rtx guard_const = gen_int_mode (guard_size, DImode);
+
+      insn = emit_insn (gen_probe_sve_stack_clash (stack_pointer_rtx,
+						   stack_pointer_rtx, temp1,
+						   probe_const, guard_const));
+
+      /* Now reset the CFA register if needed.  */
+      if (frame_related_p)
+	{
+	  add_reg_note (insn, REG_CFA_DEF_CFA,
+			gen_rtx_PLUS (Pmode, stack_pointer_rtx,
+				      gen_int_mode (poly_size, Pmode)));
+	  RTX_FRAME_RELATED_P (insn) = 1;
+	}
+
+      return;
+    }
+
   if (dump_file)
-    fprintf (dump_file,
-	     "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC " bytes"
-	     ", probing will be required.\n", size);
+    {
+      fprintf (dump_file,
+	       "Stack clash AArch64 prologue: " HOST_WIDE_INT_PRINT_DEC
+	       " bytes, probing will be required.\n", size);
+    }
 
   /* Round size to the nearest multiple of guard_size, and calculate the
      residual as the difference between the original size and the rounded
@@ -5458,6 +5591,16 @@ aarch64_uimm12_shift (HOST_WIDE_INT val)
 	  );
 }
 
+/* Returns the nearest value to VAL that will fit as a 12-bit unsigned immediate
+   that can be created with a left shift of 0 or 12.  */
+static HOST_WIDE_INT
+aarch64_uimm12_nearest_value (HOST_WIDE_INT val)
+{
+  if ((val & (((HOST_WIDE_INT) 0xfff) << 0)) == val)
+    return val;
+
+  return val & (((HOST_WIDE_INT) 0xfff) << 12);
+}
 
 /* Return true if val is an immediate that can be loaded into a
    register by a MOVZ instruction.  */
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index b8da13f14fa9990e8fdc3c71ed407c8afc65a324..100bd2cc603656d2b8ba97f905c5eff16c59793b 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -6464,6 +6464,25 @@
   [(set_attr "length" "32")]
 )
 
+;; This instruction is used to generate the stack clash stack adjustment and
+;; probing loop.  We can't change the control flow during prologue and epilogue
+;; code generation.  So we must emit a volatile unspec and expand it later on.
+
+(define_insn "probe_sve_stack_clash"
+  [(set (match_operand:DI 0 "register_operand" "=rk")
+	(unspec_volatile:DI [(match_operand:DI 1 "register_operand" "0")
+			     (match_operand:DI 2 "register_operand" "r")
+			     (match_operand:DI 3 "const_int_operand" "n")
+			     (match_operand:DI 4 "aarch64_plus_immediate" "L")]
+			      UNSPECV_PROBE_STACK_RANGE))]
+  "TARGET_SVE"
+{
+  return aarch64_output_probe_sve_stack_clash (operands[0], operands[2],
+					       operands[3], operands[4]);
+}
+  [(set_attr "length" "28")]
+)
+
 ;; Named pattern for expanding thread pointer reference.
 (define_expand "get_thread_pointerdi"
   [(match_operand:DI 0 "register_operand" "=r")]
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
new file mode 100644
index 0000000000000000000000000000000000000000..6ea87392843e4b9561cf6d43ffee57887db62e4e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-cfa-3.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -funwind-tables -ftree-vectorize" } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+
+#include <stdint.h>
+
+#define N 20040
+
+void __attribute__ ((noinline, noclone))
+test (int8_t *restrict dest, int8_t *restrict src)
+{
+  for (int i = 0; i < N; i+=8)
+    {
+      dest[i] += src[i * 4];
+      dest[i+1] += src[i * 4 + 1];
+      dest[i+2] += src[i * 4 + 2];
+      dest[i+3] += src[i * 4 + 3];
+      dest[i+4] += src[i * 4 + 4];
+      dest[i+5] += src[i * 4 + 5];
+      dest[i+6] += src[i * 4 + 6];
+      dest[i+7] += src[i * 4 + 7];
+    }
+}
+/* { dg-final { scan-assembler-times {mov\tx15, sp} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_def_cfa_register 15} 1 } } */
+/* { dg-final { scan-assembler-times {\.cfi_escape 0xf,0xc,0x8f,0,0x92,0x2e,0,0x8,0x58,0x1e,0x23,0xb0,0x2,0x22} 1 } } */
+
+/* Checks that the CFA notes are correct for every sp adjustment, but we also
+   need to make sure we can unwind correctly before the frame is set up.  So
+   check that we're emitting r15 with a copy of sp an setting the CFA there.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
new file mode 100644
index 0000000000000000000000000000000000000000..fd0e987597eba406fa7351433fe7157743aeca42
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/stack-check-prologue-16.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+/* { dg-options "-O2 -march=armv8-a+sve -fstack-clash-protection --param stack-clash-protection-guard-size=16 -ftree-vectorize" } */
+
+
+#include <stdint.h>
+
+#define N 20040
+
+void __attribute__ ((noinline, noclone))
+test (int8_t *restrict dest, int8_t *restrict src)
+{
+  for (int i = 0; i < N; i+=8)
+    {
+      dest[i] += src[i * 4];
+      dest[i+1] += src[i * 4 + 1];
+      dest[i+2] += src[i * 4 + 2];
+      dest[i+3] += src[i * 4 + 3];
+      dest[i+4] += src[i * 4 + 4];
+      dest[i+5] += src[i * 4 + 5];
+      dest[i+6] += src[i * 4 + 6];
+      dest[i+7] += src[i * 4 + 7];
+    }
+}
+
+
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 0\]} 1 } } */
+/* { dg-final { scan-assembler-times {cmp\s+x[0-9]+, 61440} 1 } } */
+/* { dg-final { scan-assembler-times {sub\s+x[0-9]+, x[0-9]+, 61440} 1 } } */
+
+/* SVE spill, requires probing as vector size is unknown at compile time.  */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24.c b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24.c
new file mode 100644
index 0000000000000000000000000000000000000000..4199e391881f1e260535c3fbdb6c41e428d54fbf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+/* { dg-options "-O2 -ftree-vectorize -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+
+#include <stdint.h>
+
+#define N 2000
+#define S 2 * 64 * 1024
+
+#define TEST_LOOP(NAME, TYPE)					\
+  void __attribute__ ((noinline, noclone))			\
+  NAME (TYPE *restrict dest, TYPE *restrict src)		\
+  {								\
+    volatile char foo[S];					\
+    foo[S-1]=1;							\
+    for (int i = 0; i < N; i=i+4)				\
+      {								\
+	dest[i] += src[i * 4];					\
+	dest[i+1] += src[(i+1) * 4];				\
+	dest[i+2] += src[(i+2) * 4];				\
+	dest[i+3] += src[(i+3) * 4];				\
+      }								\
+  }
+
+#define TEST(NAME) \
+  TEST_LOOP (NAME##_i8, int8_t) \
+  TEST_LOOP (NAME##_i16, uint16_t) \
+  TEST_LOOP (NAME##_f32, float) \
+  TEST_LOOP (NAME##_f64, double)
+
+TEST (test)
+
+/* Check the vectorized loop for stack clash probing.  */
+/* { dg-final { scan-assembler-times {str\s+xzr, \[sp, 0\]} 2 } } */
+/* { dg-final { scan-assembler-times {cmp\s+x[0-9]+, 61440} 2 } } */
+/* { dg-final { scan-assembler-times {sub\s+x[0-9]+, x[0-9]+, 61440} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24_run.c b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24_run.c
new file mode 100644
index 0000000000000000000000000000000000000000..cbf1bdfe1e3f23d342042c5b1bb5994714a65cec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/struct_vect_24_run.c
@@ -0,0 +1,37 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-require-effective-target supports_stack_clash_protection } */
+/* { dg-options "-O2 -ftree-vectorize -fstack-clash-protection --param stack-clash-protection-guard-size=16" } */
+
+#include "struct_vect_22.c"
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, TYPE)				\
+  {							\
+    TYPE out[N];					\
+    TYPE in[N * 4];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 4; ++i)			\
+      {							\
+	in[i] = i * 9 / 2;				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    NAME (out, in);					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	TYPE expected = i * 7 / 2 + in[i * 4];		\
+	if (out[i] != expected)				\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}


--xHFwDpU9dbj6ez1V--