public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load
       [not found] <20231212105302.9AB6D385AC2F@sourceware.org>
@ 2023-12-12 12:43 ` Richard Sandiford
  2023-12-12 13:22   ` Richard Biener
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Sandiford @ 2023-12-12 12:43 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> The following aovids over/under-read of storage when vectorizing
> a non-grouped load with SLP.  Instead of forcing peeling for gaps
> use a smaller load for the last vector which might access excess
> elements.  This builds upon the existing optimization avoiding
> peeling for gaps, generalizing it to all gap widths leaving a
> power-of-two remaining number of elements (but it doesn't replace
> or improve that particular case at this point).
>
> I wonder if the poly relational compares I set up are good enough
> to guarantee /* remain should now be > 0 and < nunits.  */.
>
> There is existing test coverage that runs into /* DR will be unused.  */
> always when the gap is wider than nunits.  Compared to the
> existing gap == nunits/2 case this only adjusts the load that will
> cause the overrun at the end, not every load.  Apart from the
> poly relational compares it should reliably cover these cases but
> I'll leave it for stage1 to remove.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> built and tested SPEC CPU 2017.
>
> OK?
>
> 	PR tree-optimization/112736
> 	* tree-vect-stmts.cc (vectorizable_load): Extend optimization
> 	to avoid peeling for gaps to handle single-element non-groups
> 	we now allow with SLP.
>
> 	* gcc.dg/torture/pr112736.c: New testcase.

Mostly LGTM FWIW.  A couple of comments below:

> ---
>  gcc/testsuite/gcc.dg/torture/pr112736.c | 27 ++++++++
>  gcc/tree-vect-stmts.cc                  | 86 ++++++++++++++++++++-----
>  2 files changed, 96 insertions(+), 17 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c
>
> diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c b/gcc/testsuite/gcc.dg/torture/pr112736.c
> new file mode 100644
> index 00000000000..6abb56edba3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c
> @@ -0,0 +1,27 @@
> +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */
> +
> +#include <sys/mman.h>
> +#include <unistd.h>
> +
> +int a, c[3][5];
> +
> +void __attribute__((noipa))
> +fn1 (int * __restrict b)
> +{
> +  int e;
> +  for (a = 2; a >= 0; a--)
> +    for (e = 0; e < 4; e++)
> +      c[a][e] = b[a];
> +}
> +
> +int main()
> +{
> +  long pgsz = sysconf (_SC_PAGESIZE);
> +  void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE,
> +                  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
> +  if (p == MAP_FAILED)
> +    return 0;
> +  mprotect (p, pgsz, PROT_NONE);
> +  fn1 (p + pgsz);
> +  return 0;
> +}
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 390c8472fd6..c03c4c08c9d 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo,
>  			if (new_vtype != NULL_TREE)
>  			  ltype = half_vtype;
>  		      }
> +		    /* Try to use a single smaller load when we are about
> +		       to load accesses excess elements compared to the

s/accesses //

> +		       unrolled scalar loop.
> +		       ???  This should cover the above case as well.  */
> +		    else if (known_gt ((vec_num * j + i + 1) * nunits,
> +				       (group_size * vf - gap)))

At first it seemed odd to be using known_gt rather than maybe_gt here,
given that peeling for gaps is a correctness issue.  But as things stand
this is just an optimisation, and VLA vectors (to whatever extent they're
handled by this code) allegedly work correctly without it.  So I agree
known_gt is correct.  We might need to revisit it when dealing with the
??? though.

> +		      {
> +			if (known_ge ((vec_num * j + i + 1) * nunits
> +				      - (group_size * vf - gap), nunits))
> +			  /* DR will be unused.  */
> +			  ltype = NULL_TREE;
> +			else if (alignment_support_scheme == dr_aligned)
> +			  /* Aligned access to excess elements is OK if
> +			     at least one element is accessed in the
> +			     scalar loop.  */
> +			  ;
> +			else
> +			  {
> +			    auto remain
> +			      = ((group_size * vf - gap)
> +				 - (vec_num * j + i) * nunits);
> +			    /* remain should now be > 0 and < nunits.  */
> +			    unsigned num;
> +			    if (constant_multiple_p (nunits, remain, &num))
> +			      {
> +				tree ptype;
> +				new_vtype
> +				  = vector_vector_composition_type (vectype,
> +								    num,
> +								    &ptype);
> +				if (new_vtype)
> +				  ltype = ptype;
> +			      }
> +			    /* Else use multiple loads or a masked load?  */
> +			  }
> +		      }
>  		    tree offset
>  		      = (dataref_offset ? dataref_offset
>  					: build_int_cst (ref_type, 0));
> -		    if (ltype != vectype
> -			&& memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> +		    if (!ltype)
> +		      ;
> +		    else if (ltype != vectype
> +			     && memory_access_type == VMAT_CONTIGUOUS_REVERSE)
>  		      {
>  			unsigned HOST_WIDE_INT gap_offset
> -			  = gap * tree_to_uhwi (TYPE_SIZE_UNIT (elem_type));
> +			  = (tree_to_uhwi (TYPE_SIZE_UNIT (vectype))
> +			     - tree_to_uhwi (TYPE_SIZE_UNIT (ltype)));
>  			tree gapcst = build_int_cst (ref_type, gap_offset);
>  			offset = size_binop (PLUS_EXPR, offset, gapcst);

Can this trigger for the VLA case?  Probably not.  But it might be worth
using tree_to_poly_uint64 instead of tree_to_uhwi, and auto instead of
unsigned HOST_WIDE_INT, since the calculation is no longer based on scalars.

>  		      }
> -		    data_ref
> -		      = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
> -		    if (alignment_support_scheme == dr_aligned)
> -		      ;
> -		    else
> -		      TREE_TYPE (data_ref)
> -			= build_aligned_type (TREE_TYPE (data_ref),
> -					      align * BITS_PER_UNIT);
> -		    if (ltype != vectype)
> +		    if (ltype)
> +		      {
> +			data_ref
> +			  = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
> +			if (alignment_support_scheme == dr_aligned)
> +			  ;
> +			else
> +			  TREE_TYPE (data_ref)
> +			    = build_aligned_type (TREE_TYPE (data_ref),
> +						  align * BITS_PER_UNIT);
> +		      }
> +		    if (!ltype)
> +		      data_ref = build_constructor (vectype, NULL);
> +		    else if (ltype != vectype)
>  		      {
>  			vect_copy_ref_info (data_ref,
>  					    DR_REF (first_dr_info->dr));
> @@ -11494,18 +11538,26 @@ vectorizable_load (vec_info *vinfo,
>  						     gsi);
>  			data_ref = NULL;
>  			vec<constructor_elt, va_gc> *v;
> -			vec_alloc (v, 2);
> +			unsigned num
> +			  = (exact_div (tree_to_poly_uint64
> +					  (TYPE_SIZE_UNIT (vectype)),
> +					tree_to_poly_uint64
> +					  (TYPE_SIZE_UNIT (ltype)))
> +			     .to_constant ());

FWIW, vector_unroll_factor probably fits here, but maybe that's a bit
of a stretch.  The current version is OK with me too.

Thanks,
Richard

> +			vec_alloc (v, num);
>  			if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
>  			  {
> -			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> -						    build_zero_cst (ltype));
> +			    while (--num)
> +			      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> +						      build_zero_cst (ltype));
>  			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
>  			  }
>  			else
>  			  {
>  			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
> -			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> -						    build_zero_cst (ltype));
> +			    while (--num)
> +			      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> +						      build_zero_cst (ltype));
>  			  }
>  			gcc_assert (new_vtype != NULL_TREE);
>  			if (new_vtype == vectype)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load
  2023-12-12 12:43 ` [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load Richard Sandiford
@ 2023-12-12 13:22   ` Richard Biener
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Biener @ 2023-12-12 13:22 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Tue, 12 Dec 2023, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > The following aovids over/under-read of storage when vectorizing
> > a non-grouped load with SLP.  Instead of forcing peeling for gaps
> > use a smaller load for the last vector which might access excess
> > elements.  This builds upon the existing optimization avoiding
> > peeling for gaps, generalizing it to all gap widths leaving a
> > power-of-two remaining number of elements (but it doesn't replace
> > or improve that particular case at this point).
> >
> > I wonder if the poly relational compares I set up are good enough
> > to guarantee /* remain should now be > 0 and < nunits.  */.
> >
> > There is existing test coverage that runs into /* DR will be unused.  */
> > always when the gap is wider than nunits.  Compared to the
> > existing gap == nunits/2 case this only adjusts the load that will
> > cause the overrun at the end, not every load.  Apart from the
> > poly relational compares it should reliably cover these cases but
> > I'll leave it for stage1 to remove.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> > built and tested SPEC CPU 2017.
> >
> > OK?
> >
> > 	PR tree-optimization/112736
> > 	* tree-vect-stmts.cc (vectorizable_load): Extend optimization
> > 	to avoid peeling for gaps to handle single-element non-groups
> > 	we now allow with SLP.
> >
> > 	* gcc.dg/torture/pr112736.c: New testcase.
> 
> Mostly LGTM FWIW.  A couple of comments below:
> 
> > ---
> >  gcc/testsuite/gcc.dg/torture/pr112736.c | 27 ++++++++
> >  gcc/tree-vect-stmts.cc                  | 86 ++++++++++++++++++++-----
> >  2 files changed, 96 insertions(+), 17 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c b/gcc/testsuite/gcc.dg/torture/pr112736.c
> > new file mode 100644
> > index 00000000000..6abb56edba3
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/torture/pr112736.c
> > @@ -0,0 +1,27 @@
> > +/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */
> > +
> > +#include <sys/mman.h>
> > +#include <unistd.h>
> > +
> > +int a, c[3][5];
> > +
> > +void __attribute__((noipa))
> > +fn1 (int * __restrict b)
> > +{
> > +  int e;
> > +  for (a = 2; a >= 0; a--)
> > +    for (e = 0; e < 4; e++)
> > +      c[a][e] = b[a];
> > +}
> > +
> > +int main()
> > +{
> > +  long pgsz = sysconf (_SC_PAGESIZE);
> > +  void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE,
> > +                  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
> > +  if (p == MAP_FAILED)
> > +    return 0;
> > +  mprotect (p, pgsz, PROT_NONE);
> > +  fn1 (p + pgsz);
> > +  return 0;
> > +}
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 390c8472fd6..c03c4c08c9d 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo,
> >  			if (new_vtype != NULL_TREE)
> >  			  ltype = half_vtype;
> >  		      }
> > +		    /* Try to use a single smaller load when we are about
> > +		       to load accesses excess elements compared to the
> 
> s/accesses //
> 
> > +		       unrolled scalar loop.
> > +		       ???  This should cover the above case as well.  */
> > +		    else if (known_gt ((vec_num * j + i + 1) * nunits,
> > +				       (group_size * vf - gap)))
> 
> At first it seemed odd to be using known_gt rather than maybe_gt here,
> given that peeling for gaps is a correctness issue.  But as things stand
> this is just an optimisation, and VLA vectors (to whatever extent they're
> handled by this code) allegedly work correctly without it.  So I agree
> known_gt is correct.  We might need to revisit it when dealing with the
> ??? though.

Yesh, maybe_gt would be needed if for correctness but then ...

> > +		      {
> > +			if (known_ge ((vec_num * j + i + 1) * nunits
> > +				      - (group_size * vf - gap), nunits))
> > +			  /* DR will be unused.  */
> > +			  ltype = NULL_TREE;
> > +			else if (alignment_support_scheme == dr_aligned)
> > +			  /* Aligned access to excess elements is OK if
> > +			     at least one element is accessed in the
> > +			     scalar loop.  */
> > +			  ;
> > +			else
> > +			  {
> > +			    auto remain
> > +			      = ((group_size * vf - gap)
> > +				 - (vec_num * j + i) * nunits);
> > +			    /* remain should now be > 0 and < nunits.  */

... we probably don't know it's < nunits anymore.  Indeed lets revisit
this at some later point.

> > +			    unsigned num;
> > +			    if (constant_multiple_p (nunits, remain, &num))
> > +			      {
> > +				tree ptype;
> > +				new_vtype
> > +				  = vector_vector_composition_type (vectype,
> > +								    num,
> > +								    &ptype);
> > +				if (new_vtype)
> > +				  ltype = ptype;
> > +			      }
> > +			    /* Else use multiple loads or a masked load?  */
> > +			  }
> > +		      }
> >  		    tree offset
> >  		      = (dataref_offset ? dataref_offset
> >  					: build_int_cst (ref_type, 0));
> > -		    if (ltype != vectype
> > -			&& memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> > +		    if (!ltype)
> > +		      ;
> > +		    else if (ltype != vectype
> > +			     && memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> >  		      {
> >  			unsigned HOST_WIDE_INT gap_offset
> > -			  = gap * tree_to_uhwi (TYPE_SIZE_UNIT (elem_type));
> > +			  = (tree_to_uhwi (TYPE_SIZE_UNIT (vectype))
> > +			     - tree_to_uhwi (TYPE_SIZE_UNIT (ltype)));
> >  			tree gapcst = build_int_cst (ref_type, gap_offset);
> >  			offset = size_binop (PLUS_EXPR, offset, gapcst);
> 
> Can this trigger for the VLA case?  Probably not.  But it might be worth
> using tree_to_poly_uint64 instead of tree_to_uhwi, and auto instead of
> unsigned HOST_WIDE_INT, since the calculation is no longer based on scalars.

I'm not 100% sure it cannot, so fixed.

> >  		      }
> > -		    data_ref
> > -		      = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
> > -		    if (alignment_support_scheme == dr_aligned)
> > -		      ;
> > -		    else
> > -		      TREE_TYPE (data_ref)
> > -			= build_aligned_type (TREE_TYPE (data_ref),
> > -					      align * BITS_PER_UNIT);
> > -		    if (ltype != vectype)
> > +		    if (ltype)
> > +		      {
> > +			data_ref
> > +			  = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
> > +			if (alignment_support_scheme == dr_aligned)
> > +			  ;
> > +			else
> > +			  TREE_TYPE (data_ref)
> > +			    = build_aligned_type (TREE_TYPE (data_ref),
> > +						  align * BITS_PER_UNIT);
> > +		      }
> > +		    if (!ltype)
> > +		      data_ref = build_constructor (vectype, NULL);
> > +		    else if (ltype != vectype)
> >  		      {
> >  			vect_copy_ref_info (data_ref,
> >  					    DR_REF (first_dr_info->dr));
> > @@ -11494,18 +11538,26 @@ vectorizable_load (vec_info *vinfo,
> >  						     gsi);
> >  			data_ref = NULL;
> >  			vec<constructor_elt, va_gc> *v;
> > -			vec_alloc (v, 2);
> > +			unsigned num
> > +			  = (exact_div (tree_to_poly_uint64
> > +					  (TYPE_SIZE_UNIT (vectype)),
> > +					tree_to_poly_uint64
> > +					  (TYPE_SIZE_UNIT (ltype)))
> > +			     .to_constant ());
> 
> FWIW, vector_unroll_factor probably fits here, but maybe that's a bit
> of a stretch.  The current version is OK with me too.

Yeah, we've computed this number above statically (two, for the "old"
correctness thing) or via the constant_multiple_p.  It just was least
awkward to recompute it here ...  I'll add a comment.

Richard.

> Thanks,
> Richard
> 
> > +			vec_alloc (v, num);
> >  			if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
> >  			  {
> > -			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > -						    build_zero_cst (ltype));
> > +			    while (--num)
> > +			      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > +						      build_zero_cst (ltype));
> >  			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
> >  			  }
> >  			else
> >  			  {
> >  			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
> > -			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > -						    build_zero_cst (ltype));
> > +			    while (--num)
> > +			      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
> > +						      build_zero_cst (ltype));
> >  			  }
> >  			gcc_assert (new_vtype != NULL_TREE);
> >  			if (new_vtype == vectype)
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load
@ 2023-12-12 10:51 Richard Biener
  0 siblings, 0 replies; 3+ messages in thread
From: Richard Biener @ 2023-12-12 10:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

The following aovids over/under-read of storage when vectorizing
a non-grouped load with SLP.  Instead of forcing peeling for gaps
use a smaller load for the last vector which might access excess
elements.  This builds upon the existing optimization avoiding
peeling for gaps, generalizing it to all gap widths leaving a
power-of-two remaining number of elements (but it doesn't replace
or improve that particular case at this point).

I wonder if the poly relational compares I set up are good enough
to guarantee /* remain should now be > 0 and < nunits.  */.

There is existing test coverage that runs into /* DR will be unused.  */
always when the gap is wider than nunits.  Compared to the
existing gap == nunits/2 case this only adjusts the load that will
cause the overrun at the end, not every load.  Apart from the
poly relational compares it should reliably cover these cases but
I'll leave it for stage1 to remove.

Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
built and tested SPEC CPU 2017.

OK?

	PR tree-optimization/112736
	* tree-vect-stmts.cc (vectorizable_load): Extend optimization
	to avoid peeling for gaps to handle single-element non-groups
	we now allow with SLP.

	* gcc.dg/torture/pr112736.c: New testcase.
---
 gcc/testsuite/gcc.dg/torture/pr112736.c | 27 ++++++++
 gcc/tree-vect-stmts.cc                  | 86 ++++++++++++++++++++-----
 2 files changed, 96 insertions(+), 17 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr112736.c

diff --git a/gcc/testsuite/gcc.dg/torture/pr112736.c b/gcc/testsuite/gcc.dg/torture/pr112736.c
new file mode 100644
index 00000000000..6abb56edba3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/torture/pr112736.c
@@ -0,0 +1,27 @@
+/* { dg-do run { target *-*-linux* *-*-gnu* *-*-uclinux* } } */
+
+#include <sys/mman.h>
+#include <unistd.h>
+
+int a, c[3][5];
+
+void __attribute__((noipa))
+fn1 (int * __restrict b)
+{
+  int e;
+  for (a = 2; a >= 0; a--)
+    for (e = 0; e < 4; e++)
+      c[a][e] = b[a];
+}
+
+int main()
+{
+  long pgsz = sysconf (_SC_PAGESIZE);
+  void *p = mmap (NULL, pgsz * 2, PROT_READ|PROT_WRITE,
+                  MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
+  if (p == MAP_FAILED)
+    return 0;
+  mprotect (p, pgsz, PROT_NONE);
+  fn1 (p + pgsz);
+  return 0;
+}
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 390c8472fd6..c03c4c08c9d 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -11465,26 +11465,70 @@ vectorizable_load (vec_info *vinfo,
 			if (new_vtype != NULL_TREE)
 			  ltype = half_vtype;
 		      }
+		    /* Try to use a single smaller load when we are about
+		       to load accesses excess elements compared to the
+		       unrolled scalar loop.
+		       ???  This should cover the above case as well.  */
+		    else if (known_gt ((vec_num * j + i + 1) * nunits,
+				       (group_size * vf - gap)))
+		      {
+			if (known_ge ((vec_num * j + i + 1) * nunits
+				      - (group_size * vf - gap), nunits))
+			  /* DR will be unused.  */
+			  ltype = NULL_TREE;
+			else if (alignment_support_scheme == dr_aligned)
+			  /* Aligned access to excess elements is OK if
+			     at least one element is accessed in the
+			     scalar loop.  */
+			  ;
+			else
+			  {
+			    auto remain
+			      = ((group_size * vf - gap)
+				 - (vec_num * j + i) * nunits);
+			    /* remain should now be > 0 and < nunits.  */
+			    unsigned num;
+			    if (constant_multiple_p (nunits, remain, &num))
+			      {
+				tree ptype;
+				new_vtype
+				  = vector_vector_composition_type (vectype,
+								    num,
+								    &ptype);
+				if (new_vtype)
+				  ltype = ptype;
+			      }
+			    /* Else use multiple loads or a masked load?  */
+			  }
+		      }
 		    tree offset
 		      = (dataref_offset ? dataref_offset
 					: build_int_cst (ref_type, 0));
-		    if (ltype != vectype
-			&& memory_access_type == VMAT_CONTIGUOUS_REVERSE)
+		    if (!ltype)
+		      ;
+		    else if (ltype != vectype
+			     && memory_access_type == VMAT_CONTIGUOUS_REVERSE)
 		      {
 			unsigned HOST_WIDE_INT gap_offset
-			  = gap * tree_to_uhwi (TYPE_SIZE_UNIT (elem_type));
+			  = (tree_to_uhwi (TYPE_SIZE_UNIT (vectype))
+			     - tree_to_uhwi (TYPE_SIZE_UNIT (ltype)));
 			tree gapcst = build_int_cst (ref_type, gap_offset);
 			offset = size_binop (PLUS_EXPR, offset, gapcst);
 		      }
-		    data_ref
-		      = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
-		    if (alignment_support_scheme == dr_aligned)
-		      ;
-		    else
-		      TREE_TYPE (data_ref)
-			= build_aligned_type (TREE_TYPE (data_ref),
-					      align * BITS_PER_UNIT);
-		    if (ltype != vectype)
+		    if (ltype)
+		      {
+			data_ref
+			  = fold_build2 (MEM_REF, ltype, dataref_ptr, offset);
+			if (alignment_support_scheme == dr_aligned)
+			  ;
+			else
+			  TREE_TYPE (data_ref)
+			    = build_aligned_type (TREE_TYPE (data_ref),
+						  align * BITS_PER_UNIT);
+		      }
+		    if (!ltype)
+		      data_ref = build_constructor (vectype, NULL);
+		    else if (ltype != vectype)
 		      {
 			vect_copy_ref_info (data_ref,
 					    DR_REF (first_dr_info->dr));
@@ -11494,18 +11538,26 @@ vectorizable_load (vec_info *vinfo,
 						     gsi);
 			data_ref = NULL;
 			vec<constructor_elt, va_gc> *v;
-			vec_alloc (v, 2);
+			unsigned num
+			  = (exact_div (tree_to_poly_uint64
+					  (TYPE_SIZE_UNIT (vectype)),
+					tree_to_poly_uint64
+					  (TYPE_SIZE_UNIT (ltype)))
+			     .to_constant ());
+			vec_alloc (v, num);
 			if (memory_access_type == VMAT_CONTIGUOUS_REVERSE)
 			  {
-			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
-						    build_zero_cst (ltype));
+			    while (--num)
+			      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+						      build_zero_cst (ltype));
 			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
 			  }
 			else
 			  {
 			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tem);
-			    CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
-						    build_zero_cst (ltype));
+			    while (--num)
+			      CONSTRUCTOR_APPEND_ELT (v, NULL_TREE,
+						      build_zero_cst (ltype));
 			  }
 			gcc_assert (new_vtype != NULL_TREE);
 			if (new_vtype == vectype)
-- 
2.35.3

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-12-12 13:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20231212105302.9AB6D385AC2F@sourceware.org>
2023-12-12 12:43 ` [PATCH] tree-optimization/112736 - avoid overread with non-grouped SLP load Richard Sandiford
2023-12-12 13:22   ` Richard Biener
2023-12-12 10:51 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).