public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance
       [not found] <b44942a2-0ba2-47b9-9dfe-5f75ffad7be5@DB5PEPF00014B9C.eurprd02.prod.outlook.com>
@ 2024-05-29  9:25 ` Richard Sandiford
  2024-05-29  9:50   ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2024-05-29  9:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> Code generation for contiguous load vectorization can already deal
> with generalized avoidance of loading from a gap.  The following
> extends detection of peeling for gaps requirement with that,
> gets rid of the old special casing of a half load and makes sure
> when we do access the gap we have peeling for gaps enabled.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> This is the first patch in a series to improve peeling for gaps,
> it turned out into an improvement for code rather than just doing
> the (delayed from stage3) removal of the "old" half-vector codepath.
>
> I'll wait for the pre-CI testing for pushing so you also have time
> for some comments.

LGTM FWIW (some trivia below).

Out of interest, how far are we off being able to load:

    a[i*8+0]
    a[i*8+1]
    a[i*8+3]
    a[i*8+4]

as two half vectors?  It doesn't look like we're quite there yet,
but I might have misread.

It would be nice if we could eventually integrate the overrun_p checks
with the vectorizable_load code that the code is trying to predict.
E.g. we could run through the vectorizable_load code during the
analysis phase and record overruns, similarly to Kewen's costing
patches.  As it stands, it seems difficult to make sure that the two
checks are exactly in sync, especially when the structure is so
different.

> Richard.
>
> 	PR tree-optimization/115252
> 	* tree-vect-stmts.cc (get_group_load_store_type): Enhance
> 	detecting the number of cases where we can avoid accessing a gap
> 	during code generation.
> 	(vectorizable_load): Remove old half-vector peeling for gap
> 	avoidance which is now redundant.  Add gap-aligned case where
> 	it's OK to access the gap.  Add assert that we have peeling for
> 	gaps enabled when we access a gap.
>
> 	* gcc.dg/vect/slp-gap-1.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
>  gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
>  2 files changed, 46 insertions(+), 30 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> new file mode 100644
> index 00000000000..36463ca22c5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3" } */
> +
> +typedef unsigned char uint8_t;
> +typedef short int16_t;
> +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
> +  for (int y = 0; y < 4; y++) {
> +    for (int x = 0; x < 4; x++)
> +      diff[x + y * 4] = pix1[x] - pix2[x];
> +    pix1 += 16;
> +    pix2 += 32;
> +  }
> +}
> +
> +/* We can vectorize this without peeling for gaps and thus without epilogue,
> +   but the only thing we can reliably scan is the zero-padding trick for the
> +   partial loads.  */
> +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target vect64 } } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index a01099d3456..b26cc74f417 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>  	  dr_alignment_support alss;
>  	  int misalign = dr_misalignment (first_dr_info, vectype);
>  	  tree half_vtype;
> +	  poly_uint64 remain;
> +	  unsigned HOST_WIDE_INT tem, num;
>  	  if (overrun_p
>  	      && !masked_p
>  	      && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
>  							  vectype, misalign)))
>  		   == dr_aligned
>  		  || alss == dr_unaligned_supported)
> -	      && known_eq (nunits, (group_size - gap) * 2)
> -	      && known_eq (nunits, group_size)
> -	      && (vector_vector_composition_type (vectype, 2, &half_vtype)
> -		  != NULL_TREE))
> +	      && can_div_trunc_p (group_size
> +				  * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> +				  nunits, &tem, &remain)
> +	      && (known_eq (remain, 0u)
> +		  || (constant_multiple_p (nunits, remain, &num)
> +		      && (vector_vector_composition_type (vectype, num,
> +							  &half_vtype)
> +			  != NULL_TREE))))
>  	    overrun_p = false;

Might be worth renaming half_vtype now that it isn't necessarily
a strict half.

>  
>  	  if (overrun_p && !can_overrun_p)
> @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
>  		    unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
>  		    unsigned int vect_align
>  		      = vect_known_alignment_in_bytes (first_dr_info, vectype);
> -		    unsigned int scalar_dr_size
> -		      = vect_get_scalar_dr_size (first_dr_info);
> -		    /* If there's no peeling for gaps but we have a gap
> -		       with slp loads then load the lower half of the
> -		       vector only.  See get_group_load_store_type for
> -		       when we apply this optimization.  */
> -		    if (slp
> -			&& loop_vinfo
> -			&& !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
> -			&& known_eq (nunits, (group_size - gap) * 2)
> -			&& known_eq (nunits, group_size)
> -			&& gap >= (vect_align / scalar_dr_size))
> -		      {
> -			tree half_vtype;
> -			new_vtype
> -			  = vector_vector_composition_type (vectype, 2,
> -							    &half_vtype);
> -			if (new_vtype != NULL_TREE)
> -			  ltype = half_vtype;
> -		      }
>  		    /* Try to use a single smaller load when we are about
>  		       to load excess elements compared to the unrolled
> -		       scalar loop.
> -		       ???  This should cover the above case as well.  */
> -		    else if (known_gt ((vec_num * j + i + 1) * nunits,
> +		       scalar loop.  */
> +		    if (known_gt ((vec_num * j + i + 1) * nunits,
>  				       (group_size * vf - gap)))

Missing reindentation.

Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
We can only skip doing it if we know for sure that the load won't cross
the gap.  (Not sure whether the difference can trigger in practice.)

Thanks,
Richard

>  		      {
> +			poly_uint64 remain = ((group_size * vf - gap)
> +					      - (vec_num * j + i) * nunits);
>  			if (known_ge ((vec_num * j + i + 1) * nunits
>  				      - (group_size * vf - gap), nunits))
>  			  /* DR will be unused.  */
> @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
>  			     at least one element is accessed in the
>  			     scalar loop.  */
>  			  ;
> +			else if (known_gt (vect_align,
> +					   ((nunits - remain)
> +					    * vect_get_scalar_dr_size
> +						(first_dr_info))))
> +			  /* Aligned access to the gap area when there's
> +			     at least one element in it is OK.  */
> +			  ;
>  			else
>  			  {
> -			    auto remain
> -			      = ((group_size * vf - gap)
> -				 - (vec_num * j + i) * nunits);
>  			    /* remain should now be > 0 and < nunits.  */
>  			    unsigned num;
>  			    if (constant_multiple_p (nunits, remain, &num))
> @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
>  				  ltype = ptype;
>  			      }
>  			    /* Else use multiple loads or a masked load?  */
> +			    /* For loop vectorization we now should have
> +			       an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
> +			       set.  */
> +			    if (loop_vinfo)
> +			      gcc_assert (new_vtype
> +					  || LOOP_VINFO_PEELING_FOR_GAPS
> +					       (loop_vinfo));
>  			  }
>  		      }
>  		    tree offset

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance
  2024-05-29  9:25 ` [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance Richard Sandiford
@ 2024-05-29  9:50   ` Richard Biener
  2024-05-29 11:04     ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Biener @ 2024-05-29  9:50 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 29 May 2024, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > Code generation for contiguous load vectorization can already deal
> > with generalized avoidance of loading from a gap.  The following
> > extends detection of peeling for gaps requirement with that,
> > gets rid of the old special casing of a half load and makes sure
> > when we do access the gap we have peeling for gaps enabled.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > This is the first patch in a series to improve peeling for gaps,
> > it turned out into an improvement for code rather than just doing
> > the (delayed from stage3) removal of the "old" half-vector codepath.
> >
> > I'll wait for the pre-CI testing for pushing so you also have time
> > for some comments.
> 
> LGTM FWIW (some trivia below).
> 
> Out of interest, how far are we off being able to load:
> 
>     a[i*8+0]
>     a[i*8+1]
>     a[i*8+3]
>     a[i*8+4]
> 
> as two half vectors?  It doesn't look like we're quite there yet,
> but I might have misread.

The code in vectorizable_load that eventually would do this only
triggers when we run into the final "gap" part.  We do not look
at the intermediate gaps at all (if the above is what we see
in the loop body).  Extending the code to handle the case
where the intermediate gap is produced because of unrolling (VF > 1)
should be possible - we'd simply need to check whether the currently
loaded elements have unused ones at the end.

> It would be nice if we could eventually integrate the overrun_p checks
> with the vectorizable_load code that the code is trying to predict.
> E.g. we could run through the vectorizable_load code during the
> analysis phase and record overruns, similarly to Kewen's costing
> patches.  As it stands, it seems difficult to make sure that the two
> checks are exactly in sync, especially when the structure is so
> different.

Yeah - that's why I put the assert in now (which I do expect to
trigger - also thanks to poly-ints may vs. must...)

Richard.

> > Richard.
> >
> > 	PR tree-optimization/115252
> > 	* tree-vect-stmts.cc (get_group_load_store_type): Enhance
> > 	detecting the number of cases where we can avoid accessing a gap
> > 	during code generation.
> > 	(vectorizable_load): Remove old half-vector peeling for gap
> > 	avoidance which is now redundant.  Add gap-aligned case where
> > 	it's OK to access the gap.  Add assert that we have peeling for
> > 	gaps enabled when we access a gap.
> >
> > 	* gcc.dg/vect/slp-gap-1.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
> >  gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
> >  2 files changed, 46 insertions(+), 30 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > new file mode 100644
> > index 00000000000..36463ca22c5
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3" } */
> > +
> > +typedef unsigned char uint8_t;
> > +typedef short int16_t;
> > +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
> > +  for (int y = 0; y < 4; y++) {
> > +    for (int x = 0; x < 4; x++)
> > +      diff[x + y * 4] = pix1[x] - pix2[x];
> > +    pix1 += 16;
> > +    pix2 += 32;
> > +  }
> > +}
> > +
> > +/* We can vectorize this without peeling for gaps and thus without epilogue,
> > +   but the only thing we can reliably scan is the zero-padding trick for the
> > +   partial loads.  */
> > +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target vect64 } } } */
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index a01099d3456..b26cc74f417 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
> >  	  dr_alignment_support alss;
> >  	  int misalign = dr_misalignment (first_dr_info, vectype);
> >  	  tree half_vtype;
> > +	  poly_uint64 remain;
> > +	  unsigned HOST_WIDE_INT tem, num;
> >  	  if (overrun_p
> >  	      && !masked_p
> >  	      && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
> >  							  vectype, misalign)))
> >  		   == dr_aligned
> >  		  || alss == dr_unaligned_supported)
> > -	      && known_eq (nunits, (group_size - gap) * 2)
> > -	      && known_eq (nunits, group_size)
> > -	      && (vector_vector_composition_type (vectype, 2, &half_vtype)
> > -		  != NULL_TREE))
> > +	      && can_div_trunc_p (group_size
> > +				  * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> > +				  nunits, &tem, &remain)
> > +	      && (known_eq (remain, 0u)
> > +		  || (constant_multiple_p (nunits, remain, &num)
> > +		      && (vector_vector_composition_type (vectype, num,
> > +							  &half_vtype)
> > +			  != NULL_TREE))))
> >  	    overrun_p = false;
> 
> Might be worth renaming half_vtype now that it isn't necessarily
> a strict half.
> 
> >  
> >  	  if (overrun_p && !can_overrun_p)
> > @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
> >  		    unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
> >  		    unsigned int vect_align
> >  		      = vect_known_alignment_in_bytes (first_dr_info, vectype);
> > -		    unsigned int scalar_dr_size
> > -		      = vect_get_scalar_dr_size (first_dr_info);
> > -		    /* If there's no peeling for gaps but we have a gap
> > -		       with slp loads then load the lower half of the
> > -		       vector only.  See get_group_load_store_type for
> > -		       when we apply this optimization.  */
> > -		    if (slp
> > -			&& loop_vinfo
> > -			&& !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
> > -			&& known_eq (nunits, (group_size - gap) * 2)
> > -			&& known_eq (nunits, group_size)
> > -			&& gap >= (vect_align / scalar_dr_size))
> > -		      {
> > -			tree half_vtype;
> > -			new_vtype
> > -			  = vector_vector_composition_type (vectype, 2,
> > -							    &half_vtype);
> > -			if (new_vtype != NULL_TREE)
> > -			  ltype = half_vtype;
> > -		      }
> >  		    /* Try to use a single smaller load when we are about
> >  		       to load excess elements compared to the unrolled
> > -		       scalar loop.
> > -		       ???  This should cover the above case as well.  */
> > -		    else if (known_gt ((vec_num * j + i + 1) * nunits,
> > +		       scalar loop.  */
> > +		    if (known_gt ((vec_num * j + i + 1) * nunits,
> >  				       (group_size * vf - gap)))
> 
> Missing reindentation.
> 
> Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
> We can only skip doing it if we know for sure that the load won't cross
> the gap.  (Not sure whether the difference can trigger in practice.)
> 
> Thanks,
> Richard
> 
> >  		      {
> > +			poly_uint64 remain = ((group_size * vf - gap)
> > +					      - (vec_num * j + i) * nunits);
> >  			if (known_ge ((vec_num * j + i + 1) * nunits
> >  				      - (group_size * vf - gap), nunits))
> >  			  /* DR will be unused.  */
> > @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
> >  			     at least one element is accessed in the
> >  			     scalar loop.  */
> >  			  ;
> > +			else if (known_gt (vect_align,
> > +					   ((nunits - remain)
> > +					    * vect_get_scalar_dr_size
> > +						(first_dr_info))))
> > +			  /* Aligned access to the gap area when there's
> > +			     at least one element in it is OK.  */
> > +			  ;
> >  			else
> >  			  {
> > -			    auto remain
> > -			      = ((group_size * vf - gap)
> > -				 - (vec_num * j + i) * nunits);
> >  			    /* remain should now be > 0 and < nunits.  */
> >  			    unsigned num;
> >  			    if (constant_multiple_p (nunits, remain, &num))
> > @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
> >  				  ltype = ptype;
> >  			      }
> >  			    /* Else use multiple loads or a masked load?  */
> > +			    /* For loop vectorization we now should have
> > +			       an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
> > +			       set.  */
> > +			    if (loop_vinfo)
> > +			      gcc_assert (new_vtype
> > +					  || LOOP_VINFO_PEELING_FOR_GAPS
> > +					       (loop_vinfo));
> >  			  }
> >  		      }
> >  		    tree offset
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance
  2024-05-29  9:50   ` Richard Biener
@ 2024-05-29 11:04     ` Richard Biener
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2024-05-29 11:04 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 29 May 2024, Richard Biener wrote:

> On Wed, 29 May 2024, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > Code generation for contiguous load vectorization can already deal
> > > with generalized avoidance of loading from a gap.  The following
> > > extends detection of peeling for gaps requirement with that,
> > > gets rid of the old special casing of a half load and makes sure
> > > when we do access the gap we have peeling for gaps enabled.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > This is the first patch in a series to improve peeling for gaps,
> > > it turned out into an improvement for code rather than just doing
> > > the (delayed from stage3) removal of the "old" half-vector codepath.
> > >
> > > I'll wait for the pre-CI testing for pushing so you also have time
> > > for some comments.
> > 
> > LGTM FWIW (some trivia below).
> > 
> > Out of interest, how far are we off being able to load:
> > 
> >     a[i*8+0]
> >     a[i*8+1]
> >     a[i*8+3]
> >     a[i*8+4]
> > 
> > as two half vectors?  It doesn't look like we're quite there yet,
> > but I might have misread.
> 
> The code in vectorizable_load that eventually would do this only
> triggers when we run into the final "gap" part.  We do not look
> at the intermediate gaps at all (if the above is what we see
> in the loop body).  Extending the code to handle the case
> where the intermediate gap is produced because of unrolling (VF > 1)
> should be possible - we'd simply need to check whether the currently
> loaded elements have unused ones at the end.
> 
> > It would be nice if we could eventually integrate the overrun_p checks
> > with the vectorizable_load code that the code is trying to predict.
> > E.g. we could run through the vectorizable_load code during the
> > analysis phase and record overruns, similarly to Kewen's costing
> > patches.  As it stands, it seems difficult to make sure that the two
> > checks are exactly in sync, especially when the structure is so
> > different.
> 
> Yeah - that's why I put the assert in now (which I do expect to
> trigger - also thanks to poly-ints may vs. must...)

I quickly looked and why it should be possible to set
LOOP_VINFO_PEELING_FOR_GAPS from the loops generating the actual
accesses (we run through those now after the costing refactoring)
there are quite a lot of paths that would (possibly) need to be
covered.  We could possibly set a stmt-local 
LOOP_VINFO_PEELING_FOR_GAPS flag conservatively and clear it
in the few places we handle the gap, only updating the global
LOOP_VINFO_PEELING_FOR_GAPS at the end, but it's still going to
be tricky to not forget a path here.

I've amended my TODO accordingly.

Richard.

> Richard.
> 
> > > Richard.
> > >
> > > 	PR tree-optimization/115252
> > > 	* tree-vect-stmts.cc (get_group_load_store_type): Enhance
> > > 	detecting the number of cases where we can avoid accessing a gap
> > > 	during code generation.
> > > 	(vectorizable_load): Remove old half-vector peeling for gap
> > > 	avoidance which is now redundant.  Add gap-aligned case where
> > > 	it's OK to access the gap.  Add assert that we have peeling for
> > > 	gaps enabled when we access a gap.
> > >
> > > 	* gcc.dg/vect/slp-gap-1.c: New testcase.
> > > ---
> > >  gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
> > >  gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
> > >  2 files changed, 46 insertions(+), 30 deletions(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > > new file mode 100644
> > > index 00000000000..36463ca22c5
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
> > > @@ -0,0 +1,18 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-additional-options "-O3" } */
> > > +
> > > +typedef unsigned char uint8_t;
> > > +typedef short int16_t;
> > > +void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
> > > +  for (int y = 0; y < 4; y++) {
> > > +    for (int x = 0; x < 4; x++)
> > > +      diff[x + y * 4] = pix1[x] - pix2[x];
> > > +    pix1 += 16;
> > > +    pix2 += 32;
> > > +  }
> > > +}
> > > +
> > > +/* We can vectorize this without peeling for gaps and thus without epilogue,
> > > +   but the only thing we can reliably scan is the zero-padding trick for the
> > > +   partial loads.  */
> > > +/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target vect64 } } } */
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index a01099d3456..b26cc74f417 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
> > >  	  dr_alignment_support alss;
> > >  	  int misalign = dr_misalignment (first_dr_info, vectype);
> > >  	  tree half_vtype;
> > > +	  poly_uint64 remain;
> > > +	  unsigned HOST_WIDE_INT tem, num;
> > >  	  if (overrun_p
> > >  	      && !masked_p
> > >  	      && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
> > >  							  vectype, misalign)))
> > >  		   == dr_aligned
> > >  		  || alss == dr_unaligned_supported)
> > > -	      && known_eq (nunits, (group_size - gap) * 2)
> > > -	      && known_eq (nunits, group_size)
> > > -	      && (vector_vector_composition_type (vectype, 2, &half_vtype)
> > > -		  != NULL_TREE))
> > > +	      && can_div_trunc_p (group_size
> > > +				  * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
> > > +				  nunits, &tem, &remain)
> > > +	      && (known_eq (remain, 0u)
> > > +		  || (constant_multiple_p (nunits, remain, &num)
> > > +		      && (vector_vector_composition_type (vectype, num,
> > > +							  &half_vtype)
> > > +			  != NULL_TREE))))
> > >  	    overrun_p = false;
> > 
> > Might be worth renaming half_vtype now that it isn't necessarily
> > a strict half.
> > 
> > >  
> > >  	  if (overrun_p && !can_overrun_p)
> > > @@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
> > >  		    unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
> > >  		    unsigned int vect_align
> > >  		      = vect_known_alignment_in_bytes (first_dr_info, vectype);
> > > -		    unsigned int scalar_dr_size
> > > -		      = vect_get_scalar_dr_size (first_dr_info);
> > > -		    /* If there's no peeling for gaps but we have a gap
> > > -		       with slp loads then load the lower half of the
> > > -		       vector only.  See get_group_load_store_type for
> > > -		       when we apply this optimization.  */
> > > -		    if (slp
> > > -			&& loop_vinfo
> > > -			&& !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
> > > -			&& known_eq (nunits, (group_size - gap) * 2)
> > > -			&& known_eq (nunits, group_size)
> > > -			&& gap >= (vect_align / scalar_dr_size))
> > > -		      {
> > > -			tree half_vtype;
> > > -			new_vtype
> > > -			  = vector_vector_composition_type (vectype, 2,
> > > -							    &half_vtype);
> > > -			if (new_vtype != NULL_TREE)
> > > -			  ltype = half_vtype;
> > > -		      }
> > >  		    /* Try to use a single smaller load when we are about
> > >  		       to load excess elements compared to the unrolled
> > > -		       scalar loop.
> > > -		       ???  This should cover the above case as well.  */
> > > -		    else if (known_gt ((vec_num * j + i + 1) * nunits,
> > > +		       scalar loop.  */
> > > +		    if (known_gt ((vec_num * j + i + 1) * nunits,
> > >  				       (group_size * vf - gap)))
> > 
> > Missing reindentation.
> > 
> > Pre-existing, but shouldn't this be maybe_gt rather than known_gt?
> > We can only skip doing it if we know for sure that the load won't cross
> > the gap.  (Not sure whether the difference can trigger in practice.)
> > 
> > Thanks,
> > Richard
> > 
> > >  		      {
> > > +			poly_uint64 remain = ((group_size * vf - gap)
> > > +					      - (vec_num * j + i) * nunits);
> > >  			if (known_ge ((vec_num * j + i + 1) * nunits
> > >  				      - (group_size * vf - gap), nunits))
> > >  			  /* DR will be unused.  */
> > > @@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
> > >  			     at least one element is accessed in the
> > >  			     scalar loop.  */
> > >  			  ;
> > > +			else if (known_gt (vect_align,
> > > +					   ((nunits - remain)
> > > +					    * vect_get_scalar_dr_size
> > > +						(first_dr_info))))
> > > +			  /* Aligned access to the gap area when there's
> > > +			     at least one element in it is OK.  */
> > > +			  ;
> > >  			else
> > >  			  {
> > > -			    auto remain
> > > -			      = ((group_size * vf - gap)
> > > -				 - (vec_num * j + i) * nunits);
> > >  			    /* remain should now be > 0 and < nunits.  */
> > >  			    unsigned num;
> > >  			    if (constant_multiple_p (nunits, remain, &num))
> > > @@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
> > >  				  ltype = ptype;
> > >  			      }
> > >  			    /* Else use multiple loads or a masked load?  */
> > > +			    /* For loop vectorization we now should have
> > > +			       an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
> > > +			       set.  */
> > > +			    if (loop_vinfo)
> > > +			      gcc_assert (new_vtype
> > > +					  || LOOP_VINFO_PEELING_FOR_GAPS
> > > +					       (loop_vinfo));
> > >  			  }
> > >  		      }
> > >  		    tree offset
> > 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance
@ 2024-05-28 12:17 Richard Biener
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2024-05-28 12:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

Code generation for contiguous load vectorization can already deal
with generalized avoidance of loading from a gap.  The following
extends detection of peeling for gaps requirement with that,
gets rid of the old special casing of a half load and makes sure
when we do access the gap we have peeling for gaps enabled.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

This is the first patch in a series to improve peeling for gaps,
it turned out into an improvement for code rather than just doing
the (delayed from stage3) removal of the "old" half-vector codepath.

I'll wait for the pre-CI testing for pushing so you also have time
for some comments.

Richard.

	PR tree-optimization/115252
	* tree-vect-stmts.cc (get_group_load_store_type): Enhance
	detecting the number of cases where we can avoid accessing a gap
	during code generation.
	(vectorizable_load): Remove old half-vector peeling for gap
	avoidance which is now redundant.  Add gap-aligned case where
	it's OK to access the gap.  Add assert that we have peeling for
	gaps enabled when we access a gap.

	* gcc.dg/vect/slp-gap-1.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-gap-1.c | 18 +++++++++
 gcc/tree-vect-stmts.cc                | 58 +++++++++++++--------------
 2 files changed, 46 insertions(+), 30 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-gap-1.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-gap-1.c b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
new file mode 100644
index 00000000000..36463ca22c5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-gap-1.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3" } */
+
+typedef unsigned char uint8_t;
+typedef short int16_t;
+void pixel_sub_wxh(int16_t * __restrict diff, uint8_t *pix1, uint8_t *pix2) {
+  for (int y = 0; y < 4; y++) {
+    for (int x = 0; x < 4; x++)
+      diff[x + y * 4] = pix1[x] - pix2[x];
+    pix1 += 16;
+    pix2 += 32;
+  }
+}
+
+/* We can vectorize this without peeling for gaps and thus without epilogue,
+   but the only thing we can reliably scan is the zero-padding trick for the
+   partial loads.  */
+/* { dg-final { scan-tree-dump-times "\{_\[0-9\]\+, 0" 6 "vect" { target vect64 } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a01099d3456..b26cc74f417 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2072,16 +2072,22 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	  dr_alignment_support alss;
 	  int misalign = dr_misalignment (first_dr_info, vectype);
 	  tree half_vtype;
+	  poly_uint64 remain;
+	  unsigned HOST_WIDE_INT tem, num;
 	  if (overrun_p
 	      && !masked_p
 	      && (((alss = vect_supportable_dr_alignment (vinfo, first_dr_info,
 							  vectype, misalign)))
 		   == dr_aligned
 		  || alss == dr_unaligned_supported)
-	      && known_eq (nunits, (group_size - gap) * 2)
-	      && known_eq (nunits, group_size)
-	      && (vector_vector_composition_type (vectype, 2, &half_vtype)
-		  != NULL_TREE))
+	      && can_div_trunc_p (group_size
+				  * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
+				  nunits, &tem, &remain)
+	      && (known_eq (remain, 0u)
+		  || (constant_multiple_p (nunits, remain, &num)
+		      && (vector_vector_composition_type (vectype, num,
+							  &half_vtype)
+			  != NULL_TREE))))
 	    overrun_p = false;
 
 	  if (overrun_p && !can_overrun_p)
@@ -11533,33 +11539,14 @@ vectorizable_load (vec_info *vinfo,
 		    unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info);
 		    unsigned int vect_align
 		      = vect_known_alignment_in_bytes (first_dr_info, vectype);
-		    unsigned int scalar_dr_size
-		      = vect_get_scalar_dr_size (first_dr_info);
-		    /* If there's no peeling for gaps but we have a gap
-		       with slp loads then load the lower half of the
-		       vector only.  See get_group_load_store_type for
-		       when we apply this optimization.  */
-		    if (slp
-			&& loop_vinfo
-			&& !LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && gap != 0
-			&& known_eq (nunits, (group_size - gap) * 2)
-			&& known_eq (nunits, group_size)
-			&& gap >= (vect_align / scalar_dr_size))
-		      {
-			tree half_vtype;
-			new_vtype
-			  = vector_vector_composition_type (vectype, 2,
-							    &half_vtype);
-			if (new_vtype != NULL_TREE)
-			  ltype = half_vtype;
-		      }
 		    /* Try to use a single smaller load when we are about
 		       to load excess elements compared to the unrolled
-		       scalar loop.
-		       ???  This should cover the above case as well.  */
-		    else if (known_gt ((vec_num * j + i + 1) * nunits,
+		       scalar loop.  */
+		    if (known_gt ((vec_num * j + i + 1) * nunits,
 				       (group_size * vf - gap)))
 		      {
+			poly_uint64 remain = ((group_size * vf - gap)
+					      - (vec_num * j + i) * nunits);
 			if (known_ge ((vec_num * j + i + 1) * nunits
 				      - (group_size * vf - gap), nunits))
 			  /* DR will be unused.  */
@@ -11571,11 +11558,15 @@ vectorizable_load (vec_info *vinfo,
 			     at least one element is accessed in the
 			     scalar loop.  */
 			  ;
+			else if (known_gt (vect_align,
+					   ((nunits - remain)
+					    * vect_get_scalar_dr_size
+						(first_dr_info))))
+			  /* Aligned access to the gap area when there's
+			     at least one element in it is OK.  */
+			  ;
 			else
 			  {
-			    auto remain
-			      = ((group_size * vf - gap)
-				 - (vec_num * j + i) * nunits);
 			    /* remain should now be > 0 and < nunits.  */
 			    unsigned num;
 			    if (constant_multiple_p (nunits, remain, &num))
@@ -11589,6 +11580,13 @@ vectorizable_load (vec_info *vinfo,
 				  ltype = ptype;
 			      }
 			    /* Else use multiple loads or a masked load?  */
+			    /* For loop vectorization we now should have
+			       an alternate type or LOOP_VINFO_PEELING_FOR_GAPS
+			       set.  */
+			    if (loop_vinfo)
+			      gcc_assert (new_vtype
+					  || LOOP_VINFO_PEELING_FOR_GAPS
+					       (loop_vinfo));
 			  }
 		      }
 		    tree offset
-- 
2.35.3

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-29 11:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <b44942a2-0ba2-47b9-9dfe-5f75ffad7be5@DB5PEPF00014B9C.eurprd02.prod.outlook.com>
2024-05-29  9:25 ` [PATCH] tree-optimization/115252 - enhance peeling for gaps avoidance Richard Sandiford
2024-05-29  9:50   ` Richard Biener
2024-05-29 11:04     ` Richard Biener
2024-05-28 12:17 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).