public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Avoid peeling for gaps if accesses are aligned
@ 2016-11-07 15:02 Richard Biener
  2016-11-08 10:20 ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2016-11-07 15:02 UTC (permalink / raw)
  To: gcc-patches


Currently we force peeling for gaps whenever element overrun can occur
but for aligned accesses we know that the loads won't trap and thus
we can avoid this.

Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
some testsuite fallout here so didn't bother to invent a new testcase).

Just in case somebody thinks the overrun is a bad idea in general
(even when not trapping).  Like for ASAN or valgrind.

Richard.

2016-11-07  Richard Biener  <rguenther@suse.de>

	* tree-vect-stmts.c (get_group_load_store_type): If the
	access is aligned do not trigger peeling for gaps.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 241893)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -1770,6 +1771,11 @@ get_group_load_store_type (gimple *stmt,
 			       " non-consecutive accesses\n");
 	      return false;
 	    }
+	  /* If the access is aligned an overrun is fine.  */
+	  if (overrun_p
+	      && aligned_access_p
+		   (STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt))))
+	    overrun_p = false;
 	  if (overrun_p && !can_overrun_p)
 	    {
 	      if (dump_enabled_p ())
@@ -1789,6 +1795,10 @@ get_group_load_store_type (gimple *stmt,
       /* If there is a gap at the end of the group then these optimizations
 	 would access excess elements in the last iteration.  */
       bool would_overrun_p = (gap != 0);
+      /* If the access is aligned an overrun is fine.  */
+      if (would_overrun_p
+	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
+	would_overrun_p = false;
       if (!STMT_VINFO_STRIDED_P (stmt_info)
 	  && (can_overrun_p || !would_overrun_p)
 	  && compare_step_with_zero (stmt) > 0)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2016-11-07 15:02 [PATCH] Avoid peeling for gaps if accesses are aligned Richard Biener
@ 2016-11-08 10:20 ` Richard Biener
  2017-03-01 11:54   ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2016-11-08 10:20 UTC (permalink / raw)
  To: gcc-patches

On Mon, 7 Nov 2016, Richard Biener wrote:

> 
> Currently we force peeling for gaps whenever element overrun can occur
> but for aligned accesses we know that the loads won't trap and thus
> we can avoid this.
> 
> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
> some testsuite fallout here so didn't bother to invent a new testcase).
> 
> Just in case somebody thinks the overrun is a bad idea in general
> (even when not trapping).  Like for ASAN or valgrind.

This is what I applied.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Richard.

2016-11-08  Richard Biener  <rguenther@suse.de>

	* tree-vect-stmts.c (get_group_load_store_type): If the
	access is aligned do not trigger peeling for gaps.
	* tree-vect-data-refs.c (vect_compute_data_ref_alignment): Do not
	force alignment of vars with DECL_USER_ALIGN.

	* gcc.dg/vect/vect-nb-iter-ub-2.c: Adjust.

diff --git a/gcc/testsuite/gcc.dg/vect/vect-nb-iter-ub-2.c b/gcc/testsuite/gcc.dg/vect/vect-nb-iter-ub-2.c
index bc07b4b..4e13702 100644
--- a/gcc/testsuite/gcc.dg/vect/vect-nb-iter-ub-2.c
+++ b/gcc/testsuite/gcc.dg/vect/vect-nb-iter-ub-2.c
@@ -3,7 +3,7 @@
 #include "tree-vect.h"
 
 int ii[32];
-char cc[66] =
+char cc[66] __attribute__((aligned(1))) =
   { 0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0,
     10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0,
     20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0,
diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c
index b03cb1e..f014d68 100644
--- a/gcc/tree-vect-data-refs.c
+++ b/gcc/tree-vect-data-refs.c
@@ -831,6 +831,19 @@ vect_compute_data_ref_alignment (struct data_reference *dr)
 	  return true;
 	}
 
+      if (DECL_USER_ALIGN (base))
+	{
+	  if (dump_enabled_p ())
+	    {
+	      dump_printf_loc (MSG_NOTE, vect_location,
+			       "not forcing alignment of user-aligned "
+			       "variable: ");
+	      dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
+	      dump_printf (MSG_NOTE, "\n");
+	    }
+	  return true;
+	}
+
       /* Force the alignment of the decl.
 	 NOTE: This is the only change to the code we make during
 	 the analysis phase, before deciding to vectorize the loop.  */
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 15aec21..c29e73d 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1770,6 +1770,11 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
 			       " non-consecutive accesses\n");
 	      return false;
 	    }
+	  /* If the access is aligned an overrun is fine.  */
+	  if (overrun_p
+	      && aligned_access_p
+		   (STMT_VINFO_DATA_REF (vinfo_for_stmt (first_stmt))))
+	    overrun_p = false;
 	  if (overrun_p && !can_overrun_p)
 	    {
 	      if (dump_enabled_p ())
@@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
       /* If there is a gap at the end of the group then these optimizations
 	 would access excess elements in the last iteration.  */
       bool would_overrun_p = (gap != 0);
+      /* If the access is aligned an overrun is fine.  */
+      if (would_overrun_p
+	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
+	would_overrun_p = false;
       if (!STMT_VINFO_STRIDED_P (stmt_info)
 	  && (can_overrun_p || !would_overrun_p)
 	  && compare_step_with_zero (stmt) > 0)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2016-11-08 10:20 ` Richard Biener
@ 2017-03-01 11:54   ` Richard Sandiford
  2017-03-01 12:06     ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2017-03-01 11:54 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Sorry for the late reply, but:

Richard Biener <rguenther@suse.de> writes:
> On Mon, 7 Nov 2016, Richard Biener wrote:
>
>> 
>> Currently we force peeling for gaps whenever element overrun can occur
>> but for aligned accesses we know that the loads won't trap and thus
>> we can avoid this.
>> 
>> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
>> some testsuite fallout here so didn't bother to invent a new testcase).
>> 
>> Just in case somebody thinks the overrun is a bad idea in general
>> (even when not trapping).  Like for ASAN or valgrind.
>
> This is what I applied.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Richard.
[...]
> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 15aec21..c29e73d 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>        /* If there is a gap at the end of the group then these optimizations
>  	 would access excess elements in the last iteration.  */
>        bool would_overrun_p = (gap != 0);
> +      /* If the access is aligned an overrun is fine.  */
> +      if (would_overrun_p
> +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> +	would_overrun_p = false;
>        if (!STMT_VINFO_STRIDED_P (stmt_info)
>  	  && (can_overrun_p || !would_overrun_p)
>  	  && compare_step_with_zero (stmt) > 0)

...is this right for all cases?  I think it only looks for single-vector
alignment, but the gap can in principle be vector-sized or larger,
at least for load-lanes.

E.g. say we have a 128-bit vector of doubles in a group of size 4
and a gap of 2 or 3.  Even if the access itself is aligned, the group
spans two vectors and we have no guarantee that the second one
is mapped.

I haven't been able to come up with a testcase though.  We seem to be
overly conservative when computing alignments.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 11:54   ` Richard Sandiford
@ 2017-03-01 12:06     ` Richard Biener
  2017-03-01 12:18       ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2017-03-01 12:06 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 1 Mar 2017, Richard Sandiford wrote:

> Sorry for the late reply, but:
> 
> Richard Biener <rguenther@suse.de> writes:
> > On Mon, 7 Nov 2016, Richard Biener wrote:
> >
> >> 
> >> Currently we force peeling for gaps whenever element overrun can occur
> >> but for aligned accesses we know that the loads won't trap and thus
> >> we can avoid this.
> >> 
> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
> >> some testsuite fallout here so didn't bother to invent a new testcase).
> >> 
> >> Just in case somebody thinks the overrun is a bad idea in general
> >> (even when not trapping).  Like for ASAN or valgrind.
> >
> > This is what I applied.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Richard.
> [...]
> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > index 15aec21..c29e73d 100644
> > --- a/gcc/tree-vect-stmts.c
> > +++ b/gcc/tree-vect-stmts.c
> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
> >        /* If there is a gap at the end of the group then these optimizations
> >  	 would access excess elements in the last iteration.  */
> >        bool would_overrun_p = (gap != 0);
> > +      /* If the access is aligned an overrun is fine.  */
> > +      if (would_overrun_p
> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> > +	would_overrun_p = false;
> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
> >  	  && (can_overrun_p || !would_overrun_p)
> >  	  && compare_step_with_zero (stmt) > 0)
> 
> ...is this right for all cases?  I think it only looks for single-vector
> alignment, but the gap can in principle be vector-sized or larger,
> at least for load-lanes.
>
> E.g. say we have a 128-bit vector of doubles in a group of size 4
> and a gap of 2 or 3.  Even if the access itself is aligned, the group
> spans two vectors and we have no guarantee that the second one
> is mapped.

The check assumes that if aligned_access_p () returns true then the
whole access is aligned in a way that it can't cross page boundaries.
That's of course not the case if alignment is 16 bytes but the access
will be a multiple of that.
 
> I haven't been able to come up with a testcase though.  We seem to be
> overly conservative when computing alignments.

Not sure if we can run into this with load-lanes given that bumps the
vectorization factor.  Also does load-lane work with gaps?

I think that gap can never be larger than nunits-1 so it is by definition
in the last "vector" independent of the VF.

Classical gap case is

for (i=0; i<n; ++i)
 {
   y[3*i + 0] = x[4*i + 0];
   y[3*i + 1] = x[4*i + 1];
   y[3*i + 2] = x[4*i + 2];
 }

where x has a gap of 1.  You'll get VF of 12 for the above.  Make
the y's different streams and you should get the perfect case for
load-lane:

for (i=0; i<n; ++i)
 {
   y[i] = x[4*i + 0];
   z[i] = x[4*i + 1];
   w[i] = x[4*i + 2];
 } 

previously we'd peel at least 4 iterations into the epilogue for
the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
ok.

Richard.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 12:06     ` Richard Biener
@ 2017-03-01 12:18       ` Richard Sandiford
  2017-03-01 12:23         ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2017-03-01 12:18 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Sorry for the late reply, but:
>> 
>> Richard Biener <rguenther@suse.de> writes:
>> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >
>> >> 
>> >> Currently we force peeling for gaps whenever element overrun can occur
>> >> but for aligned accesses we know that the loads won't trap and thus
>> >> we can avoid this.
>> >> 
>> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
>> >> some testsuite fallout here so didn't bother to invent a new testcase).
>> >> 
>> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> (even when not trapping).  Like for ASAN or valgrind.
>> >
>> > This is what I applied.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >
>> > Richard.
>> [...]
>> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> > index 15aec21..c29e73d 100644
>> > --- a/gcc/tree-vect-stmts.c
>> > +++ b/gcc/tree-vect-stmts.c
>> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> >        /* If there is a gap at the end of the group then these optimizations
>> >  	 would access excess elements in the last iteration.  */
>> >        bool would_overrun_p = (gap != 0);
>> > +      /* If the access is aligned an overrun is fine.  */
>> > +      if (would_overrun_p
>> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> > +	would_overrun_p = false;
>> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
>> >  	  && (can_overrun_p || !would_overrun_p)
>> >  	  && compare_step_with_zero (stmt) > 0)
>> 
>> ...is this right for all cases?  I think it only looks for single-vector
>> alignment, but the gap can in principle be vector-sized or larger,
>> at least for load-lanes.
>>
>> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> and a gap of 2 or 3.  Even if the access itself is aligned, the group
>> spans two vectors and we have no guarantee that the second one
>> is mapped.
>
> The check assumes that if aligned_access_p () returns true then the
> whole access is aligned in a way that it can't cross page boundaries.
> That's of course not the case if alignment is 16 bytes but the access
> will be a multiple of that.
>  
>> I haven't been able to come up with a testcase though.  We seem to be
>> overly conservative when computing alignments.
>
> Not sure if we can run into this with load-lanes given that bumps the
> vectorization factor.  Also does load-lane work with gaps?
>
> I think that gap can never be larger than nunits-1 so it is by definition
> in the last "vector" independent of the VF.
>
> Classical gap case is
>
> for (i=0; i<n; ++i)
>  {
>    y[3*i + 0] = x[4*i + 0];
>    y[3*i + 1] = x[4*i + 1];
>    y[3*i + 2] = x[4*i + 2];
>  }
>
> where x has a gap of 1.  You'll get VF of 12 for the above.  Make
> the y's different streams and you should get the perfect case for
> load-lane:
>
> for (i=0; i<n; ++i)
>  {
>    y[i] = x[4*i + 0];
>    z[i] = x[4*i + 1];
>    w[i] = x[4*i + 2];
>  } 
>
> previously we'd peel at least 4 iterations into the epilogue for
> the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
> ok.

The case I was thinking of was like the second, but with the
element type being DI or DF and with the + 2 statement removed.
E.g.:

double __attribute__((noinline))
foo (double *a)
{
  double res = 0.0;
  for (int n = 0; n < 256; n += 4)
    res += a[n] + a[n + 1];
  return res;
}

(with -ffast-math).  We do use LD4 for this, and having "a" aligned
to V2DF isn't enough to guarantee that we can access a[n + 2]
and a[n + 3].

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 12:18       ` Richard Sandiford
@ 2017-03-01 12:23         ` Richard Biener
  2017-03-01 12:40           ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2017-03-01 12:23 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 1 Mar 2017, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >
> >> Sorry for the late reply, but:
> >> 
> >> Richard Biener <rguenther@suse.de> writes:
> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
> >> >
> >> >> 
> >> >> Currently we force peeling for gaps whenever element overrun can occur
> >> >> but for aligned accesses we know that the loads won't trap and thus
> >> >> we can avoid this.
> >> >> 
> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
> >> >> some testsuite fallout here so didn't bother to invent a new testcase).
> >> >> 
> >> >> Just in case somebody thinks the overrun is a bad idea in general
> >> >> (even when not trapping).  Like for ASAN or valgrind.
> >> >
> >> > This is what I applied.
> >> >
> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >> >
> >> > Richard.
> >> [...]
> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> > index 15aec21..c29e73d 100644
> >> > --- a/gcc/tree-vect-stmts.c
> >> > +++ b/gcc/tree-vect-stmts.c
> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
> >> >        /* If there is a gap at the end of the group then these optimizations
> >> >  	 would access excess elements in the last iteration.  */
> >> >        bool would_overrun_p = (gap != 0);
> >> > +      /* If the access is aligned an overrun is fine.  */
> >> > +      if (would_overrun_p
> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> >> > +	would_overrun_p = false;
> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
> >> >  	  && (can_overrun_p || !would_overrun_p)
> >> >  	  && compare_step_with_zero (stmt) > 0)
> >> 
> >> ...is this right for all cases?  I think it only looks for single-vector
> >> alignment, but the gap can in principle be vector-sized or larger,
> >> at least for load-lanes.
> >>
> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
> >> spans two vectors and we have no guarantee that the second one
> >> is mapped.
> >
> > The check assumes that if aligned_access_p () returns true then the
> > whole access is aligned in a way that it can't cross page boundaries.
> > That's of course not the case if alignment is 16 bytes but the access
> > will be a multiple of that.
> >  
> >> I haven't been able to come up with a testcase though.  We seem to be
> >> overly conservative when computing alignments.
> >
> > Not sure if we can run into this with load-lanes given that bumps the
> > vectorization factor.  Also does load-lane work with gaps?
> >
> > I think that gap can never be larger than nunits-1 so it is by definition
> > in the last "vector" independent of the VF.
> >
> > Classical gap case is
> >
> > for (i=0; i<n; ++i)
> >  {
> >    y[3*i + 0] = x[4*i + 0];
> >    y[3*i + 1] = x[4*i + 1];
> >    y[3*i + 2] = x[4*i + 2];
> >  }
> >
> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
> > the y's different streams and you should get the perfect case for
> > load-lane:
> >
> > for (i=0; i<n; ++i)
> >  {
> >    y[i] = x[4*i + 0];
> >    z[i] = x[4*i + 1];
> >    w[i] = x[4*i + 2];
> >  } 
> >
> > previously we'd peel at least 4 iterations into the epilogue for
> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
> > ok.
> 
> The case I was thinking of was like the second, but with the
> element type being DI or DF and with the + 2 statement removed.
> E.g.:
> 
> double __attribute__((noinline))
> foo (double *a)
> {
>   double res = 0.0;
>   for (int n = 0; n < 256; n += 4)
>     res += a[n] + a[n + 1];
>   return res;
> }
> 
> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
> to V2DF isn't enough to guarantee that we can access a[n + 2]
> and a[n + 3].

Yes, indeed.  It's safe when peeling for gaps would remove
N < alignof (ref) / sizeof (ref) scalar iterations.

Peeling for gaps simply subtracts one from the niter of the vectorized 
loop.

One should be able to construct a testcase w/o load-lanes by ensuring
a high enough VF.

Richard.

> Thanks,
> Richard
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 12:23         ` Richard Biener
@ 2017-03-01 12:40           ` Richard Sandiford
  2017-03-01 14:37             ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2017-03-01 12:40 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >
>> >> Sorry for the late reply, but:
>> >> 
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >> >
>> >> >> 
>> >> >> Currently we force peeling for gaps whenever element overrun can occur
>> >> >> but for aligned accesses we know that the loads won't trap and thus
>> >> >> we can avoid this.
>> >> >> 
>> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
>> >> >> some testsuite fallout here so didn't bother to invent a new testcase).
>> >> >> 
>> >> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> >> (even when not trapping).  Like for ASAN or valgrind.
>> >> >
>> >> > This is what I applied.
>> >> >
>> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >> >
>> >> > Richard.
>> >> [...]
>> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> >> > index 15aec21..c29e73d 100644
>> >> > --- a/gcc/tree-vect-stmts.c
>> >> > +++ b/gcc/tree-vect-stmts.c
>> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> >> >        /* If there is a gap at the end of the group then these optimizations
>> >> >  	 would access excess elements in the last iteration.  */
>> >> >        bool would_overrun_p = (gap != 0);
>> >> > +      /* If the access is aligned an overrun is fine.  */
>> >> > +      if (would_overrun_p
>> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> >> > +	would_overrun_p = false;
>> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
>> >> >  	  && (can_overrun_p || !would_overrun_p)
>> >> >  	  && compare_step_with_zero (stmt) > 0)
>> >> 
>> >> ...is this right for all cases?  I think it only looks for single-vector
>> >> alignment, but the gap can in principle be vector-sized or larger,
>> >> at least for load-lanes.
>> >>
>> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
>> >> spans two vectors and we have no guarantee that the second one
>> >> is mapped.
>> >
>> > The check assumes that if aligned_access_p () returns true then the
>> > whole access is aligned in a way that it can't cross page boundaries.
>> > That's of course not the case if alignment is 16 bytes but the access
>> > will be a multiple of that.
>> >  
>> >> I haven't been able to come up with a testcase though.  We seem to be
>> >> overly conservative when computing alignments.
>> >
>> > Not sure if we can run into this with load-lanes given that bumps the
>> > vectorization factor.  Also does load-lane work with gaps?
>> >
>> > I think that gap can never be larger than nunits-1 so it is by definition
>> > in the last "vector" independent of the VF.
>> >
>> > Classical gap case is
>> >
>> > for (i=0; i<n; ++i)
>> >  {
>> >    y[3*i + 0] = x[4*i + 0];
>> >    y[3*i + 1] = x[4*i + 1];
>> >    y[3*i + 2] = x[4*i + 2];
>> >  }
>> >
>> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
>> > the y's different streams and you should get the perfect case for
>> > load-lane:
>> >
>> > for (i=0; i<n; ++i)
>> >  {
>> >    y[i] = x[4*i + 0];
>> >    z[i] = x[4*i + 1];
>> >    w[i] = x[4*i + 2];
>> >  } 
>> >
>> > previously we'd peel at least 4 iterations into the epilogue for
>> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
>> > ok.
>> 
>> The case I was thinking of was like the second, but with the
>> element type being DI or DF and with the + 2 statement removed.
>> E.g.:
>> 
>> double __attribute__((noinline))
>> foo (double *a)
>> {
>>   double res = 0.0;
>>   for (int n = 0; n < 256; n += 4)
>>     res += a[n] + a[n + 1];
>>   return res;
>> }
>> 
>> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
>> to V2DF isn't enough to guarantee that we can access a[n + 2]
>> and a[n + 3].
>
> Yes, indeed.  It's safe when peeling for gaps would remove
> N < alignof (ref) / sizeof (ref) scalar iterations.
>
> Peeling for gaps simply subtracts one from the niter of the vectorized 
> loop.

I think subtracting one is enough in all cases.  It's only the final
iteration of the scalar loop that can't access a[n + 2] and a[n + 3].

(Of course, subtracting one happens before peeling for niters, so it
only makes a difference if the original niters was a multiple of the VF,
in which case we peel a full vector's worth of iterations instead of
peeling none.)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 12:40           ` Richard Sandiford
@ 2017-03-01 14:37             ` Richard Biener
  2017-03-01 19:59               ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2017-03-01 14:37 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 1 Mar 2017, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >
> >> >> Sorry for the late reply, but:
> >> >> 
> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
> >> >> >
> >> >> >> 
> >> >> >> Currently we force peeling for gaps whenever element overrun can occur
> >> >> >> but for aligned accesses we know that the loads won't trap and thus
> >> >> >> we can avoid this.
> >> >> >> 
> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
> >> >> >> some testsuite fallout here so didn't bother to invent a new testcase).
> >> >> >> 
> >> >> >> Just in case somebody thinks the overrun is a bad idea in general
> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
> >> >> >
> >> >> > This is what I applied.
> >> >> >
> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >> >> >
> >> >> > Richard.
> >> >> [...]
> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> >> > index 15aec21..c29e73d 100644
> >> >> > --- a/gcc/tree-vect-stmts.c
> >> >> > +++ b/gcc/tree-vect-stmts.c
> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
> >> >> >        /* If there is a gap at the end of the group then these optimizations
> >> >> >  	 would access excess elements in the last iteration.  */
> >> >> >        bool would_overrun_p = (gap != 0);
> >> >> > +      /* If the access is aligned an overrun is fine.  */
> >> >> > +      if (would_overrun_p
> >> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> >> >> > +	would_overrun_p = false;
> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
> >> >> >  	  && (can_overrun_p || !would_overrun_p)
> >> >> >  	  && compare_step_with_zero (stmt) > 0)
> >> >> 
> >> >> ...is this right for all cases?  I think it only looks for single-vector
> >> >> alignment, but the gap can in principle be vector-sized or larger,
> >> >> at least for load-lanes.
> >> >>
> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
> >> >> spans two vectors and we have no guarantee that the second one
> >> >> is mapped.
> >> >
> >> > The check assumes that if aligned_access_p () returns true then the
> >> > whole access is aligned in a way that it can't cross page boundaries.
> >> > That's of course not the case if alignment is 16 bytes but the access
> >> > will be a multiple of that.
> >> >  
> >> >> I haven't been able to come up with a testcase though.  We seem to be
> >> >> overly conservative when computing alignments.
> >> >
> >> > Not sure if we can run into this with load-lanes given that bumps the
> >> > vectorization factor.  Also does load-lane work with gaps?
> >> >
> >> > I think that gap can never be larger than nunits-1 so it is by definition
> >> > in the last "vector" independent of the VF.
> >> >
> >> > Classical gap case is
> >> >
> >> > for (i=0; i<n; ++i)
> >> >  {
> >> >    y[3*i + 0] = x[4*i + 0];
> >> >    y[3*i + 1] = x[4*i + 1];
> >> >    y[3*i + 2] = x[4*i + 2];
> >> >  }
> >> >
> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
> >> > the y's different streams and you should get the perfect case for
> >> > load-lane:
> >> >
> >> > for (i=0; i<n; ++i)
> >> >  {
> >> >    y[i] = x[4*i + 0];
> >> >    z[i] = x[4*i + 1];
> >> >    w[i] = x[4*i + 2];
> >> >  } 
> >> >
> >> > previously we'd peel at least 4 iterations into the epilogue for
> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
> >> > ok.
> >> 
> >> The case I was thinking of was like the second, but with the
> >> element type being DI or DF and with the + 2 statement removed.
> >> E.g.:
> >> 
> >> double __attribute__((noinline))
> >> foo (double *a)
> >> {
> >>   double res = 0.0;
> >>   for (int n = 0; n < 256; n += 4)
> >>     res += a[n] + a[n + 1];
> >>   return res;
> >> }
> >> 
> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
> >> and a[n + 3].
> >
> > Yes, indeed.  It's safe when peeling for gaps would remove
> > N < alignof (ref) / sizeof (ref) scalar iterations.
> >
> > Peeling for gaps simply subtracts one from the niter of the vectorized 
> > loop.
> 
> I think subtracting one is enough in all cases.  It's only the final
> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
> 
> (Of course, subtracting one happens before peeling for niters, so it
> only makes a difference if the original niters was a multiple of the VF,
> in which case we peel a full vector's worth of iterations instead of
> peeling none.)

I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
to covert the case with bigger VF, for example by having different
types for a and b.

I can try playing with this later this week but if you can come up
with a testcase that exercises load-/store-lanes that would be great.
See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
from.

Richard.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 14:37             ` Richard Biener
@ 2017-03-01 19:59               ` Richard Sandiford
  2017-03-02 12:05                 ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2017-03-01 19:59 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >> >
>> >> >> Sorry for the late reply, but:
>> >> >> 
>> >> >> Richard Biener <rguenther@suse.de> writes:
>> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >> >> >
>> >> >> >> 
>> >> >> >> Currently we force peeling for gaps whenever element overrun can occur
>> >> >> >> but for aligned accesses we know that the loads won't trap and thus
>> >> >> >> we can avoid this.
>> >> >> >> 
>> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
>> >> >> >> some testsuite fallout here so didn't bother to invent a new testcase).
>> >> >> >> 
>> >> >> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
>> >> >> >
>> >> >> > This is what I applied.
>> >> >> >
>> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >> >> >
>> >> >> > Richard.
>> >> >> [...]
>> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> >> >> > index 15aec21..c29e73d 100644
>> >> >> > --- a/gcc/tree-vect-stmts.c
>> >> >> > +++ b/gcc/tree-vect-stmts.c
>> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> >> >> >        /* If there is a gap at the end of the group then these optimizations
>> >> >> >  	 would access excess elements in the last iteration.  */
>> >> >> >        bool would_overrun_p = (gap != 0);
>> >> >> > +      /* If the access is aligned an overrun is fine.  */
>> >> >> > +      if (would_overrun_p
>> >> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> >> >> > +	would_overrun_p = false;
>> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
>> >> >> >  	  && (can_overrun_p || !would_overrun_p)
>> >> >> >  	  && compare_step_with_zero (stmt) > 0)
>> >> >> 
>> >> >> ...is this right for all cases?  I think it only looks for single-vector
>> >> >> alignment, but the gap can in principle be vector-sized or larger,
>> >> >> at least for load-lanes.
>> >> >>
>> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
>> >> >> spans two vectors and we have no guarantee that the second one
>> >> >> is mapped.
>> >> >
>> >> > The check assumes that if aligned_access_p () returns true then the
>> >> > whole access is aligned in a way that it can't cross page boundaries.
>> >> > That's of course not the case if alignment is 16 bytes but the access
>> >> > will be a multiple of that.
>> >> >  
>> >> >> I haven't been able to come up with a testcase though.  We seem to be
>> >> >> overly conservative when computing alignments.
>> >> >
>> >> > Not sure if we can run into this with load-lanes given that bumps the
>> >> > vectorization factor.  Also does load-lane work with gaps?
>> >> >
>> >> > I think that gap can never be larger than nunits-1 so it is by definition
>> >> > in the last "vector" independent of the VF.
>> >> >
>> >> > Classical gap case is
>> >> >
>> >> > for (i=0; i<n; ++i)
>> >> >  {
>> >> >    y[3*i + 0] = x[4*i + 0];
>> >> >    y[3*i + 1] = x[4*i + 1];
>> >> >    y[3*i + 2] = x[4*i + 2];
>> >> >  }
>> >> >
>> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
>> >> > the y's different streams and you should get the perfect case for
>> >> > load-lane:
>> >> >
>> >> > for (i=0; i<n; ++i)
>> >> >  {
>> >> >    y[i] = x[4*i + 0];
>> >> >    z[i] = x[4*i + 1];
>> >> >    w[i] = x[4*i + 2];
>> >> >  } 
>> >> >
>> >> > previously we'd peel at least 4 iterations into the epilogue for
>> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
>> >> > ok.
>> >> 
>> >> The case I was thinking of was like the second, but with the
>> >> element type being DI or DF and with the + 2 statement removed.
>> >> E.g.:
>> >> 
>> >> double __attribute__((noinline))
>> >> foo (double *a)
>> >> {
>> >>   double res = 0.0;
>> >>   for (int n = 0; n < 256; n += 4)
>> >>     res += a[n] + a[n + 1];
>> >>   return res;
>> >> }
>> >> 
>> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
>> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
>> >> and a[n + 3].
>> >
>> > Yes, indeed.  It's safe when peeling for gaps would remove
>> > N < alignof (ref) / sizeof (ref) scalar iterations.
>> >
>> > Peeling for gaps simply subtracts one from the niter of the vectorized 
>> > loop.
>> 
>> I think subtracting one is enough in all cases.  It's only the final
>> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
>> 
>> (Of course, subtracting one happens before peeling for niters, so it
>> only makes a difference if the original niters was a multiple of the VF,
>> in which case we peel a full vector's worth of iterations instead of
>> peeling none.)
>
> I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
> to covert the case with bigger VF, for example by having different
> types for a and b.
>
> I can try playing with this later this week but if you can come up
> with a testcase that exercises load-/store-lanes that would be great.
> See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
> from.

Yeah, that's what I'd tried originally, but couldn't convince
vect_compute_data_ref_alignment to realise that the base was aligned.

On that topic, the same patch had:

      if (DECL_USER_ALIGN (base))
	{
	  if (dump_enabled_p ())
	    {
	      dump_printf_loc (MSG_NOTE, vect_location,
			       "not forcing alignment of user-aligned "
			       "variable: ");
	      dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
	      dump_printf (MSG_NOTE, "\n");
	    }
	  return true;
	}

But shouldn't this also be testing whether decl is packed?  The docs
say that the aligned attribute only specifies a minimum, so increasing
it should be fine.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-01 19:59               ` Richard Sandiford
@ 2017-03-02 12:05                 ` Richard Biener
  2017-03-03  9:39                   ` Richard Sandiford
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Biener @ 2017-03-02 12:05 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Wed, 1 Mar 2017, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >
> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >> >
> >> >> >> Sorry for the late reply, but:
> >> >> >> 
> >> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
> >> >> >> >
> >> >> >> >> 
> >> >> >> >> Currently we force peeling for gaps whenever element overrun can occur
> >> >> >> >> but for aligned accesses we know that the loads won't trap and thus
> >> >> >> >> we can avoid this.
> >> >> >> >> 
> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu (I expect
> >> >> >> >> some testsuite fallout here so didn't bother to invent a new testcase).
> >> >> >> >> 
> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in general
> >> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
> >> >> >> >
> >> >> >> > This is what I applied.
> >> >> >> >
> >> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >> >> >> >
> >> >> >> > Richard.
> >> >> >> [...]
> >> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> >> >> > index 15aec21..c29e73d 100644
> >> >> >> > --- a/gcc/tree-vect-stmts.c
> >> >> >> > +++ b/gcc/tree-vect-stmts.c
> >> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
> >> >> >> >        /* If there is a gap at the end of the group then these optimizations
> >> >> >> >  	 would access excess elements in the last iteration.  */
> >> >> >> >        bool would_overrun_p = (gap != 0);
> >> >> >> > +      /* If the access is aligned an overrun is fine.  */
> >> >> >> > +      if (would_overrun_p
> >> >> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> >> >> >> > +	would_overrun_p = false;
> >> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
> >> >> >> >  	  && (can_overrun_p || !would_overrun_p)
> >> >> >> >  	  && compare_step_with_zero (stmt) > 0)
> >> >> >> 
> >> >> >> ...is this right for all cases?  I think it only looks for single-vector
> >> >> >> alignment, but the gap can in principle be vector-sized or larger,
> >> >> >> at least for load-lanes.
> >> >> >>
> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
> >> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
> >> >> >> spans two vectors and we have no guarantee that the second one
> >> >> >> is mapped.
> >> >> >
> >> >> > The check assumes that if aligned_access_p () returns true then the
> >> >> > whole access is aligned in a way that it can't cross page boundaries.
> >> >> > That's of course not the case if alignment is 16 bytes but the access
> >> >> > will be a multiple of that.
> >> >> >  
> >> >> >> I haven't been able to come up with a testcase though.  We seem to be
> >> >> >> overly conservative when computing alignments.
> >> >> >
> >> >> > Not sure if we can run into this with load-lanes given that bumps the
> >> >> > vectorization factor.  Also does load-lane work with gaps?
> >> >> >
> >> >> > I think that gap can never be larger than nunits-1 so it is by definition
> >> >> > in the last "vector" independent of the VF.
> >> >> >
> >> >> > Classical gap case is
> >> >> >
> >> >> > for (i=0; i<n; ++i)
> >> >> >  {
> >> >> >    y[3*i + 0] = x[4*i + 0];
> >> >> >    y[3*i + 1] = x[4*i + 1];
> >> >> >    y[3*i + 2] = x[4*i + 2];
> >> >> >  }
> >> >> >
> >> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
> >> >> > the y's different streams and you should get the perfect case for
> >> >> > load-lane:
> >> >> >
> >> >> > for (i=0; i<n; ++i)
> >> >> >  {
> >> >> >    y[i] = x[4*i + 0];
> >> >> >    z[i] = x[4*i + 1];
> >> >> >    w[i] = x[4*i + 2];
> >> >> >  } 
> >> >> >
> >> >> > previously we'd peel at least 4 iterations into the epilogue for
> >> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
> >> >> > ok.
> >> >> 
> >> >> The case I was thinking of was like the second, but with the
> >> >> element type being DI or DF and with the + 2 statement removed.
> >> >> E.g.:
> >> >> 
> >> >> double __attribute__((noinline))
> >> >> foo (double *a)
> >> >> {
> >> >>   double res = 0.0;
> >> >>   for (int n = 0; n < 256; n += 4)
> >> >>     res += a[n] + a[n + 1];
> >> >>   return res;
> >> >> }
> >> >> 
> >> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
> >> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
> >> >> and a[n + 3].
> >> >
> >> > Yes, indeed.  It's safe when peeling for gaps would remove
> >> > N < alignof (ref) / sizeof (ref) scalar iterations.
> >> >
> >> > Peeling for gaps simply subtracts one from the niter of the vectorized 
> >> > loop.
> >> 
> >> I think subtracting one is enough in all cases.  It's only the final
> >> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
> >> 
> >> (Of course, subtracting one happens before peeling for niters, so it
> >> only makes a difference if the original niters was a multiple of the VF,
> >> in which case we peel a full vector's worth of iterations instead of
> >> peeling none.)
> >
> > I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
> > to covert the case with bigger VF, for example by having different
> > types for a and b.
> >
> > I can try playing with this later this week but if you can come up
> > with a testcase that exercises load-/store-lanes that would be great.
> > See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
> > from.
> 
> Yeah, that's what I'd tried originally, but couldn't convince
> vect_compute_data_ref_alignment to realise that the base was aligned.

Try using __builtin_assume_aligned.

> On that topic, the same patch had:
> 
>       if (DECL_USER_ALIGN (base))
> 	{
> 	  if (dump_enabled_p ())
> 	    {
> 	      dump_printf_loc (MSG_NOTE, vect_location,
> 			       "not forcing alignment of user-aligned "
> 			       "variable: ");
> 	      dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
> 	      dump_printf (MSG_NOTE, "\n");
> 	    }
> 	  return true;
> 	}
> 
> But shouldn't this also be testing whether decl is packed?  The docs
> say that the aligned attribute only specifies a minimum, so increasing
> it should be fine.

I think I copied this from elsewhere (the ipa-alignment pass I think).

What do you mean with checking packed?  That we may not increase
alignment of a decl declared as packed?  
symtab_node::can_increase_alignment_p doesn't check that either
(it also doesn't check DECL_USER_ALIGN).

The packed attribute docs suggest you may combine aligned and packed
and then get exact alignment, so you suggest

  if (DECL_USER_ALIGN (base) && DECL_PACKED (base))

or simply

  if (DECL_PACKED (base))

?

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-02 12:05                 ` Richard Biener
@ 2017-03-03  9:39                   ` Richard Sandiford
  2017-03-03 10:19                     ` Richard Biener
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Sandiford @ 2017-03-03  9:39 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

Richard Biener <rguenther@suse.de> writes:
> On Wed, 1 Mar 2017, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >> >
>> >> >> Richard Biener <rguenther@suse.de> writes:
>> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
>> >> >> >
>> >> >> >> Sorry for the late reply, but:
>> >> >> >> 
>> >> >> >> Richard Biener <rguenther@suse.de> writes:
>> >> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
>> >> >> >> >
>> >> >> >> >> 
>> >> >> >> >> Currently we force peeling for gaps whenever element
>> >> >> >> >> overrun can occur
>> >> >> >> >> but for aligned accesses we know that the loads won't trap
>> >> >> >> >> and thus
>> >> >> >> >> we can avoid this.
>> >> >> >> >> 
>> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu
>> >> >> >> >> (I expect
>> >> >> >> >> some testsuite fallout here so didn't bother to invent a
>> >> >> >> >> new testcase).
>> >> >> >> >> 
>> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in general
>> >> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
>> >> >> >> >
>> >> >> >> > This is what I applied.
>> >> >> >> >
>> >> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
>> >> >> >> >
>> >> >> >> > Richard.
>> >> >> >> [...]
>> >> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> >> >> >> > index 15aec21..c29e73d 100644
>> >> >> >> > --- a/gcc/tree-vect-stmts.c
>> >> >> >> > +++ b/gcc/tree-vect-stmts.c
>> >> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
>> >> >> >> >        /* If there is a gap at the end of the group then these optimizations
>> >> >> >> >  	 would access excess elements in the last iteration.  */
>> >> >> >> >        bool would_overrun_p = (gap != 0);
>> >> >> >> > +      /* If the access is aligned an overrun is fine.  */
>> >> >> >> > +      if (would_overrun_p
>> >> >> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
>> >> >> >> > +	would_overrun_p = false;
>> >> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
>> >> >> >> >  	  && (can_overrun_p || !would_overrun_p)
>> >> >> >> >  	  && compare_step_with_zero (stmt) > 0)
>> >> >> >> 
>> >> >> >> ...is this right for all cases?  I think it only looks for
>> >> >> >> single-vector
>> >> >> >> alignment, but the gap can in principle be vector-sized or larger,
>> >> >> >> at least for load-lanes.
>> >> >> >>
>> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
>> >> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
>> >> >> >> spans two vectors and we have no guarantee that the second one
>> >> >> >> is mapped.
>> >> >> >
>> >> >> > The check assumes that if aligned_access_p () returns true then the
>> >> >> > whole access is aligned in a way that it can't cross page boundaries.
>> >> >> > That's of course not the case if alignment is 16 bytes but the access
>> >> >> > will be a multiple of that.
>> >> >> >  
>> >> >> >> I haven't been able to come up with a testcase though.  We seem to be
>> >> >> >> overly conservative when computing alignments.
>> >> >> >
>> >> >> > Not sure if we can run into this with load-lanes given that bumps the
>> >> >> > vectorization factor.  Also does load-lane work with gaps?
>> >> >> >
>> >> >> > I think that gap can never be larger than nunits-1 so it is by definition
>> >> >> > in the last "vector" independent of the VF.
>> >> >> >
>> >> >> > Classical gap case is
>> >> >> >
>> >> >> > for (i=0; i<n; ++i)
>> >> >> >  {
>> >> >> >    y[3*i + 0] = x[4*i + 0];
>> >> >> >    y[3*i + 1] = x[4*i + 1];
>> >> >> >    y[3*i + 2] = x[4*i + 2];
>> >> >> >  }
>> >> >> >
>> >> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
>> >> >> > the y's different streams and you should get the perfect case for
>> >> >> > load-lane:
>> >> >> >
>> >> >> > for (i=0; i<n; ++i)
>> >> >> >  {
>> >> >> >    y[i] = x[4*i + 0];
>> >> >> >    z[i] = x[4*i + 1];
>> >> >> >    w[i] = x[4*i + 2];
>> >> >> >  } 
>> >> >> >
>> >> >> > previously we'd peel at least 4 iterations into the epilogue for
>> >> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
>> >> >> > ok.
>> >> >> 
>> >> >> The case I was thinking of was like the second, but with the
>> >> >> element type being DI or DF and with the + 2 statement removed.
>> >> >> E.g.:
>> >> >> 
>> >> >> double __attribute__((noinline))
>> >> >> foo (double *a)
>> >> >> {
>> >> >>   double res = 0.0;
>> >> >>   for (int n = 0; n < 256; n += 4)
>> >> >>     res += a[n] + a[n + 1];
>> >> >>   return res;
>> >> >> }
>> >> >> 
>> >> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
>> >> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
>> >> >> and a[n + 3].
>> >> >
>> >> > Yes, indeed.  It's safe when peeling for gaps would remove
>> >> > N < alignof (ref) / sizeof (ref) scalar iterations.
>> >> >
>> >> > Peeling for gaps simply subtracts one from the niter of the vectorized 
>> >> > loop.
>> >> 
>> >> I think subtracting one is enough in all cases.  It's only the final
>> >> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
>> >> 
>> >> (Of course, subtracting one happens before peeling for niters, so it
>> >> only makes a difference if the original niters was a multiple of the VF,
>> >> in which case we peel a full vector's worth of iterations instead of
>> >> peeling none.)
>> >
>> > I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
>> > to covert the case with bigger VF, for example by having different
>> > types for a and b.
>> >
>> > I can try playing with this later this week but if you can come up
>> > with a testcase that exercises load-/store-lanes that would be great.
>> > See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
>> > from.
>> 
>> Yeah, that's what I'd tried originally, but couldn't convince
>> vect_compute_data_ref_alignment to realise that the base was aligned.
>
> Try using __builtin_assume_aligned.

Thanks, that did the trick.  Filed as PR79824.

>> On that topic, the same patch had:
>> 
>>       if (DECL_USER_ALIGN (base))
>> 	{
>> 	  if (dump_enabled_p ())
>> 	    {
>> 	      dump_printf_loc (MSG_NOTE, vect_location,
>> 			       "not forcing alignment of user-aligned "
>> 			       "variable: ");
>> 	      dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
>> 	      dump_printf (MSG_NOTE, "\n");
>> 	    }
>> 	  return true;
>> 	}
>> 
>> But shouldn't this also be testing whether decl is packed?  The docs
>> say that the aligned attribute only specifies a minimum, so increasing
>> it should be fine.
>
> I think I copied this from elsewhere (the ipa-alignment pass I think).
>
> What do you mean with checking packed?  That we may not increase
> alignment of a decl declared as packed?  
> symtab_node::can_increase_alignment_p doesn't check that either
> (it also doesn't check DECL_USER_ALIGN).
>
> The packed attribute docs suggest you may combine aligned and packed
> and then get exact alignment, so you suggest
>
>   if (DECL_USER_ALIGN (base) && DECL_PACKED (base))
>
> or simply
>
>   if (DECL_PACKED (base))
>
> ?

Yeah, was thinking of the second one, but I'd forgotten that you can't
pack a variable, just a field.

I don't think we can make meaningful guarantees about a "maximum"
alignment for variables.  E.g. one variable could be placed after
another variable with a higher alignment and affectively have its
alignment increased that way.  Plus of course objects could happen
to fall on nicely-aligned addresses even if the requested alignment
was smaller.

It just seems that one of the use cases of the aligned attribute
is to encourage optimisation.  If the compiler happens to find a
reason for increasing it further then I think we should allow that.
E.g. people who add aligned attributes based on the width of one
vector architecture probably didn't want to stop the vectoriser
from handling later architectures that have wider vectors.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Avoid peeling for gaps if accesses are aligned
  2017-03-03  9:39                   ` Richard Sandiford
@ 2017-03-03 10:19                     ` Richard Biener
  0 siblings, 0 replies; 12+ messages in thread
From: Richard Biener @ 2017-03-03 10:19 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches

On Fri, 3 Mar 2017, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >
> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >> >
> >> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >> >> >
> >> >> >> >> Sorry for the late reply, but:
> >> >> >> >> 
> >> >> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
> >> >> >> >> >
> >> >> >> >> >> 
> >> >> >> >> >> Currently we force peeling for gaps whenever element
> >> >> >> >> >> overrun can occur
> >> >> >> >> >> but for aligned accesses we know that the loads won't trap
> >> >> >> >> >> and thus
> >> >> >> >> >> we can avoid this.
> >> >> >> >> >> 
> >> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu
> >> >> >> >> >> (I expect
> >> >> >> >> >> some testsuite fallout here so didn't bother to invent a
> >> >> >> >> >> new testcase).
> >> >> >> >> >> 
> >> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in general
> >> >> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
> >> >> >> >> >
> >> >> >> >> > This is what I applied.
> >> >> >> >> >
> >> >> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >> >> >> >> >
> >> >> >> >> > Richard.
> >> >> >> >> [...]
> >> >> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> >> >> >> > index 15aec21..c29e73d 100644
> >> >> >> >> > --- a/gcc/tree-vect-stmts.c
> >> >> >> >> > +++ b/gcc/tree-vect-stmts.c
> >> >> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
> >> >> >> >> >        /* If there is a gap at the end of the group then these optimizations
> >> >> >> >> >  	 would access excess elements in the last iteration.  */
> >> >> >> >> >        bool would_overrun_p = (gap != 0);
> >> >> >> >> > +      /* If the access is aligned an overrun is fine.  */
> >> >> >> >> > +      if (would_overrun_p
> >> >> >> >> > +	  && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> >> >> >> >> > +	would_overrun_p = false;
> >> >> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
> >> >> >> >> >  	  && (can_overrun_p || !would_overrun_p)
> >> >> >> >> >  	  && compare_step_with_zero (stmt) > 0)
> >> >> >> >> 
> >> >> >> >> ...is this right for all cases?  I think it only looks for
> >> >> >> >> single-vector
> >> >> >> >> alignment, but the gap can in principle be vector-sized or larger,
> >> >> >> >> at least for load-lanes.
> >> >> >> >>
> >> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
> >> >> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the group
> >> >> >> >> spans two vectors and we have no guarantee that the second one
> >> >> >> >> is mapped.
> >> >> >> >
> >> >> >> > The check assumes that if aligned_access_p () returns true then the
> >> >> >> > whole access is aligned in a way that it can't cross page boundaries.
> >> >> >> > That's of course not the case if alignment is 16 bytes but the access
> >> >> >> > will be a multiple of that.
> >> >> >> >  
> >> >> >> >> I haven't been able to come up with a testcase though.  We seem to be
> >> >> >> >> overly conservative when computing alignments.
> >> >> >> >
> >> >> >> > Not sure if we can run into this with load-lanes given that bumps the
> >> >> >> > vectorization factor.  Also does load-lane work with gaps?
> >> >> >> >
> >> >> >> > I think that gap can never be larger than nunits-1 so it is by definition
> >> >> >> > in the last "vector" independent of the VF.
> >> >> >> >
> >> >> >> > Classical gap case is
> >> >> >> >
> >> >> >> > for (i=0; i<n; ++i)
> >> >> >> >  {
> >> >> >> >    y[3*i + 0] = x[4*i + 0];
> >> >> >> >    y[3*i + 1] = x[4*i + 1];
> >> >> >> >    y[3*i + 2] = x[4*i + 2];
> >> >> >> >  }
> >> >> >> >
> >> >> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
> >> >> >> > the y's different streams and you should get the perfect case for
> >> >> >> > load-lane:
> >> >> >> >
> >> >> >> > for (i=0; i<n; ++i)
> >> >> >> >  {
> >> >> >> >    y[i] = x[4*i + 0];
> >> >> >> >    z[i] = x[4*i + 1];
> >> >> >> >    w[i] = x[4*i + 2];
> >> >> >> >  } 
> >> >> >> >
> >> >> >> > previously we'd peel at least 4 iterations into the epilogue for
> >> >> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
> >> >> >> > ok.
> >> >> >> 
> >> >> >> The case I was thinking of was like the second, but with the
> >> >> >> element type being DI or DF and with the + 2 statement removed.
> >> >> >> E.g.:
> >> >> >> 
> >> >> >> double __attribute__((noinline))
> >> >> >> foo (double *a)
> >> >> >> {
> >> >> >>   double res = 0.0;
> >> >> >>   for (int n = 0; n < 256; n += 4)
> >> >> >>     res += a[n] + a[n + 1];
> >> >> >>   return res;
> >> >> >> }
> >> >> >> 
> >> >> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
> >> >> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
> >> >> >> and a[n + 3].
> >> >> >
> >> >> > Yes, indeed.  It's safe when peeling for gaps would remove
> >> >> > N < alignof (ref) / sizeof (ref) scalar iterations.
> >> >> >
> >> >> > Peeling for gaps simply subtracts one from the niter of the vectorized 
> >> >> > loop.
> >> >> 
> >> >> I think subtracting one is enough in all cases.  It's only the final
> >> >> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
> >> >> 
> >> >> (Of course, subtracting one happens before peeling for niters, so it
> >> >> only makes a difference if the original niters was a multiple of the VF,
> >> >> in which case we peel a full vector's worth of iterations instead of
> >> >> peeling none.)
> >> >
> >> > I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
> >> > to covert the case with bigger VF, for example by having different
> >> > types for a and b.
> >> >
> >> > I can try playing with this later this week but if you can come up
> >> > with a testcase that exercises load-/store-lanes that would be great.
> >> > See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
> >> > from.
> >> 
> >> Yeah, that's what I'd tried originally, but couldn't convince
> >> vect_compute_data_ref_alignment to realise that the base was aligned.
> >
> > Try using __builtin_assume_aligned.
> 
> Thanks, that did the trick.  Filed as PR79824.
> 
> >> On that topic, the same patch had:
> >> 
> >>       if (DECL_USER_ALIGN (base))
> >> 	{
> >> 	  if (dump_enabled_p ())
> >> 	    {
> >> 	      dump_printf_loc (MSG_NOTE, vect_location,
> >> 			       "not forcing alignment of user-aligned "
> >> 			       "variable: ");
> >> 	      dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
> >> 	      dump_printf (MSG_NOTE, "\n");
> >> 	    }
> >> 	  return true;
> >> 	}
> >> 
> >> But shouldn't this also be testing whether decl is packed?  The docs
> >> say that the aligned attribute only specifies a minimum, so increasing
> >> it should be fine.
> >
> > I think I copied this from elsewhere (the ipa-alignment pass I think).
> >
> > What do you mean with checking packed?  That we may not increase
> > alignment of a decl declared as packed?  
> > symtab_node::can_increase_alignment_p doesn't check that either
> > (it also doesn't check DECL_USER_ALIGN).
> >
> > The packed attribute docs suggest you may combine aligned and packed
> > and then get exact alignment, so you suggest
> >
> >   if (DECL_USER_ALIGN (base) && DECL_PACKED (base))
> >
> > or simply
> >
> >   if (DECL_PACKED (base))
> >
> > ?
> 
> Yeah, was thinking of the second one, but I'd forgotten that you can't
> pack a variable, just a field.
> 
> I don't think we can make meaningful guarantees about a "maximum"
> alignment for variables.  E.g. one variable could be placed after
> another variable with a higher alignment and affectively have its
> alignment increased that way.  Plus of course objects could happen
> to fall on nicely-aligned addresses even if the requested alignment
> was smaller.

Yeah, this also makes the idea of preserving "lowering of alignment"
of decls bogus.

> It just seems that one of the use cases of the aligned attribute
> is to encourage optimisation.  If the compiler happens to find a
> reason for increasing it further then I think we should allow that.
> E.g. people who add aligned attributes based on the width of one
> vector architecture probably didn't want to stop the vectoriser
> from handling later architectures that have wider vectors.

So the only reason I can think of honoring DECL_USER_ALIGN is to avoid
excessively re-aligning data if data segment size is a concern.

Richard.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-03-03 10:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-07 15:02 [PATCH] Avoid peeling for gaps if accesses are aligned Richard Biener
2016-11-08 10:20 ` Richard Biener
2017-03-01 11:54   ` Richard Sandiford
2017-03-01 12:06     ` Richard Biener
2017-03-01 12:18       ` Richard Sandiford
2017-03-01 12:23         ` Richard Biener
2017-03-01 12:40           ` Richard Sandiford
2017-03-01 14:37             ` Richard Biener
2017-03-01 19:59               ` Richard Sandiford
2017-03-02 12:05                 ` Richard Biener
2017-03-03  9:39                   ` Richard Sandiford
2017-03-03 10:19                     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).