public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
@ 2022-03-25 12:11 Richard Biener
  2022-03-25 13:25 ` Hongtao Liu
  2022-03-28 13:37 ` Richard Sandiford
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Biener @ 2022-03-25 12:11 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford, hongtao.liu

Since we're now vectorizing by default at -O2 issues like PR101908
become more important where we apply basic-block vectorization to
parts of the function covering loads from function parameters passed
on the stack.  Since we have no good idea how the stack pushing
was performed but we do have a good idea that it was done recent
when at the start of the function a STLF failure is inevitable
if the argument passing code didn't use the same or larger vector
stores and also has them aligned the same way.

Until there's a robust IPA based solution the following implements
target independent heuristics in the vectorizer to retain scalar
loads for loads from parameters likely passed in memory (I use
a BLKmode DECL_MODE check for this rather than firing up
cummulative-args).  I've restricted this also to loads from the
first "block" (that can be less than the first basic block if there's
a call for example), since that covers the testcase.

Note that for the testcase (but not c-ray from the bugreport) there's
a x86 peephole2 that vectorizes things back, so the patch is
not effective there.

Any comments?  I know we're also looking at x86 port specific
mitigations but the issue will hit arm and power/z as well I think.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Thanks,
Richard.

2022-03-25  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/101908
	* tree-vect-stmts.cc (get_group_load_store_type): Add
	heuristic to limit BB vectorization of function parameters.

	* gcc.dg/vect/bb-slp-pr101908.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
 gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
 2 files changed, 44 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
new file mode 100644
index 00000000000..b7534a18f0e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_double } */
+
+struct vec3 {
+  double x, y, z;
+};
+
+struct ray {
+  struct vec3 orig, dir;
+};
+
+void ray_sphere (struct ray ray, double *res)
+{
+  res[0] = ray.orig.y * ray.dir.x;
+  res[1] = ray.orig.z * ray.dir.y;
+}
+
+/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index f7449a79d1c..1e37e9678b6 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
   /* Stores can't yet have gaps.  */
   gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
 
-  if (slp_node)
+  tree parm;
+  if (!loop_vinfo
+      && vls_type == VLS_LOAD
+      /* The access is based on a PARM_DECL.  */
+      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
+      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), true)
+      && TREE_CODE (parm) == PARM_DECL
+      /* Likely passed on the stack.  */
+      && DECL_MODE (parm) == BLKmode
+      /* The access is in the first group.  */
+      && first_dr_info->group == 0)
+    {
+      /* When doing BB vectorizing force early loads from function parameters
+	 passed on the stack and thus stored recently to be done elementwise
+	 to avoid store-to-load forwarding penalties.
+	 Note this will cause vectorization to fail for the load because of
+	 the fear of underestimating the cost of elementwise accesses,
+	 see the end of get_load_store_type.  We are then going to effectively
+	 do the same via handling the loads as external input to the SLP.  */
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location,
+			 "Not using vector loads from function parameter "
+			 "for the fear of causing a STLF fail\n");
+      *memory_access_type = VMAT_ELEMENTWISE;
+    }
+  else if (slp_node)
     {
       /* For SLP vectorization we directly vectorize a subchain
 	 without permutation.  */
-- 
2.34.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-25 12:11 [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing Richard Biener
@ 2022-03-25 13:25 ` Hongtao Liu
  2022-03-25 13:42   ` Richard Biener
  2022-03-28 13:37 ` Richard Sandiford
  1 sibling, 1 reply; 8+ messages in thread
From: Hongtao Liu @ 2022-03-25 13:25 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Richard Sandiford, Liu, Hongtao

On Fri, Mar 25, 2022 at 8:11 PM Richard Biener via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Since we're now vectorizing by default at -O2 issues like PR101908
> become more important where we apply basic-block vectorization to
> parts of the function covering loads from function parameters passed
> on the stack.  Since we have no good idea how the stack pushing
> was performed but we do have a good idea that it was done recent
> when at the start of the function a STLF failure is inevitable
> if the argument passing code didn't use the same or larger vector
> stores and also has them aligned the same way.
>
> Until there's a robust IPA based solution the following implements
> target independent heuristics in the vectorizer to retain scalar
> loads for loads from parameters likely passed in memory (I use
> a BLKmode DECL_MODE check for this rather than firing up
> cummulative-args).  I've restricted this also to loads from the
> first "block" (that can be less than the first basic block if there's
> a call for example), since that covers the testcase.
I  prefer this patch to the md-reorg way.
Can vectorizer similarly handle by-reference passed arguments if their
corresponding real parameters are local variables of the caller?
.i.e. we can also prevent vectorization for the below case.

struct vec3 {
  double x, y, z;
};

struct ray {
 struct vec3 orig, dir;
};

void
__attribute__((noinline))
ray_sphere (struct ray* __restrict ray, double *__restrict res)
{
  res[0] = ray->orig.y * ray->dir.x;
  res[1] = ray->orig.z * ray->dir.y;
}

extern struct ray g;
void bar (double* res)
{
    struct ray tmp = g;
    ray_sphere (&tmp, res);
}

>
> Note that for the testcase (but not c-ray from the bugreport) there's
> a x86 peephole2 that vectorizes things back, so the patch is
> not effective there.
>
> Any comments?  I know we're also looking at x86 port specific
> mitigations but the issue will hit arm and power/z as well I think.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Thanks,
> Richard.
>
> 2022-03-25  Richard Biener  <rguenther@suse.de>
>
>         PR tree-optimization/101908
>         * tree-vect-stmts.cc (get_group_load_store_type): Add
>         heuristic to limit BB vectorization of function parameters.
>
>         * gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
>  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> new file mode 100644
> index 00000000000..b7534a18f0e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_double } */
> +
> +struct vec3 {
> +  double x, y, z;
> +};
> +
> +struct ray {
> +  struct vec3 orig, dir;
> +};
> +
> +void ray_sphere (struct ray ray, double *res)
> +{
> +  res[0] = ray.orig.y * ray.dir.x;
> +  res[1] = ray.orig.z * ray.dir.y;
> +}
> +
> +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index f7449a79d1c..1e37e9678b6 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>    /* Stores can't yet have gaps.  */
>    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
>
> -  if (slp_node)
> +  tree parm;
> +  if (!loop_vinfo
> +      && vls_type == VLS_LOAD
> +      /* The access is based on a PARM_DECL.  */
> +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), true)
> +      && TREE_CODE (parm) == PARM_DECL
> +      /* Likely passed on the stack.  */
> +      && DECL_MODE (parm) == BLKmode
> +      /* The access is in the first group.  */
> +      && first_dr_info->group == 0)
> +    {
> +      /* When doing BB vectorizing force early loads from function parameters
> +        passed on the stack and thus stored recently to be done elementwise
> +        to avoid store-to-load forwarding penalties.
> +        Note this will cause vectorization to fail for the load because of
> +        the fear of underestimating the cost of elementwise accesses,
> +        see the end of get_load_store_type.  We are then going to effectively
> +        do the same via handling the loads as external input to the SLP.  */
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_NOTE, vect_location,
> +                        "Not using vector loads from function parameter "
> +                        "for the fear of causing a STLF fail\n");
> +      *memory_access_type = VMAT_ELEMENTWISE;
> +    }
> +  else if (slp_node)
>      {
>        /* For SLP vectorization we directly vectorize a subchain
>          without permutation.  */
> --
> 2.34.1



-- 
BR,
Hongtao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-25 13:25 ` Hongtao Liu
@ 2022-03-25 13:42   ` Richard Biener
  2022-03-25 14:09     ` Hongtao Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2022-03-25 13:42 UTC (permalink / raw)
  To: Hongtao Liu; +Cc: GCC Patches, Richard Sandiford, Liu, Hongtao

On Fri, 25 Mar 2022, Hongtao Liu wrote:

> On Fri, Mar 25, 2022 at 8:11 PM Richard Biener via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Since we're now vectorizing by default at -O2 issues like PR101908
> > become more important where we apply basic-block vectorization to
> > parts of the function covering loads from function parameters passed
> > on the stack.  Since we have no good idea how the stack pushing
> > was performed but we do have a good idea that it was done recent
> > when at the start of the function a STLF failure is inevitable
> > if the argument passing code didn't use the same or larger vector
> > stores and also has them aligned the same way.
> >
> > Until there's a robust IPA based solution the following implements
> > target independent heuristics in the vectorizer to retain scalar
> > loads for loads from parameters likely passed in memory (I use
> > a BLKmode DECL_MODE check for this rather than firing up
> > cummulative-args).  I've restricted this also to loads from the
> > first "block" (that can be less than the first basic block if there's
> > a call for example), since that covers the testcase.
> I  prefer this patch to the md-reorg way.

I was mostly posting the variant because it is target agnostic.  In
general I agree that mitigation is best done at the RTL level after
scheduling/RA so that the instruction sequence from function entry
is visible.

Did you have any success in coding this up yet?  Do you think it
can be done in a way to be re-used by multiple targets?

> Can vectorizer similarly handle by-reference passed arguments if their
> corresponding real parameters are local variables of the caller?
> .i.e. we can also prevent vectorization for the below case.
> 
> struct vec3 {
>   double x, y, z;
> };
> 
> struct ray {
>  struct vec3 orig, dir;
> };
> 
> void
> __attribute__((noinline))
> ray_sphere (struct ray* __restrict ray, double *__restrict res)
> {
>   res[0] = ray->orig.y * ray->dir.x;
>   res[1] = ray->orig.z * ray->dir.y;
> }
> 
> extern struct ray g;
> void bar (double* res)
> {
>     struct ray tmp = g;
>     ray_sphere (&tmp, res);
> }

Without IPA analysis this will be hard, if not impossible.

Richard.

> >
> > Note that for the testcase (but not c-ray from the bugreport) there's
> > a x86 peephole2 that vectorizes things back, so the patch is
> > not effective there.
> >
> > Any comments?  I know we're also looking at x86 port specific
> > mitigations but the issue will hit arm and power/z as well I think.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Thanks,
> > Richard.
> >
> > 2022-03-25  Richard Biener  <rguenther@suse.de>
> >
> >         PR tree-optimization/101908
> >         * tree-vect-stmts.cc (get_group_load_store_type): Add
> >         heuristic to limit BB vectorization of function parameters.
> >
> >         * gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
> >  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
> >  2 files changed, 44 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > new file mode 100644
> > index 00000000000..b7534a18f0e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target vect_double } */
> > +
> > +struct vec3 {
> > +  double x, y, z;
> > +};
> > +
> > +struct ray {
> > +  struct vec3 orig, dir;
> > +};
> > +
> > +void ray_sphere (struct ray ray, double *res)
> > +{
> > +  res[0] = ray.orig.y * ray.dir.x;
> > +  res[1] = ray.orig.z * ray.dir.y;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index f7449a79d1c..1e37e9678b6 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
> >    /* Stores can't yet have gaps.  */
> >    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
> >
> > -  if (slp_node)
> > +  tree parm;
> > +  if (!loop_vinfo
> > +      && vls_type == VLS_LOAD
> > +      /* The access is based on a PARM_DECL.  */
> > +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> > +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), true)
> > +      && TREE_CODE (parm) == PARM_DECL
> > +      /* Likely passed on the stack.  */
> > +      && DECL_MODE (parm) == BLKmode
> > +      /* The access is in the first group.  */
> > +      && first_dr_info->group == 0)
> > +    {
> > +      /* When doing BB vectorizing force early loads from function parameters
> > +        passed on the stack and thus stored recently to be done elementwise
> > +        to avoid store-to-load forwarding penalties.
> > +        Note this will cause vectorization to fail for the load because of
> > +        the fear of underestimating the cost of elementwise accesses,
> > +        see the end of get_load_store_type.  We are then going to effectively
> > +        do the same via handling the loads as external input to the SLP.  */
> > +      if (dump_enabled_p ())
> > +       dump_printf_loc (MSG_NOTE, vect_location,
> > +                        "Not using vector loads from function parameter "
> > +                        "for the fear of causing a STLF fail\n");
> > +      *memory_access_type = VMAT_ELEMENTWISE;
> > +    }
> > +  else if (slp_node)
> >      {
> >        /* For SLP vectorization we directly vectorize a subchain
> >          without permutation.  */
> > --
> > 2.34.1
> 
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-25 13:42   ` Richard Biener
@ 2022-03-25 14:09     ` Hongtao Liu
  0 siblings, 0 replies; 8+ messages in thread
From: Hongtao Liu @ 2022-03-25 14:09 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Richard Sandiford, Liu, Hongtao

On Fri, Mar 25, 2022 at 9:42 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Fri, 25 Mar 2022, Hongtao Liu wrote:
>
> > On Fri, Mar 25, 2022 at 8:11 PM Richard Biener via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Since we're now vectorizing by default at -O2 issues like PR101908
> > > become more important where we apply basic-block vectorization to
> > > parts of the function covering loads from function parameters passed
> > > on the stack.  Since we have no good idea how the stack pushing
> > > was performed but we do have a good idea that it was done recent
> > > when at the start of the function a STLF failure is inevitable
> > > if the argument passing code didn't use the same or larger vector
> > > stores and also has them aligned the same way.
> > >
> > > Until there's a robust IPA based solution the following implements
> > > target independent heuristics in the vectorizer to retain scalar
> > > loads for loads from parameters likely passed in memory (I use
> > > a BLKmode DECL_MODE check for this rather than firing up
> > > cummulative-args).  I've restricted this also to loads from the
> > > first "block" (that can be less than the first basic block if there's
> > > a call for example), since that covers the testcase.
> > I  prefer this patch to the md-reorg way.
>
> I was mostly posting the variant because it is target agnostic.  In
> general I agree that mitigation is best done at the RTL level after
> scheduling/RA so that the instruction sequence from function entry
> is visible.
>
> Did you have any success in coding this up yet?  Do you think it
Not yet.
> can be done in a way to be re-used by multiple targets?
>
> > Can vectorizer similarly handle by-reference passed arguments if their
> > corresponding real parameters are local variables of the caller?
> > .i.e. we can also prevent vectorization for the below case.
> >
> > struct vec3 {
> >   double x, y, z;
> > };
> >
> > struct ray {
> >  struct vec3 orig, dir;
> > };
> >
> > void
> > __attribute__((noinline))
> > ray_sphere (struct ray* __restrict ray, double *__restrict res)
> > {
> >   res[0] = ray->orig.y * ray->dir.x;
> >   res[1] = ray->orig.z * ray->dir.y;
> > }
> >
> > extern struct ray g;
> > void bar (double* res)
> > {
> >     struct ray tmp = g;
> >     ray_sphere (&tmp, res);
> > }
>
> Without IPA analysis this will be hard, if not impossible.
>
> Richard.
>
> > >
> > > Note that for the testcase (but not c-ray from the bugreport) there's
> > > a x86 peephole2 that vectorizes things back, so the patch is
> > > not effective there.
> > >
> > > Any comments?  I know we're also looking at x86 port specific
> > > mitigations but the issue will hit arm and power/z as well I think.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 2022-03-25  Richard Biener  <rguenther@suse.de>
> > >
> > >         PR tree-optimization/101908
> > >         * tree-vect-stmts.cc (get_group_load_store_type): Add
> > >         heuristic to limit BB vectorization of function parameters.
> > >
> > >         * gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> > > ---
> > >  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
> > >  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
> > >  2 files changed, 44 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > >
> > > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > > new file mode 100644
> > > index 00000000000..b7534a18f0e
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > > @@ -0,0 +1,18 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-require-effective-target vect_double } */
> > > +
> > > +struct vec3 {
> > > +  double x, y, z;
> > > +};
> > > +
> > > +struct ray {
> > > +  struct vec3 orig, dir;
> > > +};
> > > +
> > > +void ray_sphere (struct ray ray, double *res)
> > > +{
> > > +  res[0] = ray.orig.y * ray.dir.x;
> > > +  res[1] = ray.orig.z * ray.dir.y;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index f7449a79d1c..1e37e9678b6 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
> > >    /* Stores can't yet have gaps.  */
> > >    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
> > >
> > > -  if (slp_node)
> > > +  tree parm;
> > > +  if (!loop_vinfo
> > > +      && vls_type == VLS_LOAD
> > > +      /* The access is based on a PARM_DECL.  */
> > > +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> > > +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), true)
> > > +      && TREE_CODE (parm) == PARM_DECL
> > > +      /* Likely passed on the stack.  */
> > > +      && DECL_MODE (parm) == BLKmode
> > > +      /* The access is in the first group.  */
> > > +      && first_dr_info->group == 0)
> > > +    {
> > > +      /* When doing BB vectorizing force early loads from function parameters
> > > +        passed on the stack and thus stored recently to be done elementwise
> > > +        to avoid store-to-load forwarding penalties.
> > > +        Note this will cause vectorization to fail for the load because of
> > > +        the fear of underestimating the cost of elementwise accesses,
> > > +        see the end of get_load_store_type.  We are then going to effectively
> > > +        do the same via handling the loads as external input to the SLP.  */
> > > +      if (dump_enabled_p ())
> > > +       dump_printf_loc (MSG_NOTE, vect_location,
> > > +                        "Not using vector loads from function parameter "
> > > +                        "for the fear of causing a STLF fail\n");
> > > +      *memory_access_type = VMAT_ELEMENTWISE;
> > > +    }
> > > +  else if (slp_node)
> > >      {
> > >        /* For SLP vectorization we directly vectorize a subchain
> > >          without permutation.  */
> > > --
> > > 2.34.1
> >
> >
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)



--
BR,
Hongtao

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-25 12:11 [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing Richard Biener
  2022-03-25 13:25 ` Hongtao Liu
@ 2022-03-28 13:37 ` Richard Sandiford
  2022-03-28 13:49   ` Richard Biener
  1 sibling, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2022-03-28 13:37 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, hongtao.liu

Richard Biener <rguenther@suse.de> writes:
> Since we're now vectorizing by default at -O2 issues like PR101908
> become more important where we apply basic-block vectorization to
> parts of the function covering loads from function parameters passed
> on the stack.  Since we have no good idea how the stack pushing
> was performed but we do have a good idea that it was done recent
> when at the start of the function a STLF failure is inevitable
> if the argument passing code didn't use the same or larger vector
> stores and also has them aligned the same way.
>
> Until there's a robust IPA based solution the following implements
> target independent heuristics in the vectorizer to retain scalar
> loads for loads from parameters likely passed in memory (I use
> a BLKmode DECL_MODE check for this rather than firing up
> cummulative-args).  I've restricted this also to loads from the
> first "block" (that can be less than the first basic block if there's
> a call for example), since that covers the testcase.
>
> Note that for the testcase (but not c-ray from the bugreport) there's
> a x86 peephole2 that vectorizes things back, so the patch is
> not effective there.
>
> Any comments?  I know we're also looking at x86 port specific
> mitigations but the issue will hit arm and power/z as well I think.

I'm not sure this is a target-independent win.  In a loop that:

  stores 2 scalars
  loads the stored scalars as a vector
  adds a vector
  stores a vector

(no feedback), I see a 20% regression using elementwise accesses for
the load vs. using a normal vector load (measured on a Cortex-A72).
With feedback the elementwise version is still slower, but obviously
not by such a big factor.

I can see the argument for penalising this in the cost model in
a target-independent way, but I'm not sure we should change the
vectorisation strategy.

Thanks,
Richard

> Bootstrapped and tested on x86_64-unknown-linux-gnu.
>
> Thanks,
> Richard.
>
> 2022-03-25  Richard Biener  <rguenther@suse.de>
>
> 	PR tree-optimization/101908
> 	* tree-vect-stmts.cc (get_group_load_store_type): Add
> 	heuristic to limit BB vectorization of function parameters.
>
> 	* gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> ---
>  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
>  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> new file mode 100644
> index 00000000000..b7534a18f0e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> @@ -0,0 +1,18 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_double } */
> +
> +struct vec3 {
> +  double x, y, z;
> +};
> +
> +struct ray {
> +  struct vec3 orig, dir;
> +};
> +
> +void ray_sphere (struct ray ray, double *res)
> +{
> +  res[0] = ray.orig.y * ray.dir.x;
> +  res[1] = ray.orig.z * ray.dir.y;
> +}
> +
> +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index f7449a79d1c..1e37e9678b6 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>    /* Stores can't yet have gaps.  */
>    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
>  
> -  if (slp_node)
> +  tree parm;
> +  if (!loop_vinfo
> +      && vls_type == VLS_LOAD
> +      /* The access is based on a PARM_DECL.  */
> +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), true)
> +      && TREE_CODE (parm) == PARM_DECL
> +      /* Likely passed on the stack.  */
> +      && DECL_MODE (parm) == BLKmode
> +      /* The access is in the first group.  */
> +      && first_dr_info->group == 0)
> +    {
> +      /* When doing BB vectorizing force early loads from function parameters
> +	 passed on the stack and thus stored recently to be done elementwise
> +	 to avoid store-to-load forwarding penalties.
> +	 Note this will cause vectorization to fail for the load because of
> +	 the fear of underestimating the cost of elementwise accesses,
> +	 see the end of get_load_store_type.  We are then going to effectively
> +	 do the same via handling the loads as external input to the SLP.  */
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "Not using vector loads from function parameter "
> +			 "for the fear of causing a STLF fail\n");
> +      *memory_access_type = VMAT_ELEMENTWISE;
> +    }
> +  else if (slp_node)
>      {
>        /* For SLP vectorization we directly vectorize a subchain
>  	 without permutation.  */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-28 13:37 ` Richard Sandiford
@ 2022-03-28 13:49   ` Richard Biener
  2022-03-28 14:35     ` Richard Sandiford
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Biener @ 2022-03-28 13:49 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, hongtao.liu

On Mon, 28 Mar 2022, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > Since we're now vectorizing by default at -O2 issues like PR101908
> > become more important where we apply basic-block vectorization to
> > parts of the function covering loads from function parameters passed
> > on the stack.  Since we have no good idea how the stack pushing
> > was performed but we do have a good idea that it was done recent
> > when at the start of the function a STLF failure is inevitable
> > if the argument passing code didn't use the same or larger vector
> > stores and also has them aligned the same way.
> >
> > Until there's a robust IPA based solution the following implements
> > target independent heuristics in the vectorizer to retain scalar
> > loads for loads from parameters likely passed in memory (I use
> > a BLKmode DECL_MODE check for this rather than firing up
> > cummulative-args).  I've restricted this also to loads from the
> > first "block" (that can be less than the first basic block if there's
> > a call for example), since that covers the testcase.
> >
> > Note that for the testcase (but not c-ray from the bugreport) there's
> > a x86 peephole2 that vectorizes things back, so the patch is
> > not effective there.
> >
> > Any comments?  I know we're also looking at x86 port specific
> > mitigations but the issue will hit arm and power/z as well I think.
> 
> I'm not sure this is a target-independent win.  In a loop that:
> 
>   stores 2 scalars
>   loads the stored scalars as a vector
>   adds a vector
>   stores a vector
> 
> (no feedback), I see a 20% regression using elementwise accesses for
> the load vs. using a normal vector load (measured on a Cortex-A72).
> With feedback the elementwise version is still slower, but obviously
> not by such a big factor.

I see, so that's even without a call inbetween the scalar stores
and the vector load as is the case we're trying to cover.  I
would suspect that maybe the two elementwise accesses execute
too close to the two scalar stores to benefit from any forwarding
with the A72 micro-architecture?  Do you see a speedup when
performing a vector store instead of two scalar stores?

> I can see the argument for penalising this in the cost model in
> a target-independent way, but I'm not sure we should change the
> vectorisation strategy.

OK.  We tried that but I think the better idea is splitting the
stores if that makes them likely faster late during md-reorg
where we should have the best idea on whether we'll actually 
hit a STLF fail.

Since that's also what Hongtao prefers I'll drop this patch
for now.

Thanks,
Richard.

> Thanks,
> Richard
> 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >
> > Thanks,
> > Richard.
> >
> > 2022-03-25  Richard Biener  <rguenther@suse.de>
> >
> > 	PR tree-optimization/101908
> > 	* tree-vect-stmts.cc (get_group_load_store_type): Add
> > 	heuristic to limit BB vectorization of function parameters.
> >
> > 	* gcc.dg/vect/bb-slp-pr101908.c: New testcase.
> > ---
> >  gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c | 18 ++++++++++++++
> >  gcc/tree-vect-stmts.cc                      | 27 ++++++++++++++++++++-
> >  2 files changed, 44 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > new file mode 100644
> > index 00000000000..b7534a18f0e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr101908.c
> > @@ -0,0 +1,18 @@
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target vect_double } */
> > +
> > +struct vec3 {
> > +  double x, y, z;
> > +};
> > +
> > +struct ray {
> > +  struct vec3 orig, dir;
> > +};
> > +
> > +void ray_sphere (struct ray ray, double *res)
> > +{
> > +  res[0] = ray.orig.y * ray.dir.x;
> > +  res[1] = ray.orig.z * ray.dir.y;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "STLF fail" "slp2" } } */
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index f7449a79d1c..1e37e9678b6 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -2197,7 +2197,32 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
> >    /* Stores can't yet have gaps.  */
> >    gcc_assert (slp_node || vls_type == VLS_LOAD || gap == 0);
> >  
> > -  if (slp_node)
> > +  tree parm;
> > +  if (!loop_vinfo
> > +      && vls_type == VLS_LOAD
> > +      /* The access is based on a PARM_DECL.  */
> > +      && TREE_CODE (DR_BASE_ADDRESS (first_dr_info->dr)) == ADDR_EXPR
> > +      && ((parm = TREE_OPERAND (DR_BASE_ADDRESS (first_dr_info->dr), 0)), true)
> > +      && TREE_CODE (parm) == PARM_DECL
> > +      /* Likely passed on the stack.  */
> > +      && DECL_MODE (parm) == BLKmode
> > +      /* The access is in the first group.  */
> > +      && first_dr_info->group == 0)
> > +    {
> > +      /* When doing BB vectorizing force early loads from function parameters
> > +	 passed on the stack and thus stored recently to be done elementwise
> > +	 to avoid store-to-load forwarding penalties.
> > +	 Note this will cause vectorization to fail for the load because of
> > +	 the fear of underestimating the cost of elementwise accesses,
> > +	 see the end of get_load_store_type.  We are then going to effectively
> > +	 do the same via handling the loads as external input to the SLP.  */
> > +      if (dump_enabled_p ())
> > +	dump_printf_loc (MSG_NOTE, vect_location,
> > +			 "Not using vector loads from function parameter "
> > +			 "for the fear of causing a STLF fail\n");
> > +      *memory_access_type = VMAT_ELEMENTWISE;
> > +    }
> > +  else if (slp_node)
> >      {
> >        /* For SLP vectorization we directly vectorize a subchain
> >  	 without permutation.  */
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Ivo Totev; HRB 36809 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-28 13:49   ` Richard Biener
@ 2022-03-28 14:35     ` Richard Sandiford
  2022-03-29  6:21       ` Richard Biener
  0 siblings, 1 reply; 8+ messages in thread
From: Richard Sandiford @ 2022-03-28 14:35 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, hongtao.liu

Richard Biener <rguenther@suse.de> writes:
> On Mon, 28 Mar 2022, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > Since we're now vectorizing by default at -O2 issues like PR101908
>> > become more important where we apply basic-block vectorization to
>> > parts of the function covering loads from function parameters passed
>> > on the stack.  Since we have no good idea how the stack pushing
>> > was performed but we do have a good idea that it was done recent
>> > when at the start of the function a STLF failure is inevitable
>> > if the argument passing code didn't use the same or larger vector
>> > stores and also has them aligned the same way.
>> >
>> > Until there's a robust IPA based solution the following implements
>> > target independent heuristics in the vectorizer to retain scalar
>> > loads for loads from parameters likely passed in memory (I use
>> > a BLKmode DECL_MODE check for this rather than firing up
>> > cummulative-args).  I've restricted this also to loads from the
>> > first "block" (that can be less than the first basic block if there's
>> > a call for example), since that covers the testcase.
>> >
>> > Note that for the testcase (but not c-ray from the bugreport) there's
>> > a x86 peephole2 that vectorizes things back, so the patch is
>> > not effective there.
>> >
>> > Any comments?  I know we're also looking at x86 port specific
>> > mitigations but the issue will hit arm and power/z as well I think.
>> 
>> I'm not sure this is a target-independent win.  In a loop that:
>> 
>>   stores 2 scalars
>>   loads the stored scalars as a vector
>>   adds a vector
>>   stores a vector
>> 
>> (no feedback), I see a 20% regression using elementwise accesses for
>> the load vs. using a normal vector load (measured on a Cortex-A72).
>> With feedback the elementwise version is still slower, but obviously
>> not by such a big factor.
>
> I see, so that's even without a call inbetween the scalar stores
> and the vector load as is the case we're trying to cover.  I
> would suspect that maybe the two elementwise accesses execute
> too close to the two scalar stores to benefit from any forwarding
> with the A72 micro-architecture?  Do you see a speedup when
> performing a vector store instead of two scalar stores?

Yeah, it's faster with a vector store than with 2 scalar stores.

The difference between elementwise loads and vector loads reproduces
with execution of the stores and the load forced further apart.

Note that (unlike x86?) the elementwise loads are still done on the
vector side, so this is not a scalar->vector vs. scalar->scalar
trade-off.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing
  2022-03-28 14:35     ` Richard Sandiford
@ 2022-03-29  6:21       ` Richard Biener
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Biener @ 2022-03-29  6:21 UTC (permalink / raw)
  To: Richard Sandiford; +Cc: gcc-patches, hongtao.liu

On Mon, 28 Mar 2022, Richard Sandiford wrote:

> Richard Biener <rguenther@suse.de> writes:
> > On Mon, 28 Mar 2022, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > Since we're now vectorizing by default at -O2 issues like PR101908
> >> > become more important where we apply basic-block vectorization to
> >> > parts of the function covering loads from function parameters passed
> >> > on the stack.  Since we have no good idea how the stack pushing
> >> > was performed but we do have a good idea that it was done recent
> >> > when at the start of the function a STLF failure is inevitable
> >> > if the argument passing code didn't use the same or larger vector
> >> > stores and also has them aligned the same way.
> >> >
> >> > Until there's a robust IPA based solution the following implements
> >> > target independent heuristics in the vectorizer to retain scalar
> >> > loads for loads from parameters likely passed in memory (I use
> >> > a BLKmode DECL_MODE check for this rather than firing up
> >> > cummulative-args).  I've restricted this also to loads from the
> >> > first "block" (that can be less than the first basic block if there's
> >> > a call for example), since that covers the testcase.
> >> >
> >> > Note that for the testcase (but not c-ray from the bugreport) there's
> >> > a x86 peephole2 that vectorizes things back, so the patch is
> >> > not effective there.
> >> >
> >> > Any comments?  I know we're also looking at x86 port specific
> >> > mitigations but the issue will hit arm and power/z as well I think.
> >> 
> >> I'm not sure this is a target-independent win.  In a loop that:
> >> 
> >>   stores 2 scalars
> >>   loads the stored scalars as a vector
> >>   adds a vector
> >>   stores a vector
> >> 
> >> (no feedback), I see a 20% regression using elementwise accesses for
> >> the load vs. using a normal vector load (measured on a Cortex-A72).
> >> With feedback the elementwise version is still slower, but obviously
> >> not by such a big factor.
> >
> > I see, so that's even without a call inbetween the scalar stores
> > and the vector load as is the case we're trying to cover.  I
> > would suspect that maybe the two elementwise accesses execute
> > too close to the two scalar stores to benefit from any forwarding
> > with the A72 micro-architecture?  Do you see a speedup when
> > performing a vector store instead of two scalar stores?
> 
> Yeah, it's faster with a vector store than with 2 scalar stores.
> 
> The difference between elementwise loads and vector loads reproduces
> with execution of the stores and the load forced further apart.
> 
> Note that (unlike x86?) the elementwise loads are still done on the
> vector side, so this is not a scalar->vector vs. scalar->scalar
> trade-off.

That's the same on x86 - scalar loads are still done on the vector
side but forwarding from the store buffer isn't possible if the
stores were scalar and the load vector.  The CPU has to wait for
the data to reach L1-D for the load to complete.  With matching
scalar/scalar or vector/vector store/load the data can forward
from the store buffers.

Richard.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-03-29  6:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-25 12:11 [PATCH][RFC] tree-optimization/101908 - avoid STLF fails when vectorizing Richard Biener
2022-03-25 13:25 ` Hongtao Liu
2022-03-25 13:42   ` Richard Biener
2022-03-25 14:09     ` Hongtao Liu
2022-03-28 13:37 ` Richard Sandiford
2022-03-28 13:49   ` Richard Biener
2022-03-28 14:35     ` Richard Sandiford
2022-03-29  6:21       ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).