public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Fix PR84101, account for function ABI details in vectorization costs
@ 2018-01-30 10:48 Richard Biener
  2018-01-30 14:51 ` Richard Biener
  2018-02-14  5:25 ` Jeff Law
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Biener @ 2018-01-30 10:48 UTC (permalink / raw)
  To: gcc-patches; +Cc: law


This patch tries to deal with the "easy" part of a function ABI,
the return value location, in vectorization costing.  The testcase
shows that if we vectorize the returned value but the function
doesn't return in memory or in a vector register but as in this
case in an integer register pair (reg:TI ax) (bah, ABI details
exposed late?  why's this not a parallel?) we end up spilling
badly.

The idea is to account for such spilling so if vectorization
benefits outweight the spilling we'll vectorize anyway.

I think the particular testcase could be fixed in the subreg
pass basically undoing the vectorization but I realize that
generally this is a too hard problem and avoiding vectorization
is better.  Still this patch is somewhat fragile in that it
depends on us "seeing" that the stored to decl is returned
(see cfun_returns).

Bootstrap & regtest running on x86_64-unknown-linux-gnu.

I'd like to hear opinions on my use of hard_function_value
and also from other target maintainers.  I'm not sure we
have sufficient testsuite coverage of _profitable_ vectorization
of a return value.  Feel free to add to this for your
target.

Ok for trunk?

Thanks,
Richard.

2018-01-30  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84101
	* tree-vect-stmts.c: Include explow.h for hard_function_value.
	(cfun_returns): New helper.
	(vect_model_store_cost): When vectorizing a store to a decl
	we return and the function ABI returns via a non-vector
	register account for the possible spilling that will happen.

	* gcc.target/i386/pr84101.c: New testcase.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 257139)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.
 #include "tree-cfg.h"
 #include "tree-ssa-loop-manip.h"
 #include "cfgloop.h"
+#include "explow.h"
 #include "tree-ssa-loop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
@@ -893,6 +894,22 @@ vect_model_promotion_demotion_cost (stmt
                      "prologue_cost = %d .\n", inside_cost, prologue_cost);
 }
 
+/* Returns true if the current function returns DECL.  */
+
+static bool
+cfun_returns (tree decl)
+{
+  edge_iterator ei;
+  edge e;
+  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+    {
+      greturn *ret = safe_dyn_cast <greturn *> (last_stmt (e->src));
+      if (ret && gimple_return_retval (ret) == decl)
+	return true;
+    }
+  return false;
+}
+
 /* Function vect_model_store_cost
 
    Models cost for stores.  In the case of grouped accesses, one access
@@ -971,6 +988,36 @@ vect_model_store_cost (stmt_vec_info stm
 				       vec_to_scalar, stmt_info, 0, vect_body);
     }
 
+  /* When vectorizing an SLP store into the function result assign
+     a penalty if the function returns in a non-vector register.
+     In this case we assume we'll end up with having to spill the
+     vector result and do element loads.  */
+  if (slp_node)
+    {
+      tree base = get_base_address (dr->ref);
+      if (base
+	  && (TREE_CODE (base) == RESULT_DECL
+	      || (DECL_P (base) && cfun_returns (base))))
+	{
+	  rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
+	  if ((REG_P (reg) && ! VECTOR_MODE_P (GET_MODE (reg)))
+	      || GET_CODE (reg) == PARALLEL
+	      || GET_CODE (reg) == CONCAT)
+	    {
+	      /* Spill.  */
+	      inside_cost += record_stmt_cost (body_cost_vec,
+					       ncopies, vector_store,
+					       stmt_info, 0, vect_body);
+	      /* Element loads.  */
+	      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
+	      inside_cost += record_stmt_cost (body_cost_vec,
+					       ncopies * assumed_nunits,
+					       scalar_load,
+					       stmt_info, 0, vect_body);
+	    }
+	}
+    }
+
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
                      "vect_model_store_cost: inside_cost = %d, "
Index: gcc/testsuite/gcc.target/i386/pr84101.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr84101.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr84101.c	(working copy)
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-slp2-details" } */
+
+typedef struct uint64_pair uint64_pair_t ;
+struct uint64_pair
+{
+  unsigned long w0 ;
+  unsigned long w1 ;
+} ;
+
+uint64_pair_t pair(int num)
+{
+  uint64_pair_t p ;
+
+  p.w0 = num << 1 ;
+  p.w1 = num >> 1 ;
+
+  return p ;
+}
+
+/* { dg-final { scan-tree-dump-not "basic block vectorized" "slp2" } } */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-01-30 10:48 [PATCH] Fix PR84101, account for function ABI details in vectorization costs Richard Biener
@ 2018-01-30 14:51 ` Richard Biener
  2018-01-31 14:14   ` Richard Biener
  2018-02-14  5:25 ` Jeff Law
  1 sibling, 1 reply; 8+ messages in thread
From: Richard Biener @ 2018-01-30 14:51 UTC (permalink / raw)
  To: gcc-patches; +Cc: law

On Tue, 30 Jan 2018, Richard Biener wrote:

> 
> This patch tries to deal with the "easy" part of a function ABI,
> the return value location, in vectorization costing.  The testcase
> shows that if we vectorize the returned value but the function
> doesn't return in memory or in a vector register but as in this
> case in an integer register pair (reg:TI ax) (bah, ABI details
> exposed late?  why's this not a parallel?) we end up spilling
> badly.
> 
> The idea is to account for such spilling so if vectorization
> benefits outweight the spilling we'll vectorize anyway.
> 
> I think the particular testcase could be fixed in the subreg
> pass basically undoing the vectorization but I realize that
> generally this is a too hard problem and avoiding vectorization
> is better.  Still this patch is somewhat fragile in that it
> depends on us "seeing" that the stored to decl is returned
> (see cfun_returns).
> 
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> 
> I'd like to hear opinions on my use of hard_function_value
> and also from other target maintainers.  I'm not sure we
> have sufficient testsuite coverage of _profitable_ vectorization
> of a return value.  Feel free to add to this for your
> target.
> 
> Ok for trunk?

Ok, looks like hard_function_value doesn't like being called
with

    type <record_type 0x7ffff65009d8 optional asm_written 
needs-constructing type_1 type_5 type_6 BLK
        size <integer_cst 0x7ffff64ee558 constant 8008>

we then ICE:

#0  fancy_abort (
    file=0x2094ef0 
"/space/rguenther/src/svn/early-lto-debug/gcc/machmode.h", 
    line=292, 
    function=0x2095534 <opt_mode<scalar_int_mode>::require() 
const::__FUNCTION__> "require") at 
/space/rguenther/src/svn/early-lto-debug/gcc/diagnostic.c:1500
#1  0x0000000000c34790 in opt_mode<scalar_int_mode>::require (
    this=0x7fffffffcf20)
    at /space/rguenther/src/svn/early-lto-debug/gcc/machmode.h:292
#2  0x0000000000db271a in hard_function_value (valtype=0x7ffff65009d8, 
    func=0x7ffff64fff00, fntype=0x0, outgoing=1)
    at /space/rguenther/src/svn/early-lto-debug/gcc/explow.c:2212

2201          /* int_size_in_bytes can return -1.  We don't need a check 
here
2202             since the value of bytes will then be large enough that 
no
2203             mode will match anyway.  */
2204
2205          FOR_EACH_MODE_IN_CLASS (tmpmode, MODE_INT)
2206            {
(gdb) l
2207              /* Have we found a large enough mode?  */
2208              if (GET_MODE_SIZE (tmpmode.require ()) >= bytes)
2209                break;
2210            }
2211
2212          PUT_MODE (val, tmpmode.require ());

which means we didn't find an integer mode of the wanted size.
The backend simply returned (reg:BLK 0 ax).  I suppose I have
to guard hard_function_value with sth ...

Richard.

> Thanks,
> Richard.
> 
> 2018-01-30  Richard Biener  <rguenther@suse.de>
> 
> 	PR tree-optimization/84101
> 	* tree-vect-stmts.c: Include explow.h for hard_function_value.
> 	(cfun_returns): New helper.
> 	(vect_model_store_cost): When vectorizing a store to a decl
> 	we return and the function ABI returns via a non-vector
> 	register account for the possible spilling that will happen.
> 
> 	* gcc.target/i386/pr84101.c: New testcase.
> 
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c	(revision 257139)
> +++ gcc/tree-vect-stmts.c	(working copy)
> @@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.
>  #include "tree-cfg.h"
>  #include "tree-ssa-loop-manip.h"
>  #include "cfgloop.h"
> +#include "explow.h"
>  #include "tree-ssa-loop.h"
>  #include "tree-scalar-evolution.h"
>  #include "tree-vectorizer.h"
> @@ -893,6 +894,22 @@ vect_model_promotion_demotion_cost (stmt
>                       "prologue_cost = %d .\n", inside_cost, prologue_cost);
>  }
>  
> +/* Returns true if the current function returns DECL.  */
> +
> +static bool
> +cfun_returns (tree decl)
> +{
> +  edge_iterator ei;
> +  edge e;
> +  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +    {
> +      greturn *ret = safe_dyn_cast <greturn *> (last_stmt (e->src));
> +      if (ret && gimple_return_retval (ret) == decl)
> +	return true;
> +    }
> +  return false;
> +}
> +
>  /* Function vect_model_store_cost
>  
>     Models cost for stores.  In the case of grouped accesses, one access
> @@ -971,6 +988,36 @@ vect_model_store_cost (stmt_vec_info stm
>  				       vec_to_scalar, stmt_info, 0, vect_body);
>      }
>  
> +  /* When vectorizing an SLP store into the function result assign
> +     a penalty if the function returns in a non-vector register.
> +     In this case we assume we'll end up with having to spill the
> +     vector result and do element loads.  */
> +  if (slp_node)
> +    {
> +      tree base = get_base_address (dr->ref);
> +      if (base
> +	  && (TREE_CODE (base) == RESULT_DECL
> +	      || (DECL_P (base) && cfun_returns (base))))
> +	{
> +	  rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
> +	  if ((REG_P (reg) && ! VECTOR_MODE_P (GET_MODE (reg)))
> +	      || GET_CODE (reg) == PARALLEL
> +	      || GET_CODE (reg) == CONCAT)
> +	    {
> +	      /* Spill.  */
> +	      inside_cost += record_stmt_cost (body_cost_vec,
> +					       ncopies, vector_store,
> +					       stmt_info, 0, vect_body);
> +	      /* Element loads.  */
> +	      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
> +	      inside_cost += record_stmt_cost (body_cost_vec,
> +					       ncopies * assumed_nunits,
> +					       scalar_load,
> +					       stmt_info, 0, vect_body);
> +	    }
> +	}
> +    }
> +
>    if (dump_enabled_p ())
>      dump_printf_loc (MSG_NOTE, vect_location,
>                       "vect_model_store_cost: inside_cost = %d, "
> Index: gcc/testsuite/gcc.target/i386/pr84101.c
> ===================================================================
> --- gcc/testsuite/gcc.target/i386/pr84101.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/i386/pr84101.c	(working copy)
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-slp2-details" } */
> +
> +typedef struct uint64_pair uint64_pair_t ;
> +struct uint64_pair
> +{
> +  unsigned long w0 ;
> +  unsigned long w1 ;
> +} ;
> +
> +uint64_pair_t pair(int num)
> +{
> +  uint64_pair_t p ;
> +
> +  p.w0 = num << 1 ;
> +  p.w1 = num >> 1 ;
> +
> +  return p ;
> +}
> +
> +/* { dg-final { scan-tree-dump-not "basic block vectorized" "slp2" } } */
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-01-30 14:51 ` Richard Biener
@ 2018-01-31 14:14   ` Richard Biener
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Biener @ 2018-01-31 14:14 UTC (permalink / raw)
  To: gcc-patches; +Cc: law

On Tue, 30 Jan 2018, Richard Biener wrote:

> On Tue, 30 Jan 2018, Richard Biener wrote:
> 
> > 
> > This patch tries to deal with the "easy" part of a function ABI,
> > the return value location, in vectorization costing.  The testcase
> > shows that if we vectorize the returned value but the function
> > doesn't return in memory or in a vector register but as in this
> > case in an integer register pair (reg:TI ax) (bah, ABI details
> > exposed late?  why's this not a parallel?) we end up spilling
> > badly.
> > 
> > The idea is to account for such spilling so if vectorization
> > benefits outweight the spilling we'll vectorize anyway.
> > 
> > I think the particular testcase could be fixed in the subreg
> > pass basically undoing the vectorization but I realize that
> > generally this is a too hard problem and avoiding vectorization
> > is better.  Still this patch is somewhat fragile in that it
> > depends on us "seeing" that the stored to decl is returned
> > (see cfun_returns).
> > 
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> > 
> > I'd like to hear opinions on my use of hard_function_value
> > and also from other target maintainers.  I'm not sure we
> > have sufficient testsuite coverage of _profitable_ vectorization
> > of a return value.  Feel free to add to this for your
> > target.
> > 
> > Ok for trunk?
> 
> Ok, looks like hard_function_value doesn't like being called
> with
> 
>     type <record_type 0x7ffff65009d8 optional asm_written 
> needs-constructing type_1 type_5 type_6 BLK
>         size <integer_cst 0x7ffff64ee558 constant 8008>
> 
> we then ICE:
> 
> #0  fancy_abort (
>     file=0x2094ef0 
> "/space/rguenther/src/svn/early-lto-debug/gcc/machmode.h", 
>     line=292, 
>     function=0x2095534 <opt_mode<scalar_int_mode>::require() 
> const::__FUNCTION__> "require") at 
> /space/rguenther/src/svn/early-lto-debug/gcc/diagnostic.c:1500
> #1  0x0000000000c34790 in opt_mode<scalar_int_mode>::require (
>     this=0x7fffffffcf20)
>     at /space/rguenther/src/svn/early-lto-debug/gcc/machmode.h:292
> #2  0x0000000000db271a in hard_function_value (valtype=0x7ffff65009d8, 
>     func=0x7ffff64fff00, fntype=0x0, outgoing=1)
>     at /space/rguenther/src/svn/early-lto-debug/gcc/explow.c:2212
> 
> 2201          /* int_size_in_bytes can return -1.  We don't need a check 
> here
> 2202             since the value of bytes will then be large enough that 
> no
> 2203             mode will match anyway.  */
> 2204
> 2205          FOR_EACH_MODE_IN_CLASS (tmpmode, MODE_INT)
> 2206            {
> (gdb) l
> 2207              /* Have we found a large enough mode?  */
> 2208              if (GET_MODE_SIZE (tmpmode.require ()) >= bytes)
> 2209                break;
> 2210            }
> 2211
> 2212          PUT_MODE (val, tmpmode.require ());
> 
> which means we didn't find an integer mode of the wanted size.
> The backend simply returned (reg:BLK 0 ax).  I suppose I have
> to guard hard_function_value with sth ...

!aggregate_value_p seems to do the trick.

Bootstrapped and tested on x86_64-unknown-linux-gnu, ok for trunk?

Thanks,
Richard.

2018-01-31  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84101
	* tree-vect-stmts.c: Include explow.h for hard_function_value.
	(cfun_returns): New helper.
	(vect_model_store_cost): When vectorizing a store to a decl
	we return and the function ABI returns via a non-vector
	register account for the possible spilling that will happen.

	* gcc.target/i386/pr84101.c: New testcase.

Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	(revision 257181)
+++ gcc/tree-vect-stmts.c	(working copy)
@@ -43,6 +43,7 @@ along with GCC; see the file COPYING3.
 #include "tree-cfg.h"
 #include "tree-ssa-loop-manip.h"
 #include "cfgloop.h"
+#include "explow.h"
 #include "tree-ssa-loop.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
@@ -893,6 +894,22 @@ vect_model_promotion_demotion_cost (stmt
                      "prologue_cost = %d .\n", inside_cost, prologue_cost);
 }
 
+/* Returns true if the current function returns DECL.  */
+
+static bool
+cfun_returns (tree decl)
+{
+  edge_iterator ei;
+  edge e;
+  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+    {
+      greturn *ret = safe_dyn_cast <greturn *> (last_stmt (e->src));
+      if (ret && gimple_return_retval (ret) == decl)
+	return true;
+    }
+  return false;
+}
+
 /* Function vect_model_store_cost
 
    Models cost for stores.  In the case of grouped accesses, one access
@@ -971,6 +988,37 @@ vect_model_store_cost (stmt_vec_info stm
 				       vec_to_scalar, stmt_info, 0, vect_body);
     }
 
+  /* When vectorizing an SLP store into the function result assign
+     a penalty if the function returns in a non-vector register.
+     In this case we assume we'll end up with having to spill the
+     vector result and do element loads.  */
+  if (slp_node)
+    {
+      tree base = get_base_address (dr->ref);
+      if (base
+	  && (TREE_CODE (base) == RESULT_DECL
+	      || (DECL_P (base) && cfun_returns (base)))
+	  && !aggregate_value_p (base, cfun->decl))
+	{
+	  rtx reg = hard_function_value (TREE_TYPE (base), cfun->decl, 0, 1);
+	  if ((REG_P (reg) && ! VECTOR_MODE_P (GET_MODE (reg)))
+	      || GET_CODE (reg) == PARALLEL
+	      || GET_CODE (reg) == CONCAT)
+	    {
+	      /* Spill.  */
+	      inside_cost += record_stmt_cost (body_cost_vec,
+					       ncopies, vector_store,
+					       stmt_info, 0, vect_body);
+	      /* Element loads.  */
+	      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
+	      inside_cost += record_stmt_cost (body_cost_vec,
+					       ncopies * assumed_nunits,
+					       scalar_load,
+					       stmt_info, 0, vect_body);
+	    }
+	}
+    }
+
   if (dump_enabled_p ())
     dump_printf_loc (MSG_NOTE, vect_location,
                      "vect_model_store_cost: inside_cost = %d, "
Index: gcc/testsuite/gcc.target/i386/pr84101.c
===================================================================
--- gcc/testsuite/gcc.target/i386/pr84101.c	(nonexistent)
+++ gcc/testsuite/gcc.target/i386/pr84101.c	(working copy)
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-slp2-details" } */
+
+typedef struct uint64_pair uint64_pair_t ;
+struct uint64_pair
+{
+  unsigned long w0 ;
+  unsigned long w1 ;
+} ;
+
+uint64_pair_t pair(int num)
+{
+  uint64_pair_t p ;
+
+  p.w0 = num << 1 ;
+  p.w1 = num >> 1 ;
+
+  return p ;
+}
+
+/* { dg-final { scan-tree-dump-not "basic block vectorized" "slp2" } } */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-01-30 10:48 [PATCH] Fix PR84101, account for function ABI details in vectorization costs Richard Biener
  2018-01-30 14:51 ` Richard Biener
@ 2018-02-14  5:25 ` Jeff Law
  2018-02-14 11:52   ` Richard Biener
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff Law @ 2018-02-14  5:25 UTC (permalink / raw)
  To: Richard Biener, gcc-patches

On 01/30/2018 02:59 AM, Richard Biener wrote:
> 
> This patch tries to deal with the "easy" part of a function ABI,
> the return value location, in vectorization costing.  The testcase
> shows that if we vectorize the returned value but the function
> doesn't return in memory or in a vector register but as in this
> case in an integer register pair (reg:TI ax) (bah, ABI details
> exposed late?  why's this not a parallel?) we end up spilling
> badly.
PARALLEL is used when the ABI mandates a value be returned in multiple
places.  Typically that happens when the value is returned in different
types of registers (integer, floating point, vector).

Presumably it's not a PARALLEL in this case because the value is only
returned in %eax.

> 
> The idea is to account for such spilling so if vectorization
> benefits outweight the spilling we'll vectorize anyway.
That's a pretty serious bleed of the target into the vectorizer.  But
we've already deemed that the vectorizer is going to have these target
dependencies.  So I won't object on those grounds.


> 
> I think the particular testcase could be fixed in the subreg
> pass basically undoing the vectorization but I realize that
> generally this is a too hard problem and avoiding vectorization
> is better.  Still this patch is somewhat fragile in that it
> depends on us "seeing" that the stored to decl is returned
> (see cfun_returns).
> 
> Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> 
> I'd like to hear opinions on my use of hard_function_value
> and also from other target maintainers.  I'm not sure we
> have sufficient testsuite coverage of _profitable_ vectorization
> of a return value.  Feel free to add to this for your
> target.
Well, it's the right way to get at the information.  I'm not aware of
any other way to get what you want.  We could possibly hide the RTL bits
to avoid GET_MODE and friends within the vectorizer -- your call.

I'm not sure the bits in vect_mode_store_cost are right though.  ISTM
you want to penalize if and only if the return value is not stored in a
vector-capable location.  If it's a PARALLEL and any element is a
suitable vector register, then do not penalize -- the spills are
unavoidable in that case.

So I think you have to iterate over elements in the PARALLEL case to
verify none of them are suitable for holding the vector result.

I'm not entirely sure what to do with CONCAT.  I wasn't immediately
aware it could show up in that context.


Or am I missing something here?


> 
> Ok for trunk?
> 
> Thanks,
> Richard.
> 
> 2018-01-30  Richard Biener  <rguenther@suse.de>
> 
> 	PR tree-optimization/84101
> 	* tree-vect-stmts.c: Include explow.h for hard_function_value.
> 	(cfun_returns): New helper.
> 	(vect_model_store_cost): When vectorizing a store to a decl
> 	we return and the function ABI returns via a non-vector
> 	register account for the possible spilling that will happen.
> 
> 	* gcc.target/i386/pr84101.c: New testcase.


Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-02-14  5:25 ` Jeff Law
@ 2018-02-14 11:52   ` Richard Biener
  2018-02-14 12:49     ` Jakub Jelinek
  2018-06-28  5:52     ` Jeff Law
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Biener @ 2018-02-14 11:52 UTC (permalink / raw)
  To: Jeff Law; +Cc: gcc-patches

On Tue, 13 Feb 2018, Jeff Law wrote:

> On 01/30/2018 02:59 AM, Richard Biener wrote:
> > 
> > This patch tries to deal with the "easy" part of a function ABI,
> > the return value location, in vectorization costing.  The testcase
> > shows that if we vectorize the returned value but the function
> > doesn't return in memory or in a vector register but as in this
> > case in an integer register pair (reg:TI ax) (bah, ABI details
> > exposed late?  why's this not a parallel?) we end up spilling
> > badly.
> PARALLEL is used when the ABI mandates a value be returned in multiple
> places.  Typically that happens when the value is returned in different
> types of registers (integer, floating point, vector).
> 
> Presumably it's not a PARALLEL in this case because the value is only
> returned in %eax.

It's returned in %eax and %rdx (TImode after all).  But maybe
"standard register pairs" are not represented as PARALLEL ...

> > 
> > The idea is to account for such spilling so if vectorization
> > benefits outweight the spilling we'll vectorize anyway.
> That's a pretty serious bleed of the target into the vectorizer.  But
> we've already deemed that the vectorizer is going to have these target
> dependencies.  So I won't object on those grounds.
> 
> 
> > 
> > I think the particular testcase could be fixed in the subreg
> > pass basically undoing the vectorization but I realize that
> > generally this is a too hard problem and avoiding vectorization
> > is better.  Still this patch is somewhat fragile in that it
> > depends on us "seeing" that the stored to decl is returned
> > (see cfun_returns).
> > 
> > Bootstrap & regtest running on x86_64-unknown-linux-gnu.
> > 
> > I'd like to hear opinions on my use of hard_function_value
> > and also from other target maintainers.  I'm not sure we
> > have sufficient testsuite coverage of _profitable_ vectorization
> > of a return value.  Feel free to add to this for your
> > target.
> Well, it's the right way to get at the information.  I'm not aware of
> any other way to get what you want.  We could possibly hide the RTL bits
> to avoid GET_MODE and friends within the vectorizer -- your call.
>
> I'm not sure the bits in vect_mode_store_cost are right though.  ISTM
> you want to penalize if and only if the return value is not stored in a
> vector-capable location.  If it's a PARALLEL and any element is a
> suitable vector register, then do not penalize -- the spills are
> unavoidable in that case.

So we vectorize sth like

  res[0] = a;
  res[1] = b;

to

  vector_reg = {a, b};
  MEM[&res] = vector_reg;

and if 'res' is a PARALLEL we'd end up spilling vector_reg so we
can assign it to MEM[&res] and/or decompose 'res' according to the
PARALLEL.  Without this we should be able to manage to emit
direct sets of the PARALLELs components to a and b?  Certainly
it works that way for the (reg:TI ax) testcase (which is not a
PARALLEL ...).

> So I think you have to iterate over elements in the PARALLEL case to
> verify none of them are suitable for holding the vector result.

Ok, so in case ret was bigger than just res[0] and res[1] and the
PARALLEL for it would contain both a vector of two elements and
sth for the rest then yes, I guess we might be able to generate
optimal code for the assignment to the vector part of the PARALLEL.

> I'm not entirely sure what to do with CONCAT.  I wasn't immediately
> aware it could show up in that context.

It was just a guess that it might occur for complex vars.  OTOH
I wasn't able to create a testcase that didn't end up using
a SSA name for the actual return value so we don't catch this
case with the patch.

void bar(_Complex double *);
_Complex double foo (double x, double y)
{
  _Complex double z;
  bar (&z);
  __real z = x;
  __imag z = y;
  return z;
}

gets vectorized to

foo:
.LFB0:
        .cfi_startproc
        subq    $40, %rsp
        .cfi_def_cfa_offset 48
        unpcklpd        %xmm1, %xmm0
        leaq    16(%rsp), %rdi
        movaps  %xmm0, (%rsp)
        call    bar
        movapd  (%rsp), %xmm0
        movaps  %xmm0, 16(%rsp)
        movsd   16(%rsp), %xmm0
        movsd   24(%rsp), %xmm1
        addq    $40, %rsp
        .cfi_def_cfa_offset 8
        ret

vs. non-vectorized

foo:
.LFB0:
        .cfi_startproc
        subq    $40, %rsp
        .cfi_def_cfa_offset 48
        leaq    16(%rsp), %rdi
        movsd   %xmm0, 8(%rsp)
        movsd   %xmm1, (%rsp)
        call    bar
        movsd   8(%rsp), %xmm0
        movsd   (%rsp), %xmm1
        addq    $40, %rsp
        .cfi_def_cfa_offset 8
        ret

but hard_function_value indeed returns a PARALLEL here:

(parallel:DC [
        (expr_list:REG_DEP_TRUE (reg:DF 21 xmm0)
            (const_int 0 [0]))
        (expr_list:REG_DEP_TRUE (reg:DF 22 xmm1)
            (const_int 8 [0x8]))
    ])


> Or am I missing something here?

No idea - how would a testcase with a PARALLEL like you mention
look like?  On x86_64 stuff like struct A { v2df x; v2df y; };
is returned via an invisible by-reference parameter.

ISTR PARALLELs are really "parallel" so are they in a specific
order when returned from hard_function_value?  Are there ever
PARALLELs with just one element in this context?

I can happily restrict the patch to the REG_P && ! VECTOR_MODE_P
case for now as that is the only testcase I have sofar.
I also agree it's somewhat ugly and a hack only given it doesn't
handle the _Complex testcase above...

Richard.

> 
> > 
> > Ok for trunk?
> > 
> > Thanks,
> > Richard.
> > 
> > 2018-01-30  Richard Biener  <rguenther@suse.de>
> > 
> > 	PR tree-optimization/84101
> > 	* tree-vect-stmts.c: Include explow.h for hard_function_value.
> > 	(cfun_returns): New helper.
> > 	(vect_model_store_cost): When vectorizing a store to a decl
> > 	we return and the function ABI returns via a non-vector
> > 	register account for the possible spilling that will happen.
> > 
> > 	* gcc.target/i386/pr84101.c: New testcase.
> 
> 
> Jeff
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-02-14 11:52   ` Richard Biener
@ 2018-02-14 12:49     ` Jakub Jelinek
  2018-02-16 23:49       ` Jeff Law
  2018-06-28  5:52     ` Jeff Law
  1 sibling, 1 reply; 8+ messages in thread
From: Jakub Jelinek @ 2018-02-14 12:49 UTC (permalink / raw)
  To: Richard Biener; +Cc: Jeff Law, gcc-patches

On Wed, Feb 14, 2018 at 12:52:45PM +0100, Richard Biener wrote:
> On Tue, 13 Feb 2018, Jeff Law wrote:
> 
> > On 01/30/2018 02:59 AM, Richard Biener wrote:
> > > 
> > > This patch tries to deal with the "easy" part of a function ABI,
> > > the return value location, in vectorization costing.  The testcase
> > > shows that if we vectorize the returned value but the function
> > > doesn't return in memory or in a vector register but as in this
> > > case in an integer register pair (reg:TI ax) (bah, ABI details
> > > exposed late?  why's this not a parallel?) we end up spilling
> > > badly.
> > PARALLEL is used when the ABI mandates a value be returned in multiple
> > places.  Typically that happens when the value is returned in different
> > types of registers (integer, floating point, vector).
> > 
> > Presumably it's not a PARALLEL in this case because the value is only
> > returned in %eax.
> 
> It's returned in %eax and %rdx (TImode after all).  But maybe
> "standard register pairs" are not represented as PARALLEL ...

Yes, it is (reg:TI %rax) if low part is in register 0 and high part in
register 1.

	Jakub

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-02-14 12:49     ` Jakub Jelinek
@ 2018-02-16 23:49       ` Jeff Law
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Law @ 2018-02-16 23:49 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener; +Cc: gcc-patches

On 02/14/2018 04:57 AM, Jakub Jelinek wrote:
> On Wed, Feb 14, 2018 at 12:52:45PM +0100, Richard Biener wrote:
>> On Tue, 13 Feb 2018, Jeff Law wrote:
>>
>>> On 01/30/2018 02:59 AM, Richard Biener wrote:
>>>>
>>>> This patch tries to deal with the "easy" part of a function ABI,
>>>> the return value location, in vectorization costing.  The testcase
>>>> shows that if we vectorize the returned value but the function
>>>> doesn't return in memory or in a vector register but as in this
>>>> case in an integer register pair (reg:TI ax) (bah, ABI details
>>>> exposed late?  why's this not a parallel?) we end up spilling
>>>> badly.
>>> PARALLEL is used when the ABI mandates a value be returned in multiple
>>> places.  Typically that happens when the value is returned in different
>>> types of registers (integer, floating point, vector).
>>>
>>> Presumably it's not a PARALLEL in this case because the value is only
>>> returned in %eax.
>>
>> It's returned in %eax and %rdx (TImode after all).  But maybe
>> "standard register pairs" are not represented as PARALLEL ...
> 
> Yes, it is (reg:TI %rax) if low part is in register 0 and high part in
> register 1.
Right.  In general we shouldn't see CONCATs the in the insn chain.
They'll be converted to a suitably wide REG.

PARALLEL is for a value that appears in two places.  So for example on
the m68k you might return pointers in a0 and d0.  So we wrap those reg
expressions inside a PARALLEL.

Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] Fix PR84101, account for function ABI details in vectorization costs
  2018-02-14 11:52   ` Richard Biener
  2018-02-14 12:49     ` Jakub Jelinek
@ 2018-06-28  5:52     ` Jeff Law
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff Law @ 2018-06-28  5:52 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches

[ Returning to an old patch... ]
On 02/14/2018 04:52 AM, Richard Biener wrote:
> On Tue, 13 Feb 2018, Jeff Law wrote:
> 
>> On 01/30/2018 02:59 AM, Richard Biener wrote:
>>>
>>> This patch tries to deal with the "easy" part of a function ABI,
>>> the return value location, in vectorization costing.  The testcase
>>> shows that if we vectorize the returned value but the function
>>> doesn't return in memory or in a vector register but as in this
>>> case in an integer register pair (reg:TI ax) (bah, ABI details
>>> exposed late?  why's this not a parallel?) we end up spilling
>>> badly.
>> PARALLEL is used when the ABI mandates a value be returned in multiple
>> places.  Typically that happens when the value is returned in different
>> types of registers (integer, floating point, vector).
>>
>> Presumably it's not a PARALLEL in this case because the value is only
>> returned in %eax.
> 
> It's returned in %eax and %rdx (TImode after all).  But maybe
> "standard register pairs" are not represented as PARALLEL ...
Register pairs and PARALLELs handle two different issues.

A register pair such as eax/edx is used to hold a value that is larger
than a single register.

A PARALLEL is used to hold a return value that has to appear in multiple
places.  So for example a0/d0 on m68k in some circumstances.

They can even be combined.  You might have eax/edx as the first entry in
a parallel with xmm0 as the second entry.  That would indicate a TImode
value that is in eax/edx as well as in xmm0.

From the standpoint of costing the spills for a return value we can
generate the return value into any object in the PARALLEL, but we will
have to copy it to all the other objects in the PARALLEL.

So if none of the objects in the PARALLEL are suitable for vector
operations, then obviously we're going to have to copy from the vector
register to all the elements in the PARALLEL.  This is the most
expensive case.

If one of the objects in the PARALLEL is suitable for vector ops, then
that's where we want the result to end up.  We still have to copy the
result to the other elements in the PARALLEL, but it's one less
copy/spill than the previous case.

Note that copying might have to go through memory on some targets if the
registers in the PARALLEL are in different register files.

CONCAT is (IIRC) not supposed to show up in the RTL chain at all.

Note that expr.c may do something stupid with vectors.  Just looking at
emit_group_load_1 makes me wonder if everything is going to go through
memory when we've got PARALLELs and vectors.  That may actually make
your changes to vect_mode_store_cost more correct.

Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-06-28  5:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-30 10:48 [PATCH] Fix PR84101, account for function ABI details in vectorization costs Richard Biener
2018-01-30 14:51 ` Richard Biener
2018-01-31 14:14   ` Richard Biener
2018-02-14  5:25 ` Jeff Law
2018-02-14 11:52   ` Richard Biener
2018-02-14 12:49     ` Jakub Jelinek
2018-02-16 23:49       ` Jeff Law
2018-06-28  5:52     ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).