[PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization
@ 2018-02-14 10:26 Richard Biener
  2018-02-15  8:07 ` Shalnov, Sergey
  2018-02-21  7:22 ` Kirill Yukhin
  0 siblings, 2 replies; 3+ messages in thread
From: Richard Biener @ 2018-02-14 10:26 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, kirill.yukhin

The following tries to account for the fact that when constructing
AVX256 or AVX512 vectors from elements we can only use insertps to
insert into the low 128bits of a vector but have to use
vinserti128 or vinserti64x4 to build larger AVX256/512 vectors.
Those operations also have higher latency (Agner documents
3 cycles for Broadwell for reg-reg vinserti128 while insertps has
one cycle latency).  Agner doesn't have tables for AVX512 yet but
I guess the story is similar for vinserti64x4.

Latency is similar for FP adds so I re-used ix86_cost->addss for
this cost.

This works towards fixing the referenced PRs below where we end
up vectorizing a lot of loads via elementwise construction, mostly
"enabled" by the new support for alias versioning for variable
strides.  Here, analyzed for PR84037, the large number of scalar
loads and vector builds before any meaningful computation means
the CPU is bottlenecked with AGU and load ops and doesn't get
any meaningful work done thus the vectorization should end up
being not profitable (with some more massaging in the vectorizer
and using SLP which reduces the number of loads a lot I only
can get into same-speed as not vectorized territory).

So the real fix for those issues is to account for those
microarchitectural issues in the backend costing.  I've decided
to plumb this onto the vector construction op if that happens
to be fed by loads, scaling this cost by the number of
vector elements (overall latency should grow with the number
of dependences).

Bootstrap/regtest running on x86_64-unknown-linux-gnu.

I've benchmarked this on Haswell with SPEC CPU 2006 and a three-run
reveals that it doesn't regress any benchmark off-noise but improves
416.gamess by 7%, 465.tonto by 6% and 481.wrf by 2%.  It also fixes
the Polyhedron capacita regression (which is what I "tuned" the
factoring with).  I've mentioned the bugs refering any of the above
affected benchmarks in the ChangeLog but it still has to be verified
if the bugs are fully fixed (84037 is).

Ok for trunk?

Any confirmation of the microarchitectural bottleneck in, say,
Capacita from people with access to cycle-accurate simulators
are welcome ;)  Performance counters only help so much (not much...),
so my guesses are based on Agner and finger-counting.

Thanks,
Richard.

2018-02-13  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84037
	PR tree-optimization/84016
	PR target/82862
	* config/i386/i386.c (ix86_builtin_vectorization_cost):
	Adjust vec_construct for the fact we need additional higher latency
	128bit inserts for AVX256 and AVX512 vector builds.
	(ix86_add_stmt_cost): Scale vector construction cost for
	elementwise loads.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 257620)
+++ gcc/config/i386/i386.c	(working copy)
@@ -45904,7 +45904,18 @@ ix86_builtin_vectorization_cost (enum ve
 			      ix86_cost->sse_op, true);

       case vec_construct:
-	return ix86_vec_cost (mode, ix86_cost->sse_op, false);
+	{
+	  /* N element inserts.  */
+	  int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
+	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
+	  if (GET_MODE_BITSIZE (mode) == 256)
+	    cost += ix86_vec_cost (mode, ix86_cost->addss, true);
+	  /* One vinserti64x4 and two vinserti128 for combining SSE
+	     and AVX256 vectors to AVX512.  */
+	  else if (GET_MODE_BITSIZE (mode) == 512)
+	    cost += 3 * ix86_vec_cost (mode, ix86_cost->addss, true);
+	  return cost;
+	}

       default:
         gcc_unreachable ();
@@ -50243,6 +50254,18 @@ ix86_add_stmt_cost (void *data, int coun
 	  break;
 	}
     }
+  /* If we do elementwise loads into a vector then we are bound by
+     latency and execution resources for the many scalar loads
+     (AGU and load ports).  Try to account for this by scaling the
+     construction cost by the number of elements involved.  */
+  if (kind == vec_construct
+      && stmt_info
+      && stmt_info->type == load_vec_info_type
+      && stmt_info->memory_access_type == VMAT_ELEMENTWISE)
+    {
+      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
+      stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+    }
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization
  2018-02-14 10:26 [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization Richard Biener
@ 2018-02-15  8:07 ` Shalnov, Sergey
  2018-02-21  7:22 ` Kirill Yukhin
  1 sibling, 0 replies; 3+ messages in thread
From: Shalnov, Sergey @ 2018-02-15  8:07 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jan Hubicka, kirill.yukhin, Richard Biener

Richard,
I've benchmarked your patch on Skylake with SPEC CPU 20[06|17][fp|int]rate 
and another smaller benchmark suites. I found that it doesn't regress 
any benchmark off-noise but improves 525.x264 by 1.8%, 526.blender by 1.9% and 465.tonto by 3.2%.
I think this is a good reason to merge the patch.
Sergey

-----Original Message-----
From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-owner@gcc.gnu.org] On Behalf Of Richard Biener
Sent: Wednesday, February 14, 2018 11:27 AM
To: gcc-patches@gcc.gnu.org
Cc: Jan Hubicka <jh@suse.de>; kirill.yukhin@gmail.com
Subject: [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization

The following tries to account for the fact that when constructing
AVX256 or AVX512 vectors from elements we can only use insertps to insert into the low 128bits of a vector but have to use
vinserti128 or vinserti64x4 to build larger AVX256/512 vectors.
Those operations also have higher latency (Agner documents
3 cycles for Broadwell for reg-reg vinserti128 while insertps has one cycle latency).  Agner doesn't have tables for AVX512 yet but I guess the story is similar for vinserti64x4.

Latency is similar for FP adds so I re-used ix86_cost->addss for this cost.

This works towards fixing the referenced PRs below where we end up vectorizing a lot of loads via elementwise construction, mostly "enabled" by the new support for alias versioning for variable strides.  Here, analyzed for PR84037, the large number of scalar loads and vector builds before any meaningful computation means the CPU is bottlenecked with AGU and load ops and doesn't get any meaningful work done thus the vectorization should end up being not profitable (with some more massaging in the vectorizer and using SLP which reduces the number of loads a lot I only can get into same-speed as not vectorized territory).

So the real fix for those issues is to account for those microarchitectural issues in the backend costing.  I've decided to plumb this onto the vector construction op if that happens to be fed by loads, scaling this cost by the number of vector elements (overall latency should grow with the number of dependences).

Bootstrap/regtest running on x86_64-unknown-linux-gnu.

I've benchmarked this on Haswell with SPEC CPU 2006 and a three-run reveals that it doesn't regress any benchmark off-noise but improves 416.gamess by 7%, 465.tonto by 6% and 481.wrf by 2%.  It also fixes the Polyhedron capacita regression (which is what I "tuned" the factoring with).  I've mentioned the bugs refering any of the above affected benchmarks in the ChangeLog but it still has to be verified if the bugs are fully fixed (84037 is).

Ok for trunk?

Any confirmation of the microarchitectural bottleneck in, say, Capacita from people with access to cycle-accurate simulators are welcome ;)  Performance counters only help so much (not much...), so my guesses are based on Agner and finger-counting.

Thanks,
Richard.

2018-02-13  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84037
	PR tree-optimization/84016
	PR target/82862
	* config/i386/i386.c (ix86_builtin_vectorization_cost):
	Adjust vec_construct for the fact we need additional higher latency
	128bit inserts for AVX256 and AVX512 vector builds.
	(ix86_add_stmt_cost): Scale vector construction cost for
	elementwise loads.

Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 257620)
+++ gcc/config/i386/i386.c	(working copy)
@@ -45904,7 +45904,18 @@ ix86_builtin_vectorization_cost (enum ve
 			      ix86_cost->sse_op, true);

       case vec_construct:
-	return ix86_vec_cost (mode, ix86_cost->sse_op, false);
+	{
+	  /* N element inserts.  */
+	  int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
+	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
+	  if (GET_MODE_BITSIZE (mode) == 256)
+	    cost += ix86_vec_cost (mode, ix86_cost->addss, true);
+	  /* One vinserti64x4 and two vinserti128 for combining SSE
+	     and AVX256 vectors to AVX512.  */
+	  else if (GET_MODE_BITSIZE (mode) == 512)
+	    cost += 3 * ix86_vec_cost (mode, ix86_cost->addss, true);
+	  return cost;
+	}

       default:
         gcc_unreachable ();
@@ -50243,6 +50254,18 @@ ix86_add_stmt_cost (void *data, int coun
 	  break;
 	}
     }
+  /* If we do elementwise loads into a vector then we are bound by
+     latency and execution resources for the many scalar loads
+     (AGU and load ports).  Try to account for this by scaling the
+     construction cost by the number of elements involved.  */  if 
+ (kind == vec_construct
+      && stmt_info
+      && stmt_info->type == load_vec_info_type
+      && stmt_info->memory_access_type == VMAT_ELEMENTWISE)
+    {
+      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
+      stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+    }
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization
  2018-02-14 10:26 [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization Richard Biener
  2018-02-15  8:07 ` Shalnov, Sergey
@ 2018-02-21  7:22 ` Kirill Yukhin
  1 sibling, 0 replies; 3+ messages in thread
From: Kirill Yukhin @ 2018-02-21  7:22 UTC (permalink / raw)
  To: Richard Biener; +Cc: gcc-patches, Jan Hubicka

Hello Richard,
On 14 Ñ„ÐµÐ² 11:26, Richard Biener wrote:
> 
> The following tries to account for the fact that when constructing
> AVX256 or AVX512 vectors from elements we can only use insertps to
> insert into the low 128bits of a vector but have to use
> vinserti128 or vinserti64x4 to build larger AVX256/512 vectors.
> Those operations also have higher latency (Agner documents
> 3 cycles for Broadwell for reg-reg vinserti128 while insertps has
> one cycle latency).  Agner doesn't have tables for AVX512 yet but
> I guess the story is similar for vinserti64x4.
> 
> Latency is similar for FP adds so I re-used ix86_cost->addss for
> this cost.
> 
> This works towards fixing the referenced PRs below where we end
> up vectorizing a lot of loads via elementwise construction, mostly
> "enabled" by the new support for alias versioning for variable
> strides.  Here, analyzed for PR84037, the large number of scalar
> loads and vector builds before any meaningful computation means
> the CPU is bottlenecked with AGU and load ops and doesn't get
> any meaningful work done thus the vectorization should end up
> being not profitable (with some more massaging in the vectorizer
> and using SLP which reduces the number of loads a lot I only
> can get into same-speed as not vectorized territory).
> 
> So the real fix for those issues is to account for those
> microarchitectural issues in the backend costing.  I've decided
> to plumb this onto the vector construction op if that happens
> to be fed by loads, scaling this cost by the number of
> vector elements (overall latency should grow with the number
> of dependences).
> 
> Bootstrap/regtest running on x86_64-unknown-linux-gnu.
> 
> I've benchmarked this on Haswell with SPEC CPU 2006 and a three-run
> reveals that it doesn't regress any benchmark off-noise but improves
> 416.gamess by 7%, 465.tonto by 6% and 481.wrf by 2%.  It also fixes
> the Polyhedron capacita regression (which is what I "tuned" the
> factoring with).  I've mentioned the bugs refering any of the above
> affected benchmarks in the ChangeLog but it still has to be verified
> if the bugs are fully fixed (84037 is).
> 
> Ok for trunk?
Your patch is OK for trunk.

--
Thanks, K

> 
> Any confirmation of the microarchitectural bottleneck in, say,
> Capacita from people with access to cycle-accurate simulators
> are welcome ;)  Performance counters only help so much (not much...),
> so my guesses are based on Agner and finger-counting.
> 
> Thanks,
> Richard.
> 
> 2018-02-13  Richard Biener  <rguenther@suse.de>
> 
> 	PR tree-optimization/84037
> 	PR tree-optimization/84016
> 	PR target/82862
> 	* config/i386/i386.c (ix86_builtin_vectorization_cost):
> 	Adjust vec_construct for the fact we need additional higher latency
> 	128bit inserts for AVX256 and AVX512 vector builds.
> 	(ix86_add_stmt_cost): Scale vector construction cost for
> 	elementwise loads.
> 
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c	(revision 257620)
> +++ gcc/config/i386/i386.c	(working copy)
> @@ -45904,7 +45904,18 @@ ix86_builtin_vectorization_cost (enum ve
>  			      ix86_cost->sse_op, true);
>  
>        case vec_construct:
> -	return ix86_vec_cost (mode, ix86_cost->sse_op, false);
> +	{
> +	  /* N element inserts.  */
> +	  int cost = ix86_vec_cost (mode, ix86_cost->sse_op, false);
> +	  /* One vinserti128 for combining two SSE vectors for AVX256.  */
> +	  if (GET_MODE_BITSIZE (mode) == 256)
> +	    cost += ix86_vec_cost (mode, ix86_cost->addss, true);
> +	  /* One vinserti64x4 and two vinserti128 for combining SSE
> +	     and AVX256 vectors to AVX512.  */
> +	  else if (GET_MODE_BITSIZE (mode) == 512)
> +	    cost += 3 * ix86_vec_cost (mode, ix86_cost->addss, true);
> +	  return cost;
> +	}
>  
>        default:
>          gcc_unreachable ();
> @@ -50243,6 +50254,18 @@ ix86_add_stmt_cost (void *data, int coun
>  	  break;
>  	}
>      }
> +  /* If we do elementwise loads into a vector then we are bound by
> +     latency and execution resources for the many scalar loads
> +     (AGU and load ports).  Try to account for this by scaling the
> +     construction cost by the number of elements involved.  */
> +  if (kind == vec_construct
> +      && stmt_info
> +      && stmt_info->type == load_vec_info_type
> +      && stmt_info->memory_access_type == VMAT_ELEMENTWISE)
> +    {
> +      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
> +      stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
> +    }
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>  

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-02-21  7:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-14 10:26 [PATCH][i386] Adjust vec_construct cost for AVX256/512, penaltize elementwise load vectorization Richard Biener
2018-02-15  8:07 ` Shalnov, Sergey
2018-02-21  7:22 ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).