public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Adjust costing of emulated vectorized gather/scatter
@ 2023-08-30 10:35 liuhongt
  2023-08-30 12:18 ` Richard Biener
  0 siblings, 1 reply; 4+ messages in thread
From: liuhongt @ 2023-08-30 10:35 UTC (permalink / raw)
  To: gcc-patches; +Cc: rguenther, hubicka

r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
gather/scatter.
----
commit 24905a4bd1375ccd99c02510b9f9529015a48315
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jan 18 11:04:49 2023 +0100

    Adjust costing of emulated vectorized gather/scatter

    Emulated gather/scatter behave similar to strided elementwise
    accesses in that they need to decompose the offset vector
    and construct or decompose the data vector so handle them
    the same way, pessimizing the cases with may elements.
----

But for emulated gather/scatter, offset vector load/vec_construct has
aready been counted, and in real case, it's probably eliminated by
later optimizer.
Also after decomposing, element loads from continous memory could be
less bounded compared to normal elementwise load.
The patch decreases the cost a little bit.

This will enable gather emulation for below loop with VF=8(ymm)

double
foo (double* a, double* b, unsigned int* c, int n)
{
  double sum = 0;
  for (int i = 0; i != n; i++)
    sum += a[i] * b[c[i]];
  return sum;
}

For the upper loop, microbenchmark result shows on ICX,
emulated gather with VF=8 is 30% faster than emulated gather with
VF=4 when tripcount is big enough.
It bring back ~4% for 510.parest still ~5% regression compared to
gather instruction due to throughput bound.

For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
loop, VF remains 4(xmm) as before(guess related to their own cost
model).


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	PR target/111064
	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
	Decrease cost a little bit for vec_to_scalar(offset vector) in
	emulated gather.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr111064.c: New test.
---
 gcc/config/i386/i386.cc                  | 11 ++++++++++-
 gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1bc3f11ff07..337e0f1bfbb 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	  || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
     {
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
-      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
+      /* For emulated gather/scatter, offset vector load/vec_construct has
+	 already been counted and in real case, it's probably eliminated by
+	 later optimizer.
+	 Also after decomposing, element loads from continous memory
+	 could be less bounded compared to normal elementwise load.  */
+      if (kind == vec_to_scalar
+	  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+	stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+      else
+	stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
     }
   else if ((kind == vec_construct || kind == scalar_to_vec)
 	   && node
diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c
new file mode 100644
index 00000000000..aa2589bd36f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr111064.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } }  */
+
+double
+foo (double* a, double* b, unsigned int* c, int n)
+{
+  double sum = 0;
+  for (int i = 0; i != n; i++)
+    sum += a[i] * b[c[i]];
+  return sum;
+}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-08-31  8:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-30 10:35 [PATCH] Adjust costing of emulated vectorized gather/scatter liuhongt
2023-08-30 12:18 ` Richard Biener
2023-08-31  8:06   ` Hongtao Liu
2023-08-31  8:53     ` Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).