[PATCH] Adjust costing of emulated vectorized gather/scatter

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: liuhongt <hongtao.liu@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: rguenther@suse.de, hubicka@ucw.cz
Subject: [PATCH] Adjust costing of emulated vectorized gather/scatter
Date: Wed, 30 Aug 2023 18:35:16 +0800	[thread overview]
Message-ID: <20230830103516.882926-1-hongtao.liu@intel.com> (raw)

r14-332-g24905a4bd1375c adjusts costing of emulated vectorized
gather/scatter.
----
commit 24905a4bd1375ccd99c02510b9f9529015a48315
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jan 18 11:04:49 2023 +0100

    Adjust costing of emulated vectorized gather/scatter

    Emulated gather/scatter behave similar to strided elementwise
    accesses in that they need to decompose the offset vector
    and construct or decompose the data vector so handle them
    the same way, pessimizing the cases with may elements.
----

But for emulated gather/scatter, offset vector load/vec_construct has
aready been counted, and in real case, it's probably eliminated by
later optimizer.
Also after decomposing, element loads from continous memory could be
less bounded compared to normal elementwise load.
The patch decreases the cost a little bit.

This will enable gather emulation for below loop with VF=8(ymm)

double
foo (double* a, double* b, unsigned int* c, int n)
{
  double sum = 0;
  for (int i = 0; i != n; i++)
    sum += a[i] * b[c[i]];
  return sum;
}

For the upper loop, microbenchmark result shows on ICX,
emulated gather with VF=8 is 30% faster than emulated gather with
VF=4 when tripcount is big enough.
It bring back ~4% for 510.parest still ~5% regression compared to
gather instruction due to throughput bound.

For -march=znver1/2/3/4, the change doesn't enable VF=8(ymm) for the
loop, VF remains 4(xmm) as before(guess related to their own cost
model).


Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?

gcc/ChangeLog:

	PR target/111064
	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
	Decrease cost a little bit for vec_to_scalar(offset vector) in
	emulated gather.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr111064.c: New test.
---
 gcc/config/i386/i386.cc                  | 11 ++++++++++-
 gcc/testsuite/gcc.target/i386/pr111064.c | 12 ++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr111064.c

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 1bc3f11ff07..337e0f1bfbb 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24079,7 +24079,16 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	  || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER))
     {
       stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
-      stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
+      /* For emulated gather/scatter, offset vector load/vec_construct has
+	 already been counted and in real case, it's probably eliminated by
+	 later optimizer.
+	 Also after decomposing, element loads from continous memory
+	 could be less bounded compared to normal elementwise load.  */
+      if (kind == vec_to_scalar
+	  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
+	stmt_cost *= TYPE_VECTOR_SUBPARTS (vectype);
+      else
+	stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
     }
   else if ((kind == vec_construct || kind == scalar_to_vec)
 	   && node
diff --git a/gcc/testsuite/gcc.target/i386/pr111064.c b/gcc/testsuite/gcc.target/i386/pr111064.c
new file mode 100644
index 00000000000..aa2589bd36f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr111064.c
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -march=icelake-server -mno-gather" } */
+/* { dg-final { scan-assembler-times {(?n)vfmadd[123]*pd.*ymm} 2 { target { ! ia32 } } } }  */
+
+double
+foo (double* a, double* b, unsigned int* c, int n)
+{
+  double sum = 0;
+  for (int i = 0; i != n; i++)
+    sum += a[i] * b[c[i]];
+  return sum;
+}
-- 
2.31.1

next             reply	other threads:[~2023-08-30 10:37 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-30 10:35 liuhongt [this message]
2023-08-30 12:18 ` Richard Biener
2023-08-31  8:06   ` Hongtao Liu
2023-08-31  8:53     ` Richard Biener

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230830103516.882926-1-hongtao.liu@intel.com \
    --to=hongtao.liu@intel.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hubicka@ucw.cz \
    --cc=rguenther@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).