From: liuhongt <hongtao.liu@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: crazylht@gmail.com, hjl.tools@gmail.com
Subject: [PATCH] Don't vectorize when vector stmts are only vec_contruct and stores
Date: Mon, 4 Dec 2023 13:32:08 +0800 [thread overview]
Message-ID: <20231204053208.908533-1-hongtao.liu@intel.com> (raw)
.i.e. for below cases.
a[0] = b1;
a[1] = b2;
..
a[n] = bn;
There're extra dependences when contructing the vector, but not for
scalar store. According to experiments, it's generally worse.
The patch adds an cut-off heuristic when vec_stmt is just
vec_construct and vector store. It improves SPEC2017 a little bit.
BenchMarks Ratio
500.perlbench_r 2.60%
502.gcc_r 0.30%
505.mcf_r 0.40%
520.omnetpp_r -1.00%
523.xalancbmk_r 0.90%
525.x264_r 0.00%
531.deepsjeng_r 0.30%
541.leela_r 0.90%
548.exchange2_r 3.20%
557.xz_r 1.40%
503.bwaves_r 0.00%
507.cactuBSSN_r 0.00%
508.namd_r 0.30%
510.parest_r 0.00%
511.povray_r 0.20%
519.lbm_r SAME BIN
521.wrf_r -0.30%
526.blender_r -1.20%
527.cam4_r -0.20%
538.imagick_r 4.00%
544.nab_r 0.40%
549.fotonik3d_r 0.00%
554.roms_r 0.00%
Geomean-int 0.90%
Geomean-fp 0.30%
Geomean-all 0.50%
And
Regressed testcases:
gcc.target/i386/part-vect-absneghf.c
gcc.target/i386/part-vect-copysignhf.c
gcc.target/i386/part-vect-xorsignhf.c
Regressed under -m32 since it generates 2 vector
.ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The
original testcases are used to test vectorization capability for
.ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT.
gcc.target/i386/pr111023-2.c
gcc.target/i386/pr111023.c
Regressed under -m32
testcase as below
void
v8hi_v8qi (v8hi *dst, v16qi src)
{
short tem[8];
tem[0] = src[0];
tem[1] = src[1];
tem[2] = src[2];
tem[3] = src[3];
tem[4] = src[4];
tem[5] = src[5];
tem[6] = src[6];
tem[7] = src[7];
dst[0] = *(v8hi *) tem;
}
under 64-bit target, vectorizer realize it's just permutation of
original src vector, but under -m32, vectorizer relies on
vec_construct for vectorization. I think optimziation for this case
under 32-bit target maynot impact much, so just add
-fno-vect-cost-model.
gcc.target/i386/pr91446.c: This testcase is guard for cost model of
vector store, not vectorization capability, so just adjust testcase.
gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to
optimize for bswap, like other optimziation vectorizer can't realize
optimization after it. So the current solution is add
-fno-vect-cost-model to the testcase.
costmodel-pr104582-1.c
costmodel-pr104582-2.c
costmodel-pr104582-4.c
Failed since it's now not vectorized, looked at the PR, it's exactly
what's wanted, so adjust testcase to scan-tree-dump-not.
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
Ok for trunk?
gcc/ChangeLog:
PR target/99881
PR target/104582
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
Check if kind is vec_construct or vector store.
(ix86_vector_costs::finish_cost): Don't do vectorization when
vector stmts are only vec_construct and stores.
(ix86_vector_costs::ix86_vect_construct_store_only_p): New
function.
(ix86_vector_costs::ix86_vect_cut_off): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/part-vect-absneghf.c: Restrict testcase to
TARGET_64BIT.
* gcc.target/i386/part-vect-copysignhf.c: Ditto.
* gcc.target/i386/part-vect-xorsignhf.c: Ditto.
* gcc.target/i386/pr91446.c: Adjust testcase.
* gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model.
* gcc.target/i386/pr111023-2.c: Ditto.
* gcc.target/i386/pr111023.c: Ditto.
* gcc.target/i386/pr99881.c: Remove xfail.
* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed
to Scan-tree-dump-not.
* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto.
* gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto.
---
gcc/config/i386/i386.cc | 81 ++++++++++++++++++-
.../costmodel/x86_64/costmodel-pr104582-1.c | 2 +-
.../costmodel/x86_64/costmodel-pr104582-3.c | 2 +-
.../costmodel/x86_64/costmodel-pr104582-4.c | 2 +-
.../gcc.target/i386/part-vect-absneghf.c | 4 +-
.../gcc.target/i386/part-vect-copysignhf.c | 4 +-
.../gcc.target/i386/part-vect-xorsignhf.c | 4 +-
gcc/testsuite/gcc.target/i386/pr108938-3.c | 2 +-
gcc/testsuite/gcc.target/i386/pr111023-2.c | 2 +-
gcc/testsuite/gcc.target/i386/pr111023.c | 2 +-
gcc/testsuite/gcc.target/i386/pr91446.c | 14 ++--
gcc/testsuite/gcc.target/i386/pr99881.c | 2 +-
12 files changed, 99 insertions(+), 22 deletions(-)
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index dcaea6c2096..a4b23e29eba 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24573,6 +24573,10 @@ public:
private:
+ /* Don't do vectorization for certain patterns. */
+ void ix86_vect_cut_off ();
+
+ bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info);
/* Estimate register pressure of the vectorized code. */
void ix86_vect_estimate_reg_pressure ();
/* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
@@ -24581,12 +24585,15 @@ private:
where we know it's not loaded from memory. */
unsigned m_num_gpr_needed[3];
unsigned m_num_sse_needed[3];
+
+ bool m_vec_construct_store_only;
};
ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar)
: vector_costs (vinfo, costing_for_scalar),
m_num_gpr_needed (),
- m_num_sse_needed ()
+ m_num_sse_needed (),
+ m_vec_construct_store_only (true)
{
}
@@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
= (kind == scalar_stmt || kind == scalar_load || kind == scalar_store);
int stmt_cost = - 1;
+ if (m_vec_construct_store_only
+ && !ix86_vect_construct_store_only_p (kind, stmt_info))
+ m_vec_construct_store_only = false;
+
bool fp = false;
machine_mode mode = scalar_p ? SImode : TImode;
@@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
}
}
+/* Return true if KIND and STMT_INFO is either vec_construct or vector
+ store, stmt_info could be promote/demote between vec_construct and
+ vector store, also count that. */
+bool
+ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind,
+ stmt_vec_info stmt_info)
+{
+ switch (kind)
+ {
+ case vec_construct:
+ case vector_store:
+ case unaligned_store:
+ case vector_scatter_store:
+ case vec_promote_demote:
+ return true;
+
+ /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes
+ char* a;
+ short b1, b2, b3, b4;
+ a[0] = b1;
+ a[1] = b2;
+ a[2] = b3;
+ a[3] = b4;
+ Also don't vectorized it. */
+ case vector_stmt:
+ if (stmt_info && stmt_info->stmt
+ && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
+ && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR)
+ return true;
+
+ default:
+ break;
+ }
+
+ return false;
+}
+
void
-ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+ix86_vector_costs::ix86_vect_cut_off ()
{
loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
if (loop_vinfo && !m_costing_for_scalar)
@@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
&& (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
> ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
m_costs[vect_body] = INT_MAX;
+ return;
+ }
+
+ /* Don't do vectorization for things like
+
+ a[0] = b1;
+ a[1] = b2;
+ ..
+ a[n] = bn;
+
+ There're extra dependences when contructing the vector, but not for
+ scalar store. According to experiments, it's generally worse. */
+ if (m_vec_construct_store_only)
+ {
+ m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX;
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_NOTE, vect_location,
+ "Skip vectorization for stmts which only contains"
+ " vec_construct and vector store.\n");
+ return;
}
+ return;
+}
+
+void
+ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
+{
+
ix86_vect_estimate_reg_pressure ();
+ ix86_vect_cut_off ();
+
vector_costs::finish_cost (scalar_costs);
}
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
index 992a845ad7a..f940af70b72 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c
@@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b)
s.b = b_;
}
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
index 999c4905708..eff60b2a82a 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c
@@ -10,4 +10,4 @@ foo (double a, double b)
s.b = b;
}
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
index cc471e1ed73..2f354338e96 100644
--- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c
@@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b)
s.b = b_;
}
-/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */
+/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
index 48aed14d604..4052210ec39 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c
@@ -85,7 +85,7 @@ do_test (void)
abort ();
}
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */
/* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */
/* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
index 811617bc3dd..006941f6651 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c
@@ -55,6 +55,6 @@ do_test (void)
abort ();
}
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
/* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
index a8ec60a088a..a57dc5aba23 100644
--- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
+++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c
@@ -55,6 +55,6 @@ do_test (void)
abort ();
}
-/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */
-/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
+/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */
/* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c
index 32ac544c7ed..24725a9ab1d 100644
--- a/gcc/testsuite/gcc.target/i386/pr108938-3.c
+++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c
@@ -1,5 +1,5 @@
/* { dg-do compile } */
-/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */
+/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */
/* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */
/* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c
index 6c69f947544..db25668a9ae 100644
--- a/gcc/testsuite/gcc.target/i386/pr111023-2.c
+++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c
@@ -1,6 +1,6 @@
/* PR target/111023 */
/* { dg-do compile } */
-/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
+/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
typedef char v16qi __attribute__((vector_size (16)));
typedef short v8hi __attribute__((vector_size (16)));
diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c
index 6144c371f32..18cd579d937 100644
--- a/gcc/testsuite/gcc.target/i386/pr111023.c
+++ b/gcc/testsuite/gcc.target/i386/pr111023.c
@@ -1,6 +1,6 @@
/* PR target/111023 */
/* { dg-do compile } */
-/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */
+/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */
typedef unsigned char v16qi __attribute__((vector_size (16)));
typedef unsigned short v8hi __attribute__((vector_size (16)));
diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c
index 0243ca3ea68..3e0f8a4b315 100644
--- a/gcc/testsuite/gcc.target/i386/pr91446.c
+++ b/gcc/testsuite/gcc.target/i386/pr91446.c
@@ -10,15 +10,15 @@ typedef struct
extern void bar (info *);
void
-foo (unsigned long long width, unsigned long long height,
- long long x, long long y)
+foo (unsigned long long* width,
+ long long* x)
{
info t;
- t.width = width;
- t.height = height;
- t.x = x;
- t.y = y;
+ t.width = width[0];
+ t.height = width[1];
+ t.x = x[0];
+ t.y = x[1];
bar (&t);
}
-/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */
+/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c
index 3e087eb2ed7..1a59325ac1c 100644
--- a/gcc/testsuite/gcc.target/i386/pr99881.c
+++ b/gcc/testsuite/gcc.target/i386/pr99881.c
@@ -1,7 +1,7 @@
/* PR target/99881. */
/* { dg-do compile { target { ! ia32 } } } */
/* { dg-options "-Ofast -march=skylake" } */
-/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */
void
foo (int* __restrict a, int n, int c)
--
2.31.1
next reply other threads:[~2023-12-04 5:32 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-04 5:32 liuhongt [this message]
2023-12-04 14:10 ` Richard Biener
2023-12-06 2:16 ` Hongtao Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231204053208.908533-1-hongtao.liu@intel.com \
--to=hongtao.liu@intel.com \
--cc=crazylht@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=hjl.tools@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).