public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
@ 2023-06-29 12:28 Richard Biener
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Biener @ 2023-06-29 12:28 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.sandiford

With applying loop masking to epilogues on x86_64 AVX512 we see
some significant performance regressions when evaluating SPEC CPU 2017
that are caused by store-to-load forwarding fails across outer
loop iterations when the inner loop does not iterate.  Consider

  for (j = 0; j < m; ++j)
    for (i = 0; i < n; ++i)
      a[j*n + i] += b[j*n + i];

with 'n' chosen so that the inner loop vectorized code is fully
executed by the masked epilogue and that masked epilogue
storing O > n elements (with elements >= n masked of course).
Then the masked load performed for the next outer loop iteration
will get a hit in the store queue but it obviously cannot forward
so we have to wait for the store to retire.

That causes a significant hit to performance especially if 'n'
would have made a non-masked epilogue to fully cover 'n' as well
(say n == 4 for a V4DImode epilogue), avoiding the need for
store-forwarding and waiting for the retiring of the store.

The following applies a very simple heuristic, disabling
the use of loop masking when there's a memory reference pair
with dependence distance zero.  That resolves the issue
(other problematic dependence distances seem to be less common
at least).

I have applied this heuristic in generic vectorizer code but
restricted it to non-VL vector sizes.  There currently isn't
a way for the target to request disabling of masking only,
while we can reject the vectoriztion at costing time that will
not re-consider the same vector mode but without masking.
It seems simply re-costing with masking disabled should be
possible through, we'd just need an indication whether that
should be done?  Maybe always when the current vector mode is
of fixed size?

I wonder how SVE vectorized code behaves in these situations?
The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
503.bwaves_r though I think both will need a hardware vector
size covering at least 8 doubles to show the issue.  527.cam4_r
has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.

Bootstrap / regtest running on x86_64-unknown-linux-gnu.

Any comments?

Thanks,
Richard.

	PR target/110456
	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
	Initialize has_zero_dep_dist.
	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
	Remember if we've seen a dependence distance of zero.
	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
	When we've seen a dependence distance of zero and the vector
	type has constant size disable the use of partial vectors.
---
 gcc/tree-vect-data-refs.cc |  2 ++
 gcc/tree-vect-stmts.cc     | 10 ++++++++++
 gcc/tree-vectorizer.cc     |  1 +
 gcc/tree-vectorizer.h      |  3 +++
 4 files changed, 16 insertions(+)

diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
index ebe93832b1e..40cde95c16a 100644
--- a/gcc/tree-vect-data-refs.cc
+++ b/gcc/tree-vect-data-refs.cc
@@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
 			     "dependence distance == 0 between %T and %T\n",
 			     DR_REF (dra), DR_REF (drb));
 
+	  loop_vinfo->shared->has_zero_dep_dist = true;
+
 	  /* When we perform grouped accesses and perform implicit CSE
 	     by detecting equal accesses and doing disambiguation with
 	     runtime alias tests like for
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index d642d3c257f..3bcbc000323 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       using_partial_vectors_p = true;
     }
 
+  if (loop_vinfo->shared->has_zero_dep_dist
+      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
+    {
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+			 "disabling partial vectors because of possible "
+			 "STLF issues\n");
+      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+    }
+
   if (!using_partial_vectors_p)
     {
       if (dump_enabled_p ())
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index a048e9d8917..74457259b6e 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -478,6 +478,7 @@ vec_info::~vec_info ()
 
 vec_info_shared::vec_info_shared ()
   : n_stmts (0),
+    has_zero_dep_dist (false),
     datarefs (vNULL),
     datarefs_copy (vNULL),
     ddrs (vNULL)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index a36974c2c0d..7626cda2a73 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -419,6 +419,9 @@ public:
   /* The number of scalar stmts.  */
   unsigned n_stmts;
 
+  /* Whether there's a dependence with zero distance.  */
+  bool has_zero_dep_dist;
+
   /* All data references.  Freed by free_data_refs, so not an auto_vec.  */
   vec<data_reference_p> datarefs;
   vec<data_reference> datarefs_copy;
-- 
2.35.3

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-05 13:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
2023-06-29 12:55 ` [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences Richard Sandiford
2023-06-29 13:16   ` Richard Biener
2023-07-04 13:37     ` Richard Biener
2023-07-04 19:49       ` Richard Sandiford
2023-07-05 13:29         ` Richard Biener
2023-07-05 13:55           ` Richard Sandiford
2023-06-29 12:28 Richard Biener

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).