From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=jxh/=CW=suse.de=rguenther@sourceware.org>
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28])
	by sourceware.org (Postfix) with ESMTPS id 5503E3858D3C
	for <gcc-patches@gcc.gnu.org>; Tue,  4 Jul 2023 13:37:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5503E3858D3C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
Received: from relay2.suse.de (relay2.suse.de [149.44.160.134])
	by smtp-out1.suse.de (Postfix) with ESMTP id 76BD621F8E;
	Tue,  4 Jul 2023 13:37:50 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa;
	t=1688477870; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=cDoSuJ4dO+3qlj1k7RUBNVeK8hoG4RPLBu6KCav1QI4=;
	b=asDIMxbL3ekWTZiMXQ72UWnOHJ0LSy+lmYUUufzLrZtvmVEJQ85AC1xLHTLlIUBHQhiAWg
	u5GXBywcQmL+NzILazsSNWZg6YIRpAJh43WV3isH6xb9nh0wWb75kiPb6py7lxcMpEsSbc
	Yh81NQ9ywUsUWoVfH1dBWHv4E+aHwZc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
	s=susede2_ed25519; t=1688477870;
	h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc:
	 mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=cDoSuJ4dO+3qlj1k7RUBNVeK8hoG4RPLBu6KCav1QI4=;
	b=zOPfTMTZ50AN2rm25FnJKbW49Kgq0366+HnZ9DddH7ptHU52nBQRoQeLgGnLUHi12uuzop
	ooCwxDKvbAB09MCQ==
Received: from wotan.suse.de (wotan.suse.de [10.160.0.1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by relay2.suse.de (Postfix) with ESMTPS id 696EF2C141;
	Tue,  4 Jul 2023 13:37:50 +0000 (UTC)
Date: Tue, 4 Jul 2023 13:37:50 +0000 (UTC)
From: Richard Biener <rguenther@suse.de>
To: Richard Sandiford <richard.sandiford@arm.com>
cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance
 dependences
In-Reply-To: <nycvar.YFH.7.77.849.2306291310060.4723@jbgna.fhfr.qr>
Message-ID: <nycvar.YFH.7.77.849.2307041327340.4723@jbgna.fhfr.qr>
References: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com> <mptbkgy5sxh.fsf@arm.com> <nycvar.YFH.7.77.849.2306291310060.4723@jbgna.fhfr.qr>
User-Agent: Alpine 2.22 (LSU 394 2020-01-19)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, 29 Jun 2023, Richard Biener wrote:

> On Thu, 29 Jun 2023, Richard Sandiford wrote:
> 
> > Richard Biener <rguenther@suse.de> writes:
> > > With applying loop masking to epilogues on x86_64 AVX512 we see
> > > some significant performance regressions when evaluating SPEC CPU 2017
> > > that are caused by store-to-load forwarding fails across outer
> > > loop iterations when the inner loop does not iterate.  Consider
> > >
> > >   for (j = 0; j < m; ++j)
> > >     for (i = 0; i < n; ++i)
> > >       a[j*n + i] += b[j*n + i];
> > >
> > > with 'n' chosen so that the inner loop vectorized code is fully
> > > executed by the masked epilogue and that masked epilogue
> > > storing O > n elements (with elements >= n masked of course).
> > > Then the masked load performed for the next outer loop iteration
> > > will get a hit in the store queue but it obviously cannot forward
> > > so we have to wait for the store to retire.
> > >
> > > That causes a significant hit to performance especially if 'n'
> > > would have made a non-masked epilogue to fully cover 'n' as well
> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
> > > store-forwarding and waiting for the retiring of the store.
> > >
> > > The following applies a very simple heuristic, disabling
> > > the use of loop masking when there's a memory reference pair
> > > with dependence distance zero.  That resolves the issue
> > > (other problematic dependence distances seem to be less common
> > > at least).
> > >
> > > I have applied this heuristic in generic vectorizer code but
> > > restricted it to non-VL vector sizes.  There currently isn't
> > > a way for the target to request disabling of masking only,
> > > while we can reject the vectoriztion at costing time that will
> > > not re-consider the same vector mode but without masking.
> > > It seems simply re-costing with masking disabled should be
> > > possible through, we'd just need an indication whether that
> > > should be done?  Maybe always when the current vector mode is
> > > of fixed size?
> > >
> > > I wonder how SVE vectorized code behaves in these situations?
> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> > > 503.bwaves_r though I think both will need a hardware vector
> > > size covering at least 8 doubles to show the issue.  527.cam4_r
> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
> > >
> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
> > >
> > > Any comments?
> > >
> > > Thanks,
> > > Richard.
> > >
> > > 	PR target/110456
> > > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> > > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> > > 	Initialize has_zero_dep_dist.
> > > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> > > 	Remember if we've seen a dependence distance of zero.
> > > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> > > 	When we've seen a dependence distance of zero and the vector
> > > 	type has constant size disable the use of partial vectors.
> > > ---
> > >  gcc/tree-vect-data-refs.cc |  2 ++
> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
> > >  gcc/tree-vectorizer.cc     |  1 +
> > >  gcc/tree-vectorizer.h      |  3 +++
> > >  4 files changed, 16 insertions(+)
> > >
> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> > > index ebe93832b1e..40cde95c16a 100644
> > > --- a/gcc/tree-vect-data-refs.cc
> > > +++ b/gcc/tree-vect-data-refs.cc
> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
> > >  			     "dependence distance == 0 between %T and %T\n",
> > >  			     DR_REF (dra), DR_REF (drb));
> > >  
> > > +	  loop_vinfo->shared->has_zero_dep_dist = true;
> > > +
> > >  	  /* When we perform grouped accesses and perform implicit CSE
> > >  	     by detecting equal accesses and doing disambiguation with
> > >  	     runtime alias tests like for
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index d642d3c257f..3bcbc000323 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
> > >        using_partial_vectors_p = true;
> > >      }
> > >  
> > > +  if (loop_vinfo->shared->has_zero_dep_dist
> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
> > 
> > I don't think it makes sense to treat VLA and VLS differently here.
> > 
> > But RMW operations are very common, so it seems like we're giving up a
> > lot on the off chance that the inner loop is applied iteratively
> > to successive memory locations.
> 
> Yes ...
> 
> > Maybe that's still OK for AVX512, where I guess loop masking is more
> > of a niche use case.  But if so, then yeah, I think a hook to disable
> > masking might be better here.
> 
> It's a nice use case in that if you'd cost the main vector loop
> with and without masking the not masked case is always winning.
> I understand with SVE it would be the same if you fix the
> vector size but otherwise it would be a win to mask as there's
> the chance the HW implementation uses bigger vectors than the
> architected minimum size.
> 
> So for AVX512 the win is with the epilogue and the case of
> few scalar iterations where the epilogue iterations play
> a significant role.  Since we only vectorize the epilogue
> of the main loop but not the epilogue of the epilogue loop
> we're leaving quite some iterations unvectorized when the
> main loop uses 512bit vectors and the epilogue 256bit.
> 
> But that case specifically is also prone to make the cross-iteration
> issue wrt an outer loop significiant ...
> 
> I'll see to address this on the costing side somehow.

So with us requiring both partial vector and non-partial vector variants
working during vectorizable_* I thought it should be possible to
simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
re-analyzing in case the target deemed the partial vector using loop
not profitable.

While the following successfully hacks this in-place the question is
whether the above is really true for VLA vector types?

Besides from this "working", maybe the targets should get to say
more specifically what they'd like to change - should we make
the finish_cost () hook have a more elaborate return value or
provide more hints like we already do with the suggested unroll factor?

I can imagine a "partial vector usage is bad" or "too high VF"?
But then we probably still want to re-do the final costing part
so we'd need a way to push away the per-stmt costing we already
did (sth nicer than the hack below).  Maybe make
'bool costing_for_scalar' a tri-state, adding 'produce copy
from current vector cost in vinfo'?

Anyway - the takeaway is the x86 backend likes a way to disable
the use of partial vectors for the epilog (or main loop) in some cases.
An alternative to doing this somehow via the costing hooks would be
to add a new hook - the specific data dependence check could be
a hook invoked after dependence checking, initializing
LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?

Any good ideas?  Anything that comes to your minds that would be
useful in this area for other targets?

Thanks,
Richard.

The following applies ontop of the earlier posted patch in this thread.


diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 8989985700a..a892f72ded3 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
 	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
 	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
 	m_costs[vect_body] = INT_MAX;
+
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
+	  && loop_vinfo->shared->has_zero_dep_dist)
+	m_costs[vect_body] = INT_MAX;
     }
 
   vector_costs::finish_cost (scalar_costs);
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 3b46c58a8d8..85802f07443 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2773,6 +2773,7 @@ start_over:
     }
 
   loop_vinfo->vector_costs = init_cost (loop_vinfo, false);
+  vector_costs saved_costs = *loop_vinfo->vector_costs;
 
   /* Analyze the alignment of the data-refs in the loop.
      Fail if a data reference is found that cannot be vectorized.  */
@@ -3017,6 +3018,8 @@ start_over:
 	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
     }
 
+  saved_costs = *loop_vinfo->vector_costs;
+again_no_partial_vectors:
   /* If we're vectorizing an epilogue loop, the vectorized loop either needs
      to be able to handle fewer than VF scalars, or needs to have a lower VF
      than the main loop.  */
@@ -3043,6 +3046,17 @@ start_over:
     {
       ok = opt_result::failure_at (vect_location,
 				   "Loop costings may not be worthwhile.\n");
+      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "trying with partial vectors disabled\n");
+	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+	  LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false;
+	  LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;
+	  *loop_vinfo->vector_costs = saved_costs;
+	  goto again_no_partial_vectors;
+	}
       goto again;
     }
   if (!res)
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 08d071463fb..003af878ee7 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
       using_partial_vectors_p = true;
     }
 
-  if (loop_vinfo->shared->has_zero_dep_dist
+  if (0 && loop_vinfo->shared->has_zero_dep_dist
       && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
     {
       if (dump_enabled_p ())