From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=s3z6=CW=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 1B6A83858D35
	for <gcc-patches@gcc.gnu.org>; Tue,  4 Jul 2023 19:49:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1B6A83858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0A046165C;
	Tue,  4 Jul 2023 12:50:18 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 316EC3F663;
	Tue,  4 Jul 2023 12:49:35 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
References: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
	<mptbkgy5sxh.fsf@arm.com>
	<nycvar.YFH.7.77.849.2306291310060.4723@jbgna.fhfr.qr>
	<nycvar.YFH.7.77.849.2307041327340.4723@jbgna.fhfr.qr>
Date: Tue, 04 Jul 2023 20:49:33 +0100
In-Reply-To: <nycvar.YFH.7.77.849.2307041327340.4723@jbgna.fhfr.qr> (Richard
	Biener's message of "Tue, 4 Jul 2023 13:37:50 +0000 (UTC)")
Message-ID: <mpta5wbsbhu.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-27.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:
> On Thu, 29 Jun 2023, Richard Biener wrote:
>
>> On Thu, 29 Jun 2023, Richard Sandiford wrote:
>> 
>> > Richard Biener <rguenther@suse.de> writes:
>> > > With applying loop masking to epilogues on x86_64 AVX512 we see
>> > > some significant performance regressions when evaluating SPEC CPU 2017
>> > > that are caused by store-to-load forwarding fails across outer
>> > > loop iterations when the inner loop does not iterate.  Consider
>> > >
>> > >   for (j = 0; j < m; ++j)
>> > >     for (i = 0; i < n; ++i)
>> > >       a[j*n + i] += b[j*n + i];
>> > >
>> > > with 'n' chosen so that the inner loop vectorized code is fully
>> > > executed by the masked epilogue and that masked epilogue
>> > > storing O > n elements (with elements >= n masked of course).
>> > > Then the masked load performed for the next outer loop iteration
>> > > will get a hit in the store queue but it obviously cannot forward
>> > > so we have to wait for the store to retire.
>> > >
>> > > That causes a significant hit to performance especially if 'n'
>> > > would have made a non-masked epilogue to fully cover 'n' as well
>> > > (say n == 4 for a V4DImode epilogue), avoiding the need for
>> > > store-forwarding and waiting for the retiring of the store.
>> > >
>> > > The following applies a very simple heuristic, disabling
>> > > the use of loop masking when there's a memory reference pair
>> > > with dependence distance zero.  That resolves the issue
>> > > (other problematic dependence distances seem to be less common
>> > > at least).
>> > >
>> > > I have applied this heuristic in generic vectorizer code but
>> > > restricted it to non-VL vector sizes.  There currently isn't
>> > > a way for the target to request disabling of masking only,
>> > > while we can reject the vectoriztion at costing time that will
>> > > not re-consider the same vector mode but without masking.
>> > > It seems simply re-costing with masking disabled should be
>> > > possible through, we'd just need an indication whether that
>> > > should be done?  Maybe always when the current vector mode is
>> > > of fixed size?
>> > >
>> > > I wonder how SVE vectorized code behaves in these situations?
>> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
>> > > 503.bwaves_r though I think both will need a hardware vector
>> > > size covering at least 8 doubles to show the issue.  527.cam4_r
>> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
>> > >
>> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>> > >
>> > > Any comments?
>> > >
>> > > Thanks,
>> > > Richard.
>> > >
>> > > 	PR target/110456
>> > > 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
>> > > 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
>> > > 	Initialize has_zero_dep_dist.
>> > > 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
>> > > 	Remember if we've seen a dependence distance of zero.
>> > > 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
>> > > 	When we've seen a dependence distance of zero and the vector
>> > > 	type has constant size disable the use of partial vectors.
>> > > ---
>> > >  gcc/tree-vect-data-refs.cc |  2 ++
>> > >  gcc/tree-vect-stmts.cc     | 10 ++++++++++
>> > >  gcc/tree-vectorizer.cc     |  1 +
>> > >  gcc/tree-vectorizer.h      |  3 +++
>> > >  4 files changed, 16 insertions(+)
>> > >
>> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
>> > > index ebe93832b1e..40cde95c16a 100644
>> > > --- a/gcc/tree-vect-data-refs.cc
>> > > +++ b/gcc/tree-vect-data-refs.cc
>> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
>> > >  			     "dependence distance == 0 between %T and %T\n",
>> > >  			     DR_REF (dra), DR_REF (drb));
>> > >  
>> > > +	  loop_vinfo->shared->has_zero_dep_dist = true;
>> > > +
>> > >  	  /* When we perform grouped accesses and perform implicit CSE
>> > >  	     by detecting equal accesses and doing disambiguation with
>> > >  	     runtime alias tests like for
>> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>> > > index d642d3c257f..3bcbc000323 100644
>> > > --- a/gcc/tree-vect-stmts.cc
>> > > +++ b/gcc/tree-vect-stmts.cc
>> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>> > >        using_partial_vectors_p = true;
>> > >      }
>> > >  
>> > > +  if (loop_vinfo->shared->has_zero_dep_dist
>> > > +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
>> > 
>> > I don't think it makes sense to treat VLA and VLS differently here.
>> > 
>> > But RMW operations are very common, so it seems like we're giving up a
>> > lot on the off chance that the inner loop is applied iteratively
>> > to successive memory locations.
>> 
>> Yes ...
>> 
>> > Maybe that's still OK for AVX512, where I guess loop masking is more
>> > of a niche use case.  But if so, then yeah, I think a hook to disable
>> > masking might be better here.
>> 
>> It's a nice use case in that if you'd cost the main vector loop
>> with and without masking the not masked case is always winning.
>> I understand with SVE it would be the same if you fix the
>> vector size but otherwise it would be a win to mask as there's
>> the chance the HW implementation uses bigger vectors than the
>> architected minimum size.
>> 
>> So for AVX512 the win is with the epilogue and the case of
>> few scalar iterations where the epilogue iterations play
>> a significant role.  Since we only vectorize the epilogue
>> of the main loop but not the epilogue of the epilogue loop
>> we're leaving quite some iterations unvectorized when the
>> main loop uses 512bit vectors and the epilogue 256bit.
>> 
>> But that case specifically is also prone to make the cross-iteration
>> issue wrt an outer loop significiant ...
>> 
>> I'll see to address this on the costing side somehow.
>
> So with us requiring both partial vector and non-partial vector variants
> working during vectorizable_* I thought it should be possible to
> simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without
> re-analyzing in case the target deemed the partial vector using loop
> not profitable.
>
> While the following successfully hacks this in-place the question is
> whether the above is really true for VLA vector types?

Yeah, I think so.  We do support full-vector VLA.

> Besides from this "working", maybe the targets should get to say
> more specifically what they'd like to change - should we make
> the finish_cost () hook have a more elaborate return value or
> provide more hints like we already do with the suggested unroll factor?
>
> I can imagine a "partial vector usage is bad" or "too high VF"?
> But then we probably still want to re-do the final costing part
> so we'd need a way to push away the per-stmt costing we already
> did (sth nicer than the hack below).  Maybe make
> 'bool costing_for_scalar' a tri-state, adding 'produce copy
> from current vector cost in vinfo'?

Not sure I follow the tri-state thing.  If the idea is to redo
the vector_costs::add_stmt_costs stuff, could we just export
the cost_vec from vect_analyze_loop_operations and apply the
costs in vect_analyze_loop_costing instead?

That seems worthwhile anyway, if we think the costs are going
to depend on partial vs. full vectors.  As things stand, we could
already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later
set it to 0.

> Anyway - the takeaway is the x86 backend likes a way to disable
> the use of partial vectors for the epilog (or main loop) in some cases.
> An alternative to doing this somehow via the costing hooks would be
> to add a new hook - the specific data dependence check could be
> a hook invoked after dependence checking, initializing
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P?
>
> Any good ideas?  Anything that comes to your minds that would be
> useful in this area for other targets?
>
> Thanks,
> Richard.
>
> The following applies ontop of the earlier posted patch in this thread.
>
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 8989985700a..a892f72ded3 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs)
>  	  && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ())
>  	      > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo))))
>  	m_costs[vect_body] = INT_MAX;
> +
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)
> +	  && loop_vinfo->shared->has_zero_dep_dist)
> +	m_costs[vect_body] = INT_MAX;
>      }
>  
>    vector_costs::finish_cost (scalar_costs);
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 3b46c58a8d8..85802f07443 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2773,6 +2773,7 @@ start_over:
>      }
>  
>    loop_vinfo->vector_costs = init_cost (loop_vinfo, false);
> +  vector_costs saved_costs = *loop_vinfo->vector_costs;

The target can derive the class and add its own member variables,
so I think we'd need some kind of clone method for this to be safe.

>    /* Analyze the alignment of the data-refs in the loop.
>       Fail if a data reference is found that cannot be vectorized.  */
> @@ -3017,6 +3018,8 @@ start_over:
>  	LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
>      }
>  
> +  saved_costs = *loop_vinfo->vector_costs;
> +again_no_partial_vectors:
>    /* If we're vectorizing an epilogue loop, the vectorized loop either needs
>       to be able to handle fewer than VF scalars, or needs to have a lower VF
>       than the main loop.  */
> @@ -3043,6 +3046,17 @@ start_over:
>      {
>        ok = opt_result::failure_at (vect_location,
>  				   "Loop costings may not be worthwhile.\n");
> +      if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo))
> +	{
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "trying with partial vectors disabled\n");
> +	  LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +	  LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false;
> +	  LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false;

I guess these last two variables should have been set after
vect_determine_partial_vectors_and_peeling rather than before.
Sorry for not noticing earlier.

The approach seems OK to me otherwise FWIW.

Thanks,
Richard

> +	  *loop_vinfo->vector_costs = saved_costs;
> +	  goto again_no_partial_vectors;
> +	}
>        goto again;
>      }
>    if (!res)
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 08d071463fb..003af878ee7 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>        using_partial_vectors_p = true;
>      }
>  
> -  if (loop_vinfo->shared->has_zero_dep_dist
> +  if (0 && loop_vinfo->shared->has_zero_dep_dist
>        && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())
>      {
>        if (dump_enabled_p ())