From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=XfGK=CR=arm.com=richard.sandiford@sourceware.org>
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by sourceware.org (Postfix) with ESMTP id 817673858C53
	for <gcc-patches@gcc.gnu.org>; Thu, 29 Jun 2023 12:55:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 817673858C53
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5EF61C14;
	Thu, 29 Jun 2023 05:56:23 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3EC653F73F;
	Thu, 29 Jun 2023 05:55:39 -0700 (PDT)
From: Richard Sandiford <richard.sandiford@arm.com>
To: Richard Biener <rguenther@suse.de>
Mail-Followup-To: Richard Biener <rguenther@suse.de>,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Cc: gcc-patches@gcc.gnu.org
Subject: Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences
References: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
Date: Thu, 29 Jun 2023 13:55:38 +0100
In-Reply-To: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com>
	(Richard Biener's message of "Thu, 29 Jun 2023 12:28:12 +0000 (UTC)")
Message-ID: <mptbkgy5sxh.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain
X-Spam-Status: No, score=-27.2 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Richard Biener <rguenther@suse.de> writes:
> With applying loop masking to epilogues on x86_64 AVX512 we see
> some significant performance regressions when evaluating SPEC CPU 2017
> that are caused by store-to-load forwarding fails across outer
> loop iterations when the inner loop does not iterate.  Consider
>
>   for (j = 0; j < m; ++j)
>     for (i = 0; i < n; ++i)
>       a[j*n + i] += b[j*n + i];
>
> with 'n' chosen so that the inner loop vectorized code is fully
> executed by the masked epilogue and that masked epilogue
> storing O > n elements (with elements >= n masked of course).
> Then the masked load performed for the next outer loop iteration
> will get a hit in the store queue but it obviously cannot forward
> so we have to wait for the store to retire.
>
> That causes a significant hit to performance especially if 'n'
> would have made a non-masked epilogue to fully cover 'n' as well
> (say n == 4 for a V4DImode epilogue), avoiding the need for
> store-forwarding and waiting for the retiring of the store.
>
> The following applies a very simple heuristic, disabling
> the use of loop masking when there's a memory reference pair
> with dependence distance zero.  That resolves the issue
> (other problematic dependence distances seem to be less common
> at least).
>
> I have applied this heuristic in generic vectorizer code but
> restricted it to non-VL vector sizes.  There currently isn't
> a way for the target to request disabling of masking only,
> while we can reject the vectoriztion at costing time that will
> not re-consider the same vector mode but without masking.
> It seems simply re-costing with masking disabled should be
> possible through, we'd just need an indication whether that
> should be done?  Maybe always when the current vector mode is
> of fixed size?
>
> I wonder how SVE vectorized code behaves in these situations?
> The affected SPEC CPU 2017 benchmarks were 527.cam4_r and
> 503.bwaves_r though I think both will need a hardware vector
> size covering at least 8 doubles to show the issue.  527.cam4_r
> has 4 elements in the inner loop, 503.bwaves_r 5 IIRC.
>
> Bootstrap / regtest running on x86_64-unknown-linux-gnu.
>
> Any comments?
>
> Thanks,
> Richard.
>
> 	PR target/110456
> 	* tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New.
> 	* tree-vectorizer.cc (vec_info_shared::vec_info_shared):
> 	Initialize has_zero_dep_dist.
> 	* tree-vect-data-refs.cc (vect_analyze_data_ref_dependence):
> 	Remember if we've seen a dependence distance of zero.
> 	* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> 	When we've seen a dependence distance of zero and the vector
> 	type has constant size disable the use of partial vectors.
> ---
>  gcc/tree-vect-data-refs.cc |  2 ++
>  gcc/tree-vect-stmts.cc     | 10 ++++++++++
>  gcc/tree-vectorizer.cc     |  1 +
>  gcc/tree-vectorizer.h      |  3 +++
>  4 files changed, 16 insertions(+)
>
> diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc
> index ebe93832b1e..40cde95c16a 100644
> --- a/gcc/tree-vect-data-refs.cc
> +++ b/gcc/tree-vect-data-refs.cc
> @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr,
>  			     "dependence distance == 0 between %T and %T\n",
>  			     DR_REF (dra), DR_REF (drb));
>  
> +	  loop_vinfo->shared->has_zero_dep_dist = true;
> +
>  	  /* When we perform grouped accesses and perform implicit CSE
>  	     by detecting equal accesses and doing disambiguation with
>  	     runtime alias tests like for
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index d642d3c257f..3bcbc000323 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
>        using_partial_vectors_p = true;
>      }
>  
> +  if (loop_vinfo->shared->has_zero_dep_dist
> +      && TYPE_VECTOR_SUBPARTS (vectype).is_constant ())

I don't think it makes sense to treat VLA and VLS differently here.

But RMW operations are very common, so it seems like we're giving up a
lot on the off chance that the inner loop is applied iteratively
to successive memory locations.

Maybe that's still OK for AVX512, where I guess loop masking is more
of a niche use case.  But if so, then yeah, I think a hook to disable
masking might be better here.

Thanks,
Richard

> +    {
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +			 "disabling partial vectors because of possible "
> +			 "STLF issues\n");
> +      LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
> +    }
> +
>    if (!using_partial_vectors_p)
>      {
>        if (dump_enabled_p ())
> diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
> index a048e9d8917..74457259b6e 100644
> --- a/gcc/tree-vectorizer.cc
> +++ b/gcc/tree-vectorizer.cc
> @@ -478,6 +478,7 @@ vec_info::~vec_info ()
>  
>  vec_info_shared::vec_info_shared ()
>    : n_stmts (0),
> +    has_zero_dep_dist (false),
>      datarefs (vNULL),
>      datarefs_copy (vNULL),
>      ddrs (vNULL)
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index a36974c2c0d..7626cda2a73 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -419,6 +419,9 @@ public:
>    /* The number of scalar stmts.  */
>    unsigned n_stmts;
>  
> +  /* Whether there's a dependence with zero distance.  */
> +  bool has_zero_dep_dist;
> +
>    /* All data references.  Freed by free_data_refs, so not an auto_vec.  */
>    vec<data_reference_p> datarefs;
>    vec<data_reference> datarefs_copy;