From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 817673858C53 for ; Thu, 29 Jun 2023 12:55:40 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 817673858C53 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5EF61C14; Thu, 29 Jun 2023 05:56:23 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3EC653F73F; Thu, 29 Jun 2023 05:55:39 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences References: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com> Date: Thu, 29 Jun 2023 13:55:38 +0100 In-Reply-To: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com> (Richard Biener's message of "Thu, 29 Jun 2023 12:28:12 +0000 (UTC)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-27.2 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener writes: > With applying loop masking to epilogues on x86_64 AVX512 we see > some significant performance regressions when evaluating SPEC CPU 2017 > that are caused by store-to-load forwarding fails across outer > loop iterations when the inner loop does not iterate. Consider > > for (j = 0; j < m; ++j) > for (i = 0; i < n; ++i) > a[j*n + i] += b[j*n + i]; > > with 'n' chosen so that the inner loop vectorized code is fully > executed by the masked epilogue and that masked epilogue > storing O > n elements (with elements >= n masked of course). > Then the masked load performed for the next outer loop iteration > will get a hit in the store queue but it obviously cannot forward > so we have to wait for the store to retire. > > That causes a significant hit to performance especially if 'n' > would have made a non-masked epilogue to fully cover 'n' as well > (say n == 4 for a V4DImode epilogue), avoiding the need for > store-forwarding and waiting for the retiring of the store. > > The following applies a very simple heuristic, disabling > the use of loop masking when there's a memory reference pair > with dependence distance zero. That resolves the issue > (other problematic dependence distances seem to be less common > at least). > > I have applied this heuristic in generic vectorizer code but > restricted it to non-VL vector sizes. There currently isn't > a way for the target to request disabling of masking only, > while we can reject the vectoriztion at costing time that will > not re-consider the same vector mode but without masking. > It seems simply re-costing with masking disabled should be > possible through, we'd just need an indication whether that > should be done? Maybe always when the current vector mode is > of fixed size? > > I wonder how SVE vectorized code behaves in these situations? > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and > 503.bwaves_r though I think both will need a hardware vector > size covering at least 8 doubles to show the issue. 527.cam4_r > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC. > > Bootstrap / regtest running on x86_64-unknown-linux-gnu. > > Any comments? > > Thanks, > Richard. > > PR target/110456 > * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New. > * tree-vectorizer.cc (vec_info_shared::vec_info_shared): > Initialize has_zero_dep_dist. > * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence): > Remember if we've seen a dependence distance of zero. > * tree-vect-stmts.cc (check_load_store_for_partial_vectors): > When we've seen a dependence distance of zero and the vector > type has constant size disable the use of partial vectors. > --- > gcc/tree-vect-data-refs.cc | 2 ++ > gcc/tree-vect-stmts.cc | 10 ++++++++++ > gcc/tree-vectorizer.cc | 1 + > gcc/tree-vectorizer.h | 3 +++ > 4 files changed, 16 insertions(+) > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc > index ebe93832b1e..40cde95c16a 100644 > --- a/gcc/tree-vect-data-refs.cc > +++ b/gcc/tree-vect-data-refs.cc > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, > "dependence distance == 0 between %T and %T\n", > DR_REF (dra), DR_REF (drb)); > > + loop_vinfo->shared->has_zero_dep_dist = true; > + > /* When we perform grouped accesses and perform implicit CSE > by detecting equal accesses and doing disambiguation with > runtime alias tests like for > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index d642d3c257f..3bcbc000323 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, > using_partial_vectors_p = true; > } > > + if (loop_vinfo->shared->has_zero_dep_dist > + && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) I don't think it makes sense to treat VLA and VLS differently here. But RMW operations are very common, so it seems like we're giving up a lot on the off chance that the inner loop is applied iteratively to successive memory locations. Maybe that's still OK for AVX512, where I guess loop masking is more of a niche use case. But if so, then yeah, I think a hook to disable masking might be better here. Thanks, Richard > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "disabling partial vectors because of possible " > + "STLF issues\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > + } > + > if (!using_partial_vectors_p) > { > if (dump_enabled_p ()) > diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc > index a048e9d8917..74457259b6e 100644 > --- a/gcc/tree-vectorizer.cc > +++ b/gcc/tree-vectorizer.cc > @@ -478,6 +478,7 @@ vec_info::~vec_info () > > vec_info_shared::vec_info_shared () > : n_stmts (0), > + has_zero_dep_dist (false), > datarefs (vNULL), > datarefs_copy (vNULL), > ddrs (vNULL) > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index a36974c2c0d..7626cda2a73 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -419,6 +419,9 @@ public: > /* The number of scalar stmts. */ > unsigned n_stmts; > > + /* Whether there's a dependence with zero distance. */ > + bool has_zero_dep_dist; > + > /* All data references. Freed by free_data_refs, so not an auto_vec. */ > vec datarefs; > vec datarefs_copy;