From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 1B6A83858D35 for ; Tue, 4 Jul 2023 19:49:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1B6A83858D35 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0A046165C; Tue, 4 Jul 2023 12:50:18 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 316EC3F663; Tue, 4 Jul 2023 12:49:35 -0700 (PDT) From: Richard Sandiford To: Richard Biener Mail-Followup-To: Richard Biener ,gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Cc: gcc-patches@gcc.gnu.org Subject: Re: [PATCH][RFC] target/110456 - avoid loop masking with zero distance dependences References: <5b45a654-529e-4b2e-8a3e-b81ba94d28a1@AM7EUR03FT055.eop-EUR03.prod.protection.outlook.com> Date: Tue, 04 Jul 2023 20:49:33 +0100 In-Reply-To: (Richard Biener's message of "Tue, 4 Jul 2023 13:37:50 +0000 (UTC)") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-27.0 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Richard Biener writes: > On Thu, 29 Jun 2023, Richard Biener wrote: > >> On Thu, 29 Jun 2023, Richard Sandiford wrote: >> >> > Richard Biener writes: >> > > With applying loop masking to epilogues on x86_64 AVX512 we see >> > > some significant performance regressions when evaluating SPEC CPU 2017 >> > > that are caused by store-to-load forwarding fails across outer >> > > loop iterations when the inner loop does not iterate. Consider >> > > >> > > for (j = 0; j < m; ++j) >> > > for (i = 0; i < n; ++i) >> > > a[j*n + i] += b[j*n + i]; >> > > >> > > with 'n' chosen so that the inner loop vectorized code is fully >> > > executed by the masked epilogue and that masked epilogue >> > > storing O > n elements (with elements >= n masked of course). >> > > Then the masked load performed for the next outer loop iteration >> > > will get a hit in the store queue but it obviously cannot forward >> > > so we have to wait for the store to retire. >> > > >> > > That causes a significant hit to performance especially if 'n' >> > > would have made a non-masked epilogue to fully cover 'n' as well >> > > (say n == 4 for a V4DImode epilogue), avoiding the need for >> > > store-forwarding and waiting for the retiring of the store. >> > > >> > > The following applies a very simple heuristic, disabling >> > > the use of loop masking when there's a memory reference pair >> > > with dependence distance zero. That resolves the issue >> > > (other problematic dependence distances seem to be less common >> > > at least). >> > > >> > > I have applied this heuristic in generic vectorizer code but >> > > restricted it to non-VL vector sizes. There currently isn't >> > > a way for the target to request disabling of masking only, >> > > while we can reject the vectoriztion at costing time that will >> > > not re-consider the same vector mode but without masking. >> > > It seems simply re-costing with masking disabled should be >> > > possible through, we'd just need an indication whether that >> > > should be done? Maybe always when the current vector mode is >> > > of fixed size? >> > > >> > > I wonder how SVE vectorized code behaves in these situations? >> > > The affected SPEC CPU 2017 benchmarks were 527.cam4_r and >> > > 503.bwaves_r though I think both will need a hardware vector >> > > size covering at least 8 doubles to show the issue. 527.cam4_r >> > > has 4 elements in the inner loop, 503.bwaves_r 5 IIRC. >> > > >> > > Bootstrap / regtest running on x86_64-unknown-linux-gnu. >> > > >> > > Any comments? >> > > >> > > Thanks, >> > > Richard. >> > > >> > > PR target/110456 >> > > * tree-vectorizer.h (vec_info_shared::has_zero_dep_dist): New. >> > > * tree-vectorizer.cc (vec_info_shared::vec_info_shared): >> > > Initialize has_zero_dep_dist. >> > > * tree-vect-data-refs.cc (vect_analyze_data_ref_dependence): >> > > Remember if we've seen a dependence distance of zero. >> > > * tree-vect-stmts.cc (check_load_store_for_partial_vectors): >> > > When we've seen a dependence distance of zero and the vector >> > > type has constant size disable the use of partial vectors. >> > > --- >> > > gcc/tree-vect-data-refs.cc | 2 ++ >> > > gcc/tree-vect-stmts.cc | 10 ++++++++++ >> > > gcc/tree-vectorizer.cc | 1 + >> > > gcc/tree-vectorizer.h | 3 +++ >> > > 4 files changed, 16 insertions(+) >> > > >> > > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc >> > > index ebe93832b1e..40cde95c16a 100644 >> > > --- a/gcc/tree-vect-data-refs.cc >> > > +++ b/gcc/tree-vect-data-refs.cc >> > > @@ -470,6 +470,8 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, >> > > "dependence distance == 0 between %T and %T\n", >> > > DR_REF (dra), DR_REF (drb)); >> > > >> > > + loop_vinfo->shared->has_zero_dep_dist = true; >> > > + >> > > /* When we perform grouped accesses and perform implicit CSE >> > > by detecting equal accesses and doing disambiguation with >> > > runtime alias tests like for >> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >> > > index d642d3c257f..3bcbc000323 100644 >> > > --- a/gcc/tree-vect-stmts.cc >> > > +++ b/gcc/tree-vect-stmts.cc >> > > @@ -1839,6 +1839,16 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, >> > > using_partial_vectors_p = true; >> > > } >> > > >> > > + if (loop_vinfo->shared->has_zero_dep_dist >> > > + && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) >> > >> > I don't think it makes sense to treat VLA and VLS differently here. >> > >> > But RMW operations are very common, so it seems like we're giving up a >> > lot on the off chance that the inner loop is applied iteratively >> > to successive memory locations. >> >> Yes ... >> >> > Maybe that's still OK for AVX512, where I guess loop masking is more >> > of a niche use case. But if so, then yeah, I think a hook to disable >> > masking might be better here. >> >> It's a nice use case in that if you'd cost the main vector loop >> with and without masking the not masked case is always winning. >> I understand with SVE it would be the same if you fix the >> vector size but otherwise it would be a win to mask as there's >> the chance the HW implementation uses bigger vectors than the >> architected minimum size. >> >> So for AVX512 the win is with the epilogue and the case of >> few scalar iterations where the epilogue iterations play >> a significant role. Since we only vectorize the epilogue >> of the main loop but not the epilogue of the epilogue loop >> we're leaving quite some iterations unvectorized when the >> main loop uses 512bit vectors and the epilogue 256bit. >> >> But that case specifically is also prone to make the cross-iteration >> issue wrt an outer loop significiant ... >> >> I'll see to address this on the costing side somehow. > > So with us requiring both partial vector and non-partial vector variants > working during vectorizable_* I thought it should be possible to > simply reset LOOP_VINFO_USING_PARTIAL_VECTORS_P and re-cost without > re-analyzing in case the target deemed the partial vector using loop > not profitable. > > While the following successfully hacks this in-place the question is > whether the above is really true for VLA vector types? Yeah, I think so. We do support full-vector VLA. > Besides from this "working", maybe the targets should get to say > more specifically what they'd like to change - should we make > the finish_cost () hook have a more elaborate return value or > provide more hints like we already do with the suggested unroll factor? > > I can imagine a "partial vector usage is bad" or "too high VF"? > But then we probably still want to re-do the final costing part > so we'd need a way to push away the per-stmt costing we already > did (sth nicer than the hack below). Maybe make > 'bool costing_for_scalar' a tri-state, adding 'produce copy > from current vector cost in vinfo'? Not sure I follow the tri-state thing. If the idea is to redo the vector_costs::add_stmt_costs stuff, could we just export the cost_vec from vect_analyze_loop_operations and apply the costs in vect_analyze_loop_costing instead? That seems worthwhile anyway, if we think the costs are going to depend on partial vs. full vectors. As things stand, we could already cost with CAN_USE_PARTIAL_VECTORS_P set to 1 and then later set it to 0. > Anyway - the takeaway is the x86 backend likes a way to disable > the use of partial vectors for the epilog (or main loop) in some cases. > An alternative to doing this somehow via the costing hooks would be > to add a new hook - the specific data dependence check could be > a hook invoked after dependence checking, initializing > LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P? > > Any good ideas? Anything that comes to your minds that would be > useful in this area for other targets? > > Thanks, > Richard. > > The following applies ontop of the earlier posted patch in this thread. > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 8989985700a..a892f72ded3 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -23976,6 +23976,10 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > m_costs[vect_body] = INT_MAX; > + > + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo) > + && loop_vinfo->shared->has_zero_dep_dist) > + m_costs[vect_body] = INT_MAX; > } > > vector_costs::finish_cost (scalar_costs); > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 3b46c58a8d8..85802f07443 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -2773,6 +2773,7 @@ start_over: > } > > loop_vinfo->vector_costs = init_cost (loop_vinfo, false); > + vector_costs saved_costs = *loop_vinfo->vector_costs; The target can derive the class and add its own member variables, so I think we'd need some kind of clone method for this to be safe. > /* Analyze the alignment of the data-refs in the loop. > Fail if a data reference is found that cannot be vectorized. */ > @@ -3017,6 +3018,8 @@ start_over: > LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true; > } > > + saved_costs = *loop_vinfo->vector_costs; > +again_no_partial_vectors: > /* If we're vectorizing an epilogue loop, the vectorized loop either needs > to be able to handle fewer than VF scalars, or needs to have a lower VF > than the main loop. */ > @@ -3043,6 +3046,17 @@ start_over: > { > ok = opt_result::failure_at (vect_location, > "Loop costings may not be worthwhile.\n"); > + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "trying with partial vectors disabled\n"); > + LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false; > + LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = false; > + LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = false; I guess these last two variables should have been set after vect_determine_partial_vectors_and_peeling rather than before. Sorry for not noticing earlier. The approach seems OK to me otherwise FWIW. Thanks, Richard > + *loop_vinfo->vector_costs = saved_costs; > + goto again_no_partial_vectors; > + } > goto again; > } > if (!res) > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 08d071463fb..003af878ee7 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -1854,7 +1854,7 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, > using_partial_vectors_p = true; > } > > - if (loop_vinfo->shared->has_zero_dep_dist > + if (0 && loop_vinfo->shared->has_zero_dep_dist > && TYPE_VECTOR_SUBPARTS (vectype).is_constant ()) > { > if (dump_enabled_p ())