From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by sourceware.org (Postfix) with ESMTPS id 8AEB9383D012 for ; Mon, 12 Jul 2021 06:32:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8AEB9383D012 Received: by mail-ed1-x52e.google.com with SMTP id l26so17962822eda.10 for ; Sun, 11 Jul 2021 23:32:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=6sFFmE35B4Zv6SU7clfqNJ5pRA7OHRSGTFMlPF4xaBY=; b=j8bGDdKcVhJjK/Qugq/eVkTk9vnci++EdMDRTyfTGXFmVT4LuhI54Zirlxi0TQgapu mwvBj5Z5BGoI35TEn9tEQDjYYNIQWESVS5MAevY9fD7aSOD3Ukz3HdUhVojdFEDeU0IX CYOfAILQ3CWxtd42FknW+ZGukqPrwbeyE9FE8GrcC0ZjAkk3yyBnhVfsQRFQxE3rwo12 JJx/3rXiH2uEbxAfQaD8RvPshYq+WcRbUrRSzD6P3EH38q98Suo6KTuW+f5vDlQuknQN aP3yo1tfrBVRoYbVp0g8Pc4CrGgMpF2SNTk6fAyVOmBgnqRFxitkfY6rB3V+CslteQgM pfCg== X-Gm-Message-State: AOAM5333sRyn2PLT1yy8qSY+A/1l/xMf1d7IiMKBEPiyk7o/Q9q7pNYS ox3xOgI+FX7YEt2w4D7nqwCGTaafr1tKoDMP66o= X-Google-Smtp-Source: ABdhPJwuuJp/ai3etwRhKhrk5F52iPHF5HG61pFBGI1P+ypDA7MS40J2W6TMmVoXJnGlvlneCt+bpGg9yi9w4HiOYIM= X-Received: by 2002:aa7:c652:: with SMTP id z18mr4968840edr.361.1626071574462; Sun, 11 Jul 2021 23:32:54 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Mon, 12 Jul 2021 08:32:43 +0200 Message-ID: Subject: Re: [PATCH 10/10] vect: Reuse reduction accumulators between loops To: Richard Biener , GCC Patches , Richard Sandiford Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Jul 2021 06:32:57 -0000 On Fri, Jul 9, 2021 at 3:12 PM Richard Sandiford wrote: > > Thanks for the review. > > Richard Biener writes: > >> @@ -588,6 +600,23 @@ public: > >> /* Unrolling factor */ > >> poly_uint64 vectorization_factor; > >> > >> + /* If this loop is an epilogue loop whose main loop can be skipped, > >> + MAIN_LOOP_EDGE is the edge from the main loop to this loop's > >> + preheader. SKIP_MAIN_LOOP_EDGE is then the edge that skips the > >> + main loop and goes straight to this loop's preheader. > >> + > >> + Both fields are null otherwise. */ > >> + edge main_loop_edge; > >> + edge skip_main_loop_edge; > >> + > >> + /* If this loop is an epilogue loop that might be skipped after exe= cuting > >> + the main loop, this edge is the one that skips the epilogue. */ > >> + edge skip_this_loop_edge; > >> + > >> + /* After vectorization, maps live-out SSA names to information abou= t > >> + the reductions that generated them. */ > >> + hash_map reusable_accumulators; > > > > Is that the LC PHI node defs or the definition inside of the loop? > > If the latter we could attach the info directly to its stmt-info? > > Ah, yeah, I should improve the comment there. It's the vectoriser's > replacement for the original LC PHI node, i.e. the final scalar result > after the reduction has taken place. OK > >> @@ -1186,6 +1215,21 @@ public: > >> /* The vector type for performing the actual reduction. */ > >> tree reduc_vectype; > >> > >> + /* If IS_REDUC_INFO is true and if the reduction is operating on N > >> + elements in parallel, this vector gives the initial values of th= ese > >> + N elements. */ > > > > That's N scalar elements or N vector elements? I suppose it's for > > SLP reductions (rather than SLP reduction chains) and never non-SLP > > reductions? > > Yeah, poor wording again, sorry. I meant something closer to: > > /* If IS_REDUC_INFO is true and if the vector code is performing > N scalar reductions in parallel, this vector gives the initial > scalar values of those N reductions. */ > > >> + vec reduc_initial_values; > >> + > >> + /* If IS_REDUC_INFO is true and if the reduction is operating on N > >> + elements in parallel, this vector gives the scalar result of eac= h > >> + reduction. */ > >> + vec reduc_scalar_results; > > Same change here. > > >> [=E2=80=A6] > >> diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c > >> index 2909e8a0fc3..b7b0523e3c8 100644 > >> --- a/gcc/tree-vect-loop-manip.c > >> +++ b/gcc/tree-vect-loop-manip.c > >> @@ -2457,6 +2457,31 @@ vect_update_epilogue_niters (loop_vec_info epil= ogue_vinfo, > >> return vect_determine_partial_vectors_and_peeling (epilogue_vinfo, = true); > >> } > >> > >> +/* LOOP_VINFO is an epilogue loop and MAIN_LOOP_VALUE is available on= exit > >> + from the corresponding main loop. Return a value that is availabl= e in > >> + LOOP_VINFO's preheader, using SKIP_VALUE if the main loop is skipp= ed. > >> + Passing a null SKIP_VALUE is equivalent to passing zero. */ > >> + > >> +tree > >> +vect_get_main_loop_result (loop_vec_info loop_vinfo, tree main_loop_v= alue, > >> + tree skip_value) > >> +{ > >> + if (!loop_vinfo->main_loop_edge) > >> + return main_loop_value; > >> + > >> + if (!skip_value) > >> + skip_value =3D build_zero_cst (TREE_TYPE (main_loop_value)); > > > > shouldn't that be the initial value? > > For the current use case, the above two conditions are never true. > I wrote it like this because I had a follow-on patch (which might > not go anywhere) that needed this function for 0-based IVs. > > Maybe that's a bad risk/reward trade-off though. Not having to pass > zero makes things only slightly simpler for the follow-on patch, > and I guess could be dangerous in other cases. > > Perhaps in that case though I should change loop_vinfo->main_loop_edge > into a gcc_assert as well. Yeah, I think asserts (and comments in case it's because we don't handle some specific cases yet) are better than possibly wrong behavior. > >> + tree phi_result =3D make_ssa_name (TREE_TYPE (main_loop_value)); > >> + basic_block bb =3D loop_vinfo->main_loop_edge->dest; > >> + gphi *new_phi =3D create_phi_node (phi_result, bb); > >> + add_phi_arg (new_phi, main_loop_value, loop_vinfo->main_loop_edge, > >> + UNKNOWN_LOCATION); > >> + add_phi_arg (new_phi, skip_value, > >> + loop_vinfo->skip_main_loop_edge, UNKNOWN_LOCATION); > >> + return phi_result; > >> +} > >> + > >> /* Function vect_do_peeling. > >> > >> Input: > >> [=E2=80=A6] > >> @@ -4823,6 +4842,100 @@ info_for_reduction (vec_info *vinfo, stmt_vec_= info stmt_info) > >> return stmt_info; > >> } > >> > >> +/* PHI is a reduction in LOOP_VINFO that we are going to vectorize us= ing vector > >> + type VECTYPE. See if LOOP_VINFO is an epilogue loop whose main lo= op had a > >> + matching reduction that we can build on. Adjust REDUC_INFO and re= turn true > >> + if so, otherwise return false. */ > >> + > >> +static bool > >> +vect_find_reusable_accumulator (loop_vec_info loop_vinfo, > >> + stmt_vec_info reduc_info) > >> +{ > >> + loop_vec_info main_loop_vinfo =3D LOOP_VINFO_ORIG_LOOP_INFO (loop_v= info); > >> + if (!main_loop_vinfo) > >> + return false; > >> + > >> + if (STMT_VINFO_REDUC_TYPE (reduc_info) !=3D TREE_CODE_REDUCTION) > >> + return false; > >> + > >> + unsigned int num_phis =3D reduc_info->reduc_initial_values.length (= ); > >> + auto_vec main_loop_results (num_phis); > >> + auto_vec initial_values (num_phis); > >> + if (edge main_loop_edge =3D loop_vinfo->main_loop_edge) > >> + { > >> + /* The epilogue loop can be entered either from the main loop o= r > >> + from an earlier guard block. */ > >> + edge skip_edge =3D loop_vinfo->skip_main_loop_edge; > >> + for (tree incoming_value : reduc_info->reduc_initial_values) > >> + { > >> + /* Look for: > >> + > >> + INCOMING_VALUE =3D phi >> + INITIAL_VALUE(guard block)>. */ > >> + gcc_assert (TREE_CODE (incoming_value) =3D=3D SSA_NAME); > >> + > >> + gphi *phi =3D as_a (SSA_NAME_DEF_STMT (incoming_val= ue)); > >> + gcc_assert (gimple_bb (phi) =3D=3D main_loop_edge->dest); > >> + > >> + tree from_main_loop =3D PHI_ARG_DEF_FROM_EDGE (phi, main_loo= p_edge); > >> + tree from_skip =3D PHI_ARG_DEF_FROM_EDGE (phi, skip_edge); > >> + > >> + main_loop_results.quick_push (from_main_loop); > >> + initial_values.quick_push (from_skip); > >> + } > >> + } > >> + else > >> + /* The main loop dominates the epilogue loop. */ > >> + main_loop_results.splice (reduc_info->reduc_initial_values); > >> + > >> + /* See if the main loop has the kind of accumulator we need. */ > >> + vect_reusable_accumulator *accumulator > >> + =3D main_loop_vinfo->reusable_accumulators.get (main_loop_results= [0]); > >> + if (!accumulator > >> + || num_phis !=3D accumulator->reduc_info->reduc_scalar_results.= length () > >> + || !std::equal (main_loop_results.begin (), main_loop_results.e= nd (), > >> + accumulator->reduc_info->reduc_scalar_results.be= gin ())) > >> + return false; > >> + > >> + /* For now, only handle the case in which both loops are operating = on the > >> + same vector types. In future we could reduce wider vectors to n= arrower > >> + ones as well. */ > >> + tree vectype =3D STMT_VINFO_VECTYPE (reduc_info); > >> + tree old_vectype =3D TREE_TYPE (accumulator->reduc_input); > >> + if (!useless_type_conversion_p (old_vectype, vectype)) > > > > It should be indeed quite trivial to handle, likewise the case where we > > have multiple PHIs - just reduce to a single input vector and have the > > possibly multiple input vectors in the epilogue filled with neutral > > elements. I'll see if I can cook up stuff for this next week. > > Yeah, agreed. The multi-vector epilogue case should be especially easy > to handle, but it's not interesting for SVE as things stand, since: > > (a) non-SLP reductions use a single cycle for ncopies>1 (a misfeature > IMO -- on targets with wide pipelines we want exactly the opposite) > > (b) SLP reductions are limited to single vectors for variable-length targ= ets. > > So it wasn't possible to trigger multiple epilogue vectors for the > motivating SVE use case. OK, I see. If the series is in I'll see to create testcases for x86_64. > >> [=E2=80=A6] > >> @@ -5196,6 +5305,37 @@ vect_create_epilog_for_reduction (loop_vec_info= loop_vinfo, > >> reduc_inputs.safe_push (single_input); > >> } > >> > >> + tree orig_reduc_input =3D reduc_inputs[0]; > >> + > >> + /* If this loop is an epilogue loop that can be skipped after the > >> + main loop, we can only share a reduction operation between the > >> + main loop and the epilogue if we put it at the target of the > >> + skip edge. > > > > Do you have a testcase where we cannot do this? > > No, it's being defensive. I wasn't sure how the epilogue code would > evolve in future. > > >> + We can still reuse accumulators if this check fails. Doing so h= as > >> + the minor(?) benefit of making the epilogue loop's scalar result > >> + independent of the main loop's scalar result. */ > >> + bool unify_with_main_loop_p =3D false; > >> + if (reduc_info->reused_accumulator > >> + && loop_vinfo->skip_this_loop_edge > >> + && single_succ_p (exit_bb) > >> + && single_succ (exit_bb) =3D=3D loop_vinfo->skip_this_loop_edge= ->dest) > >> + { > >> + unify_with_main_loop_p =3D true; > >> + > >> + basic_block reduc_block =3D loop_vinfo->skip_this_loop_edge->de= st; > >> + reduc_inputs[0] =3D make_ssa_name (vectype); > >> + gphi *new_phi =3D create_phi_node (reduc_inputs[0], reduc_block= ); > >> + add_phi_arg (new_phi, orig_reduc_input, single_succ_edge (exit_= bb), > >> + UNKNOWN_LOCATION); > >> + add_phi_arg (new_phi, reduc_info->reused_accumulator->reduc_inp= ut, > >> + loop_vinfo->skip_this_loop_edge, UNKNOWN_LOCATION); > >> + exit_gsi =3D gsi_after_labels (reduc_block); > >> + } > >> + > >> + /* Shouldn't be used beyond this point. */ > >> + exit_bb =3D nullptr; > >> + > >> if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D COND_REDUCTION > >> && reduc_fn !=3D IFN_LAST) > >> { > >> @@ -5819,6 +5958,12 @@ vect_create_epilog_for_reduction (loop_vec_info= loop_vinfo, > >> scalar_results[0] =3D new_temp; > >> } > >> > >> + /* Record this operation if it could be reused by the epilogue loop= . */ > >> + if (STMT_VINFO_REDUC_TYPE (reduc_info) =3D=3D TREE_CODE_REDUCTION > >> + && !double_reduc) > > > > what's the issue with double_reduc? > > Probably nothing TBH. I haven't been able to construct a case that > uses predicated double reductions with vect-partial-vector-usage=3D1, > but that's probably a missed optimisation. > > There again, double reductions themselves seem to be hard to trigger > now that we have loop interchange. Is there a good way of testing > them without -fno-loop-interchange? there are a bunch of testcases in gcc.dg/vect/vect-double-reduc-?.c, I don't see how interchange avoids the double reduction, in fact when doing interchange we no longer can apply outer loop vectorization (but it's still a double reduction, just only inner loop vectorized). But eventually we don't do epilogue vectorization for outer loop vectorizations with reductions. Oh, and of course vect.exp runs with -O2 -ftree-vectorize, avoiding any of the high-level loop opts ... Richard. > Thanks, > Richard > > >> + loop_vinfo->reusable_accumulators.put (scalar_results[0], > >> + { orig_reduc_input, reduc_i= nfo }); > >> + > >> if (double_reduc) > >> loop =3D outer_loop; > >>