From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 0CB0D3858C78 for ; Wed, 6 Sep 2023 20:44:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0CB0D3858C78 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A70DF11FB; Wed, 6 Sep 2023 13:45:01 -0700 (PDT) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 11DF43F67D; Wed, 6 Sep 2023 13:44:22 -0700 (PDT) From: Richard Sandiford To: Robin Dapp Mail-Followup-To: Robin Dapp ,Jeff Law , gcc-patches , richard.sandiford@arm.com Cc: Jeff Law , gcc-patches Subject: Re: [PATCH] fwprop: Allow UNARY_P and check register pressure. References: <5a90c8a9-1570-5af4-bfdc-19d097bfee6e@gmail.com> <9acc1a24-5d01-40ad-b4b2-5948585d3e8c@gmail.com> <48bed106-190e-ab5f-4099-fdfd4f5a193f@gmail.com> <77daf152-e1fe-4980-8297-d37901d925e8@gmail.com> Date: Wed, 06 Sep 2023 21:44:21 +0100 In-Reply-To: (Robin Dapp's message of "Wed, 6 Sep 2023 13:22:12 +0200") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Status: No, score=-25.1 required=5.0 tests=BAYES_00,GIT_PATCH_0,KAM_DMARC_NONE,KAM_DMARC_STATUS,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,SPF_HELO_NONE,SPF_NONE,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Robin Dapp writes: > Hi Richard, > > I did some testing with the attached v2 that does not restrict to UNARY > anymore. As feared ;) there is some more fallout that I'm detailing below. > > On Power there is one guality fail (pr43051-1.c) that I would take > the liberty of ignoring for now. > > On x86 there are four fails: > > - cond_op_addsubmuldiv__Float16-2.c: assembler error > unsupported masking for `vmovsh'. I guess that's a latent backend > problem. > > - ifcvt-3.c, pr49095.c: Here we propagate into a compare. Before, we had > (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0). > That looks like a costing problem and can hopefully solveable by making > the second compare more expensive, preventing the propagation. > i386 costing (or every costing?) is brittle so that could well break other > things. > > - pr88873.c: This is interesting because even before this patch we > propagated with different register classes (V2DF vs DI). With the patch > we check the register pressure, find the class NO_REGS for V2DF and > abort (because the patch assumes NO_REGS = high pressure). I'm thinking > of keeping the old behavior for reg-reg propagations and only checking > the pressure for more complex operations. > > aarch64 has the most fails: > > - One guality fail (same as Power). > - shrn-combine-[123].c as before. > > - A class of (hopefully, I only checked some) similar cases where we > propagate an unspec_whilelo into an unspec_ptest. Before we would only > set a REG_EQUALS note. > Before we managed to create a while_ultsivnx16bi_cc whereas now we have > while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined. > We create redundant whilelos and I'm not sure how to improve that. I > guess a peephole is out of the question :) > > - pred-combine-and.c: Here the new propagation appears useful at first. > We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the > individual and registers remain live up to the propagation site (while > being dead before the patch). > With the registers dead, combine could create a single fcmgt before. > Now it only manages a 2->2 combination because we still need the registers > and end up with two fcmgts. > The code is worse but this seems more bad luck than anything. > > - Addressing fails from before: I looked into these and suspect all of > them are a similar. > What happens is that we have a poly_int offset that we shift, negate > and then add to x0. The result is used as load address. > Before, we would pull (combine) the (plus x0 reg) into the load keeping > the neg and shift. > Now we propagate everything into a single (set (minus x0 offset)). > The propagation itself seems worthwhile because we save one insn. > However as we got rid of the base/offset split by lumping everything > together, combine cannot pull the (plus) into the address load and > we require an aarch64_split_add_offset. This will emit the longer > sequence of ashiftl and subtract. The "base" address is x0 here so > we cannot convert (minus x0 ...)) into neg. > I didn't go through all of aarch64_split_add_offset. I suppose we > could re-add the separation of base/offset there but that might be > a loss when the result is not used as an address. > > Again, all in all no fatal problems but pretty annoying :) It's not much > but just gradually worse than with just UNARY. Any idea on how/whether to > continue? Thanks for giving it a go. Can you post the latest version of the regpressure patch too? The previous on-list version I could find seems to be too old. Thanks, Richard > Regards > Robin > > gcc/ChangeLog: > > * fwprop.cc (fwprop_propagation::profitable_p): Add unary > handling. > (fwprop_propagation::update_register_pressure): New function. > (fwprop_propagation::register_pressure_high_p): New function > (reg_single_def_for_src_p): Look through unary expressions. > (try_fwprop_subst_pattern): Check register pressure. > (forward_propagate_into): Call new function. > (fwprop_init): Init register pressure. > (fwprop_done): Clean up register pressure. > (fwprop_insn): Add comment. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test. > --- > gcc/fwprop.cc | 359 +++++++++++++++++- > .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c | 64 ++++ > 2 files changed, 419 insertions(+), 4 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c > > diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc > index 0707a234726..ce6f5a74b00 100644 > --- a/gcc/fwprop.cc > +++ b/gcc/fwprop.cc > @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3. If not see > #include "tree-pass.h" > #include "rtl-iter.h" > #include "target.h" > +#include "dominance.h" > + > +#include "ira.h" > +#include "regpressure.h" > > /* This pass does simple forward propagation and simplification when an > operand of an insn can only come from a single def. This pass uses > @@ -103,6 +107,10 @@ using namespace rtl_ssa; > > static int num_changes; > > +/* Keep track of which registers already increased the pressure to avoid double > + booking. */ > +sbitmap pressure_accounted; > + > /* Do not try to replace constant addresses or addresses of local and > argument slots. These MEM expressions are made only once and inserted > in many instructions, as well as being used to control symbol table > @@ -181,6 +189,8 @@ namespace > bool changed_mem_p () const { return result_flags & CHANGED_MEM; } > bool folded_to_constants_p () const; > bool profitable_p () const; > + bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const; > + bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const; > > bool check_mem (int, rtx) final override; > void note_simplification (int, uint16_t, rtx, rtx) final override; > @@ -340,18 +350,272 @@ fwprop_propagation::profitable_p () const > && !paradoxical_subreg_p (to)) > return true; > > - if (CONSTANT_P (to)) > + /* Disallow propagations into debug insns when it is not a reg-reg > + propagation. Their costs are usually 0 and a cost comparison will > + always be profitable no matter how much we complicate the pattern. */ > + if (DEBUG_INSN_P (insn)) > + return false; > + > + /* For more complex operations check each sub-rtx. */ > + subrtx_iterator::array_type array; > + FOR_EACH_SUBRTX (iter, array, to, NONCONST) > + { > + const_rtx x = *iter; > + > + if (REG_P (x)) > + continue; > + > + else if (GET_CODE (x) == SUBREG > + && REG_P (SUBREG_REG (x)) > + && !paradoxical_subreg_p (x)) > + continue; > + > + else if (rtx_equal_p (x, to)) > + continue; > + > + else > + return false; > + } > + > + return true; > +} > + > +/* Check if the register pressure in any predecessor block of USE's block > + until DEF's block is equal or higher than the number of hardregs in NU's > + register class. */ > +bool > +fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def, > + rtx_insn *use) const > +{ > + enum reg_class nu_class, old_class; > + int nu_nregs, old_nregs; > + nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs); > + old_class > + = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs); > + > + if (nu_class == NO_REGS && old_class == NO_REGS) > return true; > > - return false; > This patch enables the forwarding of sources from more complex > operations like unary (i.e. vec_duplicate) or binary. As this > potentially involves replacing a register with a register of a different > class the ira_hoist_pressure machinery is used to calculate the change > in register pressure. If the propagation would increase the pressure > beyond the number of hard regs, we don't perform it. > + if (nu_class == old_class) > + return false; > + > + basic_block bbfrom = BLOCK_FOR_INSN (def); > + basic_block bbto = BLOCK_FOR_INSN (use); > + > + basic_block bb; > + > + sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun)); > + bitmap_clear (visited); > + auto_vec q; > + q.safe_push (bbto); > + > + bool high = false; > + > + while (!q.is_empty ()) > + { > + bb = q.pop (); > + > + if (bitmap_bit_p (visited, bb->index)) > + continue; > + > + /* Nothing to do if the register to be replaced is not live > + in this BB. */ > + if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old))) > + continue; > + > + /* Nothing to do if the replacement register is already live in > + this BB. */ > + if (regpressure_is_live_in (bb, REGNO (nu))) > + continue; > + > + bitmap_set_bit (visited, bb->index); > + > + int press = regpressure_get (bb, nu_class); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "pressure for reg %d in bb %d: %d\n", > + REGNO (nu), bb->index, press); > + > + if (press + nu_nregs > ira_class_hard_regs_num[nu_class]) > + { > + high = true; > + break; > + } > + > + if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun)) > + break; > + > + edge e; > + edge_iterator ei; > + > + FOR_EACH_EDGE (e, ei, bb->preds) > + { > + basic_block pred = e->src; > + q.safe_push (pred); > + } > + } > + > + sbitmap_free (visited); > + > + return high; > } > > -/* Check that X has a single def. */ > +/* Update the register pressure when propagating from DEF to USE > + replacing OLD with NU. Return TRUE if the pressure was updated. */ > +bool > +fwprop_propagation::update_register_pressure (rtx nu, rtx old, rtx_insn *def, > + rtx_insn *use) const > +{ > + basic_block bb; > + > + enum reg_class nu_class, old_class; > + int nu_nregs, old_nregs; > + nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs); > + old_class = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs); > + > + bool changed = false; > + > + if (nu_class == old_class) > + return false; > + > + basic_block bbfrom = BLOCK_FOR_INSN (def); > + basic_block bbto = BLOCK_FOR_INSN (use); > + > + sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun)); > + bitmap_clear (visited); > + auto_vec q; > + q.safe_push (bbto); > + > + /* Precompute the users. */ > + auto_vec users; > + df_ref duse; > + bool should_decrease = true; > + FOR_EACH_INSN_USE (duse, def) > + { > + rtx dreg = DF_REF_REAL_REG (duse); > + if (REGNO (dreg) != REGNO (old)) > + continue; > + > + df_ref op_ref = DF_REG_USE_CHAIN (REGNO (dreg)); > + for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref)) > + { > + if (!DF_REF_INSN_INFO (op_ref)) > + continue; > + > + rtx_insn *insn = DF_REF_INSN (op_ref); > + > + /* If there are other users in BBTO, never decrease the > + pressure. */ > + if (BLOCK_FOR_INSN (insn) == bbto && insn != use) > + { > + should_decrease = false; > + break; > + } > + > + users.safe_push (insn); > + } > + } > + > + while (!q.is_empty ()) > + { > + bb = q.pop (); > + > + if (bitmap_bit_p (visited, bb->index)) > + continue; > + > + /* Nothing to do if the register to be replaced is not live > + in this BB. */ > + if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old))) > + continue; > + > + /* Nothing to do if the new register is already live in this BB. */ > + if (regpressure_is_live_in (bb, REGNO (nu))) > + continue; > + > + bitmap_set_bit (visited, bb->index); > + > + if (regpressure_is_live_in (bb, REGNO (old))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "increasing pressure for " > + "reg %d in bb %d by %d reg(s).\n", > + REGNO (nu), bb->index, nu_nregs); > + regpressure_set_live_in (bb, REGNO (nu)); > + regpressure_increase (bb, nu_class, nu_nregs); > + changed = true; > + > + if (should_decrease) > + { > + rtx_insn *user; > + int ix; > + bool should_decrease_local = true; > + > + /* If this BB dominates no BB where OLD is used (except BBTO) > + the register pressure is decreased. */ > + FOR_EACH_VEC_ELT(users, ix, user) > + { > + basic_block ubb = BLOCK_FOR_INSN (user); > + if (ubb == bbto) > + continue; > + if (dominated_by_p (CDI_DOMINATORS, ubb, bb)) > + { > + should_decrease_local = false; > + break; > + } > + } > + > + if (should_decrease_local) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "decreasing pressure for " > + "reg %d in bb %d by %d reg(s).\n", > + REGNO (old), bb->index, old_nregs); > + regpressure_clear_live_in (bb, REGNO (old)); > + regpressure_decrease (bb, old_class, old_nregs); > + } > + } > + } > + > + if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun)) > + break; > + > + edge e; > + edge_iterator ei; > + > + FOR_EACH_EDGE (e, ei, bb->preds) > + { > + basic_block pred = e->src; > + q.safe_push (pred); > + } > + } > + > + sbitmap_free (visited); > + return changed; > +} > + > +/* Check that X has a single def. In case of more complex operations > + check each sub-rtx. */ > > static bool > reg_single_def_p (rtx x) > { > - return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x)); > + if (REG_P (x)) > + return crtl->ssa->single_dominating_def (REGNO (x)); > + > + subrtx_var_iterator::array_type array; > + FOR_EACH_SUBRTX_VAR (iter, array, x, ALL) > + { > + rtx sub = *iter; > + > + if (SUBREG_P (x) && !contains_paradoxical_subreg_p (sub)) > + sub = SUBREG_REG (x); > + > + bool ok = REG_P (sub) && crtl->ssa->single_dominating_def (REGNO (sub)); > + > + if (!ok) > + return false; > + } > + > + return true; > } > > /* Try to substitute (set DEST SRC), which defines DEF, into note NOTE of > @@ -490,6 +754,81 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change, > } > } > > + /* Perform register-pressure checks for all sub-rtxs. > + If FLAG_IRA_HOIST_PRESSURE is set only allow the propagation if the > + register pressure is not high. > + Otherwise only allow the propagation if DEF has only a single use. */ > + if (ok) > + { > + rtx_insn *def_rtl = def_insn->rtl (); > + > + subrtx_var_iterator::array_type array; > + FOR_EACH_SUBRTX_VAR (iter, array, src, NONCONST) > + { > + rtx srcreg = *iter; > + rtx dstreg = dest; > + > + /* For register pressure we do not care about subregs so strip > + them. */ > + while (SUBREG_P (dstreg)) > + dstreg = SUBREG_REG (dstreg); > + > + while (SUBREG_P (srcreg)) > + srcreg = SUBREG_REG (srcreg); > + > + if (!REG_P (dstreg) || !REG_P (srcreg)) > + continue; > + > + /* Nothing to do for the same class. */ > + bool same_class = regpressure_same_class (dstreg, srcreg); > + if (same_class) > + { > + if (flag_ira_hoist_pressure) > + continue; > + else > + { > + /* Check if this is the only use. If not do not > + continue. */ > + for (use_info *use : def->all_uses ()) > + if (use->insn () != use_insn) > + return false; > + } > + } > + > + /* Never allow different pressure classes when not considering > + register pressure. */ > + if (!same_class && !flag_ira_hoist_pressure) > + return false; > + > + if (!bitmap_bit_p (pressure_accounted, REGNO (srcreg))) > + { > + if (!prop.register_pressure_high_p (srcreg, dstreg, def_rtl, > + use_rtl)) > + { > + bool updated > + = prop.update_register_pressure (srcreg, dstreg, def_rtl, > + use_rtl); > + if (updated) > + bitmap_set_bit (pressure_accounted, REGNO (srcreg)); > + } > + else > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "cannot propagate from insn %d into" > + " insn %d: %s\n", def_insn->uid (), > + use_insn->uid (), > + "would increase register pressure beyond number" > + " of hard regs"); > + ok = false; > + } > + } > + else > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "pressure for reg %d has already" > + " been accounted\n", REGNO (srcreg)); > + } > + } > + > if (!ok) > { > /* The pattern didn't match, but if all uses of SRC folded to > @@ -538,6 +877,7 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change, > confirm_change_group (); > crtl->ssa->change_insn (use_change); > num_changes++; > + > return true; > } > > @@ -890,6 +1230,10 @@ fwprop_init (void) > > df_analyze (); > crtl->ssa = new rtl_ssa::function_info (cfun); > + > + regpressure_init (); > + pressure_accounted = sbitmap_alloc (DF_REG_SIZE (df)); > + bitmap_clear (pressure_accounted); > } > > static void > @@ -910,6 +1254,9 @@ fwprop_done (void) > fprintf (dump_file, > "\nNumber of successful forward propagations: %d\n\n", > num_changes); > + > + regpressure_cleanup (); > + sbitmap_free (pressure_accounted); > } > > /* Try to optimize INSN, returning true if something changes. > @@ -919,6 +1266,10 @@ fwprop_done (void) > static bool > fwprop_insn (insn_info *insn, bool fwprop_addr_p) > { > + /* TODO: when rejecting propagations due to register pressure > + we would actually like to try the propagation with most > + potential = uses first instead of going through them sequentially. */ > + > for (use_info *use : insn->uses ()) > { > if (use->is_mem ()) > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c > new file mode 100644 > index 00000000000..0ed1bb80f73 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c > @@ -0,0 +1,64 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-O2 -march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=fixed-vlmax -fdump-rtl-fwprop1-details" } */ > + > +/* Check that we propagate from vec_duplicates into vadd.vv, turning it into > + vadd.vx. */ > + > +#include > + > +/* Define 26 variables so that we can perform two forward propagations until we > + reach the hard reg limit of 28. */ > +uint64_t *restrict dst1; > +uint64_t *restrict dst2; > +uint64_t *restrict dst3; > +uint64_t *restrict dst4; > +uint64_t *restrict dst5; > +uint64_t *restrict dst6; > +uint64_t *restrict dst7; > +uint64_t *restrict dst8; > +uint64_t *restrict dst9; > +uint64_t *restrict dst10; > +uint64_t *restrict dst11; > +uint64_t *restrict a1; > +uint64_t *restrict a2; > +uint64_t *restrict a3; > +uint64_t *restrict a4; > +uint64_t *restrict a5; > +uint64_t *restrict a6; > +uint64_t *restrict a7; > +uint64_t *restrict a8; > +uint64_t *restrict a9; > +uint64_t *restrict a10; > +uint64_t *restrict a11; > +uint64_t b1; > +uint64_t b2; > +uint64_t b3; > +uint64_t b4; > + > +void foo (int n) > +{ > + __builtin_expect (n % 128, 0); > + for (int i = 0; i < n; i++) > + { > + /* We expect b3 to be propagated into vadd here as well as in the other > + 7 occurences (without increasing the pressure every time). > + Then b2 is still supposed to be propagated but b1 is not even though > + it would be better to do so. */ > + dst1[i] = a1[i] + b3; > + dst2[i] = a2[i] + b3; > + dst3[i] = a3[i] + b2; > + dst4[i] = a4[i] + b1; > + dst5[i] = a5[i] + b1; > + dst6[i] = a6[i] + b3; > + dst7[i] = a7[i] + b3; > + dst8[i] = a8[i] + b3; > + dst9[i] = a9[i] + b3; > + dst10[i] = a10[i] + b3; > + dst11[i] = a11[i] + b3; > + } > +} > + > +/* { dg-final { scan-rtl-dump-times "increasing pressure for reg" 2 "fwprop1" } } */ > +/* { dg-final { scan-rtl-dump-times "pressure for reg \[0-9\]+ has already been accounted" 7 "fwprop1" } } */ > +/* { dg-final { scan-rtl-dump-times "would increase register pressure beyond number of hard regs" 2 "fwprop1" } } */ > +/* { dg-final { scan-assembler-times "\tvadd\.vx" 9 } } */