From: Uros Bizjak <ubizjak@gmail.com>
To: Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
Jakub Jelinek <jakub@redhat.com>
Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
Date: Mon, 05 Aug 2019 12:44:00 -0000 [thread overview]
Message-ID: <CAFULd4Y9AHNpGyrsygqxQ6GtVeu5pyvQQt8Q7owHttuZFhbFbQ@mail.gmail.com> (raw)
In-Reply-To: <alpine.LSU.2.20.1908051316260.19626@zhemvz.fhfr.qr>
On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>
> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
> > >
> > > On Thu, 1 Aug 2019, Uros Bizjak wrote:
> > >
> > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
> > > >
> > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
> > > >>>> necessary even when going the STV route. The actual regression
> > > >>>> for the testcase could also be solved by turing the smaxsi3
> > > >>>> back into a compare and jump rather than a conditional move sequence.
> > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload
> > > >>>> after pass_split_after_reload and I'm not sure we can split
> > > >>>> as late as pass_split_before_sched2 (there's also a split _after_
> > > >>>> sched2 on x86 it seems).
> > > >>>>
> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
> > > >>>> case STV doesn't end up doing any transform?
> > > >>>
> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits
> > > >>> the insn back to compare+cmove.
> > > >>
> > > >> OK, that would work. But there's no way to force a jumpy sequence then
> > > >> which we know is faster than compare+cmove because later RTL
> > > >> if-conversion passes happily re-discover the smax (or conditional move)
> > > >> sequence.
> > > >>
> > > >>> However, considering the SImode move
> > > >>> from/to int/xmm register is relatively cheap, the cost function should
> > > >>> be tuned so that STV always converts smaxsi3 pattern.
> > > >>
> > > >> Note that on both Zen and even more so bdverN the int/xmm transition
> > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov
> > > >> sequence... (for the loop in hmmer which is the only one I see
> > > >> any effect of any of my patches). So identifying chains that
> > > >> start/end in memory is important for cost reasons.
> > > >
> > > > Please note that the cost function also considers the cost of move
> > > > from/to xmm. So, the cost of the whole chain would disable the
> > > > transformation.
> > > >
> > > >> So I think the splitting has to happen after the last if-conversion
> > > >> pass (and thus we may need to allocate a scratch register for this
> > > >> purpose?)
> > > >
> > > > I really hope that the underlying issue will be solved by a machine
> > > > dependant pass inserted somewhere after the pre-reload split. This
> > > > way, we can split unconverted smax to the cmove, and this later pass
> > > > would handle jcc and cmove instructions. Until then... yes your
> > > > proposed approach is one of the ways to avoid unwanted if-conversion,
> > > > although sometimes we would like to split to cmove instead.
> > >
> > > So the following makes STV also consider SImode chains, re-using the
> > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern
> > > and also did not alter the {SI,DI}mode chain cost function - it's
> > > quite off for TARGET_64BIT. With this I get the expected conversion
> > > for the testcase derived from hmmer.
> > >
> > > No further testing sofar.
> > >
> > > Is it OK to re-use the DImode chain code this way? I'll clean things
> > > up some more of course.
> >
> > Yes, the approach looks OK to me. It makes chain building mode
> > agnostic, and the chain building can be used for
> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode
> > minmax and surrounding SImode operations)
> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
> > DImode operations)
> >
> > > Still need help with the actual patterns for minmax and how the splitters
> > > should look like.
> >
> > Please look at the attached patch. Maybe we can add memory_operand as
> > operand 1 and operand 2 predicate, but let's keep things simple for
> > now.
>
> Thanks. The attached patch makes the patch cleaner and it survives
> "some" barebone testing. It also touches the cost function to
> avoid being too overly trigger-happy. I've also ended up using
> ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In
> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
> move costs shouldn't be wrapped in COST_N_INSNS.
> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
> At least with the current code every reg-reg move biases in favor of
> SSE...
>
> And we're simply adding move and non-move costs in 'gain', somewhat
> mixing apples and oranges? We could separate those and require
> both to be a net positive win?
>
> Still using -mtune=bdverN exposes that some cost tables have xmm and gpr
> costs as apples and oranges... (so it never triggers for Bulldozer)
>
> I now run into
>
> /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1:
> error: unrecognizable insn:
> (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
> (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0)
> (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0)))
> -1
> (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ])
> (expr_list:REG_UNUSED (reg:CC 17 flags)
> (nil))))
> during RTL pass: stv
>
> where even with -mavx2 we do not have s{min,max}v2di3. We do have
> an expander here but it seems only AVX512F has the DImode min/max
> ops. I have adjusted dimode_scalar_to_vector_candidate_p
> accordingly.
Uh, you need to use some other mode iterator that SWI48 then, like:
(define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
and then we need to split DImode for 32bits, too.
Uros.
> I'm considering to rename the
> dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs}
> functions to drop the dimode_ prefix - is that OK or do you
> prefer some other prefix?
>
> So - bootstrap with --with-arch=skylake in progress.
>
> It detects quite a few chains (unsurprisingly) so I guess we need
> to address compile-time issues in the pass before enabling this
> enhancement (maybe as followup?).
>
> Further comments on the actual patch welcome, I consider it
> "finished" if testing reveals no issues. ChangeLog still needs
> to be written and testcases to be added.
>
> Thanks,
> Richard.
>
> Index: gcc/config/i386/i386-features.c
> ===================================================================
> --- gcc/config/i386/i386-features.c (revision 274111)
> +++ gcc/config/i386/i386-features.c (working copy)
> @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0;
>
> /* Initialize new chain. */
>
> -scalar_chain::scalar_chain ()
> +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> {
> + smode = smode_;
> + vmode = vmode_;
> +
> chain_id = ++max_id;
>
> if (dump_file)
> @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate
> && !HARD_REGISTER_P (SET_DEST (def_set)))
> bitmap_set_bit (defs, REGNO (SET_DEST (def_set)));
>
> + /* ??? The following is quadratic since analyze_register_chain
> + iterates over all refs to look for dual-mode regs. Instead this
> + should be done separately for all regs mentioned in the chain once. */
> df_ref ref;
> df_ref def;
> for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref))
> @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost (
> {
> gcc_assert (CONST_INT_P (exp));
>
> - if (standard_sse_constant_p (exp, V2DImode))
> - return COSTS_N_INSNS (1);
> - return ix86_cost->sse_load[1];
> + if (standard_sse_constant_p (exp, vmode))
> + return ix86_cost->sse_op;
> + /* We have separate costs for SImode and DImode, use SImode costs
> + for smaller modes. */
> + return ix86_cost->sse_load[smode == DImode ? 1 : 0];
> }
>
> /* Compute a gain for chain conversion. */
> @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai
> if (dump_file)
> fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id);
>
> + /* SSE costs distinguish between SImode and DImode loads/stores, for
> + int costs factor in the number of GPRs involved. When supporting
> + smaller modes than SImode the int load/store costs need to be
> + adjusted as well. */
> + unsigned sse_cost_idx = smode == DImode ? 1 : 0;
> + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
> +
> EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
> {
> rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
> rtx def_set = single_set (insn);
> rtx src = SET_SRC (def_set);
> rtx dst = SET_DEST (def_set);
> + int igain = 0;
>
> if (REG_P (src) && REG_P (dst))
> - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move;
> + igain += 2 * m - ix86_cost->xmm_move;
> else if (REG_P (src) && MEM_P (dst))
> - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> + igain
> + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
> else if (MEM_P (src) && REG_P (dst))
> - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1];
> + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
> else if (GET_CODE (src) == ASHIFT
> || GET_CODE (src) == ASHIFTRT
> || GET_CODE (src) == LSHIFTRT)
> {
> if (CONST_INT_P (XEXP (src, 0)))
> - gain -= vector_const_cost (XEXP (src, 0));
> - gain += ix86_cost->shift_const;
> + igain -= vector_const_cost (XEXP (src, 0));
> + igain += m * ix86_cost->shift_const - ix86_cost->sse_op;
> if (INTVAL (XEXP (src, 1)) >= 32)
> - gain -= COSTS_N_INSNS (1);
> + igain -= COSTS_N_INSNS (1);
> }
> else if (GET_CODE (src) == PLUS
> || GET_CODE (src) == MINUS
> @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai
> || GET_CODE (src) == XOR
> || GET_CODE (src) == AND)
> {
> - gain += ix86_cost->add;
> + igain += m * ix86_cost->add - ix86_cost->sse_op;
> /* Additional gain for andnot for targets without BMI. */
> if (GET_CODE (XEXP (src, 0)) == NOT
> && !TARGET_BMI)
> - gain += 2 * ix86_cost->add;
> + igain += m * ix86_cost->add;
>
> if (CONST_INT_P (XEXP (src, 0)))
> - gain -= vector_const_cost (XEXP (src, 0));
> + igain -= vector_const_cost (XEXP (src, 0));
> if (CONST_INT_P (XEXP (src, 1)))
> - gain -= vector_const_cost (XEXP (src, 1));
> + igain -= vector_const_cost (XEXP (src, 1));
> }
> else if (GET_CODE (src) == NEG
> || GET_CODE (src) == NOT)
> - gain += ix86_cost->add - COSTS_N_INSNS (1);
> + igain += m * ix86_cost->add - ix86_cost->sse_op;
> + else if (GET_CODE (src) == SMAX
> + || GET_CODE (src) == SMIN
> + || GET_CODE (src) == UMAX
> + || GET_CODE (src) == UMIN)
> + {
> + /* We do not have any conditional move cost, estimate it as a
> + reg-reg move. Comparisons are costed as adds. */
> + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add);
> + /* Integer SSE ops are all costed the same. */
> + igain -= ix86_cost->sse_op;
> + }
> else if (GET_CODE (src) == COMPARE)
> {
> /* Assume comparison cost is the same. */
> @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai
> else if (CONST_INT_P (src))
> {
> if (REG_P (dst))
> - gain += COSTS_N_INSNS (2);
> + /* DImode can be immediate for TARGET_64BIT and SImode always. */
> + igain += COSTS_N_INSNS (m);
> else if (MEM_P (dst))
> - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1];
> - gain -= vector_const_cost (src);
> + igain += (m * ix86_cost->int_store[2]
> + - ix86_cost->sse_store[sse_cost_idx]);
> + igain -= vector_const_cost (src);
> }
> else
> gcc_unreachable ();
> +
> + if (igain != 0 && dump_file)
> + {
> + fprintf (dump_file, " Instruction gain %d for ", igain);
> + dump_insn_slim (dump_file, insn);
> + }
> + gain += igain;
> }
>
> if (dump_file)
> fprintf (dump_file, " Instruction conversion gain: %d\n", gain);
>
> + /* ??? What about integer to SSE? */
> EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi)
> cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer;
>
> @@ -573,7 +611,7 @@ rtx
> dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg)
> {
> if (x == reg)
> - return gen_rtx_SUBREG (V2DImode, new_reg, 0);
> + return gen_rtx_SUBREG (vmode, new_reg, 0);
>
> const char *fmt = GET_RTX_FORMAT (GET_CODE (x));
> int i, j;
> @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies
> start_sequence ();
> if (!TARGET_INTER_UNIT_MOVES_TO_VEC)
> {
> - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> - emit_move_insn (adjust_address (tmp, SImode, 0),
> - gen_rtx_SUBREG (SImode, reg, 0));
> - emit_move_insn (adjust_address (tmp, SImode, 4),
> - gen_rtx_SUBREG (SImode, reg, 4));
> + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> + if (smode == DImode && !TARGET_64BIT)
> + {
> + emit_move_insn (adjust_address (tmp, SImode, 0),
> + gen_rtx_SUBREG (SImode, reg, 0));
> + emit_move_insn (adjust_address (tmp, SImode, 4),
> + gen_rtx_SUBREG (SImode, reg, 4));
> + }
> + else
> + emit_move_insn (tmp, reg);
> emit_move_insn (vreg, tmp);
> }
> - else if (TARGET_SSE4_1)
> + else if (!TARGET_64BIT && smode == DImode)
> {
> - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> - CONST0_RTX (V4SImode),
> - gen_rtx_SUBREG (SImode, reg, 0)));
> - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> - gen_rtx_SUBREG (V4SImode, vreg, 0),
> - gen_rtx_SUBREG (SImode, reg, 4),
> - GEN_INT (2)));
> + if (TARGET_SSE4_1)
> + {
> + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> + CONST0_RTX (V4SImode),
> + gen_rtx_SUBREG (SImode, reg, 0)));
> + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0),
> + gen_rtx_SUBREG (V4SImode, vreg, 0),
> + gen_rtx_SUBREG (SImode, reg, 4),
> + GEN_INT (2)));
> + }
> + else
> + {
> + rtx tmp = gen_reg_rtx (DImode);
> + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> + CONST0_RTX (V4SImode),
> + gen_rtx_SUBREG (SImode, reg, 0)));
> + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> + CONST0_RTX (V4SImode),
> + gen_rtx_SUBREG (SImode, reg, 4)));
> + emit_insn (gen_vec_interleave_lowv4si
> + (gen_rtx_SUBREG (V4SImode, vreg, 0),
> + gen_rtx_SUBREG (V4SImode, vreg, 0),
> + gen_rtx_SUBREG (V4SImode, tmp, 0)));
> + }
> }
> else
> - {
> - rtx tmp = gen_reg_rtx (DImode);
> - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0),
> - CONST0_RTX (V4SImode),
> - gen_rtx_SUBREG (SImode, reg, 0)));
> - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0),
> - CONST0_RTX (V4SImode),
> - gen_rtx_SUBREG (SImode, reg, 4)));
> - emit_insn (gen_vec_interleave_lowv4si
> - (gen_rtx_SUBREG (V4SImode, vreg, 0),
> - gen_rtx_SUBREG (V4SImode, vreg, 0),
> - gen_rtx_SUBREG (V4SImode, tmp, 0)));
> - }
> + emit_move_insn (gen_lowpart (smode, vreg), reg);
> rtx_insn *seq = get_insns ();
> end_sequence ();
> rtx_insn *insn = DF_REF_INSN (ref);
> @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign
> bitmap_copy (conv, insns);
>
> if (scalar_copy)
> - scopy = gen_reg_rtx (DImode);
> + scopy = gen_reg_rtx (smode);
>
> for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref))
> {
> @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign
> start_sequence ();
> if (!TARGET_INTER_UNIT_MOVES_FROM_VEC)
> {
> - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP);
> + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP);
> emit_move_insn (tmp, reg);
> - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> - adjust_address (tmp, SImode, 0));
> - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> - adjust_address (tmp, SImode, 4));
> + if (!TARGET_64BIT && smode == DImode)
> + {
> + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> + adjust_address (tmp, SImode, 0));
> + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> + adjust_address (tmp, SImode, 4));
> + }
> + else
> + emit_move_insn (scopy, tmp);
> }
> - else if (TARGET_SSE4_1)
> + else if (!TARGET_64BIT && smode == DImode)
> {
> - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx));
> - emit_insn
> - (gen_rtx_SET
> - (gen_rtx_SUBREG (SImode, scopy, 0),
> - gen_rtx_VEC_SELECT (SImode,
> - gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> -
> - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> - emit_insn
> - (gen_rtx_SET
> - (gen_rtx_SUBREG (SImode, scopy, 4),
> - gen_rtx_VEC_SELECT (SImode,
> - gen_rtx_SUBREG (V4SImode, reg, 0), tmp)));
> + if (TARGET_SSE4_1)
> + {
> + rtx tmp = gen_rtx_PARALLEL (VOIDmode,
> + gen_rtvec (1, const0_rtx));
> + emit_insn
> + (gen_rtx_SET
> + (gen_rtx_SUBREG (SImode, scopy, 0),
> + gen_rtx_VEC_SELECT (SImode,
> + gen_rtx_SUBREG (V4SImode, reg, 0),
> + tmp)));
> +
> + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx));
> + emit_insn
> + (gen_rtx_SET
> + (gen_rtx_SUBREG (SImode, scopy, 4),
> + gen_rtx_VEC_SELECT (SImode,
> + gen_rtx_SUBREG (V4SImode, reg, 0),
> + tmp)));
> + }
> + else
> + {
> + rtx vcopy = gen_reg_rtx (V2DImode);
> + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> + gen_rtx_SUBREG (SImode, vcopy, 0));
> + emit_move_insn (vcopy,
> + gen_rtx_LSHIFTRT (V2DImode,
> + vcopy, GEN_INT (32)));
> + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> + gen_rtx_SUBREG (SImode, vcopy, 0));
> + }
> }
> else
> - {
> - rtx vcopy = gen_reg_rtx (V2DImode);
> - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0));
> - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0),
> - gen_rtx_SUBREG (SImode, vcopy, 0));
> - emit_move_insn (vcopy,
> - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32)));
> - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4),
> - gen_rtx_SUBREG (SImode, vcopy, 0));
> - }
> + emit_move_insn (scopy, reg);
> +
> rtx_insn *seq = get_insns ();
> end_sequence ();
> emit_conversion_insns (seq, insn);
> @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op
> if (GET_CODE (*op) == NOT)
> {
> convert_op (&XEXP (*op, 0), insn);
> - PUT_MODE (*op, V2DImode);
> + PUT_MODE (*op, vmode);
> }
> else if (MEM_P (*op))
> {
> - rtx tmp = gen_reg_rtx (DImode);
> + rtx tmp = gen_reg_rtx (GET_MODE (*op));
>
> emit_insn_before (gen_move_insn (tmp, *op), insn);
> - *op = gen_rtx_SUBREG (V2DImode, tmp, 0);
> + *op = gen_rtx_SUBREG (vmode, tmp, 0);
>
> if (dump_file)
> fprintf (dump_file, " Preloading operand for insn %d into r%d\n",
> @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op
> gcc_assert (!DF_REF_CHAIN (ref));
> break;
> }
> - *op = gen_rtx_SUBREG (V2DImode, *op, 0);
> + *op = gen_rtx_SUBREG (vmode, *op, 0);
> }
> else if (CONST_INT_P (*op))
> {
> rtx vec_cst;
> - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0);
> + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0);
>
> /* Prefer all ones vector in case of -1. */
> if (constm1_operand (*op, GET_MODE (*op)))
> - vec_cst = CONSTM1_RTX (V2DImode);
> + vec_cst = CONSTM1_RTX (vmode);
> else
> - vec_cst = gen_rtx_CONST_VECTOR (V2DImode,
> - gen_rtvec (2, *op, const0_rtx));
> + {
> + unsigned n = GET_MODE_NUNITS (vmode);
> + rtx *v = XALLOCAVEC (rtx, n);
> + v[0] = *op;
> + for (unsigned i = 1; i < n; ++i)
> + v[i] = const0_rtx;
> + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v));
> + }
>
> - if (!standard_sse_constant_p (vec_cst, V2DImode))
> + if (!standard_sse_constant_p (vec_cst, vmode))
> {
> start_sequence ();
> - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst));
> + vec_cst = validize_mem (force_const_mem (vmode, vec_cst));
> rtx_insn *seq = get_insns ();
> end_sequence ();
> emit_insn_before (seq, insn);
> @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op
> else
> {
> gcc_assert (SUBREG_P (*op));
> - gcc_assert (GET_MODE (*op) == V2DImode);
> + gcc_assert (GET_MODE (*op) == vmode);
> }
> }
>
> @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i
> {
> /* There are no scalar integer instructions and therefore
> temporary register usage is required. */
> - rtx tmp = gen_reg_rtx (DImode);
> + rtx tmp = gen_reg_rtx (GET_MODE (dst));
> emit_conversion_insns (gen_move_insn (dst, tmp), insn);
> - dst = gen_rtx_SUBREG (V2DImode, tmp, 0);
> + dst = gen_rtx_SUBREG (vmode, tmp, 0);
> }
>
> switch (GET_CODE (src))
> @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i
> case ASHIFTRT:
> case LSHIFTRT:
> convert_op (&XEXP (src, 0), insn);
> - PUT_MODE (src, V2DImode);
> + PUT_MODE (src, vmode);
> break;
>
> case PLUS:
> @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i
> case IOR:
> case XOR:
> case AND:
> + case SMAX:
> + case SMIN:
> + case UMAX:
> + case UMIN:
> convert_op (&XEXP (src, 0), insn);
> convert_op (&XEXP (src, 1), insn);
> - PUT_MODE (src, V2DImode);
> + PUT_MODE (src, vmode);
> break;
>
> case NEG:
> src = XEXP (src, 0);
> convert_op (&src, insn);
> - subreg = gen_reg_rtx (V2DImode);
> - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn);
> - src = gen_rtx_MINUS (V2DImode, subreg, src);
> + subreg = gen_reg_rtx (vmode);
> + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn);
> + src = gen_rtx_MINUS (vmode, subreg, src);
> break;
>
> case NOT:
> src = XEXP (src, 0);
> convert_op (&src, insn);
> - subreg = gen_reg_rtx (V2DImode);
> - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn);
> - src = gen_rtx_XOR (V2DImode, src, subreg);
> + subreg = gen_reg_rtx (vmode);
> + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn);
> + src = gen_rtx_XOR (vmode, src, subreg);
> break;
>
> case MEM:
> @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i
> break;
>
> case SUBREG:
> - gcc_assert (GET_MODE (src) == V2DImode);
> + gcc_assert (GET_MODE (src) == vmode);
> break;
>
> case COMPARE:
> src = SUBREG_REG (XEXP (XEXP (src, 0), 0));
>
> - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode)
> - || (SUBREG_P (src) && GET_MODE (src) == V2DImode));
> + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode))
> + || (SUBREG_P (src) && GET_MODE (src) == vmode));
>
> if (REG_P (src))
> - subreg = gen_rtx_SUBREG (V2DImode, src, 0);
> + subreg = gen_rtx_SUBREG (vmode, src, 0);
> else
> subreg = copy_rtx_if_shared (src);
> emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg),
> @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i
> PATTERN (insn) = def_set;
>
> INSN_CODE (insn) = -1;
> - recog_memoized (insn);
> + int patt = recog_memoized (insn);
> + if (patt == -1)
> + fatal_insn_not_found (insn);
> df_insn_rescan (insn);
> }
>
> @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn
> (const_int 0 [0]))) */
>
> static bool
> -convertible_comparison_p (rtx_insn *insn)
> +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode)
> {
> if (!TARGET_SSE4_1)
> return false;
> @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn
>
> if (!SUBREG_P (op1)
> || !SUBREG_P (op2)
> - || GET_MODE (op1) != SImode
> - || GET_MODE (op2) != SImode
> + || GET_MODE (op1) != mode
> + || GET_MODE (op2) != mode
> || ((SUBREG_BYTE (op1) != 0
> - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode))
> + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode))
> && (SUBREG_BYTE (op2) != 0
> - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode))))
> + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode))))
> return false;
>
> op1 = SUBREG_REG (op1);
> @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn
>
> if (op1 != op2
> || !REG_P (op1)
> - || GET_MODE (op1) != DImode)
> + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ())
> return false;
>
> return true;
> @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn
> /* The DImode version of scalar_to_vector_candidate_p. */
>
> static bool
> -dimode_scalar_to_vector_candidate_p (rtx_insn *insn)
> +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode)
> {
> rtx def_set = single_set (insn);
>
> @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx
> rtx dst = SET_DEST (def_set);
>
> if (GET_CODE (src) == COMPARE)
> - return convertible_comparison_p (insn);
> + return convertible_comparison_p (insn, mode);
>
> /* We are interested in DImode promotion only. */
> - if ((GET_MODE (src) != DImode
> + if ((GET_MODE (src) != mode
> && !CONST_INT_P (src))
> - || GET_MODE (dst) != DImode)
> + || GET_MODE (dst) != mode)
> return false;
>
> if (!REG_P (dst) && !MEM_P (dst))
> @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx
> return false;
> break;
>
> + case SMAX:
> + case SMIN:
> + case UMAX:
> + case UMIN:
> + if ((mode == DImode && !TARGET_AVX512F)
> + || (mode == SImode && !TARGET_SSE4_1))
> + return false;
> + /* Fallthru. */
> +
> case PLUS:
> case MINUS:
> case IOR:
> @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx
> && !CONST_INT_P (XEXP (src, 1)))
> return false;
>
> - if (GET_MODE (XEXP (src, 1)) != DImode
> + if (GET_MODE (XEXP (src, 1)) != mode
> && !CONST_INT_P (XEXP (src, 1)))
> return false;
> break;
> @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx
> || !REG_P (XEXP (XEXP (src, 0), 0))))
> return false;
>
> - if (GET_MODE (XEXP (src, 0)) != DImode
> + if (GET_MODE (XEXP (src, 0)) != mode
> && !CONST_INT_P (XEXP (src, 0)))
> return false;
>
> @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx
> return false;
> }
>
> -/* Return 1 if INSN may be converted into vector
> - instruction. */
> -
> -static bool
> -scalar_to_vector_candidate_p (rtx_insn *insn)
> -{
> - if (TARGET_64BIT)
> - return timode_scalar_to_vector_candidate_p (insn);
> - else
> - return dimode_scalar_to_vector_candidate_p (insn);
> -}
> +/* For a given bitmap of insn UIDs scans all instruction and
> + remove insn from CANDIDATES in case it has both convertible
> + and not convertible definitions.
>
> -/* The DImode version of remove_non_convertible_regs. */
> + All insns in a bitmap are conversion candidates according to
> + scalar_to_vector_candidate_p. Currently it implies all insns
> + are single_set. */
>
> static void
> dimode_remove_non_convertible_regs (bitmap candidates)
> @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm
> BITMAP_FREE (regs);
> }
>
> -/* For a given bitmap of insn UIDs scans all instruction and
> - remove insn from CANDIDATES in case it has both convertible
> - and not convertible definitions.
> -
> - All insns in a bitmap are conversion candidates according to
> - scalar_to_vector_candidate_p. Currently it implies all insns
> - are single_set. */
> -
> -static void
> -remove_non_convertible_regs (bitmap candidates)
> -{
> - if (TARGET_64BIT)
> - timode_remove_non_convertible_regs (candidates);
> - else
> - dimode_remove_non_convertible_regs (candidates);
> -}
> -
> /* Main STV pass function. Find and convert scalar
> instructions into vector mode when profitable. */
>
> @@ -1577,11 +1638,14 @@ static unsigned int
> convert_scalars_to_vector ()
> {
> basic_block bb;
> - bitmap candidates;
> int converted_insns = 0;
>
> bitmap_obstack_initialize (NULL);
> - candidates = BITMAP_ALLOC (NULL);
> + const machine_mode cand_mode[3] = { SImode, DImode, TImode };
> + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode };
> + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */
> + for (unsigned i = 0; i < 3; ++i)
> + bitmap_initialize (&candidates[i], &bitmap_default_obstack);
>
> calculate_dominance_info (CDI_DOMINATORS);
> df_set_flags (DF_DEFER_INSN_RESCAN);
> @@ -1597,51 +1661,73 @@ convert_scalars_to_vector ()
> {
> rtx_insn *insn;
> FOR_BB_INSNS (bb, insn)
> - if (scalar_to_vector_candidate_p (insn))
> + if (TARGET_64BIT
> + && timode_scalar_to_vector_candidate_p (insn))
> {
> if (dump_file)
> - fprintf (dump_file, " insn %d is marked as a candidate\n",
> + fprintf (dump_file, " insn %d is marked as a TImode candidate\n",
> INSN_UID (insn));
>
> - bitmap_set_bit (candidates, INSN_UID (insn));
> + bitmap_set_bit (&candidates[2], INSN_UID (insn));
> + }
> + else
> + {
> + /* Check {SI,DI}mode. */
> + for (unsigned i = 0; i <= 1; ++i)
> + if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i]))
> + {
> + if (dump_file)
> + fprintf (dump_file, " insn %d is marked as a %s candidate\n",
> + INSN_UID (insn), i == 0 ? "SImode" : "DImode");
> +
> + bitmap_set_bit (&candidates[i], INSN_UID (insn));
> + break;
> + }
> }
> }
>
> - remove_non_convertible_regs (candidates);
> + if (TARGET_64BIT)
> + timode_remove_non_convertible_regs (&candidates[2]);
> + for (unsigned i = 0; i <= 1; ++i)
> + dimode_remove_non_convertible_regs (&candidates[i]);
>
> - if (bitmap_empty_p (candidates))
> - if (dump_file)
> + for (unsigned i = 0; i <= 2; ++i)
> + if (!bitmap_empty_p (&candidates[i]))
> + break;
> + else if (i == 2 && dump_file)
> fprintf (dump_file, "There are no candidates for optimization.\n");
>
> - while (!bitmap_empty_p (candidates))
> - {
> - unsigned uid = bitmap_first_set_bit (candidates);
> - scalar_chain *chain;
> + for (unsigned i = 0; i <= 2; ++i)
> + while (!bitmap_empty_p (&candidates[i]))
> + {
> + unsigned uid = bitmap_first_set_bit (&candidates[i]);
> + scalar_chain *chain;
>
> - if (TARGET_64BIT)
> - chain = new timode_scalar_chain;
> - else
> - chain = new dimode_scalar_chain;
> + if (cand_mode[i] == TImode)
> + chain = new timode_scalar_chain;
> + else
> + chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]);
>
> - /* Find instructions chain we want to convert to vector mode.
> - Check all uses and definitions to estimate all required
> - conversions. */
> - chain->build (candidates, uid);
> + /* Find instructions chain we want to convert to vector mode.
> + Check all uses and definitions to estimate all required
> + conversions. */
> + chain->build (&candidates[i], uid);
>
> - if (chain->compute_convert_gain () > 0)
> - converted_insns += chain->convert ();
> - else
> - if (dump_file)
> - fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> - chain->chain_id);
> + if (chain->compute_convert_gain () > 0)
> + converted_insns += chain->convert ();
> + else
> + if (dump_file)
> + fprintf (dump_file, "Chain #%d conversion is not profitable\n",
> + chain->chain_id);
>
> - delete chain;
> - }
> + delete chain;
> + }
>
> if (dump_file)
> fprintf (dump_file, "Total insns converted: %d\n", converted_insns);
>
> - BITMAP_FREE (candidates);
> + for (unsigned i = 0; i <= 2; ++i)
> + bitmap_release (&candidates[i]);
> bitmap_obstack_release (NULL);
> df_process_deferred_rescans ();
>
> Index: gcc/config/i386/i386-features.h
> ===================================================================
> --- gcc/config/i386/i386-features.h (revision 274111)
> +++ gcc/config/i386/i386-features.h (working copy)
> @@ -127,11 +127,16 @@ namespace {
> class scalar_chain
> {
> public:
> - scalar_chain ();
> + scalar_chain (enum machine_mode, enum machine_mode);
> virtual ~scalar_chain ();
>
> static unsigned max_id;
>
> + /* Scalar mode. */
> + enum machine_mode smode;
> + /* Vector mode. */
> + enum machine_mode vmode;
> +
> /* ID of a chain. */
> unsigned int chain_id;
> /* A queue of instructions to be included into a chain. */
> @@ -162,6 +167,8 @@ class scalar_chain
> class dimode_scalar_chain : public scalar_chain
> {
> public:
> + dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_)
> + : scalar_chain (smode_, vmode_) {}
> int compute_convert_gain ();
> private:
> void mark_dual_mode_def (df_ref def);
> @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala
> class timode_scalar_chain : public scalar_chain
> {
> public:
> + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {}
> +
> /* Convert from TImode to V1TImode is always faster. */
> int compute_convert_gain () { return 1; }
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md (revision 274111)
> +++ gcc/config/i386/i386.md (working copy)
> @@ -17721,6 +17721,27 @@ (define_peephole2
> std::swap (operands[4], operands[5]);
> })
>
> +;; min/max patterns
> +
> +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")])
> +
> +(define_insn_and_split "<code><mode>3"
> + [(set (match_operand:SWI48 0 "register_operand")
> + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand")
> + (match_operand:SWI48 2 "register_operand")))
> + (clobber (reg:CC FLAGS_REG))]
> + "TARGET_STV && TARGET_SSE4_1
> + && can_create_pseudo_p ()"
> + "#"
> + "&& 1"
> + [(set (reg:CCGC FLAGS_REG)
> + (compare:CCGC (match_dup 1)(match_dup 2)))
> + (set (match_dup 0)
> + (if_then_else:SWI48
> + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0))
> + (match_dup 1)
> + (match_dup 2)))])
> +
> ;; Conditional addition patterns
> (define_expand "add<mode>cc"
> [(match_operand:SWI 0 "register_operand")
next prev parent reply other threads:[~2019-08-05 12:44 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-07-23 14:03 Richard Biener
2019-07-24 9:14 ` Richard Biener
2019-07-24 11:30 ` Richard Biener
2019-07-24 15:12 ` Jeff Law
2019-07-27 10:07 ` Uros Bizjak
2019-08-09 22:15 ` Jeff Law
2019-07-25 9:15 ` Martin Jambor
2019-07-25 12:57 ` Richard Biener
2019-07-27 11:14 ` Uros Bizjak
2019-07-27 18:23 ` Uros Bizjak
2019-07-31 12:01 ` Richard Biener
2019-08-01 8:54 ` Uros Bizjak
2019-08-01 9:28 ` Richard Biener
2019-08-01 9:38 ` Uros Bizjak
2019-08-03 17:26 ` Richard Biener
2019-08-04 17:11 ` Uros Bizjak
2019-08-04 17:23 ` Jakub Jelinek
2019-08-04 17:36 ` Uros Bizjak
2019-08-05 8:47 ` Richard Biener
2019-08-05 9:13 ` Richard Sandiford
2019-08-05 10:08 ` Uros Bizjak
2019-08-05 10:12 ` Richard Sandiford
2019-08-05 10:24 ` Uros Bizjak
2019-08-05 10:39 ` Richard Sandiford
2019-08-05 11:50 ` Richard Biener
2019-08-05 11:59 ` Uros Bizjak
2019-08-05 12:16 ` Richard Biener
2019-08-05 12:23 ` Uros Bizjak
2019-08-05 12:33 ` Uros Bizjak
2019-08-08 16:23 ` Jeff Law
2019-08-05 12:44 ` Uros Bizjak [this message]
2019-08-05 12:51 ` Uros Bizjak
2019-08-05 12:54 ` Jakub Jelinek
2019-08-05 12:57 ` Uros Bizjak
2019-08-05 13:04 ` Richard Biener
2019-08-05 13:09 ` Uros Bizjak
2019-08-05 13:29 ` Richard Biener
2019-08-05 19:35 ` Uros Bizjak
2019-08-07 9:52 ` Richard Biener
2019-08-07 12:04 ` Richard Biener
2019-08-07 12:11 ` Uros Bizjak
2019-08-07 12:42 ` Uros Bizjak
2019-08-07 12:58 ` Uros Bizjak
2019-08-07 13:00 ` Richard Biener
2019-08-07 13:32 ` Uros Bizjak
2019-08-07 14:15 ` Richard Biener
2019-08-09 7:28 ` Uros Bizjak
2019-08-09 10:13 ` Richard Biener
2019-08-09 10:26 ` Jakub Jelinek
2019-08-09 11:15 ` Richard Biener
2019-08-09 11:06 ` Richard Biener
2019-08-09 13:13 ` Richard Biener
2019-08-09 14:39 ` Uros Bizjak
2019-08-12 12:57 ` Richard Biener
2019-08-12 14:48 ` Uros Bizjak
2019-08-13 16:28 ` Jeff Law
2019-08-13 20:07 ` H.J. Lu
2019-08-15 9:24 ` Uros Bizjak
2019-08-13 15:20 ` Jeff Law
2019-08-14 9:15 ` Richard Biener
2019-08-14 9:36 ` Uros Bizjak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFULd4Y9AHNpGyrsygqxQ6GtVeu5pyvQQt8Q7owHttuZFhbFbQ@mail.gmail.com \
--to=ubizjak@gmail.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=jakub@redhat.com \
--cc=rguenther@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).