From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 90337 invoked by alias); 12 Aug 2019 12:27:50 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 90271 invoked by uid 89); 12 Aug 2019 12:27:46 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-10.4 required=5.0 tests=AWL,BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,SPF_PASS autolearn=ham version=3.3.1 spammy=effectively, H*c:HHHHHHHHH, approval, ssi X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 12 Aug 2019 12:27:41 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 89003AF83; Mon, 12 Aug 2019 12:27:38 +0000 (UTC) Date: Mon, 12 Aug 2019 12:57:00 -0000 From: Richard Biener To: Uros Bizjak cc: Jakub Jelinek , "gcc-patches@gcc.gnu.org" , Jeff Law , hjl.tools@gmail.com Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs In-Reply-To: Message-ID: References: <20190805125358.GR2726@tucnak> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: multipart/mixed; BOUNDARY="-1609908220-888097525-1565594583=:11741" X-SW-Source: 2019-08/txt/msg00735.txt.bz2 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1609908220-888097525-1565594583=:11741 Content-Type: text/plain; charset=KOI8-R Content-Transfer-Encoding: 8BIT Content-length: 47989 On Fri, 9 Aug 2019, Uros Bizjak wrote: > On Fri, Aug 9, 2019 at 3:00 PM Richard Biener wrote: > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > > > case SMAX: > > > > > > > case SMIN: > > > > > > > case UMAX: > > > > > > > case UMIN: > > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > > return false; > > > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > > This is of course doable, but somehow more complex than simply > > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > > splitter does. So, a follow-up task. > > > > > > > > > > Please find attached the complete .md part that enables SImode for > > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > > > > targets. The patterns also allows for memory operand 2, so STV has > > > > > chance to create the vector pattern with implicit load. In case STV > > > > > fails, the memory operand 2 is loaded to the register first; operand > > > > > 2 is used in compare and cmove instruction, so pre-loading of the > > > > > operand should be beneficial. > > > > > > > > Thanks. > > > > > > > > > Also note, that splitting should happen rarely. Due to the cost > > > > > function, STV should effectively always convert minmax to a vector > > > > > insn. > > > > > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > > > > this kind of "simple" conversion: > > > > > > > > 5.50 1d0: test %esi,%es > > > > 0.07  mov $0x0,%ex > > > >  cmovs %eax,%es > > > > 5.84  imul %r8d,%es > > > > > > > > to > > > > > > > > 0.65 1e0: vpxor %xmm0,%xmm0,%xmm0 > > > > 0.32  vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > > > 40.45  vmovd %xmm0,%eax > > > > 2.45  imul %r8d,%eax > > > > > > > > which looks like a RA artifact in the end. We spill %esi only > > > > with -mstv here as STV introduces a (subreg:V4SI ...) use > > > > of a pseudo ultimatively set from di. STV creates an additional > > > > pseudo for this (copy-in) but it places that copy next to the > > > > original def rather than next to the start of the chain it > > > > converts which is probably the issue why we spill. And this > > > > is because it inserts those at each definition of the pseudo > > > > rather than just at the reaching definition(s) or at the > > > > uses of the pseudo in the chain (that because there may be > > > > defs of that pseudo in the chain itself). Note that STV emits > > > > such "conversion" copies as simple reg-reg moves: > > > > > > > > (insn 1094 3 4 2 (set (reg:SI 777) > > > > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > > > > (nil)) > > > > > > > > but those do not prevail very long (this one gets removed by CSE2). > > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > > > > and computes > > > > > > > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > > > > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > > > > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves > > > > here (but I guess nothing again prevents RTL optimizers from > > > > combining that with the single-use in the max instruction...). > > > > > > > > So this boils down to STV splitting live-ranges but other > > > > passes undoing that and then RA not considering splitting > > > > live-ranges here, arriving at unoptimal allocation. > > > > > > > > A testcase showing this issue is (simplified from 464.h264ref > > > > UMVLine16Y_11): > > > > > > > > unsigned short > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > > { > > > > if (y != width) > > > > { > > > > y = y < 0 ? 0 : y; > > > > return Pic[y * width]; > > > > } > > > > return Pic[y]; > > > > } > > > > > > > > where the condition and the Pic[y] load mimics the other use of y. > > > > Different, even worse spilling is generated by > > > > > > > > unsigned short > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > > { > > > > y = y < 0 ? 0 : y; > > > > return Pic[y * width] + y; > > > > } > > > > > > > > I guess this all shows that STVs "trick" of simply wrapping > > > > integer mode pseudos in (subreg:vector-mode ...) is bad? > > > > > > > > I've added a (failing) testcase to reflect the above. > > > > > > Experimenting a bit with just for the conversion insns using > > > V4SImode pseudos we end up preserving those moves (but I > > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > > > ends up using movv4si_internal which only leaves us with > > > memory for the SImode operand) _plus_ moving the move next > > > to the actual use has an effect. Not necssarily a good one > > > though: > > > > > > vpxor %xmm0, %xmm0, %xmm0 > > > vmovaps %xmm0, -16(%rsp) > > > movl %esi, -16(%rsp) > > > vpmaxsd -16(%rsp), %xmm0, %xmm0 > > > vmovd %xmm0, %eax > > > > > > eh? I guess the lowpart set is not good (my patch has this > > > as well, but I got saved by never having vector modes to subset...). > > > Using > > > > > > (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > > > (const_vector:V4SI [ > > > (const_int 0 [0]) repeated x4 > > > ]) > > > (const_int 1 [0x1]))) "t3.c":5:10 -1 > > > > > > for the move ends up with > > > > > > vpxor %xmm1, %xmm1, %xmm1 > > > vpinsrd $0, %esi, %xmm1, %xmm0 > > > > > > eh? LRA chooses the correct alternative here but somehow > > > postreload CSE CSEs the zero with the xmm1 clearing, leading > > > to the vpinsrd... (I guess a general issue, not sure if really > > > worse - definitely a larger instruction). Unfortunately > > > postreload-cse doesn't add a reg-equal note. This happens only > > > when emitting the reg move before the use, not doing that emits > > > a vmovd as expected. > > > > > > At least the spilling is gone here. > > > > > > I am re-testing as follows, the main change is that > > > general_scalar_chain::make_vector_copies now generates a > > > vector pseudo as destination (and I've fixed up the code > > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > > > > > > Hope this fixes the observed slowdowns (it fixes the new testcase). > > > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > > > CCing Jeff who "knows RTL". > > > > OK? > > Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid > spurious test failures on SSE4.1 targets. Done. I've also adjusted the i386.md changelog as follows: * config/i386/i386.md (3): New expander. (*3_1): New insn-and-split. (*di3_doubleword): Likewise. I see FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest FAIL: gcc.target/i386/pr78794.c scan-assembler pandn with the latest patch (this is with -m32) where -mstv causes all spills to go away and the cmoves replaced (so clearly better code after the patch) for pr65105-5.c, no obvious improvements for pr65105-3.c where cmov does appear with -mstv. I'd rather not "fix" those by adding -mno-stv but instead have the Intel people fix costing for slm and/or decide what to do. For pr65105-3.c I'm not sure why if-conversion didn't choose to use cmov, so clearly the enabled minmax patterns expose the "failure" here. I've also seen a 32bit ICE for a bogus store we create with the live-range splitting fix fixed in the patch below (convert_insn REG src handling with MEM dst needs to account for a vector-mode src case). Maybe it would help to split out changes unrelated to {DI,SI}mode chain support from the STV costing and also separately install the live-range splitting "fix"? I'm willing to do some more legwork to make review and approval easier here. Anyway, bootstrapped & tested on x86_64-unknown-linux-gnu. I've re-checked SPEC CPU 2006 on Haswell with no changes over the previous results. Thanks, Richard. 2019-08-12 Richard Biener PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Fix reg-reg move cost gain, use ix86_cost->sse_op cost and adjust scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction gain if not zero. (general_scalar_chain::replace_with_subreg): Use vmode/smode. Elide the subreg if the reg is already vector. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. Use a vector-mode pseudo as destination. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. Elide the subreg if the reg is already vector. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (3): New expander. (*3_1): New insn-and-split. (*di3_doubleword): Likewise. * gcc.target/i386/pr91154.c: New testcase. * gcc.target/i386/minmax-3.c: Likewise. * gcc.target/i386/minmax-4.c: Likewise. * gcc.target/i386/minmax-5.c: Likewise. * gcc.target/i386/minmax-6.c: Likewise. * gcc.target/i386/minmax-1.c: Add -mno-stv. * gcc.target/i386/minmax-2.c: Likewise. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274278) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return (GET_MODE (new_reg) == vmode + ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0)); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (vmode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); - emit_move_insn (vreg, tmp); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); + emit_move_insn (vreg, + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + tmp), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U))); + } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); + emit_move_insn (vreg, + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + reg), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U))); } rtx_insn *seq = get_insns (); end_sequence (); @@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + if (GET_MODE (*op) != vmode) + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -936,20 +1024,22 @@ dimode_scalar_chain::convert_insn (rtx_i case REG: if (!MEM_P (dst)) convert_op (&src, insn); + else if (GET_MODE (src) != smode) + src = gen_rtx_SUBREG (smode, src, 0); break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1067,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1116,7 +1208,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1186,7 +1278,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1311,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1324,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1333,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1347,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1372,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512VL) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1391,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1420,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,22 +1484,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1553,23 +1648,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1655,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1678,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274278) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274278) +++ gcc/config/i386/i386.md (working copy) @@ -17719,6 +17719,110 @@ (define_expand "addcc" (match_operand:SWI 3 "const_int_operand")] "" "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") + +;; min/max patterns + +(define_mode_iterator MAXMIN_IMODE + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) +(define_code_attr maxmin_rel + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) + +(define_expand "3" + [(parallel + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))])] + "TARGET_STV") + +(define_insn_and_split "*3_1" + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "(TARGET_64BIT || mode != DImode) && TARGET_STV + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:MAXMIN_IMODE (match_dup 3) + (match_dup 1) + (match_dup 2)))] +{ + machine_mode mode = mode; + + if (!register_operand (operands[2], mode)) + operands[2] = force_reg (mode, operands[2]); + + enum rtx_code code = ; + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); + + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + emit_insn (gen_rtx_SET (flags, tmp)); + + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); +}) + +(define_insn_and_split "*di3_doubleword" + [(set (match_operand:DI 0 "register_operand") + (maxmin:DI (match_operand:DI 1 "register_operand") + (match_operand:DI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:SI (match_dup 6) + (match_dup 1) + (match_dup 2))) + (set (match_dup 3) + (if_then_else:SI (match_dup 6) + (match_dup 4) + (match_dup 5)))] +{ + if (!register_operand (operands[2], DImode)) + operands[2] = force_reg (DImode, operands[2]); + + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); + + rtx cmplo[2] = { operands[1], operands[2] }; + rtx cmphi[2] = { operands[4], operands[5] }; + + enum rtx_code code = ; + + switch (code) + { + case LE: case LEU: + std::swap (cmplo[0], cmplo[1]); + std::swap (cmphi[0], cmphi[1]); + code = swap_condition (code); + /* FALLTHRU */ + + case GE: case GEU: + { + bool uns = (code == GEU); + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; + + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); + + rtx tmp = gen_rtx_SCRATCH (SImode); + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); + + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); + + break; + } + + default: + gcc_unreachable (); + } +}) ;; Misc patterns (?) Index: gcc/testsuite/gcc.target/i386/pr91154.c =================================================================== --- gcc/testsuite/gcc.target/i386/pr91154.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/pr91154.c (working copy) @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse4.1 -mstv" } */ + +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) +{ + int sc; + int k; + for (k = 1; k <= M; k++) + { + dc[k] = dc[k-1] + tpdd[k-1]; + if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; + if (dc[k] < -987654321) dc[k] = -987654321; + } +} + +/* We want to convert the loop to SSE since SSE pmaxsd is faster than + compare + conditional move. */ +/* { dg-final { scan-assembler-not "cmov" } } */ +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */ +/* { dg-final { scan-assembler-times "paddd" 2 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-1.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-1.c (revision 274278) +++ gcc/testsuite/gcc.target/i386/minmax-1.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -march=opteron" } */ +/* { dg-options "-O2 -march=opteron -mno-stv" } */ /* { dg-final { scan-assembler "test" } } */ /* { dg-final { scan-assembler-not "cmp" } } */ #define max(a,b) (((a) > (b))? (a) : (b)) Index: gcc/testsuite/gcc.target/i386/minmax-2.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-2.c (revision 274278) +++ gcc/testsuite/gcc.target/i386/minmax-2.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2" } */ +/* { dg-options "-O2 -mno-stv" } */ /* { dg-final { scan-assembler "test" } } */ /* { dg-final { scan-assembler-not "cmp" } } */ #define max(a,b) (((a) > (b))? (a) : (b)) Index: gcc/testsuite/gcc.target/i386/minmax-3.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int ssi[1024]; +unsigned int usi[1024]; +long long sdi[1024]; +unsigned long long udi[1024]; + +#define CHECK(FN, VARIANT) \ +void \ +FN ## VARIANT (void) \ +{ \ + for (int i = 1; i < 1024; ++i) \ + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ +} + +CHECK(max, ssi); +CHECK(min, ssi); +CHECK(max, usi); +CHECK(min, usi); +CHECK(max, sdi); +CHECK(min, sdi); +CHECK(max, udi); +CHECK(min, udi); Index: gcc/testsuite/gcc.target/i386/minmax-4.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -msse4.1" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ +/* { dg-final { scan-assembler-times "pminud" 1 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-6.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-6.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-6.c (working copy) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=haswell" } */ + +unsigned short +UMVLine16Y_11 (short unsigned int * Pic, int y, int width) +{ + if (y != width) + { + y = y < 0 ? 0 : y; + return Pic[y * width]; + } + return Pic[y]; +} + +/* We do not want the RA to spill %esi for it's dual-use but using + pmaxsd is OK. */ +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ +/* { dg-final { scan-assembler "pmaxsd" } } */ ---1609908220-888097525-1565594583=:11741--