* [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs @ 2019-07-23 14:03 Richard Biener 2019-07-24 9:14 ` Richard Biener ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Richard Biener @ 2019-07-23 14:03 UTC (permalink / raw) To: gcc-patches Cc: Jan Hubicka, ubizjak, kirill.yukhin, vmakarov, hjl.tools, Martin Jambor The following fixes the runtime regression of 456.hmmer caused by matching ICC in code generation and using cmov more aggressively (through GIMPLE level MAX_EXPR usage). Appearantly (discovered by manual assembler editing) using the SSE unit for performing SImode loads, adds and then two singed max operations plus stores is quite a bit faster than cmovs - even faster than the original single cmov plus branchy second max. Even more so for AMD CPUs than Intel CPUs. Instead of hacking up some pattern recognition pass to transform integer mode memory-to-memory computation chains involving conditional moves to "vector" code (similar to what STV does for TImode ops on x86_64) the following simply allows SImode into SSE registers (support for this is already there in some important places like move patterns!). For the particular case of 456.hmmer the required support is loads/stores (already implemented), SImode adds and SImode smax. So the patch adds a smax pattern for SImode (we don't have any for scalar modes but currently expand via a conditional move sequence) emitting as SSE vector max or cmp/cmov depending on the alternative. And it amends the *add<mode>_1 pattern with SSE alternatives (which have to come before the memory alternative as IRA otherwise doesn't consider reloading a memory operand to a register). With this in place the runtime of 456.hmmer improves by 10% on Haswell which is back to before regression speed but not to same levels as seen with manually editing just the single important loop. I'm currently benchmarking all SPEC CPU 2006 on Haswell. More interesting is probably Zen where moves crossing the integer - vector domain are excessively expensive (they get done via the stack). Clearly this approach will run into register allocation issues but it looks cleaner than writing yet another STV-like pass (STV itself is quite awkwardly structured so I refrain from touching it...). Anyway - comments? It seems to me that MMX-in-SSE does something very similar. Bootstrapped on x86_64-unknown-linux-gnu, previous testing revealed some issue. Forgot that *add<mode>_1 also handles DImode..., fixed below, re-testing in progress. Thanks, Richard. 2019-07-23 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386.md (smaxsi3): New. (*add<mode>_1): Add SSE and AVX variants. * config/i386/i386.c (ix86_lea_for_add_ok): Do not allow SSE registers. Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 273732) +++ gcc/config/i386/i386.md (working copy) @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>" "" "ix86_expand_move (<MODE>mode, operands); DONE;") +(define_insn "smaxsi3" + [(set (match_operand:SI 0 "register_operand" "=r,v,x") + (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0") + (match_operand:SI 2 "register_operand" "r,v,x"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_SSE4_1" +{ + switch (get_attr_type (insn)) + { + case TYPE_SSEADD: + if (which_alternative == 1) + return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}"; + else + return "pmaxsd\t{%2, %0|%0, %2}"; + case TYPE_ICMOV: + /* ??? Instead split this after reload? */ + return "cmpl\t{%2, %0|%0, %2}\n" + "\tcmovl\t{%2, %0|%0, %2}"; + default: + gcc_unreachable (); + } +} + [(set_attr "isa" "noavx,avx,noavx") + (set_attr "prefix" "orig,vex,orig") + (set_attr "memory" "none") + (set_attr "type" "icmov,sseadd,sseadd")]) + (define_insn "*mov<mode>_xor" [(set (match_operand:SWI48 0 "register_operand" "=r") (match_operand:SWI48 1 "const0_operand")) @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl }) (define_insn "*add<mode>_1" - [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") + [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r") (plus:SWI48 - (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r") - (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le"))) + (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r") + (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le"))) (clobber (reg:CC FLAGS_REG))] "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)" { @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1" return "dec{<imodesuffix>}\t%0"; } + case TYPE_SSEADD: + if (which_alternative == 1) + { + if (<MODE>mode == SImode) + return "%vpaddd\t{%2, %1, %0|%0, %1, %2}"; + else + return "%vpaddq\t{%2, %1, %0|%0, %1, %2}"; + } + else if (<MODE>mode == SImode) + return "paddd\t{%2, %0|%0, %2}"; + else + return "paddq\t{%2, %0|%0, %2}"; + default: /* For most processors, ADD is faster than LEA. This alternative was added to use ADD as much as possible. */ - if (which_alternative == 2) + if (which_alternative == 4) std::swap (operands[1], operands[2]); gcc_assert (rtx_equal_p (operands[0], operands[1])); @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1" return "add{<imodesuffix>}\t{%2, %0|%0, %2}"; } } - [(set (attr "type") - (cond [(eq_attr "alternative" "3") + [(set_attr "isa" "*,avx,noavx,*,*,*") + (set (attr "type") + (cond [(eq_attr "alternative" "5") (const_string "lea") + (eq_attr "alternative" "1") + (const_string "sseadd") + (eq_attr "alternative" "2") + (const_string "sseadd") (match_operand:SWI48 2 "incdec_operand") (const_string "incdec") ] Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 273732) +++ gcc/config/i386/i386.c (working copy) @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx unsigned int regno1 = true_regnum (operands[1]); unsigned int regno2 = true_regnum (operands[2]); + if (SSE_REGNO_P (regno1)) + return false; + /* If a = b + c, (a!=b && a!=c), must use lea form. */ if (regno0 != regno1 && regno0 != regno2) return true; ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-23 14:03 [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs Richard Biener @ 2019-07-24 9:14 ` Richard Biener 2019-07-24 11:30 ` Richard Biener 2019-07-24 15:12 ` Jeff Law 2019-07-25 9:15 ` Martin Jambor 2 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-07-24 9:14 UTC (permalink / raw) To: gcc-patches Cc: Jan Hubicka, ubizjak, kirill.yukhin, vmakarov, hjl.tools, Martin Jambor [-- Attachment #1: Type: text/plain, Size: 9509 bytes --] On Tue, 23 Jul 2019, Richard Biener wrote: > > The following fixes the runtime regression of 456.hmmer caused > by matching ICC in code generation and using cmov more aggressively > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > by manual assembler editing) using the SSE unit for performing > SImode loads, adds and then two singed max operations plus stores > is quite a bit faster than cmovs - even faster than the original > single cmov plus branchy second max. Even more so for AMD CPUs > than Intel CPUs. > > Instead of hacking up some pattern recognition pass to transform > integer mode memory-to-memory computation chains involving > conditional moves to "vector" code (similar to what STV does > for TImode ops on x86_64) the following simply allows SImode > into SSE registers (support for this is already there in some > important places like move patterns!). For the particular > case of 456.hmmer the required support is loads/stores > (already implemented), SImode adds and SImode smax. > > So the patch adds a smax pattern for SImode (we don't have any > for scalar modes but currently expand via a conditional move sequence) > emitting as SSE vector max or cmp/cmov depending on the alternative. > > And it amends the *add<mode>_1 pattern with SSE alternatives > (which have to come before the memory alternative as IRA otherwise > doesn't consider reloading a memory operand to a register). > > With this in place the runtime of 456.hmmer improves by 10% > on Haswell which is back to before regression speed but not > to same levels as seen with manually editing just the single > important loop. > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > interesting is probably Zen where moves crossing the > integer - vector domain are excessively expensive (they get > done via the stack). > > Clearly this approach will run into register allocation issues > but it looks cleaner than writing yet another STV-like pass > (STV itself is quite awkwardly structured so I refrain from > touching it...). > > Anyway - comments? It seems to me that MMX-in-SSE does > something very similar. > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > revealed some issue. Forgot that *add<mode>_1 also handles > DImode..., fixed below, re-testing in progress. Bootstrapped/tested on x86_64-unknown-linux-gnu. A 3-run of SPEC CPU 2006 on a Haswell machine completed and results are in the noise besides the 456.hmmer improvement: 456.hmmer 9330 184 50.7 S 9330 162 57.4 S 456.hmmer 9330 182 51.2 * 9330 162 57.7 * 456.hmmer 9330 182 51.2 S 9330 162 57.7 S the peak binaries (patched) are all a slightly bit bigger, the smaxsi3 pattern triggers 6840 times, every time using SSE registers and never expanding to the cmov variant. The *add<mode>_1 pattern ends up using SSE regs 264 times (out of undoubtly many more, uncounted, times). I do see cases where the RA ends up moving sources of the max from GPR to XMM when the destination is stored to memory and used in other ops with SSE but still it could have used XMM regs for the sources as well: movl -208(%rbp), %r8d addl (%r9,%rax), %r8d vmovd %r8d, %xmm2 movq -120(%rbp), %r8 # MAX WITH SSE vpmaxsd %xmm4, %xmm2, %xmm2 amending the *add<mode>_1 was of course the trickiest part, mostly because the GPR case has memory alternatives while the SSE part does not (since we have to use a whole-vector add we can't use a memory operand which would be wider than SImode - AVX512 might come to the rescue with using {splat} from scalar/immediate or masking but that might come at a runtime cost as well). Allowing memory and splitting after reload, adding a match-scratch might work as well. But I'm not sure if that wouldn't make using SSE regs too obvious if it's not all in the same alternative. While the above code isn't too bad on Core, both Bulldozer and Zen take a big hit. Another case from 400.perlbench: vmovd .LC45(%rip), %xmm7 vmovd %ebp, %xmm5 # MAX WITH SSE vpmaxsd %xmm7, %xmm5, %xmm4 vmovd %xmm4, %ecx eh? I can't see why the RA would ever choose the second alternative. It looks like it prefers SSE_REGS for the operand set from a constant. A testcase like int foo (int a) { return a > 5 ? a : 5; } produces the above with -mavx2, possibly IRA thinks the missing matching constraint for the 2nd alternative makes it win? The dumps aren't too verbose here just showing the costs, not how we arrive at them. Generally using SSE for scalar integer ops shouldn't be bad, esp. in loops it might free GPRs for induction variables. Cons are larger instruction encoding and inefficient/missing handling of immediates and no memory operands. Of course in the end it's just that for some unknown reason cmp + cmov is so much slower than pmaxsd (OK, it's a lot less uops, but...) and that pmaxsd is quite a bit faster than the variant with a (very well predicted) branch. Richard. > Thanks, > Richard. > > 2019-07-23 Richard Biener <rguenther@suse.de> > > PR target/91154 > * config/i386/i386.md (smaxsi3): New. > (*add<mode>_1): Add SSE and AVX variants. > * config/i386/i386.c (ix86_lea_for_add_ok): Do not allow > SSE registers. > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 273732) > +++ gcc/config/i386/i386.md (working copy) > @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>" > "" > "ix86_expand_move (<MODE>mode, operands); DONE;") > > +(define_insn "smaxsi3" > + [(set (match_operand:SI 0 "register_operand" "=r,v,x") > + (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0") > + (match_operand:SI 2 "register_operand" "r,v,x"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_SSE4_1" > +{ > + switch (get_attr_type (insn)) > + { > + case TYPE_SSEADD: > + if (which_alternative == 1) > + return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}"; > + else > + return "pmaxsd\t{%2, %0|%0, %2}"; > + case TYPE_ICMOV: > + /* ??? Instead split this after reload? */ > + return "cmpl\t{%2, %0|%0, %2}\n" > + "\tcmovl\t{%2, %0|%0, %2}"; > + default: > + gcc_unreachable (); > + } > +} > + [(set_attr "isa" "noavx,avx,noavx") > + (set_attr "prefix" "orig,vex,orig") > + (set_attr "memory" "none") > + (set_attr "type" "icmov,sseadd,sseadd")]) > + > (define_insn "*mov<mode>_xor" > [(set (match_operand:SWI48 0 "register_operand" "=r") > (match_operand:SWI48 1 "const0_operand")) > @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl > }) > > (define_insn "*add<mode>_1" > - [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") > + [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r") > (plus:SWI48 > - (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r") > - (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le"))) > + (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r") > + (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le"))) > (clobber (reg:CC FLAGS_REG))] > "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)" > { > @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1" > return "dec{<imodesuffix>}\t%0"; > } > > + case TYPE_SSEADD: > + if (which_alternative == 1) > + { > + if (<MODE>mode == SImode) > + return "%vpaddd\t{%2, %1, %0|%0, %1, %2}"; > + else > + return "%vpaddq\t{%2, %1, %0|%0, %1, %2}"; > + } > + else if (<MODE>mode == SImode) > + return "paddd\t{%2, %0|%0, %2}"; > + else > + return "paddq\t{%2, %0|%0, %2}"; > + > default: > /* For most processors, ADD is faster than LEA. This alternative > was added to use ADD as much as possible. */ > - if (which_alternative == 2) > + if (which_alternative == 4) > std::swap (operands[1], operands[2]); > > gcc_assert (rtx_equal_p (operands[0], operands[1])); > @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1" > return "add{<imodesuffix>}\t{%2, %0|%0, %2}"; > } > } > - [(set (attr "type") > - (cond [(eq_attr "alternative" "3") > + [(set_attr "isa" "*,avx,noavx,*,*,*") > + (set (attr "type") > + (cond [(eq_attr "alternative" "5") > (const_string "lea") > + (eq_attr "alternative" "1") > + (const_string "sseadd") > + (eq_attr "alternative" "2") > + (const_string "sseadd") > (match_operand:SWI48 2 "incdec_operand") > (const_string "incdec") > ] > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 273732) > +++ gcc/config/i386/i386.c (working copy) > @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx > unsigned int regno1 = true_regnum (operands[1]); > unsigned int regno2 = true_regnum (operands[2]); > > + if (SSE_REGNO_P (regno1)) > + return false; > + > /* If a = b + c, (a!=b && a!=c), must use lea form. */ > if (regno0 != regno1 && regno0 != regno2) > return true; > -- Richard Biener <rguenther@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG NÌrnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-24 9:14 ` Richard Biener @ 2019-07-24 11:30 ` Richard Biener 0 siblings, 0 replies; 61+ messages in thread From: Richard Biener @ 2019-07-24 11:30 UTC (permalink / raw) To: gcc-patches Cc: Jan Hubicka, ubizjak, kirill.yukhin, vmakarov, hjl.tools, Martin Jambor On Wed, 24 Jul 2019, Richard Biener wrote: > On Tue, 23 Jul 2019, Richard Biener wrote: > > > > > The following fixes the runtime regression of 456.hmmer caused > > by matching ICC in code generation and using cmov more aggressively > > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > > by manual assembler editing) using the SSE unit for performing > > SImode loads, adds and then two singed max operations plus stores > > is quite a bit faster than cmovs - even faster than the original > > single cmov plus branchy second max. Even more so for AMD CPUs > > than Intel CPUs. > > > > Instead of hacking up some pattern recognition pass to transform > > integer mode memory-to-memory computation chains involving > > conditional moves to "vector" code (similar to what STV does > > for TImode ops on x86_64) the following simply allows SImode > > into SSE registers (support for this is already there in some > > important places like move patterns!). For the particular > > case of 456.hmmer the required support is loads/stores > > (already implemented), SImode adds and SImode smax. > > > > So the patch adds a smax pattern for SImode (we don't have any > > for scalar modes but currently expand via a conditional move sequence) > > emitting as SSE vector max or cmp/cmov depending on the alternative. > > > > And it amends the *add<mode>_1 pattern with SSE alternatives > > (which have to come before the memory alternative as IRA otherwise > > doesn't consider reloading a memory operand to a register). > > > > With this in place the runtime of 456.hmmer improves by 10% > > on Haswell which is back to before regression speed but not > > to same levels as seen with manually editing just the single > > important loop. > > > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > > interesting is probably Zen where moves crossing the > > integer - vector domain are excessively expensive (they get > > done via the stack). > > > > Clearly this approach will run into register allocation issues > > but it looks cleaner than writing yet another STV-like pass > > (STV itself is quite awkwardly structured so I refrain from > > touching it...). > > > > Anyway - comments? It seems to me that MMX-in-SSE does > > something very similar. > > > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > > revealed some issue. Forgot that *add<mode>_1 also handles > > DImode..., fixed below, re-testing in progress. > > Bootstrapped/tested on x86_64-unknown-linux-gnu. A 3-run of > SPEC CPU 2006 on a Haswell machine completed and results > are in the noise besides the 456.hmmer improvement: > > 456.hmmer 9330 184 50.7 S 9330 162 > 57.4 S > 456.hmmer 9330 182 51.2 * 9330 162 > 57.7 * > 456.hmmer 9330 182 51.2 S 9330 162 > 57.7 S > > the peak binaries (patched) are all a slightly bit bigger, the > smaxsi3 pattern triggers 6840 times, every time using SSE > registers and never expanding to the cmov variant. The > *add<mode>_1 pattern ends up using SSE regs 264 times > (out of undoubtly many more, uncounted, times). > > I do see cases where the RA ends up moving sources of > the max from GPR to XMM when the destination is stored > to memory and used in other ops with SSE but still > it could have used XMM regs for the sources as well: > > movl -208(%rbp), %r8d > addl (%r9,%rax), %r8d > vmovd %r8d, %xmm2 > movq -120(%rbp), %r8 > # MAX WITH SSE > vpmaxsd %xmm4, %xmm2, %xmm2 > > amending the *add<mode>_1 was of course the trickiest part, > mostly because the GPR case has memory alternatives while > the SSE part does not (since we have to use a whole-vector > add we can't use a memory operand which would be wider > than SImode - AVX512 might come to the rescue with > using {splat} from scalar/immediate or masking > but that might come at a runtime cost as well). Allowing > memory and splitting after reload, adding a match-scratch > might work as well. But I'm not sure if that wouldn't > make using SSE regs too obvious if it's not all in the > same alternative. While the above code isn't too bad > on Core, both Bulldozer and Zen take a big hit. > > Another case from 400.perlbench: > > vmovd .LC45(%rip), %xmm7 > vmovd %ebp, %xmm5 > # MAX WITH SSE > vpmaxsd %xmm7, %xmm5, %xmm4 > vmovd %xmm4, %ecx > > eh? I can't see why the RA would ever choose the second > alternative. It looks like it prefers SSE_REGS for the > operand set from a constant. A testcase like > > int foo (int a) > { > return a > 5 ? a : 5; > } > > produces the above with -mavx2, possibly IRA thinks > the missing matching constraint for the 2nd alternative > makes it win? The dumps aren't too verbose here just > showing the costs, not how we arrive at them. Eh, this is to my use of the "cpu" attribute for smaxsi3 which makes it only enable this alternative for -mavx. Removing that we fail to consider SSE regs for the original and this testcase :/ Oh well. RA needs some more pixie dust it seems ... Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-23 14:03 [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs Richard Biener 2019-07-24 9:14 ` Richard Biener @ 2019-07-24 15:12 ` Jeff Law 2019-07-27 10:07 ` Uros Bizjak 2019-07-25 9:15 ` Martin Jambor 2 siblings, 1 reply; 61+ messages in thread From: Jeff Law @ 2019-07-24 15:12 UTC (permalink / raw) To: Richard Biener, gcc-patches Cc: Jan Hubicka, ubizjak, kirill.yukhin, vmakarov, hjl.tools, Martin Jambor On 7/23/19 8:00 AM, Richard Biener wrote: > > The following fixes the runtime regression of 456.hmmer caused > by matching ICC in code generation and using cmov more aggressively > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > by manual assembler editing) using the SSE unit for performing > SImode loads, adds and then two singed max operations plus stores > is quite a bit faster than cmovs - even faster than the original > single cmov plus branchy second max. Even more so for AMD CPUs > than Intel CPUs. > > Instead of hacking up some pattern recognition pass to transform > integer mode memory-to-memory computation chains involving > conditional moves to "vector" code (similar to what STV does > for TImode ops on x86_64) the following simply allows SImode > into SSE registers (support for this is already there in some > important places like move patterns!). For the particular > case of 456.hmmer the required support is loads/stores > (already implemented), SImode adds and SImode smax. > > So the patch adds a smax pattern for SImode (we don't have any > for scalar modes but currently expand via a conditional move sequence) > emitting as SSE vector max or cmp/cmov depending on the alternative. > > And it amends the *add<mode>_1 pattern with SSE alternatives > (which have to come before the memory alternative as IRA otherwise > doesn't consider reloading a memory operand to a register). > > With this in place the runtime of 456.hmmer improves by 10% > on Haswell which is back to before regression speed but not > to same levels as seen with manually editing just the single > important loop. > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > interesting is probably Zen where moves crossing the > integer - vector domain are excessively expensive (they get > done via the stack). > > Clearly this approach will run into register allocation issues > but it looks cleaner than writing yet another STV-like pass > (STV itself is quite awkwardly structured so I refrain from > touching it...). > > Anyway - comments? It seems to me that MMX-in-SSE does > something very similar. > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > revealed some issue. Forgot that *add<mode>_1 also handles > DImode..., fixed below, re-testing in progress. Certainly simpler than most of the options and seems effective. FWIW, I think all the STV code is still disabled and has been for several releases. One could make an argument it should get dropped. If someone wants to make something like STV work, they can try again and hopefully learn from the problems with the first implementation. jeff ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-24 15:12 ` Jeff Law @ 2019-07-27 10:07 ` Uros Bizjak 2019-08-09 22:15 ` Jeff Law 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-07-27 10:07 UTC (permalink / raw) To: Jeff Law Cc: Richard Biener, gcc-patches, Jan Hubicka, Kirill Yukhin, Vladimir Makarov, H. J. Lu, Martin Jambor On Wed, Jul 24, 2019 at 5:03 PM Jeff Law <law@redhat.com> wrote: > > Clearly this approach will run into register allocation issues > > but it looks cleaner than writing yet another STV-like pass > > (STV itself is quite awkwardly structured so I refrain from > > touching it...). > > > > Anyway - comments? It seems to me that MMX-in-SSE does > > something very similar. > > > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > > revealed some issue. Forgot that *add<mode>_1 also handles > > DImode..., fixed below, re-testing in progress. > Certainly simpler than most of the options and seems effective. > > FWIW, I think all the STV code is still disabled and has been for > several releases. One could make an argument it should get dropped. If > someone wants to make something like STV work, they can try again and > hopefully learn from the problems with the first implementation. Huh? STV code is *enabled by default* on 32bit SSE2 targets, and works surprisingly well (*) for DImode arithmetic, logic and constant shift operations. Even 32bit multilib on x86_64 is built with STV. I am indeed surprised that the perception of the developers is that STV doesn't work. Maybe I'm missing something obvious here? (*) The infrastructure includes: - cost analysis of the whole STV chain, including moves from integer registers, loading and storing DImode values - preloading of arguments into vector registers to avoid duplicate int-vec moves - different strategies to move arguments between int and vector registers (e.g. respects TARGET_INTER_UNIT_MOVES_{TO,FROM}_VEC flag) Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-27 10:07 ` Uros Bizjak @ 2019-08-09 22:15 ` Jeff Law 0 siblings, 0 replies; 61+ messages in thread From: Jeff Law @ 2019-08-09 22:15 UTC (permalink / raw) To: Uros Bizjak Cc: Richard Biener, gcc-patches, Jan Hubicka, Kirill Yukhin, Vladimir Makarov, H. J. Lu, Martin Jambor On 7/27/19 3:22 AM, Uros Bizjak wrote: > On Wed, Jul 24, 2019 at 5:03 PM Jeff Law <law@redhat.com> wrote: > >>> Clearly this approach will run into register allocation issues >>> but it looks cleaner than writing yet another STV-like pass >>> (STV itself is quite awkwardly structured so I refrain from >>> touching it...). >>> >>> Anyway - comments? It seems to me that MMX-in-SSE does >>> something very similar. >>> >>> Bootstrapped on x86_64-unknown-linux-gnu, previous testing >>> revealed some issue. Forgot that *add<mode>_1 also handles >>> DImode..., fixed below, re-testing in progress. >> Certainly simpler than most of the options and seems effective. >> >> FWIW, I think all the STV code is still disabled and has been for >> several releases. One could make an argument it should get dropped. If >> someone wants to make something like STV work, they can try again and >> hopefully learn from the problems with the first implementation. > > Huh? > > STV code is *enabled by default* on 32bit SSE2 targets, and works > surprisingly well (*) for DImode arithmetic, logic and constant shift > operations. Even 32bit multilib on x86_64 is built with STV. I must be mis-remembering or confusing it with something else. Sorry for any confusion. Jeff ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-23 14:03 [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs Richard Biener 2019-07-24 9:14 ` Richard Biener 2019-07-24 15:12 ` Jeff Law @ 2019-07-25 9:15 ` Martin Jambor 2019-07-25 12:57 ` Richard Biener 2 siblings, 1 reply; 61+ messages in thread From: Martin Jambor @ 2019-07-25 9:15 UTC (permalink / raw) To: Richard Biener, gcc-patches Cc: Jan Hubicka, ubizjak, kirill.yukhin, vmakarov, hjl.tools, Martin Jambor Hello, On Tue, Jul 23 2019, Richard Biener wrote: > The following fixes the runtime regression of 456.hmmer caused > by matching ICC in code generation and using cmov more aggressively > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > by manual assembler editing) using the SSE unit for performing > SImode loads, adds and then two singed max operations plus stores > is quite a bit faster than cmovs - even faster than the original > single cmov plus branchy second max. Even more so for AMD CPUs > than Intel CPUs. > > Instead of hacking up some pattern recognition pass to transform > integer mode memory-to-memory computation chains involving > conditional moves to "vector" code (similar to what STV does > for TImode ops on x86_64) the following simply allows SImode > into SSE registers (support for this is already there in some > important places like move patterns!). For the particular > case of 456.hmmer the required support is loads/stores > (already implemented), SImode adds and SImode smax. > > So the patch adds a smax pattern for SImode (we don't have any > for scalar modes but currently expand via a conditional move sequence) > emitting as SSE vector max or cmp/cmov depending on the alternative. > > And it amends the *add<mode>_1 pattern with SSE alternatives > (which have to come before the memory alternative as IRA otherwise > doesn't consider reloading a memory operand to a register). > > With this in place the runtime of 456.hmmer improves by 10% > on Haswell which is back to before regression speed but not > to same levels as seen with manually editing just the single > important loop. > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > interesting is probably Zen where moves crossing the > integer - vector domain are excessively expensive (they get > done via the stack). There was a znver2 CPU machine not doing anything useful overnight here so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning patches, I am afraid). All benchmarks were run only once with options -Ofast -march=native -mtune=native. By far the biggest change was indeed 456.hmmer which improved by incredible 35%. There was no other change bigger than +- 1.5% in SPEC 2006 so the SPECint score grew by almost 3.4%. I understand this patch fixes a regression in that benchmark but even so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9 and with the patch is 20% faster than gcc 9. In SPEC 2017, there were two changes worth mentioning although they probably need to be confirmed and re-measured on top of the new tuning changes. 525.x264_r regressed by 3.37% and 511.povray_r improved by 3.04%. Martin > > Clearly this approach will run into register allocation issues > but it looks cleaner than writing yet another STV-like pass > (STV itself is quite awkwardly structured so I refrain from > touching it...). > > Anyway - comments? It seems to me that MMX-in-SSE does > something very similar. > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > revealed some issue. Forgot that *add<mode>_1 also handles > DImode..., fixed below, re-testing in progress. > > Thanks, > Richard. > > 2019-07-23 Richard Biener <rguenther@suse.de> > > PR target/91154 > * config/i386/i386.md (smaxsi3): New. > (*add<mode>_1): Add SSE and AVX variants. > * config/i386/i386.c (ix86_lea_for_add_ok): Do not allow > SSE registers. > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-25 9:15 ` Martin Jambor @ 2019-07-25 12:57 ` Richard Biener 2019-07-27 11:14 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-07-25 12:57 UTC (permalink / raw) To: Martin Jambor; +Cc: gcc-patches, Jakub Jelinek, ubizjak, vmakarov On Thu, 25 Jul 2019, Martin Jambor wrote: > Hello, > > On Tue, Jul 23 2019, Richard Biener wrote: > > The following fixes the runtime regression of 456.hmmer caused > > by matching ICC in code generation and using cmov more aggressively > > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > > by manual assembler editing) using the SSE unit for performing > > SImode loads, adds and then two singed max operations plus stores > > is quite a bit faster than cmovs - even faster than the original > > single cmov plus branchy second max. Even more so for AMD CPUs > > than Intel CPUs. > > > > Instead of hacking up some pattern recognition pass to transform > > integer mode memory-to-memory computation chains involving > > conditional moves to "vector" code (similar to what STV does > > for TImode ops on x86_64) the following simply allows SImode > > into SSE registers (support for this is already there in some > > important places like move patterns!). For the particular > > case of 456.hmmer the required support is loads/stores > > (already implemented), SImode adds and SImode smax. > > > > So the patch adds a smax pattern for SImode (we don't have any > > for scalar modes but currently expand via a conditional move sequence) > > emitting as SSE vector max or cmp/cmov depending on the alternative. > > > > And it amends the *add<mode>_1 pattern with SSE alternatives > > (which have to come before the memory alternative as IRA otherwise > > doesn't consider reloading a memory operand to a register). > > > > With this in place the runtime of 456.hmmer improves by 10% > > on Haswell which is back to before regression speed but not > > to same levels as seen with manually editing just the single > > important loop. > > > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > > interesting is probably Zen where moves crossing the > > integer - vector domain are excessively expensive (they get > > done via the stack). > > There was a znver2 CPU machine not doing anything useful overnight here > so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top > of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning > patches, I am afraid). All benchmarks were run only once with options > -Ofast -march=native -mtune=native. > > By far the biggest change was indeed 456.hmmer which improved by > incredible 35%. There was no other change bigger than +- 1.5% in SPEC > 2006 so the SPECint score grew by almost 3.4%. > > I understand this patch fixes a regression in that benchmark but even > so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9 > and with the patch is 20% faster than gcc 9. > > In SPEC 2017, there were two changes worth mentioning although they > probably need to be confirmed and re-measured on top of the new tuning > changes. 525.x264_r regressed by 3.37% and 511.povray_r improved by > 3.04%. Thanks for checking. Meanwhile I figured how to restore the effects of the patch without disabling the GPR alternative in smaxsi3. The additional trick I need is avoid register class preferencing from moves so *movsi_internal gets a few more *s (in the end we'd need to split the r = g alternative because for r = C we _do_ want to prefer general regs - unless it's a special constant that can be loaded into a SSE reg? Or maybe that's not needed and reload costs will take care of that). I've also needed to pessimize the GPR alternative in smaxsi3 because that instruction is supposed to drive all the decision as it is cheaper when done on SSE regs. Tunings still wreck things, like using -march=bdver2 will give you one vpmaxsd and the rest in integer regs, including an inter-unit move via the stack. Still this is the best I can get to with my limited .md / LRA skills. Is avoiding register-class preferencing from moves good? I think it makes sense at least. How would one write smaxsi3 as a splitter to be split after reload in the case LRA assigned the GPR alternative? Is it even worth doing? Even the SSE reg alternative can be split to remove the not needed CC clobber. Finally I'm unsure about the add where I needed to place the SSE alternative before the 2nd op memory one since it otherwise gets the same cost and wins. So - how to go forward with this? Thanks, Richard. Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 273792) +++ gcc/config/i386/i386.md (working copy) @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>" "" "ix86_expand_move (<MODE>mode, operands); DONE;") +(define_insn "smaxsi3" + [(set (match_operand:SI 0 "register_operand" "=?r,v,x") + (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0") + (match_operand:SI 2 "register_operand" "r,v,x"))) + (clobber (reg:CC FLAGS_REG))] + "" +{ + switch (get_attr_type (insn)) + { + case TYPE_SSEADD: + if (which_alternative == 1) + return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}"; + else + return "pmaxsd\t{%2, %0|%0, %2}"; + case TYPE_ICMOV: + /* ??? Instead split this after reload? */ + return "cmpl\t{%2, %0|%0, %2}\n" + "\tcmovl\t{%2, %0|%0, %2}"; + default: + gcc_unreachable (); + } +} + [(set_attr "isa" "*,avx,sse4_noavx") + (set_attr "prefix" "orig,vex,orig") + (set_attr "memory" "none") + (set_attr "type" "icmov,sseadd,sseadd")]) + (define_insn "*mov<mode>_xor" [(set (match_operand:SWI48 0 "register_operand" "=r") (match_operand:SWI48 1 "const0_operand")) @@ -2342,9 +2369,9 @@ (define_peephole2 (define_insn "*movsi_internal" [(set (match_operand:SI 0 "nonimmediate_operand" - "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k") + "=*r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k") (match_operand:SI 1 "general_operand" - "g ,re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k ,CBC"))] + "g ,*re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k ,CBC"))] "!(MEM_P (operands[0]) && MEM_P (operands[1]))" { switch (get_attr_type (insn)) @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl }) (define_insn "*add<mode>_1" - [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") + [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r") (plus:SWI48 - (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r") - (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le"))) + (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r") + (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le"))) (clobber (reg:CC FLAGS_REG))] "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)" { @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1" return "dec{<imodesuffix>}\t%0"; } + case TYPE_SSEADD: + if (which_alternative == 1) + { + if (<MODE>mode == SImode) + return "%vpaddd\t{%2, %1, %0|%0, %1, %2}"; + else + return "%vpaddq\t{%2, %1, %0|%0, %1, %2}"; + } + else if (<MODE>mode == SImode) + return "paddd\t{%2, %0|%0, %2}"; + else + return "paddq\t{%2, %0|%0, %2}"; + default: /* For most processors, ADD is faster than LEA. This alternative was added to use ADD as much as possible. */ - if (which_alternative == 2) + if (which_alternative == 4) std::swap (operands[1], operands[2]); gcc_assert (rtx_equal_p (operands[0], operands[1])); @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1" return "add{<imodesuffix>}\t{%2, %0|%0, %2}"; } } - [(set (attr "type") - (cond [(eq_attr "alternative" "3") + [(set_attr "isa" "*,avx,sse2,*,*,*") + (set (attr "type") + (cond [(eq_attr "alternative" "5") (const_string "lea") + (eq_attr "alternative" "1") + (const_string "sseadd") + (eq_attr "alternative" "2") + (const_string "sseadd") (match_operand:SWI48 2 "incdec_operand") (const_string "incdec") ] Index: gcc/config/i386/i386.c =================================================================== --- gcc/config/i386/i386.c (revision 273792) +++ gcc/config/i386/i386.c (working copy) @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx unsigned int regno1 = true_regnum (operands[1]); unsigned int regno2 = true_regnum (operands[2]); + if (SSE_REGNO_P (regno1)) + return false; + /* If a = b + c, (a!=b && a!=c), must use lea form. */ if (regno0 != regno1 && regno0 != regno2) return true; ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-25 12:57 ` Richard Biener @ 2019-07-27 11:14 ` Uros Bizjak 2019-07-27 18:23 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-07-27 11:14 UTC (permalink / raw) To: Richard Biener Cc: Martin Jambor, gcc-patches, Jakub Jelinek, Vladimir Makarov On Thu, Jul 25, 2019 at 2:21 PM Richard Biener <rguenther@suse.de> wrote: > > On Thu, 25 Jul 2019, Martin Jambor wrote: > > > Hello, > > > > On Tue, Jul 23 2019, Richard Biener wrote: > > > The following fixes the runtime regression of 456.hmmer caused > > > by matching ICC in code generation and using cmov more aggressively > > > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > > > by manual assembler editing) using the SSE unit for performing > > > SImode loads, adds and then two singed max operations plus stores > > > is quite a bit faster than cmovs - even faster than the original > > > single cmov plus branchy second max. Even more so for AMD CPUs > > > than Intel CPUs. > > > > > > Instead of hacking up some pattern recognition pass to transform > > > integer mode memory-to-memory computation chains involving > > > conditional moves to "vector" code (similar to what STV does > > > for TImode ops on x86_64) the following simply allows SImode > > > into SSE registers (support for this is already there in some > > > important places like move patterns!). For the particular > > > case of 456.hmmer the required support is loads/stores > > > (already implemented), SImode adds and SImode smax. > > > > > > So the patch adds a smax pattern for SImode (we don't have any > > > for scalar modes but currently expand via a conditional move sequence) > > > emitting as SSE vector max or cmp/cmov depending on the alternative. > > > > > > And it amends the *add<mode>_1 pattern with SSE alternatives > > > (which have to come before the memory alternative as IRA otherwise > > > doesn't consider reloading a memory operand to a register). > > > > > > With this in place the runtime of 456.hmmer improves by 10% > > > on Haswell which is back to before regression speed but not > > > to same levels as seen with manually editing just the single > > > important loop. > > > > > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > > > interesting is probably Zen where moves crossing the > > > integer - vector domain are excessively expensive (they get > > > done via the stack). > > > > There was a znver2 CPU machine not doing anything useful overnight here > > so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top > > of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning > > patches, I am afraid). All benchmarks were run only once with options > > -Ofast -march=native -mtune=native. > > > > By far the biggest change was indeed 456.hmmer which improved by > > incredible 35%. There was no other change bigger than +- 1.5% in SPEC > > 2006 so the SPECint score grew by almost 3.4%. > > > > I understand this patch fixes a regression in that benchmark but even > > so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9 > > and with the patch is 20% faster than gcc 9. > > > > In SPEC 2017, there were two changes worth mentioning although they > > probably need to be confirmed and re-measured on top of the new tuning > > changes. 525.x264_r regressed by 3.37% and 511.povray_r improved by > > 3.04%. > > Thanks for checking. Meanwhile I figured how to restore the > effects of the patch without disabling the GPR alternative in > smaxsi3. The additional trick I need is avoid register class > preferencing from moves so *movsi_internal gets a few more *s > (in the end we'd need to split the r = g alternative because > for r = C we _do_ want to prefer general regs - unless it's > a special constant that can be loaded into a SSE reg? Or maybe > that's not needed and reload costs will take care of that). > > I've also needed to pessimize the GPR alternative in smaxsi3 > because that instruction is supposed to drive all the decision > as it is cheaper when done on SSE regs. > > Tunings still wreck things, like using -march=bdver2 will > give you one vpmaxsd and the rest in integer regs, including > an inter-unit move via the stack. > > Still this is the best I can get to with my limited .md / LRA > skills. > > Is avoiding register-class preferencing from moves good? I think > it makes sense at least. > > How would one write smaxsi3 as a splitter to be split after > reload in the case LRA assigned the GPR alternative? Is it > even worth doing? Even the SSE reg alternative can be split > to remove the not needed CC clobber. > > Finally I'm unsure about the add where I needed to place > the SSE alternative before the 2nd op memory one since it > otherwise gets the same cost and wins. > > So - how to go forward with this? Sorry to come a bit late to the discussion. We are aware of CMOV issue for quite some time, but the issue is not understood yet in detail (I was hoping for Intel people to look at this). However, you demonstrated that using PMAX and PMIN instead of scalar CMOV can bring us big gains, and this thread now deals on how to best implement PMAX/PMIN for scalar code. I think that the way to go forward is with STV infrastructure. Currently, the implementation only deals with DImode on SSE2 32bit targets, but I see no issues on using STV pass also for SImode (on 32bit and 64bit targets). There are actually two STV passes, the first one (currently run on 64bit targets) is run before cse2, and the second (which currently runs on 32bit SSE2 only) is run after combine and before split1 pass. The second pass is interesting to us. The base idea of the second STV pass (for 32bit targets!) is that we introduce a DImode _doubleword instructons that otherwise do not exist with integer registers. Now, the passes up to and including combine pass can use these instructions to simplify and optimize the insn flow. Later, based on cost analysis, STV pass either converts the _doubleword instructions to a real vector ones (e.g. V2DImode patterns) or leaves them intact, and a follow-up split pass splits them into scalar SImode instruction pairs. STV pass also takes care to move and preload values from their scalar form to a vector representation (using SUBREGs). Please note that all this happens on pseudos, and register allocator will later simply use scalar (integer) registers in scalar patterns and vector registers with vector insn patterns. Your approach to amend existing scalar SImode patterns with vector registers will introduce no end of problems. Register allocator will do funny things during register pressure, where values will take a trip to a vector register before being stored to memory (and vice versa, you already found some of them). Current RA simply can't distinguish clearly between two register sets. So, my advice would be to use STV pass also for SImode values, on 64bit and 32bit targets. On both targets, we will be able to use instructions that operate on vector register set, and for 32bit targets (and to some extent on 64bit targets), we would perhaps be able to relax register pressure in a kind of controlled way. So, to demonstrate the benefits of existing STV pass, it should be relatively easy to introduce 64bit max/min pattern on 32bit target to handle 64bit values. For 32bit values, the pass should be re-run to convert SImode scalar operations to vector operations in a controlled way, based on various cost functions. Uros. > Thanks, > Richard. > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 273792) > +++ gcc/config/i386/i386.md (working copy) > @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>" > "" > "ix86_expand_move (<MODE>mode, operands); DONE;") > > +(define_insn "smaxsi3" > + [(set (match_operand:SI 0 "register_operand" "=?r,v,x") > + (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0") > + (match_operand:SI 2 "register_operand" "r,v,x"))) > + (clobber (reg:CC FLAGS_REG))] > + "" > +{ > + switch (get_attr_type (insn)) > + { > + case TYPE_SSEADD: > + if (which_alternative == 1) > + return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}"; > + else > + return "pmaxsd\t{%2, %0|%0, %2}"; > + case TYPE_ICMOV: > + /* ??? Instead split this after reload? */ > + return "cmpl\t{%2, %0|%0, %2}\n" > + "\tcmovl\t{%2, %0|%0, %2}"; > + default: > + gcc_unreachable (); > + } > +} > + [(set_attr "isa" "*,avx,sse4_noavx") > + (set_attr "prefix" "orig,vex,orig") > + (set_attr "memory" "none") > + (set_attr "type" "icmov,sseadd,sseadd")]) > + > (define_insn "*mov<mode>_xor" > [(set (match_operand:SWI48 0 "register_operand" "=r") > (match_operand:SWI48 1 "const0_operand")) > @@ -2342,9 +2369,9 @@ (define_peephole2 > > (define_insn "*movsi_internal" > [(set (match_operand:SI 0 "nonimmediate_operand" > - "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k") > + "=*r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k") > (match_operand:SI 1 "general_operand" > - "g ,re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k ,CBC"))] > + "g ,*re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k ,CBC"))] > "!(MEM_P (operands[0]) && MEM_P (operands[1]))" > { > switch (get_attr_type (insn)) > @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl > }) > > (define_insn "*add<mode>_1" > - [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") > + [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r") > (plus:SWI48 > - (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r") > - (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le"))) > + (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r") > + (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le"))) > (clobber (reg:CC FLAGS_REG))] > "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)" > { > @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1" > return "dec{<imodesuffix>}\t%0"; > } > > + case TYPE_SSEADD: > + if (which_alternative == 1) > + { > + if (<MODE>mode == SImode) > + return "%vpaddd\t{%2, %1, %0|%0, %1, %2}"; > + else > + return "%vpaddq\t{%2, %1, %0|%0, %1, %2}"; > + } > + else if (<MODE>mode == SImode) > + return "paddd\t{%2, %0|%0, %2}"; > + else > + return "paddq\t{%2, %0|%0, %2}"; > + > default: > /* For most processors, ADD is faster than LEA. This alternative > was added to use ADD as much as possible. */ > - if (which_alternative == 2) > + if (which_alternative == 4) > std::swap (operands[1], operands[2]); > > gcc_assert (rtx_equal_p (operands[0], operands[1])); > @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1" > return "add{<imodesuffix>}\t{%2, %0|%0, %2}"; > } > } > - [(set (attr "type") > - (cond [(eq_attr "alternative" "3") > + [(set_attr "isa" "*,avx,sse2,*,*,*") > + (set (attr "type") > + (cond [(eq_attr "alternative" "5") > (const_string "lea") > + (eq_attr "alternative" "1") > + (const_string "sseadd") > + (eq_attr "alternative" "2") > + (const_string "sseadd") > (match_operand:SWI48 2 "incdec_operand") > (const_string "incdec") > ] > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 273792) > +++ gcc/config/i386/i386.c (working copy) > @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx > unsigned int regno1 = true_regnum (operands[1]); > unsigned int regno2 = true_regnum (operands[2]); > > + if (SSE_REGNO_P (regno1)) > + return false; > + > /* If a = b + c, (a!=b && a!=c), must use lea form. */ > if (regno0 != regno1 && regno0 != regno2) > return true; ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-27 11:14 ` Uros Bizjak @ 2019-07-27 18:23 ` Uros Bizjak 2019-07-31 12:01 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-07-27 18:23 UTC (permalink / raw) To: Richard Biener Cc: Martin Jambor, gcc-patches, Jakub Jelinek, Vladimir Makarov [-- Attachment #1: Type: text/plain, Size: 5717 bytes --] On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > How would one write smaxsi3 as a splitter to be split after > > reload in the case LRA assigned the GPR alternative? Is it > > even worth doing? Even the SSE reg alternative can be split > > to remove the not needed CC clobber. > > > > Finally I'm unsure about the add where I needed to place > > the SSE alternative before the 2nd op memory one since it > > otherwise gets the same cost and wins. > > > > So - how to go forward with this? > > Sorry to come a bit late to the discussion. > > We are aware of CMOV issue for quite some time, but the issue is not > understood yet in detail (I was hoping for Intel people to look at > this). However, you demonstrated that using PMAX and PMIN instead of > scalar CMOV can bring us big gains, and this thread now deals on how > to best implement PMAX/PMIN for scalar code. > > I think that the way to go forward is with STV infrastructure. > Currently, the implementation only deals with DImode on SSE2 32bit > targets, but I see no issues on using STV pass also for SImode (on > 32bit and 64bit targets). There are actually two STV passes, the first > one (currently run on 64bit targets) is run before cse2, and the > second (which currently runs on 32bit SSE2 only) is run after combine > and before split1 pass. The second pass is interesting to us. > > The base idea of the second STV pass (for 32bit targets!) is that we > introduce a DImode _doubleword instructons that otherwise do not exist > with integer registers. Now, the passes up to and including combine > pass can use these instructions to simplify and optimize the insn > flow. Later, based on cost analysis, STV pass either converts the > _doubleword instructions to a real vector ones (e.g. V2DImode > patterns) or leaves them intact, and a follow-up split pass splits > them into scalar SImode instruction pairs. STV pass also takes care to > move and preload values from their scalar form to a vector > representation (using SUBREGs). Please note that all this happens on > pseudos, and register allocator will later simply use scalar (integer) > registers in scalar patterns and vector registers with vector insn > patterns. > > Your approach to amend existing scalar SImode patterns with vector > registers will introduce no end of problems. Register allocator will > do funny things during register pressure, where values will take a > trip to a vector register before being stored to memory (and vice > versa, you already found some of them). Current RA simply can't > distinguish clearly between two register sets. > > So, my advice would be to use STV pass also for SImode values, on > 64bit and 32bit targets. On both targets, we will be able to use > instructions that operate on vector register set, and for 32bit > targets (and to some extent on 64bit targets), we would perhaps be > able to relax register pressure in a kind of controlled way. > > So, to demonstrate the benefits of existing STV pass, it should be > relatively easy to introduce 64bit max/min pattern on 32bit target to > handle 64bit values. For 32bit values, the pass should be re-run to > convert SImode scalar operations to vector operations in a controlled > way, based on various cost functions. Please find attached patch to see STV in action. The compilation will crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2 dump, you will be able to see chain building, cost calculation and conversion insertion. The testcase: --cut here-- long long test (long long a, long long b) { return (a > b) ? a : b; } --cut here-- gcc -O2 -m32 -msse2 (-mstv): _.268r.stv2 dump: Searching for mode conversion candidates... insn 2 is marked as a candidate insn 3 is marked as a candidate insn 7 is marked as a candidate Created a new instruction chain #1 Building chain #1... Adding insn 2 to chain #1 Adding insn 7 into chain's #1 queue Adding insn 7 to chain #1 r85 use in insn 12 isn't convertible Mark r85 def in insn 7 as requiring both modes in chain #1 Adding insn 3 into chain's #1 queue Adding insn 3 to chain #1 Collected chain #1... insns: 2, 3, 7 defs to convert: r85 Computing gain for chain #1... Instruction conversion gain: 24 Registers conversion cost: 6 Total gain: 18 Converting chain #1... ... (insn 2 5 3 2 (set (reg/v:DI 83 [ a ]) (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66 {*movdi_internal} (nil)) (insn 3 2 4 2 (set (reg/v:DI 84 [ b ]) (mem/c:DI (plus:SI (reg/f:SI 16 argp) (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66 {*movdi_internal} (nil)) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0) (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0) (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1 (expr_list:REG_DEAD (reg/v:DI 84 [ b ]) (expr_list:REG_DEAD (reg/v:DI 83 [ a ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 15 7 16 2 (set (reg:V2DI 87) (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1 (nil)) (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0) (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 (nil)) (insn 17 16 18 2 (set (reg:V2DI 87) (lshiftrt:V2DI (reg:V2DI 87) (const_int 32 [0x20]))) "max.c":3:22 -1 (nil)) (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4) (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 (nil)) (insn 12 18 13 2 (set (reg/i:DI 0 ax) (reg:DI 86)) "max.c":4:1 66 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 86) (nil))) (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1 (nil)) Uros. [-- Attachment #2: smax.diff.txt --] [-- Type: text/plain, Size: 1502 bytes --] Index: i386-features.c =================================================================== --- i386-features.c (revision 273844) +++ i386-features.c (working copy) @@ -531,6 +531,9 @@ if (CONST_INT_P (XEXP (src, 1))) gain -= vector_const_cost (XEXP (src, 1)); } + else if (GET_CODE (src) == SMAX + || (GET_CODE (src) == SMIN)) + gain += COSTS_N_INSNS (3); else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) gain += ix86_cost->add - COSTS_N_INSNS (1); @@ -907,6 +910,8 @@ case IOR: case XOR: case AND: + case SMAX: + case SMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); PUT_MODE (src, V2DImode); @@ -1285,6 +1290,8 @@ case IOR: case XOR: case AND: + case SMAX: + case SMIN: if (!REG_P (XEXP (src, 1)) && !MEM_P (XEXP (src, 1)) && !CONST_INT_P (XEXP (src, 1))) Index: i386.md =================================================================== --- i386.md (revision 273844) +++ i386.md (working copy) @@ -17489,6 +17489,14 @@ gcc_unreachable (); }) +(define_insn "smaxdi3" + [(set (match_operand:DI 0 "register_operand") + (smax:DI (match_operand:DI 1 "register_operand") + (match_operand:DI 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_64BIT && TARGET_STV && TARGET_SSE2" + "#") + (define_expand "mov<mode>cc" [(set (match_operand:X87MODEF 0 "register_operand") (if_then_else:X87MODEF ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-27 18:23 ` Uros Bizjak @ 2019-07-31 12:01 ` Richard Biener 2019-08-01 8:54 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-07-31 12:01 UTC (permalink / raw) To: Uros Bizjak; +Cc: Martin Jambor, gcc-patches, Jakub Jelinek, Vladimir Makarov [-- Attachment #1: Type: text/plain, Size: 7601 bytes --] On Sat, 27 Jul 2019, Uros Bizjak wrote: > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > How would one write smaxsi3 as a splitter to be split after > > > reload in the case LRA assigned the GPR alternative? Is it > > > even worth doing? Even the SSE reg alternative can be split > > > to remove the not needed CC clobber. > > > > > > Finally I'm unsure about the add where I needed to place > > > the SSE alternative before the 2nd op memory one since it > > > otherwise gets the same cost and wins. > > > > > > So - how to go forward with this? > > > > Sorry to come a bit late to the discussion. > > > > We are aware of CMOV issue for quite some time, but the issue is not > > understood yet in detail (I was hoping for Intel people to look at > > this). However, you demonstrated that using PMAX and PMIN instead of > > scalar CMOV can bring us big gains, and this thread now deals on how > > to best implement PMAX/PMIN for scalar code. > > > > I think that the way to go forward is with STV infrastructure. > > Currently, the implementation only deals with DImode on SSE2 32bit > > targets, but I see no issues on using STV pass also for SImode (on > > 32bit and 64bit targets). There are actually two STV passes, the first > > one (currently run on 64bit targets) is run before cse2, and the > > second (which currently runs on 32bit SSE2 only) is run after combine > > and before split1 pass. The second pass is interesting to us. > > > > The base idea of the second STV pass (for 32bit targets!) is that we > > introduce a DImode _doubleword instructons that otherwise do not exist > > with integer registers. Now, the passes up to and including combine > > pass can use these instructions to simplify and optimize the insn > > flow. Later, based on cost analysis, STV pass either converts the > > _doubleword instructions to a real vector ones (e.g. V2DImode > > patterns) or leaves them intact, and a follow-up split pass splits > > them into scalar SImode instruction pairs. STV pass also takes care to > > move and preload values from their scalar form to a vector > > representation (using SUBREGs). Please note that all this happens on > > pseudos, and register allocator will later simply use scalar (integer) > > registers in scalar patterns and vector registers with vector insn > > patterns. > > > > Your approach to amend existing scalar SImode patterns with vector > > registers will introduce no end of problems. Register allocator will > > do funny things during register pressure, where values will take a > > trip to a vector register before being stored to memory (and vice > > versa, you already found some of them). Current RA simply can't > > distinguish clearly between two register sets. > > > > So, my advice would be to use STV pass also for SImode values, on > > 64bit and 32bit targets. On both targets, we will be able to use > > instructions that operate on vector register set, and for 32bit > > targets (and to some extent on 64bit targets), we would perhaps be > > able to relax register pressure in a kind of controlled way. > > > > So, to demonstrate the benefits of existing STV pass, it should be > > relatively easy to introduce 64bit max/min pattern on 32bit target to > > handle 64bit values. For 32bit values, the pass should be re-run to > > convert SImode scalar operations to vector operations in a controlled > > way, based on various cost functions. I've looked at STV before trying to use RA to solve the issue but quickly stepped away because of its structure which seems to be tied to particular modes, duplicating things for TImode and DImode so it looked like I have to write up everything again for SImode... It really should be possible to run the pass once, handling a set of modes rather than re-running it for the SImode case I am after. See also a recent PR about STV slowness and tendency to hog memory because it seems to enable every DF problem that is around... > Please find attached patch to see STV in action. The compilation will > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2 > dump, you will be able to see chain building, cost calculation and > conversion insertion. So you unconditionally add a smaxdi3 pattern - indeed this looks necessary even when going the STV route. The actual regression for the testcase could also be solved by turing the smaxsi3 back into a compare and jump rather than a conditional move sequence. So I wonder how you'd do that given that there's pass_if_after_reload after pass_split_after_reload and I'm not sure we can split as late as pass_split_before_sched2 (there's also a split _after_ sched2 on x86 it seems). So how would you go implement {s,u}{min,max}{si,di}3 for the case STV doesn't end up doing any transform? You could save me some guesswork here if you can come up with a reasonably complete final set of patterns (ok, I only care about smaxsi3) so I can have a look at the STV approach again (you may remember I simply "split" at assembler emission time). Thanks, Richard. > The testcase: > > --cut here-- > long long test (long long a, long long b) > { > return (a > b) ? a : b; > } > --cut here-- > > gcc -O2 -m32 -msse2 (-mstv): > > _.268r.stv2 dump: > > Searching for mode conversion candidates... > insn 2 is marked as a candidate > insn 3 is marked as a candidate > insn 7 is marked as a candidate > Created a new instruction chain #1 > Building chain #1... > Adding insn 2 to chain #1 > Adding insn 7 into chain's #1 queue > Adding insn 7 to chain #1 > r85 use in insn 12 isn't convertible > Mark r85 def in insn 7 as requiring both modes in chain #1 > Adding insn 3 into chain's #1 queue > Adding insn 3 to chain #1 > Collected chain #1... > insns: 2, 3, 7 > defs to convert: r85 > Computing gain for chain #1... > Instruction conversion gain: 24 > Registers conversion cost: 6 > Total gain: 18 > Converting chain #1... > > ... > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ]) > (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66 > {*movdi_internal} > (nil)) > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ]) > (mem/c:DI (plus:SI (reg/f:SI 16 argp) > (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66 > {*movdi_internal} > (nil)) > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0) > (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0) > (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1 > (expr_list:REG_DEAD (reg/v:DI 84 [ b ]) > (expr_list:REG_DEAD (reg/v:DI 83 [ a ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil))))) > (insn 15 7 16 2 (set (reg:V2DI 87) > (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1 > (nil)) > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0) > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > (nil)) > (insn 17 16 18 2 (set (reg:V2DI 87) > (lshiftrt:V2DI (reg:V2DI 87) > (const_int 32 [0x20]))) "max.c":3:22 -1 > (nil)) > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4) > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > (nil)) > (insn 12 18 13 2 (set (reg/i:DI 0 ax) > (reg:DI 86)) "max.c":4:1 66 {*movdi_internal} > (expr_list:REG_DEAD (reg:DI 86) > (nil))) > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1 > (nil)) > > Uros. > -- Richard Biener <rguenther@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG NÌrnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-07-31 12:01 ` Richard Biener @ 2019-08-01 8:54 ` Uros Bizjak 2019-08-01 9:28 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-01 8:54 UTC (permalink / raw) To: Richard Biener Cc: Martin Jambor, gcc-patches, Jakub Jelinek, Vladimir Makarov On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguenther@suse.de> wrote: > > On Sat, 27 Jul 2019, Uros Bizjak wrote: > > > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > How would one write smaxsi3 as a splitter to be split after > > > > reload in the case LRA assigned the GPR alternative? Is it > > > > even worth doing? Even the SSE reg alternative can be split > > > > to remove the not needed CC clobber. > > > > > > > > Finally I'm unsure about the add where I needed to place > > > > the SSE alternative before the 2nd op memory one since it > > > > otherwise gets the same cost and wins. > > > > > > > > So - how to go forward with this? > > > > > > Sorry to come a bit late to the discussion. > > > > > > We are aware of CMOV issue for quite some time, but the issue is not > > > understood yet in detail (I was hoping for Intel people to look at > > > this). However, you demonstrated that using PMAX and PMIN instead of > > > scalar CMOV can bring us big gains, and this thread now deals on how > > > to best implement PMAX/PMIN for scalar code. > > > > > > I think that the way to go forward is with STV infrastructure. > > > Currently, the implementation only deals with DImode on SSE2 32bit > > > targets, but I see no issues on using STV pass also for SImode (on > > > 32bit and 64bit targets). There are actually two STV passes, the first > > > one (currently run on 64bit targets) is run before cse2, and the > > > second (which currently runs on 32bit SSE2 only) is run after combine > > > and before split1 pass. The second pass is interesting to us. > > > > > > The base idea of the second STV pass (for 32bit targets!) is that we > > > introduce a DImode _doubleword instructons that otherwise do not exist > > > with integer registers. Now, the passes up to and including combine > > > pass can use these instructions to simplify and optimize the insn > > > flow. Later, based on cost analysis, STV pass either converts the > > > _doubleword instructions to a real vector ones (e.g. V2DImode > > > patterns) or leaves them intact, and a follow-up split pass splits > > > them into scalar SImode instruction pairs. STV pass also takes care to > > > move and preload values from their scalar form to a vector > > > representation (using SUBREGs). Please note that all this happens on > > > pseudos, and register allocator will later simply use scalar (integer) > > > registers in scalar patterns and vector registers with vector insn > > > patterns. > > > > > > Your approach to amend existing scalar SImode patterns with vector > > > registers will introduce no end of problems. Register allocator will > > > do funny things during register pressure, where values will take a > > > trip to a vector register before being stored to memory (and vice > > > versa, you already found some of them). Current RA simply can't > > > distinguish clearly between two register sets. > > > > > > So, my advice would be to use STV pass also for SImode values, on > > > 64bit and 32bit targets. On both targets, we will be able to use > > > instructions that operate on vector register set, and for 32bit > > > targets (and to some extent on 64bit targets), we would perhaps be > > > able to relax register pressure in a kind of controlled way. > > > > > > So, to demonstrate the benefits of existing STV pass, it should be > > > relatively easy to introduce 64bit max/min pattern on 32bit target to > > > handle 64bit values. For 32bit values, the pass should be re-run to > > > convert SImode scalar operations to vector operations in a controlled > > > way, based on various cost functions. > > I've looked at STV before trying to use RA to solve the issue but > quickly stepped away because of its structure which seems to be > tied to particular modes, duplicating things for TImode and DImode > so it looked like I have to write up everything again for SImode... ATM, DImode is used exclusively for x86_32 while TImode is used exclusively for x86_64. Also, TImode is used for different purpose before combine, while DImode is used after combine. I don't remember the details, but IIRC it made sense for the intended purpose. > > It really should be possible to run the pass once, handling a set > of modes rather than re-running it for the SImode case I am after. > See also a recent PR about STV slowness and tendency to hog memory > because it seems to enable every DF problem that is around... Huh, I was not aware of implementation details... > > Please find attached patch to see STV in action. The compilation will > > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2 > > dump, you will be able to see chain building, cost calculation and > > conversion insertion. > > So you unconditionally add a smaxdi3 pattern - indeed this looks > necessary even when going the STV route. The actual regression > for the testcase could also be solved by turing the smaxsi3 > back into a compare and jump rather than a conditional move sequence. > So I wonder how you'd do that given that there's pass_if_after_reload > after pass_split_after_reload and I'm not sure we can split > as late as pass_split_before_sched2 (there's also a split _after_ > sched2 on x86 it seems). > > So how would you go implement {s,u}{min,max}{si,di}3 for the > case STV doesn't end up doing any transform? If STV doesn't transform the insn, then a pre-reload splitter splits the insn back to compare+cmove. However, considering the SImode move from/to int/xmm register is relatively cheap, the cost function should be tuned so that STV always converts smaxsi3 pattern. (As said before, the fix of the slowdown with consecutive cmov insns is a side effect of the transformation to smax insn that helps in this particular case, I think that this issue should be fixed in a general way, there are already a couple of PRs reported). > You could save me some guesswork here if you can come up with > a reasonably complete final set of patterns (ok, I only care > about smaxsi3) so I can have a look at the STV approach again > (you may remember I simply "split" at assembler emission time). I think that the cost function should always enable smaxsi3 generation. To further optimize STV chain (to avoid unnecessary xmm<->int transitions) we could add all integer logic, arithmetic and constant shifts to the candidates (the ones that DImode STV converts). Uros. > Thanks, > Richard. > > > The testcase: > > > > --cut here-- > > long long test (long long a, long long b) > > { > > return (a > b) ? a : b; > > } > > --cut here-- > > > > gcc -O2 -m32 -msse2 (-mstv): > > > > _.268r.stv2 dump: > > > > Searching for mode conversion candidates... > > insn 2 is marked as a candidate > > insn 3 is marked as a candidate > > insn 7 is marked as a candidate > > Created a new instruction chain #1 > > Building chain #1... > > Adding insn 2 to chain #1 > > Adding insn 7 into chain's #1 queue > > Adding insn 7 to chain #1 > > r85 use in insn 12 isn't convertible > > Mark r85 def in insn 7 as requiring both modes in chain #1 > > Adding insn 3 into chain's #1 queue > > Adding insn 3 to chain #1 > > Collected chain #1... > > insns: 2, 3, 7 > > defs to convert: r85 > > Computing gain for chain #1... > > Instruction conversion gain: 24 > > Registers conversion cost: 6 > > Total gain: 18 > > Converting chain #1... > > > > ... > > > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ]) > > (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66 > > {*movdi_internal} > > (nil)) > > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ]) > > (mem/c:DI (plus:SI (reg/f:SI 16 argp) > > (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66 > > {*movdi_internal} > > (nil)) > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) > > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0) > > (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0) > > (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1 > > (expr_list:REG_DEAD (reg/v:DI 84 [ b ]) > > (expr_list:REG_DEAD (reg/v:DI 83 [ a ]) > > (expr_list:REG_UNUSED (reg:CC 17 flags) > > (nil))))) > > (insn 15 7 16 2 (set (reg:V2DI 87) > > (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1 > > (nil)) > > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0) > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > > (nil)) > > (insn 17 16 18 2 (set (reg:V2DI 87) > > (lshiftrt:V2DI (reg:V2DI 87) > > (const_int 32 [0x20]))) "max.c":3:22 -1 > > (nil)) > > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4) > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > > (nil)) > > (insn 12 18 13 2 (set (reg/i:DI 0 ax) > > (reg:DI 86)) "max.c":4:1 66 {*movdi_internal} > > (expr_list:REG_DEAD (reg:DI 86) > > (nil))) > > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1 > > (nil)) > > > > Uros. > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-01 8:54 ` Uros Bizjak @ 2019-08-01 9:28 ` Richard Biener 2019-08-01 9:38 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-01 9:28 UTC (permalink / raw) To: Uros Bizjak; +Cc: Martin Jambor, gcc-patches, Jakub Jelinek, Vladimir Makarov [-- Attachment #1: Type: text/plain, Size: 10442 bytes --] On Thu, 1 Aug 2019, Uros Bizjak wrote: > On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Sat, 27 Jul 2019, Uros Bizjak wrote: > > > > > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > > How would one write smaxsi3 as a splitter to be split after > > > > > reload in the case LRA assigned the GPR alternative? Is it > > > > > even worth doing? Even the SSE reg alternative can be split > > > > > to remove the not needed CC clobber. > > > > > > > > > > Finally I'm unsure about the add where I needed to place > > > > > the SSE alternative before the 2nd op memory one since it > > > > > otherwise gets the same cost and wins. > > > > > > > > > > So - how to go forward with this? > > > > > > > > Sorry to come a bit late to the discussion. > > > > > > > > We are aware of CMOV issue for quite some time, but the issue is not > > > > understood yet in detail (I was hoping for Intel people to look at > > > > this). However, you demonstrated that using PMAX and PMIN instead of > > > > scalar CMOV can bring us big gains, and this thread now deals on how > > > > to best implement PMAX/PMIN for scalar code. > > > > > > > > I think that the way to go forward is with STV infrastructure. > > > > Currently, the implementation only deals with DImode on SSE2 32bit > > > > targets, but I see no issues on using STV pass also for SImode (on > > > > 32bit and 64bit targets). There are actually two STV passes, the first > > > > one (currently run on 64bit targets) is run before cse2, and the > > > > second (which currently runs on 32bit SSE2 only) is run after combine > > > > and before split1 pass. The second pass is interesting to us. > > > > > > > > The base idea of the second STV pass (for 32bit targets!) is that we > > > > introduce a DImode _doubleword instructons that otherwise do not exist > > > > with integer registers. Now, the passes up to and including combine > > > > pass can use these instructions to simplify and optimize the insn > > > > flow. Later, based on cost analysis, STV pass either converts the > > > > _doubleword instructions to a real vector ones (e.g. V2DImode > > > > patterns) or leaves them intact, and a follow-up split pass splits > > > > them into scalar SImode instruction pairs. STV pass also takes care to > > > > move and preload values from their scalar form to a vector > > > > representation (using SUBREGs). Please note that all this happens on > > > > pseudos, and register allocator will later simply use scalar (integer) > > > > registers in scalar patterns and vector registers with vector insn > > > > patterns. > > > > > > > > Your approach to amend existing scalar SImode patterns with vector > > > > registers will introduce no end of problems. Register allocator will > > > > do funny things during register pressure, where values will take a > > > > trip to a vector register before being stored to memory (and vice > > > > versa, you already found some of them). Current RA simply can't > > > > distinguish clearly between two register sets. > > > > > > > > So, my advice would be to use STV pass also for SImode values, on > > > > 64bit and 32bit targets. On both targets, we will be able to use > > > > instructions that operate on vector register set, and for 32bit > > > > targets (and to some extent on 64bit targets), we would perhaps be > > > > able to relax register pressure in a kind of controlled way. > > > > > > > > So, to demonstrate the benefits of existing STV pass, it should be > > > > relatively easy to introduce 64bit max/min pattern on 32bit target to > > > > handle 64bit values. For 32bit values, the pass should be re-run to > > > > convert SImode scalar operations to vector operations in a controlled > > > > way, based on various cost functions. > > > > I've looked at STV before trying to use RA to solve the issue but > > quickly stepped away because of its structure which seems to be > > tied to particular modes, duplicating things for TImode and DImode > > so it looked like I have to write up everything again for SImode... > > ATM, DImode is used exclusively for x86_32 while TImode is used > exclusively for x86_64. Also, TImode is used for different purpose > before combine, while DImode is used after combine. I don't remember > the details, but IIRC it made sense for the intended purpose. > > > > It really should be possible to run the pass once, handling a set > > of modes rather than re-running it for the SImode case I am after. > > See also a recent PR about STV slowness and tendency to hog memory > > because it seems to enable every DF problem that is around... > > Huh, I was not aware of implementation details... > > > > Please find attached patch to see STV in action. The compilation will > > > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2 > > > dump, you will be able to see chain building, cost calculation and > > > conversion insertion. > > > > So you unconditionally add a smaxdi3 pattern - indeed this looks > > necessary even when going the STV route. The actual regression > > for the testcase could also be solved by turing the smaxsi3 > > back into a compare and jump rather than a conditional move sequence. > > So I wonder how you'd do that given that there's pass_if_after_reload > > after pass_split_after_reload and I'm not sure we can split > > as late as pass_split_before_sched2 (there's also a split _after_ > > sched2 on x86 it seems). > > > > So how would you go implement {s,u}{min,max}{si,di}3 for the > > case STV doesn't end up doing any transform? > > If STV doesn't transform the insn, then a pre-reload splitter splits > the insn back to compare+cmove. OK, that would work. But there's no way to force a jumpy sequence then which we know is faster than compare+cmove because later RTL if-conversion passes happily re-discover the smax (or conditional move) sequence. > However, considering the SImode move > from/to int/xmm register is relatively cheap, the cost function should > be tuned so that STV always converts smaxsi3 pattern. Note that on both Zen and even more so bdverN the int/xmm transition makes it no longer profitable but a _lot_ slower than the cmp/cmov sequence... (for the loop in hmmer which is the only one I see any effect of any of my patches). So identifying chains that start/end in memory is important for cost reasons. So I think the splitting has to happen after the last if-conversion pass (and thus we may need to allocate a scratch register for this purpose?) > (As said before, > the fix of the slowdown with consecutive cmov insns is a side effect > of the transformation to smax insn that helps in this particular case, > I think that this issue should be fixed in a general way, there are > already a couple of PRs reported). > > > You could save me some guesswork here if you can come up with > > a reasonably complete final set of patterns (ok, I only care > > about smaxsi3) so I can have a look at the STV approach again > > (you may remember I simply "split" at assembler emission time). > > I think that the cost function should always enable smaxsi3 > generation. To further optimize STV chain (to avoid unnecessary > xmm<->int transitions) we could add all integer logic, arithmetic and > constant shifts to the candidates (the ones that DImode STV converts). > > Uros. > > > Thanks, > > Richard. > > > > > The testcase: > > > > > > --cut here-- > > > long long test (long long a, long long b) > > > { > > > return (a > b) ? a : b; > > > } > > > --cut here-- > > > > > > gcc -O2 -m32 -msse2 (-mstv): > > > > > > _.268r.stv2 dump: > > > > > > Searching for mode conversion candidates... > > > insn 2 is marked as a candidate > > > insn 3 is marked as a candidate > > > insn 7 is marked as a candidate > > > Created a new instruction chain #1 > > > Building chain #1... > > > Adding insn 2 to chain #1 > > > Adding insn 7 into chain's #1 queue > > > Adding insn 7 to chain #1 > > > r85 use in insn 12 isn't convertible > > > Mark r85 def in insn 7 as requiring both modes in chain #1 > > > Adding insn 3 into chain's #1 queue > > > Adding insn 3 to chain #1 > > > Collected chain #1... > > > insns: 2, 3, 7 > > > defs to convert: r85 > > > Computing gain for chain #1... > > > Instruction conversion gain: 24 > > > Registers conversion cost: 6 > > > Total gain: 18 > > > Converting chain #1... > > > > > > ... > > > > > > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ]) > > > (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66 > > > {*movdi_internal} > > > (nil)) > > > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ]) > > > (mem/c:DI (plus:SI (reg/f:SI 16 argp) > > > (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66 > > > {*movdi_internal} > > > (nil)) > > > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) > > > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0) > > > (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0) > > > (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1 > > > (expr_list:REG_DEAD (reg/v:DI 84 [ b ]) > > > (expr_list:REG_DEAD (reg/v:DI 83 [ a ]) > > > (expr_list:REG_UNUSED (reg:CC 17 flags) > > > (nil))))) > > > (insn 15 7 16 2 (set (reg:V2DI 87) > > > (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1 > > > (nil)) > > > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0) > > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > > > (nil)) > > > (insn 17 16 18 2 (set (reg:V2DI 87) > > > (lshiftrt:V2DI (reg:V2DI 87) > > > (const_int 32 [0x20]))) "max.c":3:22 -1 > > > (nil)) > > > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4) > > > (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 > > > (nil)) > > > (insn 12 18 13 2 (set (reg/i:DI 0 ax) > > > (reg:DI 86)) "max.c":4:1 66 {*movdi_internal} > > > (expr_list:REG_DEAD (reg:DI 86) > > > (nil))) > > > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1 > > > (nil)) > > > > > > Uros. > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; > > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) > -- Richard Biener <rguenther@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-01 9:28 ` Richard Biener @ 2019-08-01 9:38 ` Uros Bizjak 2019-08-03 17:26 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-01 9:38 UTC (permalink / raw) To: Richard Biener Cc: Martin Jambor, gcc-patches, Jakub Jelinek, Vladimir Makarov On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > So you unconditionally add a smaxdi3 pattern - indeed this looks > > > necessary even when going the STV route. The actual regression > > > for the testcase could also be solved by turing the smaxsi3 > > > back into a compare and jump rather than a conditional move sequence. > > > So I wonder how you'd do that given that there's pass_if_after_reload > > > after pass_split_after_reload and I'm not sure we can split > > > as late as pass_split_before_sched2 (there's also a split _after_ > > > sched2 on x86 it seems). > > > > > > So how would you go implement {s,u}{min,max}{si,di}3 for the > > > case STV doesn't end up doing any transform? > > > > If STV doesn't transform the insn, then a pre-reload splitter splits > > the insn back to compare+cmove. > > OK, that would work. But there's no way to force a jumpy sequence then > which we know is faster than compare+cmove because later RTL > if-conversion passes happily re-discover the smax (or conditional move) > sequence. > > > However, considering the SImode move > > from/to int/xmm register is relatively cheap, the cost function should > > be tuned so that STV always converts smaxsi3 pattern. > > Note that on both Zen and even more so bdverN the int/xmm transition > makes it no longer profitable but a _lot_ slower than the cmp/cmov > sequence... (for the loop in hmmer which is the only one I see > any effect of any of my patches). So identifying chains that > start/end in memory is important for cost reasons. Please note that the cost function also considers the cost of move from/to xmm. So, the cost of the whole chain would disable the transformation. > So I think the splitting has to happen after the last if-conversion > pass (and thus we may need to allocate a scratch register for this > purpose?) I really hope that the underlying issue will be solved by a machine dependant pass inserted somewhere after the pre-reload split. This way, we can split unconverted smax to the cmove, and this later pass would handle jcc and cmove instructions. Until then... yes your proposed approach is one of the ways to avoid unwanted if-conversion, although sometimes we would like to split to cmove instead. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-01 9:38 ` Uros Bizjak @ 2019-08-03 17:26 ` Richard Biener 2019-08-04 17:11 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-03 17:26 UTC (permalink / raw) To: Uros Bizjak; +Cc: gcc-patches, Jakub Jelinek On Thu, 1 Aug 2019, Uros Bizjak wrote: > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >>>> necessary even when going the STV route. The actual regression >>>> for the testcase could also be solved by turing the smaxsi3 >>>> back into a compare and jump rather than a conditional move sequence. >>>> So I wonder how you'd do that given that there's pass_if_after_reload >>>> after pass_split_after_reload and I'm not sure we can split >>>> as late as pass_split_before_sched2 (there's also a split _after_ >>>> sched2 on x86 it seems). >>>> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >>>> case STV doesn't end up doing any transform? >>> >>> If STV doesn't transform the insn, then a pre-reload splitter splits >>> the insn back to compare+cmove. >> >> OK, that would work. But there's no way to force a jumpy sequence then >> which we know is faster than compare+cmove because later RTL >> if-conversion passes happily re-discover the smax (or conditional move) >> sequence. >> >>> However, considering the SImode move >>> from/to int/xmm register is relatively cheap, the cost function should >>> be tuned so that STV always converts smaxsi3 pattern. >> >> Note that on both Zen and even more so bdverN the int/xmm transition >> makes it no longer profitable but a _lot_ slower than the cmp/cmov >> sequence... (for the loop in hmmer which is the only one I see >> any effect of any of my patches). So identifying chains that >> start/end in memory is important for cost reasons. > > Please note that the cost function also considers the cost of move > from/to xmm. So, the cost of the whole chain would disable the > transformation. > >> So I think the splitting has to happen after the last if-conversion >> pass (and thus we may need to allocate a scratch register for this >> purpose?) > > I really hope that the underlying issue will be solved by a machine > dependant pass inserted somewhere after the pre-reload split. This > way, we can split unconverted smax to the cmove, and this later pass > would handle jcc and cmove instructions. Until then... yes your > proposed approach is one of the ways to avoid unwanted if-conversion, > although sometimes we would like to split to cmove instead. So the following makes STV also consider SImode chains, re-using the DImode chain code. I've kept a simple incomplete smaxsi3 pattern and also did not alter the {SI,DI}mode chain cost function - it's quite off for TARGET_64BIT. With this I get the expected conversion for the testcase derived from hmmer. No further testing sofar. Is it OK to re-use the DImode chain code this way? I'll clean things up some more of course. Still need help with the actual patterns for minmax and how the splitters should look like. Richard. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274037) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -473,7 +476,7 @@ { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) + if (standard_sse_constant_p (exp, vmode)) return COSTS_N_INSNS (1); return ix86_cost->sse_load[1]; } @@ -534,6 +537,9 @@ else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) gain += ix86_cost->add - COSTS_N_INSNS (1); + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN) + gain += COSTS_N_INSNS (3); else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -573,7 +579,7 @@ dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -707,7 +713,7 @@ bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -750,6 +756,10 @@ gen_rtx_VEC_SELECT (SImode, gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); } + else if (smode == SImode) + { + emit_move_insn (scopy, gen_rtx_SUBREG (SImode, reg, 0)); + } else { rtx vcopy = gen_reg_rtx (V2DImode); @@ -816,14 +826,14 @@ if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +851,30 @@ gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,7 +886,7 @@ else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } @@ -888,9 +904,9 @@ { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +915,7 @@ case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +923,27 @@ case IOR: case XOR: case AND: + case SMAX: + case SMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +957,17 @@ break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -1186,7 +1204,7 @@ (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1237,12 @@ if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1250,7 @@ if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1259,7 @@ /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1273,12 @@ rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1285,12 +1303,14 @@ case IOR: case XOR: case AND: + case SMAX: + case SMIN: if (!REG_P (XEXP (src, 1)) && !MEM_P (XEXP (src, 1)) && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1339,7 @@ || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1392,7 +1412,7 @@ if (TARGET_64BIT) return timode_scalar_to_vector_candidate_p (insn); else - return dimode_scalar_to_vector_candidate_p (insn); + return dimode_scalar_to_vector_candidate_p (insn, DImode); } /* The DImode version of remove_non_convertible_regs. */ @@ -1577,11 +1597,12 @@ convert_scalars_to_vector () { basic_block bb; - bitmap candidates; + bitmap candidates, sicandidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); candidates = BITMAP_ALLOC (NULL); + sicandidates = BITMAP_ALLOC (NULL); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1605,28 +1626,43 @@ bitmap_set_bit (candidates, INSN_UID (insn)); } + else if (dimode_scalar_to_vector_candidate_p (insn, SImode)) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a SI candidate\n", + INSN_UID (insn)); + + bitmap_set_bit (sicandidates, INSN_UID (insn)); + } } remove_non_convertible_regs (candidates); + dimode_remove_non_convertible_regs (sicandidates); - if (bitmap_empty_p (candidates)) + if (bitmap_empty_p (candidates) + && bitmap_empty_p (sicandidates)) if (dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) + bitmap cand = candidates; + do { - unsigned uid = bitmap_first_set_bit (candidates); + while (!bitmap_empty_p (cand)) + { + unsigned uid = bitmap_first_set_bit (cand); scalar_chain *chain; - if (TARGET_64BIT) + if (TARGET_64BIT && cand == candidates) chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + else if (cand == candidates) + chain = new dimode_scalar_chain (DImode, V2DImode); + else if (cand == sicandidates) + chain = new dimode_scalar_chain (SImode, V4SImode); /* Find instructions chain we want to convert to vector mode. Check all uses and definitions to estimate all required conversions. */ - chain->build (candidates, uid); + chain->build (cand, uid); if (chain->compute_convert_gain () > 0) converted_insns += chain->convert (); @@ -1637,11 +1673,17 @@ delete chain; } + if (cand == sicandidates) + break; + cand = sicandidates; + } + while (1); if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); BITMAP_FREE (candidates); + BITMAP_FREE (sicandidates); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274037) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -162,6 +167,8 @@ class dimode_scalar_chain : public scalar_chain { public: + dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274037) +++ gcc/config/i386/i386.md (working copy) @@ -5325,6 +5325,16 @@ (const_string "SI") (const_string "<MODE>")))]) +;; min/max patterns + +(define_insn "smaxsi3" + [(set (match_operand:SI 0 "register_operand") + (smax:SI (match_operand:SI 1 "register_operand") + (match_operand:SI 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1" + "#") + ;; Add instructions (define_expand "add<mode>3" ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-03 17:26 ` Richard Biener @ 2019-08-04 17:11 ` Uros Bizjak 2019-08-04 17:23 ` Jakub Jelinek ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-04 17:11 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek [-- Attachment #1: Type: text/plain, Size: 3584 bytes --] On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > >>>> necessary even when going the STV route. The actual regression > >>>> for the testcase could also be solved by turing the smaxsi3 > >>>> back into a compare and jump rather than a conditional move sequence. > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > >>>> after pass_split_after_reload and I'm not sure we can split > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > >>>> sched2 on x86 it seems). > >>>> > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > >>>> case STV doesn't end up doing any transform? > >>> > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > >>> the insn back to compare+cmove. > >> > >> OK, that would work. But there's no way to force a jumpy sequence then > >> which we know is faster than compare+cmove because later RTL > >> if-conversion passes happily re-discover the smax (or conditional move) > >> sequence. > >> > >>> However, considering the SImode move > >>> from/to int/xmm register is relatively cheap, the cost function should > >>> be tuned so that STV always converts smaxsi3 pattern. > >> > >> Note that on both Zen and even more so bdverN the int/xmm transition > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > >> sequence... (for the loop in hmmer which is the only one I see > >> any effect of any of my patches). So identifying chains that > >> start/end in memory is important for cost reasons. > > > > Please note that the cost function also considers the cost of move > > from/to xmm. So, the cost of the whole chain would disable the > > transformation. > > > >> So I think the splitting has to happen after the last if-conversion > >> pass (and thus we may need to allocate a scratch register for this > >> purpose?) > > > > I really hope that the underlying issue will be solved by a machine > > dependant pass inserted somewhere after the pre-reload split. This > > way, we can split unconverted smax to the cmove, and this later pass > > would handle jcc and cmove instructions. Until then... yes your > > proposed approach is one of the ways to avoid unwanted if-conversion, > > although sometimes we would like to split to cmove instead. > > So the following makes STV also consider SImode chains, re-using the > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > and also did not alter the {SI,DI}mode chain cost function - it's > quite off for TARGET_64BIT. With this I get the expected conversion > for the testcase derived from hmmer. > > No further testing sofar. > > Is it OK to re-use the DImode chain code this way? I'll clean things > up some more of course. Yes, the approach looks OK to me. It makes chain building mode agnostic, and the chain building can be used for a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. b) SImode x86_32 and x86_64 (this will be mainly used for SImode minmax and surrounding SImode operations) c) DImode x86_64 (also, mainly used for DImode minmax and surrounding DImode operations) > Still need help with the actual patterns for minmax and how the splitters > should look like. Please look at the attached patch. Maybe we can add memory_operand as operand 1 and operand 2 predicate, but let's keep things simple for now. Uros. [-- Attachment #2: minmax-md.diff.txt --] [-- Type: text/plain, Size: 946 bytes --] Index: i386.md =================================================================== --- i386.md (revision 274008) +++ i386.md (working copy) @@ -17721,6 +17721,27 @@ std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:CCGC FLAGS_REG) + (compare:CCGC (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-04 17:11 ` Uros Bizjak @ 2019-08-04 17:23 ` Jakub Jelinek 2019-08-04 17:36 ` Uros Bizjak 2019-08-05 9:13 ` Richard Sandiford 2019-08-05 11:50 ` Richard Biener 2 siblings, 1 reply; 61+ messages in thread From: Jakub Jelinek @ 2019-08-04 17:23 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Biener, gcc-patches On Sun, Aug 04, 2019 at 07:11:01PM +0200, Uros Bizjak wrote: > Yes, the approach looks OK to me. It makes chain building mode > agnostic, and the chain building can be used for > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > minmax and surrounding SImode operations) > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > DImode operations) > > > Still need help with the actual patterns for minmax and how the splitters > > should look like. > > Please look at the attached patch. Maybe we can add memory_operand as > operand 1 and operand 2 predicate, but let's keep things simple for > now. Shouldn't it be used also for p{min,max}ud rather than just p{min,max}sd? What about p{min,max}{s,u}{b,w,q}? Some of those are already in SSE. If the conversion of the chain fails, couldn't the STV pass split those SImode etc. min/max patterns into code with branches, rather than turn it into cmovs? Jakub ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-04 17:23 ` Jakub Jelinek @ 2019-08-04 17:36 ` Uros Bizjak 2019-08-05 8:47 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-04 17:36 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches On Sun, Aug 4, 2019 at 7:23 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Sun, Aug 04, 2019 at 07:11:01PM +0200, Uros Bizjak wrote: > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > > > Still need help with the actual patterns for minmax and how the splitters > > > should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > Shouldn't it be used also for p{min,max}ud rather than just p{min,max}sd? > What about p{min,max}{s,u}{b,w,q}? Some of those are already in SSE. Sure, unsigned ops will also be added. I just went through the Richard's patch and looked for RTXes that Richard's patch handles. I'm not sure about HImode and QImode minmax operations. While these can be added, we would need to re-run STV in HImode and QImode - I wonder if it is worth. > If the conversion of the chain fails, couldn't the STV pass split those > SImode etc. min/max patterns into code with branches, rather than turn it > into cmovs? Since these patterns require SSE4.1, we are sure that we can split back to cmov. But IMO, cmov/jcc issue is orthogonal to minmax conversion and should be handled by some other machine-specific pass that would analyse cmove insertion and eventually split unwanted cmoves back to jcc (based on some yet unknown metrics). Please note that there is no definite proof that it is beneficial to convert cmoves to jcc for all x86 targets. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-04 17:36 ` Uros Bizjak @ 2019-08-05 8:47 ` Richard Biener 0 siblings, 0 replies; 61+ messages in thread From: Richard Biener @ 2019-08-05 8:47 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On Sun, 4 Aug 2019, Uros Bizjak wrote: > On Sun, Aug 4, 2019 at 7:23 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > On Sun, Aug 04, 2019 at 07:11:01PM +0200, Uros Bizjak wrote: > > > Yes, the approach looks OK to me. It makes chain building mode > > > agnostic, and the chain building can be used for > > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > > minmax and surrounding SImode operations) > > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > > DImode operations) > > > > > > > Still need help with the actual patterns for minmax and how the splitters > > > > should look like. > > > > > > Please look at the attached patch. Maybe we can add memory_operand as > > > operand 1 and operand 2 predicate, but let's keep things simple for > > > now. > > > > Shouldn't it be used also for p{min,max}ud rather than just p{min,max}sd? > > What about p{min,max}{s,u}{b,w,q}? Some of those are already in SSE. > > Sure, unsigned ops will also be added. I just went through the > Richard's patch and looked for RTXes that Richard's patch handles. I'm > not sure about HImode and QImode minmax operations. While these can be > added, we would need to re-run STV in HImode and QImode - I wonder if > it is worth. I think we can always extend later, for now I'm trying to do {SI,DI}mode only, but yes, u{min,max} would be nice to not miss. > > If the conversion of the chain fails, couldn't the STV pass split those > > SImode etc. min/max patterns into code with branches, rather than turn it > > into cmovs? > > Since these patterns require SSE4.1, we are sure that we can split > back to cmov. But IMO, cmov/jcc issue is orthogonal to minmax > conversion and should be handled by some other machine-specific pass > that would > analyse cmove insertion and eventually split unwanted cmoves back to > jcc (based on some yet unknown metrics). Please note that there is no > definite proof that it is beneficial to convert cmoves to jcc for all > x86 targets. I guess a tunable plus (micro-)benchmarking could make this decision. But yes, this is largely independent - and if we split to jumps then RTL if-conversion will happily turn it back to cmoves anyway. Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-04 17:11 ` Uros Bizjak 2019-08-04 17:23 ` Jakub Jelinek @ 2019-08-05 9:13 ` Richard Sandiford 2019-08-05 10:08 ` Uros Bizjak 2019-08-05 11:50 ` Richard Biener 2 siblings, 1 reply; 61+ messages in thread From: Richard Sandiford @ 2019-08-05 9:13 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Biener, gcc-patches, Jakub Jelinek Uros Bizjak <ubizjak@gmail.com> writes: > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote: >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: >> > >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >> >>>> necessary even when going the STV route. The actual regression >> >>>> for the testcase could also be solved by turing the smaxsi3 >> >>>> back into a compare and jump rather than a conditional move sequence. >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload >> >>>> after pass_split_after_reload and I'm not sure we can split >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ >> >>>> sched2 on x86 it seems). >> >>>> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >> >>>> case STV doesn't end up doing any transform? >> >>> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits >> >>> the insn back to compare+cmove. >> >> >> >> OK, that would work. But there's no way to force a jumpy sequence then >> >> which we know is faster than compare+cmove because later RTL >> >> if-conversion passes happily re-discover the smax (or conditional move) >> >> sequence. >> >> >> >>> However, considering the SImode move >> >>> from/to int/xmm register is relatively cheap, the cost function should >> >>> be tuned so that STV always converts smaxsi3 pattern. >> >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov >> >> sequence... (for the loop in hmmer which is the only one I see >> >> any effect of any of my patches). So identifying chains that >> >> start/end in memory is important for cost reasons. >> > >> > Please note that the cost function also considers the cost of move >> > from/to xmm. So, the cost of the whole chain would disable the >> > transformation. >> > >> >> So I think the splitting has to happen after the last if-conversion >> >> pass (and thus we may need to allocate a scratch register for this >> >> purpose?) >> > >> > I really hope that the underlying issue will be solved by a machine >> > dependant pass inserted somewhere after the pre-reload split. This >> > way, we can split unconverted smax to the cmove, and this later pass >> > would handle jcc and cmove instructions. Until then... yes your >> > proposed approach is one of the ways to avoid unwanted if-conversion, >> > although sometimes we would like to split to cmove instead. >> >> So the following makes STV also consider SImode chains, re-using the >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern >> and also did not alter the {SI,DI}mode chain cost function - it's >> quite off for TARGET_64BIT. With this I get the expected conversion >> for the testcase derived from hmmer. >> >> No further testing sofar. >> >> Is it OK to re-use the DImode chain code this way? I'll clean things >> up some more of course. > > Yes, the approach looks OK to me. It makes chain building mode > agnostic, and the chain building can be used for > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > minmax and surrounding SImode operations) > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > DImode operations) > >> Still need help with the actual patterns for minmax and how the splitters >> should look like. > > Please look at the attached patch. Maybe we can add memory_operand as > operand 1 and operand 2 predicate, but let's keep things simple for > now. > > Uros. > > Index: i386.md > =================================================================== > --- i386.md (revision 274008) > +++ i386.md (working copy) > @@ -17721,6 +17721,27 @@ > std::swap (operands[4], operands[5]); > }) > > +;; min/max patterns > + > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:SWI48 0 "register_operand") > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > + (match_operand:SWI48 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && TARGET_SSE4_1 > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:CCGC FLAGS_REG) > + (compare:CCGC (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:SWI48 > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + The pattern could in theory be matched after the last pre-RA split pass has run, so I think the pattern still needs to have constraints and be matchable even without can_create_pseudo_p. It looks like the split above should work post-RA. A bit pedantic, because the pattern's probably fine in practice... Thanks, Richard > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 9:13 ` Richard Sandiford @ 2019-08-05 10:08 ` Uros Bizjak 2019-08-05 10:12 ` Richard Sandiford 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 10:08 UTC (permalink / raw) To: Uros Bizjak, Richard Biener, gcc-patches, Jakub Jelinek, Richard Sandiford On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Uros Bizjak <ubizjak@gmail.com> writes: > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > >> > >> On Thu, 1 Aug 2019, Uros Bizjak wrote: > >> > >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > >> > > >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > >> >>>> necessary even when going the STV route. The actual regression > >> >>>> for the testcase could also be solved by turing the smaxsi3 > >> >>>> back into a compare and jump rather than a conditional move sequence. > >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload > >> >>>> after pass_split_after_reload and I'm not sure we can split > >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ > >> >>>> sched2 on x86 it seems). > >> >>>> > >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > >> >>>> case STV doesn't end up doing any transform? > >> >>> > >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits > >> >>> the insn back to compare+cmove. > >> >> > >> >> OK, that would work. But there's no way to force a jumpy sequence then > >> >> which we know is faster than compare+cmove because later RTL > >> >> if-conversion passes happily re-discover the smax (or conditional move) > >> >> sequence. > >> >> > >> >>> However, considering the SImode move > >> >>> from/to int/xmm register is relatively cheap, the cost function should > >> >>> be tuned so that STV always converts smaxsi3 pattern. > >> >> > >> >> Note that on both Zen and even more so bdverN the int/xmm transition > >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > >> >> sequence... (for the loop in hmmer which is the only one I see > >> >> any effect of any of my patches). So identifying chains that > >> >> start/end in memory is important for cost reasons. > >> > > >> > Please note that the cost function also considers the cost of move > >> > from/to xmm. So, the cost of the whole chain would disable the > >> > transformation. > >> > > >> >> So I think the splitting has to happen after the last if-conversion > >> >> pass (and thus we may need to allocate a scratch register for this > >> >> purpose?) > >> > > >> > I really hope that the underlying issue will be solved by a machine > >> > dependant pass inserted somewhere after the pre-reload split. This > >> > way, we can split unconverted smax to the cmove, and this later pass > >> > would handle jcc and cmove instructions. Until then... yes your > >> > proposed approach is one of the ways to avoid unwanted if-conversion, > >> > although sometimes we would like to split to cmove instead. > >> > >> So the following makes STV also consider SImode chains, re-using the > >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern > >> and also did not alter the {SI,DI}mode chain cost function - it's > >> quite off for TARGET_64BIT. With this I get the expected conversion > >> for the testcase derived from hmmer. > >> > >> No further testing sofar. > >> > >> Is it OK to re-use the DImode chain code this way? I'll clean things > >> up some more of course. > > > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > >> Still need help with the actual patterns for minmax and how the splitters > >> should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > > > Uros. > > > > Index: i386.md > > =================================================================== > > --- i386.md (revision 274008) > > +++ i386.md (working copy) > > @@ -17721,6 +17721,27 @@ > > std::swap (operands[4], operands[5]); > > }) > > > > +;; min/max patterns > > + > > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > > + > > +(define_insn_and_split "<code><mode>3" > > + [(set (match_operand:SWI48 0 "register_operand") > > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > > + (match_operand:SWI48 2 "register_operand"))) > > + (clobber (reg:CC FLAGS_REG))] > > + "TARGET_STV && TARGET_SSE4_1 > > + && can_create_pseudo_p ()" > > + "#" > > + "&& 1" > > + [(set (reg:CCGC FLAGS_REG) > > + (compare:CCGC (match_dup 1)(match_dup 2))) > > + (set (match_dup 0) > > + (if_then_else:SWI48 > > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > > + (match_dup 1) > > + (match_dup 2)))]) > > + > > The pattern could in theory be matched after the last pre-RA split pass > has run, so I think the pattern still needs to have constraints and be > matchable even without can_create_pseudo_p. It looks like the split > above should work post-RA. > > A bit pedantic, because the pattern's probably fine in practice... Currently, all unmatched STV patterns split before reload, and there were no problems. If the pattern matches after last pre-RA split, then the post-reload splitter will fail, since can_create_pseudo_p also applies to the part that splits the insn. In any case, thanks for the heads-up, hopefully we didn't assume something that doesn't hold. Thanks, Uros. > Thanks, > Richard > > > ;; Conditional addition patterns > > (define_expand "add<mode>cc" > > [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 10:08 ` Uros Bizjak @ 2019-08-05 10:12 ` Richard Sandiford 2019-08-05 10:24 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Sandiford @ 2019-08-05 10:12 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Biener, gcc-patches, Jakub Jelinek Uros Bizjak <ubizjak@gmail.com> writes: > On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Uros Bizjak <ubizjak@gmail.com> writes: >> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: >> >> >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote: >> >> >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: >> >> > >> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >> >> >>>> necessary even when going the STV route. The actual regression >> >> >>>> for the testcase could also be solved by turing the smaxsi3 >> >> >>>> back into a compare and jump rather than a conditional move sequence. >> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload >> >> >>>> after pass_split_after_reload and I'm not sure we can split >> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ >> >> >>>> sched2 on x86 it seems). >> >> >>>> >> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >> >> >>>> case STV doesn't end up doing any transform? >> >> >>> >> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits >> >> >>> the insn back to compare+cmove. >> >> >> >> >> >> OK, that would work. But there's no way to force a jumpy sequence then >> >> >> which we know is faster than compare+cmove because later RTL >> >> >> if-conversion passes happily re-discover the smax (or conditional move) >> >> >> sequence. >> >> >> >> >> >>> However, considering the SImode move >> >> >>> from/to int/xmm register is relatively cheap, the cost function should >> >> >>> be tuned so that STV always converts smaxsi3 pattern. >> >> >> >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition >> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov >> >> >> sequence... (for the loop in hmmer which is the only one I see >> >> >> any effect of any of my patches). So identifying chains that >> >> >> start/end in memory is important for cost reasons. >> >> > >> >> > Please note that the cost function also considers the cost of move >> >> > from/to xmm. So, the cost of the whole chain would disable the >> >> > transformation. >> >> > >> >> >> So I think the splitting has to happen after the last if-conversion >> >> >> pass (and thus we may need to allocate a scratch register for this >> >> >> purpose?) >> >> > >> >> > I really hope that the underlying issue will be solved by a machine >> >> > dependant pass inserted somewhere after the pre-reload split. This >> >> > way, we can split unconverted smax to the cmove, and this later pass >> >> > would handle jcc and cmove instructions. Until then... yes your >> >> > proposed approach is one of the ways to avoid unwanted if-conversion, >> >> > although sometimes we would like to split to cmove instead. >> >> >> >> So the following makes STV also consider SImode chains, re-using the >> >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern >> >> and also did not alter the {SI,DI}mode chain cost function - it's >> >> quite off for TARGET_64BIT. With this I get the expected conversion >> >> for the testcase derived from hmmer. >> >> >> >> No further testing sofar. >> >> >> >> Is it OK to re-use the DImode chain code this way? I'll clean things >> >> up some more of course. >> > >> > Yes, the approach looks OK to me. It makes chain building mode >> > agnostic, and the chain building can be used for >> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. >> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode >> > minmax and surrounding SImode operations) >> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding >> > DImode operations) >> > >> >> Still need help with the actual patterns for minmax and how the splitters >> >> should look like. >> > >> > Please look at the attached patch. Maybe we can add memory_operand as >> > operand 1 and operand 2 predicate, but let's keep things simple for >> > now. >> > >> > Uros. >> > >> > Index: i386.md >> > =================================================================== >> > --- i386.md (revision 274008) >> > +++ i386.md (working copy) >> > @@ -17721,6 +17721,27 @@ >> > std::swap (operands[4], operands[5]); >> > }) >> > >> > +;; min/max patterns >> > + >> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) >> > + >> > +(define_insn_and_split "<code><mode>3" >> > + [(set (match_operand:SWI48 0 "register_operand") >> > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") >> > + (match_operand:SWI48 2 "register_operand"))) >> > + (clobber (reg:CC FLAGS_REG))] >> > + "TARGET_STV && TARGET_SSE4_1 >> > + && can_create_pseudo_p ()" >> > + "#" >> > + "&& 1" >> > + [(set (reg:CCGC FLAGS_REG) >> > + (compare:CCGC (match_dup 1)(match_dup 2))) >> > + (set (match_dup 0) >> > + (if_then_else:SWI48 >> > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) >> > + (match_dup 1) >> > + (match_dup 2)))]) >> > + >> >> The pattern could in theory be matched after the last pre-RA split pass >> has run, so I think the pattern still needs to have constraints and be >> matchable even without can_create_pseudo_p. It looks like the split >> above should work post-RA. >> >> A bit pedantic, because the pattern's probably fine in practice... > > Currently, all unmatched STV patterns split before reload, and there > were no problems. If the pattern matches after last pre-RA split, then > the post-reload splitter will fail, since can_create_pseudo_p also > applies to the part that splits the insn. But what I meant was: you should be able to remove the can_create_pseudo_p () and add constraints. (You'd have to remove can_create_pseudo_p () with constraints anyway, since the insn wouldn't match after RA otherwise.) Thanks, Richard > In any case, thanks for the heads-up, hopefully we didn't assume > something that doesn't hold. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 10:12 ` Richard Sandiford @ 2019-08-05 10:24 ` Uros Bizjak 2019-08-05 10:39 ` Richard Sandiford 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 10:24 UTC (permalink / raw) To: Uros Bizjak, Richard Biener, gcc-patches, Jakub Jelinek, Richard Sandiford On Mon, Aug 5, 2019 at 12:12 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Uros Bizjak <ubizjak@gmail.com> writes: > > On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> Uros Bizjak <ubizjak@gmail.com> writes: > >> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > >> >> > >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote: > >> >> > >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > >> >> > > >> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > >> >> >>>> necessary even when going the STV route. The actual regression > >> >> >>>> for the testcase could also be solved by turing the smaxsi3 > >> >> >>>> back into a compare and jump rather than a conditional move sequence. > >> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload > >> >> >>>> after pass_split_after_reload and I'm not sure we can split > >> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ > >> >> >>>> sched2 on x86 it seems). > >> >> >>>> > >> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > >> >> >>>> case STV doesn't end up doing any transform? > >> >> >>> > >> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits > >> >> >>> the insn back to compare+cmove. > >> >> >> > >> >> >> OK, that would work. But there's no way to force a jumpy sequence then > >> >> >> which we know is faster than compare+cmove because later RTL > >> >> >> if-conversion passes happily re-discover the smax (or conditional move) > >> >> >> sequence. > >> >> >> > >> >> >>> However, considering the SImode move > >> >> >>> from/to int/xmm register is relatively cheap, the cost function should > >> >> >>> be tuned so that STV always converts smaxsi3 pattern. > >> >> >> > >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition > >> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > >> >> >> sequence... (for the loop in hmmer which is the only one I see > >> >> >> any effect of any of my patches). So identifying chains that > >> >> >> start/end in memory is important for cost reasons. > >> >> > > >> >> > Please note that the cost function also considers the cost of move > >> >> > from/to xmm. So, the cost of the whole chain would disable the > >> >> > transformation. > >> >> > > >> >> >> So I think the splitting has to happen after the last if-conversion > >> >> >> pass (and thus we may need to allocate a scratch register for this > >> >> >> purpose?) > >> >> > > >> >> > I really hope that the underlying issue will be solved by a machine > >> >> > dependant pass inserted somewhere after the pre-reload split. This > >> >> > way, we can split unconverted smax to the cmove, and this later pass > >> >> > would handle jcc and cmove instructions. Until then... yes your > >> >> > proposed approach is one of the ways to avoid unwanted if-conversion, > >> >> > although sometimes we would like to split to cmove instead. > >> >> > >> >> So the following makes STV also consider SImode chains, re-using the > >> >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern > >> >> and also did not alter the {SI,DI}mode chain cost function - it's > >> >> quite off for TARGET_64BIT. With this I get the expected conversion > >> >> for the testcase derived from hmmer. > >> >> > >> >> No further testing sofar. > >> >> > >> >> Is it OK to re-use the DImode chain code this way? I'll clean things > >> >> up some more of course. > >> > > >> > Yes, the approach looks OK to me. It makes chain building mode > >> > agnostic, and the chain building can be used for > >> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > >> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > >> > minmax and surrounding SImode operations) > >> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > >> > DImode operations) > >> > > >> >> Still need help with the actual patterns for minmax and how the splitters > >> >> should look like. > >> > > >> > Please look at the attached patch. Maybe we can add memory_operand as > >> > operand 1 and operand 2 predicate, but let's keep things simple for > >> > now. > >> > > >> > Uros. > >> > > >> > Index: i386.md > >> > =================================================================== > >> > --- i386.md (revision 274008) > >> > +++ i386.md (working copy) > >> > @@ -17721,6 +17721,27 @@ > >> > std::swap (operands[4], operands[5]); > >> > }) > >> > > >> > +;; min/max patterns > >> > + > >> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > >> > + > >> > +(define_insn_and_split "<code><mode>3" > >> > + [(set (match_operand:SWI48 0 "register_operand") > >> > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > >> > + (match_operand:SWI48 2 "register_operand"))) > >> > + (clobber (reg:CC FLAGS_REG))] > >> > + "TARGET_STV && TARGET_SSE4_1 > >> > + && can_create_pseudo_p ()" > >> > + "#" > >> > + "&& 1" > >> > + [(set (reg:CCGC FLAGS_REG) > >> > + (compare:CCGC (match_dup 1)(match_dup 2))) > >> > + (set (match_dup 0) > >> > + (if_then_else:SWI48 > >> > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > >> > + (match_dup 1) > >> > + (match_dup 2)))]) > >> > + > >> > >> The pattern could in theory be matched after the last pre-RA split pass > >> has run, so I think the pattern still needs to have constraints and be > >> matchable even without can_create_pseudo_p. It looks like the split > >> above should work post-RA. > >> > >> A bit pedantic, because the pattern's probably fine in practice... > > > > Currently, all unmatched STV patterns split before reload, and there > > were no problems. If the pattern matches after last pre-RA split, then > > the post-reload splitter will fail, since can_create_pseudo_p also > > applies to the part that splits the insn. > > But what I meant was: you should be able to remove the > can_create_pseudo_p () and add constraints. (You'd have to remove > can_create_pseudo_p () with constraints anyway, since the insn > wouldn't match after RA otherwise.) I was under impression that it is better to split pseudo->pseudo, so reload has some more freedom on what register to choose, especially with matched and earlyclobbered DImode regs in x86_32 DImode patterns. There were some complications with andn pattern (that needed earlyclobber on a register to avoid clobbering registers in a memory address), and it was necessary to clobber the whole DImode register pair, wasting a SImode register. We can avoid all these complications by splitting before the RA, where also a pseudo can be allocated. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 10:24 ` Uros Bizjak @ 2019-08-05 10:39 ` Richard Sandiford 0 siblings, 0 replies; 61+ messages in thread From: Richard Sandiford @ 2019-08-05 10:39 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Biener, gcc-patches, Jakub Jelinek Uros Bizjak <ubizjak@gmail.com> writes: > On Mon, Aug 5, 2019 at 12:12 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Uros Bizjak <ubizjak@gmail.com> writes: >> > On Mon, Aug 5, 2019 at 11:13 AM Richard Sandiford >> > <richard.sandiford@arm.com> wrote: >> >> >> >> Uros Bizjak <ubizjak@gmail.com> writes: >> >> > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: >> >> >> >> >> >> On Thu, 1 Aug 2019, Uros Bizjak wrote: >> >> >> >> >> >> > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: >> >> >> > >> >> >> >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >> >> >> >>>> necessary even when going the STV route. The actual regression >> >> >> >>>> for the testcase could also be solved by turing the smaxsi3 >> >> >> >>>> back into a compare and jump rather than a conditional move sequence. >> >> >> >>>> So I wonder how you'd do that given that there's pass_if_after_reload >> >> >> >>>> after pass_split_after_reload and I'm not sure we can split >> >> >> >>>> as late as pass_split_before_sched2 (there's also a split _after_ >> >> >> >>>> sched2 on x86 it seems). >> >> >> >>>> >> >> >> >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >> >> >> >>>> case STV doesn't end up doing any transform? >> >> >> >>> >> >> >> >>> If STV doesn't transform the insn, then a pre-reload splitter splits >> >> >> >>> the insn back to compare+cmove. >> >> >> >> >> >> >> >> OK, that would work. But there's no way to force a jumpy sequence then >> >> >> >> which we know is faster than compare+cmove because later RTL >> >> >> >> if-conversion passes happily re-discover the smax (or conditional move) >> >> >> >> sequence. >> >> >> >> >> >> >> >>> However, considering the SImode move >> >> >> >>> from/to int/xmm register is relatively cheap, the cost function should >> >> >> >>> be tuned so that STV always converts smaxsi3 pattern. >> >> >> >> >> >> >> >> Note that on both Zen and even more so bdverN the int/xmm transition >> >> >> >> makes it no longer profitable but a _lot_ slower than the cmp/cmov >> >> >> >> sequence... (for the loop in hmmer which is the only one I see >> >> >> >> any effect of any of my patches). So identifying chains that >> >> >> >> start/end in memory is important for cost reasons. >> >> >> > >> >> >> > Please note that the cost function also considers the cost of move >> >> >> > from/to xmm. So, the cost of the whole chain would disable the >> >> >> > transformation. >> >> >> > >> >> >> >> So I think the splitting has to happen after the last if-conversion >> >> >> >> pass (and thus we may need to allocate a scratch register for this >> >> >> >> purpose?) >> >> >> > >> >> >> > I really hope that the underlying issue will be solved by a machine >> >> >> > dependant pass inserted somewhere after the pre-reload split. This >> >> >> > way, we can split unconverted smax to the cmove, and this later pass >> >> >> > would handle jcc and cmove instructions. Until then... yes your >> >> >> > proposed approach is one of the ways to avoid unwanted if-conversion, >> >> >> > although sometimes we would like to split to cmove instead. >> >> >> >> >> >> So the following makes STV also consider SImode chains, re-using the >> >> >> DImode chain code. I've kept a simple incomplete smaxsi3 pattern >> >> >> and also did not alter the {SI,DI}mode chain cost function - it's >> >> >> quite off for TARGET_64BIT. With this I get the expected conversion >> >> >> for the testcase derived from hmmer. >> >> >> >> >> >> No further testing sofar. >> >> >> >> >> >> Is it OK to re-use the DImode chain code this way? I'll clean things >> >> >> up some more of course. >> >> > >> >> > Yes, the approach looks OK to me. It makes chain building mode >> >> > agnostic, and the chain building can be used for >> >> > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. >> >> > b) SImode x86_32 and x86_64 (this will be mainly used for SImode >> >> > minmax and surrounding SImode operations) >> >> > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding >> >> > DImode operations) >> >> > >> >> >> Still need help with the actual patterns for minmax and how the splitters >> >> >> should look like. >> >> > >> >> > Please look at the attached patch. Maybe we can add memory_operand as >> >> > operand 1 and operand 2 predicate, but let's keep things simple for >> >> > now. >> >> > >> >> > Uros. >> >> > >> >> > Index: i386.md >> >> > =================================================================== >> >> > --- i386.md (revision 274008) >> >> > +++ i386.md (working copy) >> >> > @@ -17721,6 +17721,27 @@ >> >> > std::swap (operands[4], operands[5]); >> >> > }) >> >> > >> >> > +;; min/max patterns >> >> > + >> >> > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) >> >> > + >> >> > +(define_insn_and_split "<code><mode>3" >> >> > + [(set (match_operand:SWI48 0 "register_operand") >> >> > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") >> >> > + (match_operand:SWI48 2 "register_operand"))) >> >> > + (clobber (reg:CC FLAGS_REG))] >> >> > + "TARGET_STV && TARGET_SSE4_1 >> >> > + && can_create_pseudo_p ()" >> >> > + "#" >> >> > + "&& 1" >> >> > + [(set (reg:CCGC FLAGS_REG) >> >> > + (compare:CCGC (match_dup 1)(match_dup 2))) >> >> > + (set (match_dup 0) >> >> > + (if_then_else:SWI48 >> >> > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) >> >> > + (match_dup 1) >> >> > + (match_dup 2)))]) >> >> > + >> >> >> >> The pattern could in theory be matched after the last pre-RA split pass >> >> has run, so I think the pattern still needs to have constraints and be >> >> matchable even without can_create_pseudo_p. It looks like the split >> >> above should work post-RA. >> >> >> >> A bit pedantic, because the pattern's probably fine in practice... >> > >> > Currently, all unmatched STV patterns split before reload, and there >> > were no problems. If the pattern matches after last pre-RA split, then >> > the post-reload splitter will fail, since can_create_pseudo_p also >> > applies to the part that splits the insn. >> >> But what I meant was: you should be able to remove the >> can_create_pseudo_p () and add constraints. (You'd have to remove >> can_create_pseudo_p () with constraints anyway, since the insn >> wouldn't match after RA otherwise.) > > I was under impression that it is better to split pseudo->pseudo, so > reload has some more freedom on what register to choose, especially > with matched and earlyclobbered DImode regs in x86_32 DImode patterns. > There were some complications with andn pattern (that needed > earlyclobber on a register to avoid clobbering registers in a memory > address), and it was necessary to clobber the whole DImode register > pair, wasting a SImode register. We can avoid all these complications > by splitting before the RA, where also a pseudo can be allocated. Yeah, splitting before RA is fine. All I meant was that: (define_insn_and_split "<code><mode>3" [(set (match_operand:SWI48 0 "register_operand" "=r") (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand" "r") (match_operand:SWI48 2 "register_operand" "r"))) (clobber (reg:CC FLAGS_REG))] "TARGET_STV && TARGET_SSE4_1" "#" "&& 1" [(set (reg:CCGC FLAGS_REG) (compare:CCGC (match_dup 1) (match_dup 2))) (set (match_dup 0) (if_then_else:SWI48 (<smaxmin_rel> (reg:CCGC FLAGS_REG) (const_int 0)) (match_dup 1) (match_dup 2)))]) seems like it should be correct too and avoids the theoretical problem I mentioned. If the instruction does survive until RA then the split should work correctly on the reloaded instruction. Thanks, Richard ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-04 17:11 ` Uros Bizjak 2019-08-04 17:23 ` Jakub Jelinek 2019-08-05 9:13 ` Richard Sandiford @ 2019-08-05 11:50 ` Richard Biener 2019-08-05 11:59 ` Uros Bizjak ` (2 more replies) 2 siblings, 3 replies; 61+ messages in thread From: Richard Biener @ 2019-08-05 11:50 UTC (permalink / raw) To: Uros Bizjak; +Cc: gcc-patches, Jakub Jelinek On Sun, 4 Aug 2019, Uros Bizjak wrote: > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > >>>> necessary even when going the STV route. The actual regression > > >>>> for the testcase could also be solved by turing the smaxsi3 > > >>>> back into a compare and jump rather than a conditional move sequence. > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > >>>> after pass_split_after_reload and I'm not sure we can split > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > >>>> sched2 on x86 it seems). > > >>>> > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > >>>> case STV doesn't end up doing any transform? > > >>> > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > >>> the insn back to compare+cmove. > > >> > > >> OK, that would work. But there's no way to force a jumpy sequence then > > >> which we know is faster than compare+cmove because later RTL > > >> if-conversion passes happily re-discover the smax (or conditional move) > > >> sequence. > > >> > > >>> However, considering the SImode move > > >>> from/to int/xmm register is relatively cheap, the cost function should > > >>> be tuned so that STV always converts smaxsi3 pattern. > > >> > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > >> sequence... (for the loop in hmmer which is the only one I see > > >> any effect of any of my patches). So identifying chains that > > >> start/end in memory is important for cost reasons. > > > > > > Please note that the cost function also considers the cost of move > > > from/to xmm. So, the cost of the whole chain would disable the > > > transformation. > > > > > >> So I think the splitting has to happen after the last if-conversion > > >> pass (and thus we may need to allocate a scratch register for this > > >> purpose?) > > > > > > I really hope that the underlying issue will be solved by a machine > > > dependant pass inserted somewhere after the pre-reload split. This > > > way, we can split unconverted smax to the cmove, and this later pass > > > would handle jcc and cmove instructions. Until then... yes your > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > although sometimes we would like to split to cmove instead. > > > > So the following makes STV also consider SImode chains, re-using the > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > and also did not alter the {SI,DI}mode chain cost function - it's > > quite off for TARGET_64BIT. With this I get the expected conversion > > for the testcase derived from hmmer. > > > > No further testing sofar. > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > up some more of course. > > Yes, the approach looks OK to me. It makes chain building mode > agnostic, and the chain building can be used for > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > minmax and surrounding SImode operations) > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > DImode operations) > > > Still need help with the actual patterns for minmax and how the splitters > > should look like. > > Please look at the attached patch. Maybe we can add memory_operand as > operand 1 and operand 2 predicate, but let's keep things simple for > now. Thanks. The attached patch makes the patch cleaner and it survives "some" barebone testing. It also touches the cost function to avoid being too overly trigger-happy. I've also ended up using ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In particular we estimated GPR reg-reg move as COST_N_INSNS(2) while move costs shouldn't be wrapped in COST_N_INSNS. IMHO we should probably disregard any reg-reg moves for costing pre-RA. At least with the current code every reg-reg move biases in favor of SSE... And we're simply adding move and non-move costs in 'gain', somewhat mixing apples and oranges? We could separate those and require both to be a net positive win? Still using -mtune=bdverN exposes that some cost tables have xmm and gpr costs as apples and oranges... (so it never triggers for Bulldozer) I now run into /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: error: unrecognizable insn: (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) -1 (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) during RTL pass: stv where even with -mavx2 we do not have s{min,max}v2di3. We do have an expander here but it seems only AVX512F has the DImode min/max ops. I have adjusted dimode_scalar_to_vector_candidate_p accordingly. I'm considering to rename the dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} functions to drop the dimode_ prefix - is that OK or do you prefer some other prefix? So - bootstrap with --with-arch=skylake in progress. It detects quite a few chains (unsurprisingly) so I guess we need to address compile-time issues in the pass before enabling this enhancement (maybe as followup?). Further comments on the actual patch welcome, I consider it "finished" if testing reveals no issues. ChangeLog still needs to be written and testcases to be added. Thanks, Richard. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274111) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost ( { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -573,7 +611,7 @@ rtx dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); emit_move_insn (vreg, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else - { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); - } + emit_move_insn (gen_lowpart (smode, vreg), reg); rtx_insn *seq = get_insns (); end_sequence (); rtx_insn *insn = DF_REF_INSN (ref); @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512F) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void dimode_remove_non_convertible_regs (bitmap candidates) @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1638,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + dimode_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274111) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -162,6 +167,8 @@ class scalar_chain class dimode_scalar_chain : public scalar_chain { public: + dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274111) +++ gcc/config/i386/i386.md (working copy) @@ -17721,6 +17721,27 @@ (define_peephole2 std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:CCGC FLAGS_REG) + (compare:CCGC (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 11:50 ` Richard Biener @ 2019-08-05 11:59 ` Uros Bizjak 2019-08-05 12:16 ` Richard Biener 2019-08-05 12:33 ` Uros Bizjak 2019-08-05 12:44 ` Uros Bizjak 2 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 11:59 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote: > > On Sun, 4 Aug 2019, Uros Bizjak wrote: > > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > > >>>> necessary even when going the STV route. The actual regression > > > >>>> for the testcase could also be solved by turing the smaxsi3 > > > >>>> back into a compare and jump rather than a conditional move sequence. > > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > > >>>> after pass_split_after_reload and I'm not sure we can split > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > > >>>> sched2 on x86 it seems). > > > >>>> > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > > >>>> case STV doesn't end up doing any transform? > > > >>> > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > > >>> the insn back to compare+cmove. > > > >> > > > >> OK, that would work. But there's no way to force a jumpy sequence then > > > >> which we know is faster than compare+cmove because later RTL > > > >> if-conversion passes happily re-discover the smax (or conditional move) > > > >> sequence. > > > >> > > > >>> However, considering the SImode move > > > >>> from/to int/xmm register is relatively cheap, the cost function should > > > >>> be tuned so that STV always converts smaxsi3 pattern. > > > >> > > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > > >> sequence... (for the loop in hmmer which is the only one I see > > > >> any effect of any of my patches). So identifying chains that > > > >> start/end in memory is important for cost reasons. > > > > > > > > Please note that the cost function also considers the cost of move > > > > from/to xmm. So, the cost of the whole chain would disable the > > > > transformation. > > > > > > > >> So I think the splitting has to happen after the last if-conversion > > > >> pass (and thus we may need to allocate a scratch register for this > > > >> purpose?) > > > > > > > > I really hope that the underlying issue will be solved by a machine > > > > dependant pass inserted somewhere after the pre-reload split. This > > > > way, we can split unconverted smax to the cmove, and this later pass > > > > would handle jcc and cmove instructions. Until then... yes your > > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > > although sometimes we would like to split to cmove instead. > > > > > > So the following makes STV also consider SImode chains, re-using the > > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > > and also did not alter the {SI,DI}mode chain cost function - it's > > > quite off for TARGET_64BIT. With this I get the expected conversion > > > for the testcase derived from hmmer. > > > > > > No further testing sofar. > > > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > > up some more of course. > > > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > > > Still need help with the actual patterns for minmax and how the splitters > > > should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > Thanks. The attached patch makes the patch cleaner and it survives > "some" barebone testing. It also touches the cost function to > avoid being too overly trigger-happy. I've also ended up using > ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while > move costs shouldn't be wrapped in COST_N_INSNS. > IMHO we should probably disregard any reg-reg moves for costing pre-RA. > At least with the current code every reg-reg move biases in favor of > SSE... > > And we're simply adding move and non-move costs in 'gain', somewhat > mixing apples and oranges? We could separate those and require > both to be a net positive win? > > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr > costs as apples and oranges... (so it never triggers for Bulldozer) > > I now run into > > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: > error: unrecognizable insn: > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) > -1 > (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) > during RTL pass: stv > > where even with -mavx2 we do not have s{min,max}v2di3. We do have > an expander here but it seems only AVX512F has the DImode min/max > ops. I have adjusted dimode_scalar_to_vector_candidate_p > accordingly. > > I'm considering to rename the > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > functions to drop the dimode_ prefix - is that OK or do you > prefer some other prefix? No, please just drop the prefix. > So - bootstrap with --with-arch=skylake in progress. > > It detects quite a few chains (unsurprisingly) so I guess we need > to address compile-time issues in the pass before enabling this > enhancement (maybe as followup?). > > Further comments on the actual patch welcome, I consider it > "finished" if testing reveals no issues. ChangeLog still needs > to be written and testcases to be added. I'll look at the patch later today from the x86 target PoV, maybe an opinion of the RTL expert would also come in hand here. Uros. > > Thanks, > Richard. > > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c (revision 274111) > +++ gcc/config/i386/i386-features.c (working copy) > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > /* Initialize new chain. */ > > -scalar_chain::scalar_chain () > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > { > + smode = smode_; > + vmode = vmode_; > + > chain_id = ++max_id; > > if (dump_file) > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > && !HARD_REGISTER_P (SET_DEST (def_set))) > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > + /* ??? The following is quadratic since analyze_register_chain > + iterates over all refs to look for dual-mode regs. Instead this > + should be done separately for all regs mentioned in the chain once. */ > df_ref ref; > df_ref def; > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost ( > { > gcc_assert (CONST_INT_P (exp)); > > - if (standard_sse_constant_p (exp, V2DImode)) > - return COSTS_N_INSNS (1); > - return ix86_cost->sse_load[1]; > + if (standard_sse_constant_p (exp, vmode)) > + return ix86_cost->sse_op; > + /* We have separate costs for SImode and DImode, use SImode costs > + for smaller modes. */ > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > } > > /* Compute a gain for chain conversion. */ > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > + int costs factor in the number of GPRs involved. When supporting > + smaller modes than SImode the int load/store costs need to be > + adjusted as well. */ > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > + int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > + igain += 2 * m - ix86_cost->xmm_move; > else if (REG_P (src) && MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > else if (MEM_P (src) && REG_P (dst)) > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > else if (GET_CODE (src) == ASHIFT > || GET_CODE (src) == ASHIFTRT > || GET_CODE (src) == LSHIFTRT) > { > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > - gain += ix86_cost->shift_const; > + igain -= vector_const_cost (XEXP (src, 0)); > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > if (INTVAL (XEXP (src, 1)) >= 32) > - gain -= COSTS_N_INSNS (1); > + igain -= COSTS_N_INSNS (1); > } > else if (GET_CODE (src) == PLUS > || GET_CODE (src) == MINUS > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > { > - gain += ix86_cost->add; > + igain += m * ix86_cost->add - ix86_cost->sse_op; > /* Additional gain for andnot for targets without BMI. */ > if (GET_CODE (XEXP (src, 0)) == NOT > && !TARGET_BMI) > - gain += 2 * ix86_cost->add; > + igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0)); > if (CONST_INT_P (XEXP (src, 1))) > - gain -= vector_const_cost (XEXP (src, 1)); > + igain -= vector_const_cost (XEXP (src, 1)); > } > else if (GET_CODE (src) == NEG > || GET_CODE (src) == NOT) > - gain += ix86_cost->add - COSTS_N_INSNS (1); > + igain += m * ix86_cost->add - ix86_cost->sse_op; > + else if (GET_CODE (src) == SMAX > + || GET_CODE (src) == SMIN > + || GET_CODE (src) == UMAX > + || GET_CODE (src) == UMIN) > + { > + /* We do not have any conditional move cost, estimate it as a > + reg-reg move. Comparisons are costed as adds. */ > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > + /* Integer SSE ops are all costed the same. */ > + igain -= ix86_cost->sse_op; > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > else if (CONST_INT_P (src)) > { > if (REG_P (dst)) > - gain += COSTS_N_INSNS (2); > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > + igain += COSTS_N_INSNS (m); > else if (MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > - gain -= vector_const_cost (src); > + igain += (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]); > + igain -= vector_const_cost (src); > } > else > gcc_unreachable (); > + > + if (igain != 0 && dump_file) > + { > + fprintf (dump_file, " Instruction gain %d for ", igain); > + dump_insn_slim (dump_file, insn); > + } > + gain += igain; > } > > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > + /* ??? What about integer to SSE? */ > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > @@ -573,7 +611,7 @@ rtx > dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > { > if (x == reg) > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > int i, j; > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > - emit_move_insn (adjust_address (tmp, SImode, 0), > - gen_rtx_SUBREG (SImode, reg, 0)); > - emit_move_insn (adjust_address (tmp, SImode, 4), > - gen_rtx_SUBREG (SImode, reg, 4)); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > + if (smode == DImode && !TARGET_64BIT) > + { > + emit_move_insn (adjust_address (tmp, SImode, 0), > + gen_rtx_SUBREG (SImode, reg, 0)); > + emit_move_insn (adjust_address (tmp, SImode, 4), > + gen_rtx_SUBREG (SImode, reg, 4)); > + } > + else > + emit_move_insn (tmp, reg); > emit_move_insn (vreg, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (SImode, reg, 4), > - GEN_INT (2))); > + if (TARGET_SSE4_1) > + { > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (SImode, reg, 4), > + GEN_INT (2))); > + } > + else > + { > + rtx tmp = gen_reg_rtx (DImode); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 4))); > + emit_insn (gen_vec_interleave_lowv4si > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > + } > } > else > - { > - rtx tmp = gen_reg_rtx (DImode); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 4))); > - emit_insn (gen_vec_interleave_lowv4si > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > - } > + emit_move_insn (gen_lowpart (smode, vreg), reg); > rtx_insn *seq = get_insns (); > end_sequence (); > rtx_insn *insn = DF_REF_INSN (ref); > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign > bitmap_copy (conv, insns); > > if (scalar_copy) > - scopy = gen_reg_rtx (DImode); > + scopy = gen_reg_rtx (smode); > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > { > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > emit_move_insn (tmp, reg); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - adjust_address (tmp, SImode, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - adjust_address (tmp, SImode, 4)); > + if (!TARGET_64BIT && smode == DImode) > + { > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + adjust_address (tmp, SImode, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + adjust_address (tmp, SImode, 4)); > + } > + else > + emit_move_insn (scopy, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > - > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > + if (TARGET_SSE4_1) > + { > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > + gen_rtvec (1, const0_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + } > + else > + { > + rtx vcopy = gen_reg_rtx (V2DImode); > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + emit_move_insn (vcopy, > + gen_rtx_LSHIFTRT (V2DImode, > + vcopy, GEN_INT (32))); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + } > } > else > - { > - rtx vcopy = gen_reg_rtx (V2DImode); > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - emit_move_insn (vcopy, > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - } > + emit_move_insn (scopy, reg); > + > rtx_insn *seq = get_insns (); > end_sequence (); > emit_conversion_insns (seq, insn); > @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op > if (GET_CODE (*op) == NOT) > { > convert_op (&XEXP (*op, 0), insn); > - PUT_MODE (*op, V2DImode); > + PUT_MODE (*op, vmode); > } > else if (MEM_P (*op)) > { > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > emit_insn_before (gen_move_insn (tmp, *op), insn); > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > if (dump_file) > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op > gcc_assert (!DF_REF_CHAIN (ref)); > break; > } > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > + *op = gen_rtx_SUBREG (vmode, *op, 0); > } > else if (CONST_INT_P (*op)) > { > rtx vec_cst; > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > /* Prefer all ones vector in case of -1. */ > if (constm1_operand (*op, GET_MODE (*op))) > - vec_cst = CONSTM1_RTX (V2DImode); > + vec_cst = CONSTM1_RTX (vmode); > else > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > - gen_rtvec (2, *op, const0_rtx)); > + { > + unsigned n = GET_MODE_NUNITS (vmode); > + rtx *v = XALLOCAVEC (rtx, n); > + v[0] = *op; > + for (unsigned i = 1; i < n; ++i) > + v[i] = const0_rtx; > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > + } > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > + if (!standard_sse_constant_p (vec_cst, vmode)) > { > start_sequence (); > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > rtx_insn *seq = get_insns (); > end_sequence (); > emit_insn_before (seq, insn); > @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op > else > { > gcc_assert (SUBREG_P (*op)); > - gcc_assert (GET_MODE (*op) == V2DImode); > + gcc_assert (GET_MODE (*op) == vmode); > } > } > > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i > { > /* There are no scalar integer instructions and therefore > temporary register usage is required. */ > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > } > > switch (GET_CODE (src)) > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i > case ASHIFTRT: > case LSHIFTRT: > convert_op (&XEXP (src, 0), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case PLUS: > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i > case IOR: > case XOR: > case AND: > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > convert_op (&XEXP (src, 0), insn); > convert_op (&XEXP (src, 1), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case NEG: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > - src = gen_rtx_MINUS (V2DImode, subreg, src); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > + src = gen_rtx_MINUS (vmode, subreg, src); > break; > > case NOT: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > - src = gen_rtx_XOR (V2DImode, src, subreg); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > + src = gen_rtx_XOR (vmode, src, subreg); > break; > > case MEM: > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i > break; > > case SUBREG: > - gcc_assert (GET_MODE (src) == V2DImode); > + gcc_assert (GET_MODE (src) == vmode); > break; > > case COMPARE: > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > if (REG_P (src)) > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > + subreg = gen_rtx_SUBREG (vmode, src, 0); > else > subreg = copy_rtx_if_shared (src); > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i > PATTERN (insn) = def_set; > > INSN_CODE (insn) = -1; > - recog_memoized (insn); > + int patt = recog_memoized (insn); > + if (patt == -1) > + fatal_insn_not_found (insn); > df_insn_rescan (insn); > } > > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn > (const_int 0 [0]))) */ > > static bool > -convertible_comparison_p (rtx_insn *insn) > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > { > if (!TARGET_SSE4_1) > return false; > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn > > if (!SUBREG_P (op1) > || !SUBREG_P (op2) > - || GET_MODE (op1) != SImode > - || GET_MODE (op2) != SImode > + || GET_MODE (op1) != mode > + || GET_MODE (op2) != mode > || ((SUBREG_BYTE (op1) != 0 > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > && (SUBREG_BYTE (op2) != 0 > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > return false; > > op1 = SUBREG_REG (op1); > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn > > if (op1 != op2 > || !REG_P (op1) > - || GET_MODE (op1) != DImode) > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > return false; > > return true; > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn > /* The DImode version of scalar_to_vector_candidate_p. */ > > static bool > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > { > rtx def_set = single_set (insn); > > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx > rtx dst = SET_DEST (def_set); > > if (GET_CODE (src) == COMPARE) > - return convertible_comparison_p (insn); > + return convertible_comparison_p (insn, mode); > > /* We are interested in DImode promotion only. */ > - if ((GET_MODE (src) != DImode > + if ((GET_MODE (src) != mode > && !CONST_INT_P (src)) > - || GET_MODE (dst) != DImode) > + || GET_MODE (dst) != mode) > return false; > > if (!REG_P (dst) && !MEM_P (dst)) > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx > return false; > break; > > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > + if ((mode == DImode && !TARGET_AVX512F) > + || (mode == SImode && !TARGET_SSE4_1)) > + return false; > + /* Fallthru. */ > + > case PLUS: > case MINUS: > case IOR: > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx > && !CONST_INT_P (XEXP (src, 1))) > return false; > > - if (GET_MODE (XEXP (src, 1)) != DImode > + if (GET_MODE (XEXP (src, 1)) != mode > && !CONST_INT_P (XEXP (src, 1))) > return false; > break; > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx > || !REG_P (XEXP (XEXP (src, 0), 0)))) > return false; > > - if (GET_MODE (XEXP (src, 0)) != DImode > + if (GET_MODE (XEXP (src, 0)) != mode > && !CONST_INT_P (XEXP (src, 0))) > return false; > > @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx > return false; > } > > -/* Return 1 if INSN may be converted into vector > - instruction. */ > - > -static bool > -scalar_to_vector_candidate_p (rtx_insn *insn) > -{ > - if (TARGET_64BIT) > - return timode_scalar_to_vector_candidate_p (insn); > - else > - return dimode_scalar_to_vector_candidate_p (insn); > -} > +/* For a given bitmap of insn UIDs scans all instruction and > + remove insn from CANDIDATES in case it has both convertible > + and not convertible definitions. > > -/* The DImode version of remove_non_convertible_regs. */ > + All insns in a bitmap are conversion candidates according to > + scalar_to_vector_candidate_p. Currently it implies all insns > + are single_set. */ > > static void > dimode_remove_non_convertible_regs (bitmap candidates) > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm > BITMAP_FREE (regs); > } > > -/* For a given bitmap of insn UIDs scans all instruction and > - remove insn from CANDIDATES in case it has both convertible > - and not convertible definitions. > - > - All insns in a bitmap are conversion candidates according to > - scalar_to_vector_candidate_p. Currently it implies all insns > - are single_set. */ > - > -static void > -remove_non_convertible_regs (bitmap candidates) > -{ > - if (TARGET_64BIT) > - timode_remove_non_convertible_regs (candidates); > - else > - dimode_remove_non_convertible_regs (candidates); > -} > - > /* Main STV pass function. Find and convert scalar > instructions into vector mode when profitable. */ > > @@ -1577,11 +1638,14 @@ static unsigned int > convert_scalars_to_vector () > { > basic_block bb; > - bitmap candidates; > int converted_insns = 0; > > bitmap_obstack_initialize (NULL); > - candidates = BITMAP_ALLOC (NULL); > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > + for (unsigned i = 0; i < 3; ++i) > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > calculate_dominance_info (CDI_DOMINATORS); > df_set_flags (DF_DEFER_INSN_RESCAN); > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () > { > rtx_insn *insn; > FOR_BB_INSNS (bb, insn) > - if (scalar_to_vector_candidate_p (insn)) > + if (TARGET_64BIT > + && timode_scalar_to_vector_candidate_p (insn)) > { > if (dump_file) > - fprintf (dump_file, " insn %d is marked as a candidate\n", > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > INSN_UID (insn)); > > - bitmap_set_bit (candidates, INSN_UID (insn)); > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > + } > + else > + { > + /* Check {SI,DI}mode. */ > + for (unsigned i = 0; i <= 1; ++i) > + if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i])) > + { > + if (dump_file) > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > + > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > + break; > + } > } > } > > - remove_non_convertible_regs (candidates); > + if (TARGET_64BIT) > + timode_remove_non_convertible_regs (&candidates[2]); > + for (unsigned i = 0; i <= 1; ++i) > + dimode_remove_non_convertible_regs (&candidates[i]); > > - if (bitmap_empty_p (candidates)) > - if (dump_file) > + for (unsigned i = 0; i <= 2; ++i) > + if (!bitmap_empty_p (&candidates[i])) > + break; > + else if (i == 2 && dump_file) > fprintf (dump_file, "There are no candidates for optimization.\n"); > > - while (!bitmap_empty_p (candidates)) > - { > - unsigned uid = bitmap_first_set_bit (candidates); > - scalar_chain *chain; > + for (unsigned i = 0; i <= 2; ++i) > + while (!bitmap_empty_p (&candidates[i])) > + { > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > + scalar_chain *chain; > > - if (TARGET_64BIT) > - chain = new timode_scalar_chain; > - else > - chain = new dimode_scalar_chain; > + if (cand_mode[i] == TImode) > + chain = new timode_scalar_chain; > + else > + chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]); > > - /* Find instructions chain we want to convert to vector mode. > - Check all uses and definitions to estimate all required > - conversions. */ > - chain->build (candidates, uid); > + /* Find instructions chain we want to convert to vector mode. > + Check all uses and definitions to estimate all required > + conversions. */ > + chain->build (&candidates[i], uid); > > - if (chain->compute_convert_gain () > 0) > - converted_insns += chain->convert (); > - else > - if (dump_file) > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > - chain->chain_id); > + if (chain->compute_convert_gain () > 0) > + converted_insns += chain->convert (); > + else > + if (dump_file) > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > + chain->chain_id); > > - delete chain; > - } > + delete chain; > + } > > if (dump_file) > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > - BITMAP_FREE (candidates); > + for (unsigned i = 0; i <= 2; ++i) > + bitmap_release (&candidates[i]); > bitmap_obstack_release (NULL); > df_process_deferred_rescans (); > > Index: gcc/config/i386/i386-features.h > =================================================================== > --- gcc/config/i386/i386-features.h (revision 274111) > +++ gcc/config/i386/i386-features.h (working copy) > @@ -127,11 +127,16 @@ namespace { > class scalar_chain > { > public: > - scalar_chain (); > + scalar_chain (enum machine_mode, enum machine_mode); > virtual ~scalar_chain (); > > static unsigned max_id; > > + /* Scalar mode. */ > + enum machine_mode smode; > + /* Vector mode. */ > + enum machine_mode vmode; > + > /* ID of a chain. */ > unsigned int chain_id; > /* A queue of instructions to be included into a chain. */ > @@ -162,6 +167,8 @@ class scalar_chain > class dimode_scalar_chain : public scalar_chain > { > public: > + dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > + : scalar_chain (smode_, vmode_) {} > int compute_convert_gain (); > private: > void mark_dual_mode_def (df_ref def); > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > class timode_scalar_chain : public scalar_chain > { > public: > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > + > /* Convert from TImode to V1TImode is always faster. */ > int compute_convert_gain () { return 1; } > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 274111) > +++ gcc/config/i386/i386.md (working copy) > @@ -17721,6 +17721,27 @@ (define_peephole2 > std::swap (operands[4], operands[5]); > }) > > +;; min/max patterns > + > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:SWI48 0 "register_operand") > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > + (match_operand:SWI48 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && TARGET_SSE4_1 > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:CCGC FLAGS_REG) > + (compare:CCGC (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:SWI48 > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 11:59 ` Uros Bizjak @ 2019-08-05 12:16 ` Richard Biener 2019-08-05 12:23 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-05 12:16 UTC (permalink / raw) To: Uros Bizjak; +Cc: gcc-patches, Jakub Jelinek On Mon, 5 Aug 2019, Uros Bizjak wrote: > > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > > functions to drop the dimode_ prefix - is that OK or do you > > prefer some other prefix? > > No, please just drop the prefix. just noticed this applies to the derived dimode_scalar_chain class as well where I can't simply drop the prefix. So would general_scalar_chain / general_{scalar_to_vector_candidate_p,remove_non_convertible_regs} be OK? Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 12:16 ` Richard Biener @ 2019-08-05 12:23 ` Uros Bizjak 0 siblings, 0 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 12:23 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek On Mon, Aug 5, 2019 at 2:16 PM Richard Biener <rguenther@suse.de> wrote: > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > > > functions to drop the dimode_ prefix - is that OK or do you > > > prefer some other prefix? > > > > No, please just drop the prefix. > > just noticed this applies to the derived dimode_scalar_chain class > as well where I can't simply drop the prefix. So would > general_scalar_chain / > general_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > be OK? I don't want to bikeshed too much here ;) Whatever fits you best. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 11:50 ` Richard Biener 2019-08-05 11:59 ` Uros Bizjak @ 2019-08-05 12:33 ` Uros Bizjak 2019-08-08 16:23 ` Jeff Law 2019-08-05 12:44 ` Uros Bizjak 2 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 12:33 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek, H. J. Lu, Jan Hubicka [-- Attachment #1: Type: text/plain, Size: 7285 bytes --] On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote: > > On Sun, 4 Aug 2019, Uros Bizjak wrote: > > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > > >>>> necessary even when going the STV route. The actual regression > > > >>>> for the testcase could also be solved by turing the smaxsi3 > > > >>>> back into a compare and jump rather than a conditional move sequence. > > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > > >>>> after pass_split_after_reload and I'm not sure we can split > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > > >>>> sched2 on x86 it seems). > > > >>>> > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > > >>>> case STV doesn't end up doing any transform? > > > >>> > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > > >>> the insn back to compare+cmove. > > > >> > > > >> OK, that would work. But there's no way to force a jumpy sequence then > > > >> which we know is faster than compare+cmove because later RTL > > > >> if-conversion passes happily re-discover the smax (or conditional move) > > > >> sequence. > > > >> > > > >>> However, considering the SImode move > > > >>> from/to int/xmm register is relatively cheap, the cost function should > > > >>> be tuned so that STV always converts smaxsi3 pattern. > > > >> > > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > > >> sequence... (for the loop in hmmer which is the only one I see > > > >> any effect of any of my patches). So identifying chains that > > > >> start/end in memory is important for cost reasons. > > > > > > > > Please note that the cost function also considers the cost of move > > > > from/to xmm. So, the cost of the whole chain would disable the > > > > transformation. > > > > > > > >> So I think the splitting has to happen after the last if-conversion > > > >> pass (and thus we may need to allocate a scratch register for this > > > >> purpose?) > > > > > > > > I really hope that the underlying issue will be solved by a machine > > > > dependant pass inserted somewhere after the pre-reload split. This > > > > way, we can split unconverted smax to the cmove, and this later pass > > > > would handle jcc and cmove instructions. Until then... yes your > > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > > although sometimes we would like to split to cmove instead. > > > > > > So the following makes STV also consider SImode chains, re-using the > > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > > and also did not alter the {SI,DI}mode chain cost function - it's > > > quite off for TARGET_64BIT. With this I get the expected conversion > > > for the testcase derived from hmmer. > > > > > > No further testing sofar. > > > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > > up some more of course. > > > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > > > Still need help with the actual patterns for minmax and how the splitters > > > should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > Thanks. The attached patch makes the patch cleaner and it survives > "some" barebone testing. It also touches the cost function to > avoid being too overly trigger-happy. I've also ended up using > ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while > move costs shouldn't be wrapped in COST_N_INSNS. > IMHO we should probably disregard any reg-reg moves for costing pre-RA. > At least with the current code every reg-reg move biases in favor of > SSE... This is currently a bit mixed-up area in x86 target support. HJ is looking into this [1] and I hope Honza can review the patch. > And we're simply adding move and non-move costs in 'gain', somewhat > mixing apples and oranges? We could separate those and require > both to be a net positive win? > > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr > costs as apples and oranges... (so it never triggers for Bulldozer) > > I now run into > > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: > error: unrecognizable insn: > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) > -1 > (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) > during RTL pass: stv > > where even with -mavx2 we do not have s{min,max}v2di3. We do have > an expander here but it seems only AVX512F has the DImode min/max > ops. I have adjusted dimode_scalar_to_vector_candidate_p > accordingly. > > I'm considering to rename the > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > functions to drop the dimode_ prefix - is that OK or do you > prefer some other prefix? > > So - bootstrap with --with-arch=skylake in progress. > > It detects quite a few chains (unsurprisingly) so I guess we need > to address compile-time issues in the pass before enabling this > enhancement (maybe as followup?). > > Further comments on the actual patch welcome, I consider it > "finished" if testing reveals no issues. ChangeLog still needs > to be written and testcases to be added. > +;; min/max patterns > + > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:SWI48 0 "register_operand") > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > + (match_operand:SWI48 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && TARGET_SSE4_1 > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:CCGC FLAGS_REG) > + (compare:CCGC (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:SWI48 > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") Please find attached (untested) i386.md patch that defines signed and unsigned min/max pattern. [1] https://gcc.gnu.org/ml/gcc-patches/2019-07/msg01542.html Uros. [-- Attachment #2: maxmin-md.diff.txt --] [-- Type: text/plain, Size: 1117 bytes --] diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index e19a591fa9d..8a492626103 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -17721,6 +17721,30 @@ std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr maxmin_rel + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) +(define_code_attr maxmin_cmpmode + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:<maxmin_cmpmode> FLAGS_REG) + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 12:33 ` Uros Bizjak @ 2019-08-08 16:23 ` Jeff Law 0 siblings, 0 replies; 61+ messages in thread From: Jeff Law @ 2019-08-08 16:23 UTC (permalink / raw) To: Uros Bizjak, Richard Biener Cc: gcc-patches, Jakub Jelinek, H. J. Lu, Jan Hubicka On 8/5/19 6:32 AM, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote: >> >> On Sun, 4 Aug 2019, Uros Bizjak wrote: >> >>> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: >>>> >>>> On Thu, 1 Aug 2019, Uros Bizjak wrote: >>>> >>>>> On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: >>>>> >>>>>>>> So you unconditionally add a smaxdi3 pattern - indeed this looks >>>>>>>> necessary even when going the STV route. The actual regression >>>>>>>> for the testcase could also be solved by turing the smaxsi3 >>>>>>>> back into a compare and jump rather than a conditional move sequence. >>>>>>>> So I wonder how you'd do that given that there's pass_if_after_reload >>>>>>>> after pass_split_after_reload and I'm not sure we can split >>>>>>>> as late as pass_split_before_sched2 (there's also a split _after_ >>>>>>>> sched2 on x86 it seems). >>>>>>>> >>>>>>>> So how would you go implement {s,u}{min,max}{si,di}3 for the >>>>>>>> case STV doesn't end up doing any transform? >>>>>>> >>>>>>> If STV doesn't transform the insn, then a pre-reload splitter splits >>>>>>> the insn back to compare+cmove. >>>>>> >>>>>> OK, that would work. But there's no way to force a jumpy sequence then >>>>>> which we know is faster than compare+cmove because later RTL >>>>>> if-conversion passes happily re-discover the smax (or conditional move) >>>>>> sequence. >>>>>> >>>>>>> However, considering the SImode move >>>>>>> from/to int/xmm register is relatively cheap, the cost function should >>>>>>> be tuned so that STV always converts smaxsi3 pattern. >>>>>> >>>>>> Note that on both Zen and even more so bdverN the int/xmm transition >>>>>> makes it no longer profitable but a _lot_ slower than the cmp/cmov >>>>>> sequence... (for the loop in hmmer which is the only one I see >>>>>> any effect of any of my patches). So identifying chains that >>>>>> start/end in memory is important for cost reasons. >>>>> >>>>> Please note that the cost function also considers the cost of move >>>>> from/to xmm. So, the cost of the whole chain would disable the >>>>> transformation. >>>>> >>>>>> So I think the splitting has to happen after the last if-conversion >>>>>> pass (and thus we may need to allocate a scratch register for this >>>>>> purpose?) >>>>> >>>>> I really hope that the underlying issue will be solved by a machine >>>>> dependant pass inserted somewhere after the pre-reload split. This >>>>> way, we can split unconverted smax to the cmove, and this later pass >>>>> would handle jcc and cmove instructions. Until then... yes your >>>>> proposed approach is one of the ways to avoid unwanted if-conversion, >>>>> although sometimes we would like to split to cmove instead. >>>> >>>> So the following makes STV also consider SImode chains, re-using the >>>> DImode chain code. I've kept a simple incomplete smaxsi3 pattern >>>> and also did not alter the {SI,DI}mode chain cost function - it's >>>> quite off for TARGET_64BIT. With this I get the expected conversion >>>> for the testcase derived from hmmer. >>>> >>>> No further testing sofar. >>>> >>>> Is it OK to re-use the DImode chain code this way? I'll clean things >>>> up some more of course. >>> >>> Yes, the approach looks OK to me. It makes chain building mode >>> agnostic, and the chain building can be used for >>> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. >>> b) SImode x86_32 and x86_64 (this will be mainly used for SImode >>> minmax and surrounding SImode operations) >>> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding >>> DImode operations) >>> >>>> Still need help with the actual patterns for minmax and how the splitters >>>> should look like. >>> >>> Please look at the attached patch. Maybe we can add memory_operand as >>> operand 1 and operand 2 predicate, but let's keep things simple for >>> now. >> >> Thanks. The attached patch makes the patch cleaner and it survives >> "some" barebone testing. It also touches the cost function to >> avoid being too overly trigger-happy. I've also ended up using >> ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In >> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while >> move costs shouldn't be wrapped in COST_N_INSNS. >> IMHO we should probably disregard any reg-reg moves for costing pre-RA. >> At least with the current code every reg-reg move biases in favor of >> SSE... > > This is currently a bit mixed-up area in x86 target support. HJ is > looking into this [1] and I hope Honza can review the patch. Yea, Honza's input on that would be greatly appreciated. Jeff > ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 11:50 ` Richard Biener 2019-08-05 11:59 ` Uros Bizjak 2019-08-05 12:33 ` Uros Bizjak @ 2019-08-05 12:44 ` Uros Bizjak 2019-08-05 12:51 ` Uros Bizjak 2 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 12:44 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote: > > On Sun, 4 Aug 2019, Uros Bizjak wrote: > > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > > >>>> necessary even when going the STV route. The actual regression > > > >>>> for the testcase could also be solved by turing the smaxsi3 > > > >>>> back into a compare and jump rather than a conditional move sequence. > > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > > >>>> after pass_split_after_reload and I'm not sure we can split > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > > >>>> sched2 on x86 it seems). > > > >>>> > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > > >>>> case STV doesn't end up doing any transform? > > > >>> > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > > >>> the insn back to compare+cmove. > > > >> > > > >> OK, that would work. But there's no way to force a jumpy sequence then > > > >> which we know is faster than compare+cmove because later RTL > > > >> if-conversion passes happily re-discover the smax (or conditional move) > > > >> sequence. > > > >> > > > >>> However, considering the SImode move > > > >>> from/to int/xmm register is relatively cheap, the cost function should > > > >>> be tuned so that STV always converts smaxsi3 pattern. > > > >> > > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > > >> sequence... (for the loop in hmmer which is the only one I see > > > >> any effect of any of my patches). So identifying chains that > > > >> start/end in memory is important for cost reasons. > > > > > > > > Please note that the cost function also considers the cost of move > > > > from/to xmm. So, the cost of the whole chain would disable the > > > > transformation. > > > > > > > >> So I think the splitting has to happen after the last if-conversion > > > >> pass (and thus we may need to allocate a scratch register for this > > > >> purpose?) > > > > > > > > I really hope that the underlying issue will be solved by a machine > > > > dependant pass inserted somewhere after the pre-reload split. This > > > > way, we can split unconverted smax to the cmove, and this later pass > > > > would handle jcc and cmove instructions. Until then... yes your > > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > > although sometimes we would like to split to cmove instead. > > > > > > So the following makes STV also consider SImode chains, re-using the > > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > > and also did not alter the {SI,DI}mode chain cost function - it's > > > quite off for TARGET_64BIT. With this I get the expected conversion > > > for the testcase derived from hmmer. > > > > > > No further testing sofar. > > > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > > up some more of course. > > > > Yes, the approach looks OK to me. It makes chain building mode > > agnostic, and the chain building can be used for > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > minmax and surrounding SImode operations) > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > DImode operations) > > > > > Still need help with the actual patterns for minmax and how the splitters > > > should look like. > > > > Please look at the attached patch. Maybe we can add memory_operand as > > operand 1 and operand 2 predicate, but let's keep things simple for > > now. > > Thanks. The attached patch makes the patch cleaner and it survives > "some" barebone testing. It also touches the cost function to > avoid being too overly trigger-happy. I've also ended up using > ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while > move costs shouldn't be wrapped in COST_N_INSNS. > IMHO we should probably disregard any reg-reg moves for costing pre-RA. > At least with the current code every reg-reg move biases in favor of > SSE... > > And we're simply adding move and non-move costs in 'gain', somewhat > mixing apples and oranges? We could separate those and require > both to be a net positive win? > > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr > costs as apples and oranges... (so it never triggers for Bulldozer) > > I now run into > > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: > error: unrecognizable insn: > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) > -1 > (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) > during RTL pass: stv > > where even with -mavx2 we do not have s{min,max}v2di3. We do have > an expander here but it seems only AVX512F has the DImode min/max > ops. I have adjusted dimode_scalar_to_vector_candidate_p > accordingly. Uh, you need to use some other mode iterator that SWI48 then, like: (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) and then we need to split DImode for 32bits, too. Uros. > I'm considering to rename the > dimode_{scalar_to_vector_candidate_p,remove_non_convertible_regs} > functions to drop the dimode_ prefix - is that OK or do you > prefer some other prefix? > > So - bootstrap with --with-arch=skylake in progress. > > It detects quite a few chains (unsurprisingly) so I guess we need > to address compile-time issues in the pass before enabling this > enhancement (maybe as followup?). > > Further comments on the actual patch welcome, I consider it > "finished" if testing reveals no issues. ChangeLog still needs > to be written and testcases to be added. > > Thanks, > Richard. > > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c (revision 274111) > +++ gcc/config/i386/i386-features.c (working copy) > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > /* Initialize new chain. */ > > -scalar_chain::scalar_chain () > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > { > + smode = smode_; > + vmode = vmode_; > + > chain_id = ++max_id; > > if (dump_file) > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > && !HARD_REGISTER_P (SET_DEST (def_set))) > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > + /* ??? The following is quadratic since analyze_register_chain > + iterates over all refs to look for dual-mode regs. Instead this > + should be done separately for all regs mentioned in the chain once. */ > df_ref ref; > df_ref def; > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > @@ -473,9 +479,11 @@ dimode_scalar_chain::vector_const_cost ( > { > gcc_assert (CONST_INT_P (exp)); > > - if (standard_sse_constant_p (exp, V2DImode)) > - return COSTS_N_INSNS (1); > - return ix86_cost->sse_load[1]; > + if (standard_sse_constant_p (exp, vmode)) > + return ix86_cost->sse_op; > + /* We have separate costs for SImode and DImode, use SImode costs > + for smaller modes. */ > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > } > > /* Compute a gain for chain conversion. */ > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > + int costs factor in the number of GPRs involved. When supporting > + smaller modes than SImode the int load/store costs need to be > + adjusted as well. */ > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > + int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > + igain += 2 * m - ix86_cost->xmm_move; > else if (REG_P (src) && MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > else if (MEM_P (src) && REG_P (dst)) > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > else if (GET_CODE (src) == ASHIFT > || GET_CODE (src) == ASHIFTRT > || GET_CODE (src) == LSHIFTRT) > { > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > - gain += ix86_cost->shift_const; > + igain -= vector_const_cost (XEXP (src, 0)); > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > if (INTVAL (XEXP (src, 1)) >= 32) > - gain -= COSTS_N_INSNS (1); > + igain -= COSTS_N_INSNS (1); > } > else if (GET_CODE (src) == PLUS > || GET_CODE (src) == MINUS > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > { > - gain += ix86_cost->add; > + igain += m * ix86_cost->add - ix86_cost->sse_op; > /* Additional gain for andnot for targets without BMI. */ > if (GET_CODE (XEXP (src, 0)) == NOT > && !TARGET_BMI) > - gain += 2 * ix86_cost->add; > + igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0)); > if (CONST_INT_P (XEXP (src, 1))) > - gain -= vector_const_cost (XEXP (src, 1)); > + igain -= vector_const_cost (XEXP (src, 1)); > } > else if (GET_CODE (src) == NEG > || GET_CODE (src) == NOT) > - gain += ix86_cost->add - COSTS_N_INSNS (1); > + igain += m * ix86_cost->add - ix86_cost->sse_op; > + else if (GET_CODE (src) == SMAX > + || GET_CODE (src) == SMIN > + || GET_CODE (src) == UMAX > + || GET_CODE (src) == UMIN) > + { > + /* We do not have any conditional move cost, estimate it as a > + reg-reg move. Comparisons are costed as adds. */ > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > + /* Integer SSE ops are all costed the same. */ > + igain -= ix86_cost->sse_op; > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > else if (CONST_INT_P (src)) > { > if (REG_P (dst)) > - gain += COSTS_N_INSNS (2); > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > + igain += COSTS_N_INSNS (m); > else if (MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > - gain -= vector_const_cost (src); > + igain += (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]); > + igain -= vector_const_cost (src); > } > else > gcc_unreachable (); > + > + if (igain != 0 && dump_file) > + { > + fprintf (dump_file, " Instruction gain %d for ", igain); > + dump_insn_slim (dump_file, insn); > + } > + gain += igain; > } > > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > + /* ??? What about integer to SSE? */ > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > @@ -573,7 +611,7 @@ rtx > dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > { > if (x == reg) > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > int i, j; > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > - emit_move_insn (adjust_address (tmp, SImode, 0), > - gen_rtx_SUBREG (SImode, reg, 0)); > - emit_move_insn (adjust_address (tmp, SImode, 4), > - gen_rtx_SUBREG (SImode, reg, 4)); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > + if (smode == DImode && !TARGET_64BIT) > + { > + emit_move_insn (adjust_address (tmp, SImode, 0), > + gen_rtx_SUBREG (SImode, reg, 0)); > + emit_move_insn (adjust_address (tmp, SImode, 4), > + gen_rtx_SUBREG (SImode, reg, 4)); > + } > + else > + emit_move_insn (tmp, reg); > emit_move_insn (vreg, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (SImode, reg, 4), > - GEN_INT (2))); > + if (TARGET_SSE4_1) > + { > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (SImode, reg, 4), > + GEN_INT (2))); > + } > + else > + { > + rtx tmp = gen_reg_rtx (DImode); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 4))); > + emit_insn (gen_vec_interleave_lowv4si > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > + } > } > else > - { > - rtx tmp = gen_reg_rtx (DImode); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 4))); > - emit_insn (gen_vec_interleave_lowv4si > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > - } > + emit_move_insn (gen_lowpart (smode, vreg), reg); > rtx_insn *seq = get_insns (); > end_sequence (); > rtx_insn *insn = DF_REF_INSN (ref); > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign > bitmap_copy (conv, insns); > > if (scalar_copy) > - scopy = gen_reg_rtx (DImode); > + scopy = gen_reg_rtx (smode); > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > { > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > emit_move_insn (tmp, reg); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - adjust_address (tmp, SImode, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - adjust_address (tmp, SImode, 4)); > + if (!TARGET_64BIT && smode == DImode) > + { > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + adjust_address (tmp, SImode, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + adjust_address (tmp, SImode, 4)); > + } > + else > + emit_move_insn (scopy, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > - > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > + if (TARGET_SSE4_1) > + { > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > + gen_rtvec (1, const0_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + } > + else > + { > + rtx vcopy = gen_reg_rtx (V2DImode); > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + emit_move_insn (vcopy, > + gen_rtx_LSHIFTRT (V2DImode, > + vcopy, GEN_INT (32))); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + } > } > else > - { > - rtx vcopy = gen_reg_rtx (V2DImode); > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - emit_move_insn (vcopy, > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - } > + emit_move_insn (scopy, reg); > + > rtx_insn *seq = get_insns (); > end_sequence (); > emit_conversion_insns (seq, insn); > @@ -816,14 +879,14 @@ dimode_scalar_chain::convert_op (rtx *op > if (GET_CODE (*op) == NOT) > { > convert_op (&XEXP (*op, 0), insn); > - PUT_MODE (*op, V2DImode); > + PUT_MODE (*op, vmode); > } > else if (MEM_P (*op)) > { > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > emit_insn_before (gen_move_insn (tmp, *op), insn); > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > if (dump_file) > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op > gcc_assert (!DF_REF_CHAIN (ref)); > break; > } > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > + *op = gen_rtx_SUBREG (vmode, *op, 0); > } > else if (CONST_INT_P (*op)) > { > rtx vec_cst; > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > /* Prefer all ones vector in case of -1. */ > if (constm1_operand (*op, GET_MODE (*op))) > - vec_cst = CONSTM1_RTX (V2DImode); > + vec_cst = CONSTM1_RTX (vmode); > else > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > - gen_rtvec (2, *op, const0_rtx)); > + { > + unsigned n = GET_MODE_NUNITS (vmode); > + rtx *v = XALLOCAVEC (rtx, n); > + v[0] = *op; > + for (unsigned i = 1; i < n; ++i) > + v[i] = const0_rtx; > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > + } > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > + if (!standard_sse_constant_p (vec_cst, vmode)) > { > start_sequence (); > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > rtx_insn *seq = get_insns (); > end_sequence (); > emit_insn_before (seq, insn); > @@ -870,7 +939,7 @@ dimode_scalar_chain::convert_op (rtx *op > else > { > gcc_assert (SUBREG_P (*op)); > - gcc_assert (GET_MODE (*op) == V2DImode); > + gcc_assert (GET_MODE (*op) == vmode); > } > } > > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i > { > /* There are no scalar integer instructions and therefore > temporary register usage is required. */ > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > } > > switch (GET_CODE (src)) > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i > case ASHIFTRT: > case LSHIFTRT: > convert_op (&XEXP (src, 0), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case PLUS: > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i > case IOR: > case XOR: > case AND: > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > convert_op (&XEXP (src, 0), insn); > convert_op (&XEXP (src, 1), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case NEG: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > - src = gen_rtx_MINUS (V2DImode, subreg, src); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > + src = gen_rtx_MINUS (vmode, subreg, src); > break; > > case NOT: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > - src = gen_rtx_XOR (V2DImode, src, subreg); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > + src = gen_rtx_XOR (vmode, src, subreg); > break; > > case MEM: > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i > break; > > case SUBREG: > - gcc_assert (GET_MODE (src) == V2DImode); > + gcc_assert (GET_MODE (src) == vmode); > break; > > case COMPARE: > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > if (REG_P (src)) > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > + subreg = gen_rtx_SUBREG (vmode, src, 0); > else > subreg = copy_rtx_if_shared (src); > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i > PATTERN (insn) = def_set; > > INSN_CODE (insn) = -1; > - recog_memoized (insn); > + int patt = recog_memoized (insn); > + if (patt == -1) > + fatal_insn_not_found (insn); > df_insn_rescan (insn); > } > > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn > (const_int 0 [0]))) */ > > static bool > -convertible_comparison_p (rtx_insn *insn) > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > { > if (!TARGET_SSE4_1) > return false; > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn > > if (!SUBREG_P (op1) > || !SUBREG_P (op2) > - || GET_MODE (op1) != SImode > - || GET_MODE (op2) != SImode > + || GET_MODE (op1) != mode > + || GET_MODE (op2) != mode > || ((SUBREG_BYTE (op1) != 0 > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > && (SUBREG_BYTE (op2) != 0 > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > return false; > > op1 = SUBREG_REG (op1); > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn > > if (op1 != op2 > || !REG_P (op1) > - || GET_MODE (op1) != DImode) > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > return false; > > return true; > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn > /* The DImode version of scalar_to_vector_candidate_p. */ > > static bool > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > +dimode_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > { > rtx def_set = single_set (insn); > > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx > rtx dst = SET_DEST (def_set); > > if (GET_CODE (src) == COMPARE) > - return convertible_comparison_p (insn); > + return convertible_comparison_p (insn, mode); > > /* We are interested in DImode promotion only. */ > - if ((GET_MODE (src) != DImode > + if ((GET_MODE (src) != mode > && !CONST_INT_P (src)) > - || GET_MODE (dst) != DImode) > + || GET_MODE (dst) != mode) > return false; > > if (!REG_P (dst) && !MEM_P (dst)) > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx > return false; > break; > > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > + if ((mode == DImode && !TARGET_AVX512F) > + || (mode == SImode && !TARGET_SSE4_1)) > + return false; > + /* Fallthru. */ > + > case PLUS: > case MINUS: > case IOR: > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx > && !CONST_INT_P (XEXP (src, 1))) > return false; > > - if (GET_MODE (XEXP (src, 1)) != DImode > + if (GET_MODE (XEXP (src, 1)) != mode > && !CONST_INT_P (XEXP (src, 1))) > return false; > break; > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx > || !REG_P (XEXP (XEXP (src, 0), 0)))) > return false; > > - if (GET_MODE (XEXP (src, 0)) != DImode > + if (GET_MODE (XEXP (src, 0)) != mode > && !CONST_INT_P (XEXP (src, 0))) > return false; > > @@ -1383,19 +1467,13 @@ timode_scalar_to_vector_candidate_p (rtx > return false; > } > > -/* Return 1 if INSN may be converted into vector > - instruction. */ > - > -static bool > -scalar_to_vector_candidate_p (rtx_insn *insn) > -{ > - if (TARGET_64BIT) > - return timode_scalar_to_vector_candidate_p (insn); > - else > - return dimode_scalar_to_vector_candidate_p (insn); > -} > +/* For a given bitmap of insn UIDs scans all instruction and > + remove insn from CANDIDATES in case it has both convertible > + and not convertible definitions. > > -/* The DImode version of remove_non_convertible_regs. */ > + All insns in a bitmap are conversion candidates according to > + scalar_to_vector_candidate_p. Currently it implies all insns > + are single_set. */ > > static void > dimode_remove_non_convertible_regs (bitmap candidates) > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm > BITMAP_FREE (regs); > } > > -/* For a given bitmap of insn UIDs scans all instruction and > - remove insn from CANDIDATES in case it has both convertible > - and not convertible definitions. > - > - All insns in a bitmap are conversion candidates according to > - scalar_to_vector_candidate_p. Currently it implies all insns > - are single_set. */ > - > -static void > -remove_non_convertible_regs (bitmap candidates) > -{ > - if (TARGET_64BIT) > - timode_remove_non_convertible_regs (candidates); > - else > - dimode_remove_non_convertible_regs (candidates); > -} > - > /* Main STV pass function. Find and convert scalar > instructions into vector mode when profitable. */ > > @@ -1577,11 +1638,14 @@ static unsigned int > convert_scalars_to_vector () > { > basic_block bb; > - bitmap candidates; > int converted_insns = 0; > > bitmap_obstack_initialize (NULL); > - candidates = BITMAP_ALLOC (NULL); > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > + for (unsigned i = 0; i < 3; ++i) > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > calculate_dominance_info (CDI_DOMINATORS); > df_set_flags (DF_DEFER_INSN_RESCAN); > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () > { > rtx_insn *insn; > FOR_BB_INSNS (bb, insn) > - if (scalar_to_vector_candidate_p (insn)) > + if (TARGET_64BIT > + && timode_scalar_to_vector_candidate_p (insn)) > { > if (dump_file) > - fprintf (dump_file, " insn %d is marked as a candidate\n", > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > INSN_UID (insn)); > > - bitmap_set_bit (candidates, INSN_UID (insn)); > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > + } > + else > + { > + /* Check {SI,DI}mode. */ > + for (unsigned i = 0; i <= 1; ++i) > + if (dimode_scalar_to_vector_candidate_p (insn, cand_mode[i])) > + { > + if (dump_file) > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > + > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > + break; > + } > } > } > > - remove_non_convertible_regs (candidates); > + if (TARGET_64BIT) > + timode_remove_non_convertible_regs (&candidates[2]); > + for (unsigned i = 0; i <= 1; ++i) > + dimode_remove_non_convertible_regs (&candidates[i]); > > - if (bitmap_empty_p (candidates)) > - if (dump_file) > + for (unsigned i = 0; i <= 2; ++i) > + if (!bitmap_empty_p (&candidates[i])) > + break; > + else if (i == 2 && dump_file) > fprintf (dump_file, "There are no candidates for optimization.\n"); > > - while (!bitmap_empty_p (candidates)) > - { > - unsigned uid = bitmap_first_set_bit (candidates); > - scalar_chain *chain; > + for (unsigned i = 0; i <= 2; ++i) > + while (!bitmap_empty_p (&candidates[i])) > + { > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > + scalar_chain *chain; > > - if (TARGET_64BIT) > - chain = new timode_scalar_chain; > - else > - chain = new dimode_scalar_chain; > + if (cand_mode[i] == TImode) > + chain = new timode_scalar_chain; > + else > + chain = new dimode_scalar_chain (cand_mode[i], cand_vmode[i]); > > - /* Find instructions chain we want to convert to vector mode. > - Check all uses and definitions to estimate all required > - conversions. */ > - chain->build (candidates, uid); > + /* Find instructions chain we want to convert to vector mode. > + Check all uses and definitions to estimate all required > + conversions. */ > + chain->build (&candidates[i], uid); > > - if (chain->compute_convert_gain () > 0) > - converted_insns += chain->convert (); > - else > - if (dump_file) > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > - chain->chain_id); > + if (chain->compute_convert_gain () > 0) > + converted_insns += chain->convert (); > + else > + if (dump_file) > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > + chain->chain_id); > > - delete chain; > - } > + delete chain; > + } > > if (dump_file) > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > - BITMAP_FREE (candidates); > + for (unsigned i = 0; i <= 2; ++i) > + bitmap_release (&candidates[i]); > bitmap_obstack_release (NULL); > df_process_deferred_rescans (); > > Index: gcc/config/i386/i386-features.h > =================================================================== > --- gcc/config/i386/i386-features.h (revision 274111) > +++ gcc/config/i386/i386-features.h (working copy) > @@ -127,11 +127,16 @@ namespace { > class scalar_chain > { > public: > - scalar_chain (); > + scalar_chain (enum machine_mode, enum machine_mode); > virtual ~scalar_chain (); > > static unsigned max_id; > > + /* Scalar mode. */ > + enum machine_mode smode; > + /* Vector mode. */ > + enum machine_mode vmode; > + > /* ID of a chain. */ > unsigned int chain_id; > /* A queue of instructions to be included into a chain. */ > @@ -162,6 +167,8 @@ class scalar_chain > class dimode_scalar_chain : public scalar_chain > { > public: > + dimode_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > + : scalar_chain (smode_, vmode_) {} > int compute_convert_gain (); > private: > void mark_dual_mode_def (df_ref def); > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > class timode_scalar_chain : public scalar_chain > { > public: > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > + > /* Convert from TImode to V1TImode is always faster. */ > int compute_convert_gain () { return 1; } > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 274111) > +++ gcc/config/i386/i386.md (working copy) > @@ -17721,6 +17721,27 @@ (define_peephole2 > std::swap (operands[4], operands[5]); > }) > > +;; min/max patterns > + > +(define_code_attr smaxmin_rel [(smax "ge") (smin "le")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:SWI48 0 "register_operand") > + (smaxmin:SWI48 (match_operand:SWI48 1 "register_operand") > + (match_operand:SWI48 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && TARGET_SSE4_1 > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:CCGC FLAGS_REG) > + (compare:CCGC (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:SWI48 > + (<smaxmin_rel> (reg:CCGC FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 12:44 ` Uros Bizjak @ 2019-08-05 12:51 ` Uros Bizjak 2019-08-05 12:54 ` Jakub Jelinek 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 12:51 UTC (permalink / raw) To: Richard Biener; +Cc: gcc-patches, Jakub Jelinek On Mon, Aug 5, 2019 at 2:43 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Sun, 4 Aug 2019, Uros Bizjak wrote: > > > > > On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > On Thu, 1 Aug 2019, Uros Bizjak wrote: > > > > > > > > > On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > >>>> So you unconditionally add a smaxdi3 pattern - indeed this looks > > > > >>>> necessary even when going the STV route. The actual regression > > > > >>>> for the testcase could also be solved by turing the smaxsi3 > > > > >>>> back into a compare and jump rather than a conditional move sequence. > > > > >>>> So I wonder how you'd do that given that there's pass_if_after_reload > > > > >>>> after pass_split_after_reload and I'm not sure we can split > > > > >>>> as late as pass_split_before_sched2 (there's also a split _after_ > > > > >>>> sched2 on x86 it seems). > > > > >>>> > > > > >>>> So how would you go implement {s,u}{min,max}{si,di}3 for the > > > > >>>> case STV doesn't end up doing any transform? > > > > >>> > > > > >>> If STV doesn't transform the insn, then a pre-reload splitter splits > > > > >>> the insn back to compare+cmove. > > > > >> > > > > >> OK, that would work. But there's no way to force a jumpy sequence then > > > > >> which we know is faster than compare+cmove because later RTL > > > > >> if-conversion passes happily re-discover the smax (or conditional move) > > > > >> sequence. > > > > >> > > > > >>> However, considering the SImode move > > > > >>> from/to int/xmm register is relatively cheap, the cost function should > > > > >>> be tuned so that STV always converts smaxsi3 pattern. > > > > >> > > > > >> Note that on both Zen and even more so bdverN the int/xmm transition > > > > >> makes it no longer profitable but a _lot_ slower than the cmp/cmov > > > > >> sequence... (for the loop in hmmer which is the only one I see > > > > >> any effect of any of my patches). So identifying chains that > > > > >> start/end in memory is important for cost reasons. > > > > > > > > > > Please note that the cost function also considers the cost of move > > > > > from/to xmm. So, the cost of the whole chain would disable the > > > > > transformation. > > > > > > > > > >> So I think the splitting has to happen after the last if-conversion > > > > >> pass (and thus we may need to allocate a scratch register for this > > > > >> purpose?) > > > > > > > > > > I really hope that the underlying issue will be solved by a machine > > > > > dependant pass inserted somewhere after the pre-reload split. This > > > > > way, we can split unconverted smax to the cmove, and this later pass > > > > > would handle jcc and cmove instructions. Until then... yes your > > > > > proposed approach is one of the ways to avoid unwanted if-conversion, > > > > > although sometimes we would like to split to cmove instead. > > > > > > > > So the following makes STV also consider SImode chains, re-using the > > > > DImode chain code. I've kept a simple incomplete smaxsi3 pattern > > > > and also did not alter the {SI,DI}mode chain cost function - it's > > > > quite off for TARGET_64BIT. With this I get the expected conversion > > > > for the testcase derived from hmmer. > > > > > > > > No further testing sofar. > > > > > > > > Is it OK to re-use the DImode chain code this way? I'll clean things > > > > up some more of course. > > > > > > Yes, the approach looks OK to me. It makes chain building mode > > > agnostic, and the chain building can be used for > > > a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added. > > > b) SImode x86_32 and x86_64 (this will be mainly used for SImode > > > minmax and surrounding SImode operations) > > > c) DImode x86_64 (also, mainly used for DImode minmax and surrounding > > > DImode operations) > > > > > > > Still need help with the actual patterns for minmax and how the splitters > > > > should look like. > > > > > > Please look at the attached patch. Maybe we can add memory_operand as > > > operand 1 and operand 2 predicate, but let's keep things simple for > > > now. > > > > Thanks. The attached patch makes the patch cleaner and it survives > > "some" barebone testing. It also touches the cost function to > > avoid being too overly trigger-happy. I've also ended up using > > ix86_cost->sse_op instead of COSTS_N_INSN-based magic. In > > particular we estimated GPR reg-reg move as COST_N_INSNS(2) while > > move costs shouldn't be wrapped in COST_N_INSNS. > > IMHO we should probably disregard any reg-reg moves for costing pre-RA. > > At least with the current code every reg-reg move biases in favor of > > SSE... > > > > And we're simply adding move and non-move costs in 'gain', somewhat > > mixing apples and oranges? We could separate those and require > > both to be a net positive win? > > > > Still using -mtune=bdverN exposes that some cost tables have xmm and gpr > > costs as apples and oranges... (so it never triggers for Bulldozer) > > > > I now run into > > > > /space/rguenther/src/svn/trunk-bisect/libgcc/libgcov-driver.c:509:1: > > error: unrecognizable insn: > > (insn 116 115 1511 8 (set (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > > (smax:V2DI (subreg:V2DI (reg/v:DI 87 [ run_max ]) 0) > > (subreg:V2DI (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) 0))) > > -1 > > (expr_list:REG_DEAD (reg:DI 349 [ MEM[base: _261, offset: 0B] ]) > > (expr_list:REG_UNUSED (reg:CC 17 flags) > > (nil)))) > > during RTL pass: stv > > > > where even with -mavx2 we do not have s{min,max}v2di3. We do have > > an expander here but it seems only AVX512F has the DImode min/max > > ops. I have adjusted dimode_scalar_to_vector_candidate_p > > accordingly. > > Uh, you need to use some other mode iterator that SWI48 then, like: > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > and then we need to split DImode for 32bits, too. For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode condition, I'll provide _doubleword splitter later. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 12:51 ` Uros Bizjak @ 2019-08-05 12:54 ` Jakub Jelinek 2019-08-05 12:57 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Jakub Jelinek @ 2019-08-05 12:54 UTC (permalink / raw) To: Uros Bizjak; +Cc: Richard Biener, gcc-patches On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote: > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > and then we need to split DImode for 32bits, too. > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > condition, I'll provide _doubleword splitter later. Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. to force use of %zmmN? Jakub ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 12:54 ` Jakub Jelinek @ 2019-08-05 12:57 ` Uros Bizjak 2019-08-05 13:04 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 12:57 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Richard Biener, gcc-patches On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote: > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > and then we need to split DImode for 32bits, too. > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > condition, I'll provide _doubleword splitter later. > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > to force use of %zmmN? It generates V4SI mode, so - yes, AVX512VL. Thanks, Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 12:57 ` Uros Bizjak @ 2019-08-05 13:04 ` Richard Biener 2019-08-05 13:09 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-05 13:04 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On Mon, 5 Aug 2019, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote: > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > condition, I'll provide _doubleword splitter later. > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > to force use of %zmmN? > > It generates V4SI mode, so - yes, AVX512VL. case SMAX: case SMIN: case UMAX: case UMIN: if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) || (mode == SImode && !TARGET_SSE4_1)) return false; so there's no way to use AVX512VL for 32bit? Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 13:04 ` Richard Biener @ 2019-08-05 13:09 ` Uros Bizjak 2019-08-05 13:29 ` Richard Biener 2019-08-09 7:28 ` Uros Bizjak 0 siblings, 2 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 13:09 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches On Mon, Aug 5, 2019 at 3:04 PM Richard Biener <rguenther@suse.de> wrote: > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > > > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote: > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > condition, I'll provide _doubleword splitter later. > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > to force use of %zmmN? > > > > It generates V4SI mode, so - yes, AVX512VL. > > case SMAX: > case SMIN: > case UMAX: > case UMIN: > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > || (mode == SImode && !TARGET_SSE4_1)) > return false; > > so there's no way to use AVX512VL for 32bit? There is a way, but on 32bit targets, we need to split DImode operation to a sequence of SImode operations for unconverted pattern. This is of course doable, but somehow more complex than simply emitting a DImode compare + DImode cmove, which is what current splitter does. So, a follow-up task. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 13:09 ` Uros Bizjak @ 2019-08-05 13:29 ` Richard Biener 2019-08-05 19:35 ` Uros Bizjak 2019-08-09 7:28 ` Uros Bizjak 1 sibling, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-05 13:29 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On Mon, 5 Aug 2019, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 3:04 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > > On Mon, Aug 5, 2019 at 2:54 PM Jakub Jelinek <jakub@redhat.com> wrote: > > > > > > > > On Mon, Aug 05, 2019 at 02:51:01PM +0200, Uros Bizjak wrote: > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > to force use of %zmmN? > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > case SMAX: > > case SMIN: > > case UMAX: > > case UMIN: > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > || (mode == SImode && !TARGET_SSE4_1)) > > return false; > > > > so there's no way to use AVX512VL for 32bit? > > There is a way, but on 32bit targets, we need to split DImode > operation to a sequence of SImode operations for unconverted pattern. > This is of course doable, but somehow more complex than simply > emitting a DImode compare + DImode cmove, which is what current > splitter does. So, a follow-up task. Ah, OK. So for the above condition we can elide the !TARGET_64BIT check we just need to properly split if we enable the scalar minmax pattern for DImode on 32bits, the STV conversion would go fine. Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 13:29 ` Richard Biener @ 2019-08-05 19:35 ` Uros Bizjak 2019-08-07 9:52 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-05 19:35 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > to force use of %zmmN? > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > case SMAX: > > > case SMIN: > > > case UMAX: > > > case UMIN: > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > || (mode == SImode && !TARGET_SSE4_1)) > > > return false; > > > > > > so there's no way to use AVX512VL for 32bit? > > > > There is a way, but on 32bit targets, we need to split DImode > > operation to a sequence of SImode operations for unconverted pattern. > > This is of course doable, but somehow more complex than simply > > emitting a DImode compare + DImode cmove, which is what current > > splitter does. So, a follow-up task. > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > check we just need to properly split if we enable the scalar minmax > pattern for DImode on 32bits, the STV conversion would go fine. Yes, that is correct. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 19:35 ` Uros Bizjak @ 2019-08-07 9:52 ` Richard Biener 2019-08-07 12:04 ` Richard Biener 2019-08-07 14:15 ` Richard Biener 0 siblings, 2 replies; 61+ messages in thread From: Richard Biener @ 2019-08-07 9:52 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On Mon, 5 Aug 2019, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > to force use of %zmmN? > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > case SMAX: > > > > case SMIN: > > > > case UMAX: > > > > case UMIN: > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > return false; > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > operation to a sequence of SImode operations for unconverted pattern. > > > This is of course doable, but somehow more complex than simply > > > emitting a DImode compare + DImode cmove, which is what current > > > splitter does. So, a follow-up task. > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > check we just need to properly split if we enable the scalar minmax > > pattern for DImode on 32bits, the STV conversion would go fine. > > Yes, that is correct. So I tested the patch below (now with appropriate ChangeLog) on x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with the obvious hmmer improvement, now checking for off-noise results with a 3-run on those that may have one (with more than +-1 second differences in the 1-run). As-is the patch likely runs into the splitting issue for DImode on i?86 and the patch misses functional testcases. I'll do the hmmer loop with both DImode and SImode and testcases to trigger all pattern variants with the different ISAs we have. Some of the patch could be split out (the cost changes that are also effective for DImode for example). AFAICS we could go with only adding SImode avoiding the DImode splitting thing and this would solve the hmmer regression. Thanks, Richard. 2019-08-07 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Fix reg-reg move cost gain, use ix86_cost->sse_op cost and adjust scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction gain if not zero. (general_scalar_chain::replace_with_subreg): Use vmode/smode. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274111) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (smode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); emit_move_insn (vreg, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else - { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); - } + emit_move_insn (gen_lowpart (smode, vreg), reg); rtx_insn *seq = get_insns (); end_sequence (); rtx_insn *insn = DF_REF_INSN (ref); @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512VL) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1638,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274111) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274111) +++ gcc/config/i386/i386.md (working copy) @@ -17721,6 +17721,30 @@ (define_peephole2 std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_code_attr maxmin_rel + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) +(define_code_attr maxmin_cmpmode + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:SWI48 0 "register_operand") + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") + (match_operand:SWI48 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && TARGET_SSE4_1 + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:<maxmin_cmpmode> FLAGS_REG) + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:SWI48 + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand") ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 9:52 ` Richard Biener @ 2019-08-07 12:04 ` Richard Biener 2019-08-07 12:11 ` Uros Bizjak 2019-08-07 12:42 ` Uros Bizjak 2019-08-07 14:15 ` Richard Biener 1 sibling, 2 replies; 61+ messages in thread From: Richard Biener @ 2019-08-07 12:04 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches [-- Attachment #1: Type: text/plain, Size: 36817 bytes --] On Wed, 7 Aug 2019, Richard Biener wrote: > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > to force use of %zmmN? > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > case SMAX: > > > > > case SMIN: > > > > > case UMAX: > > > > > case UMIN: > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > return false; > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > This is of course doable, but somehow more complex than simply > > > > emitting a DImode compare + DImode cmove, which is what current > > > > splitter does. So, a follow-up task. > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > check we just need to properly split if we enable the scalar minmax > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > Yes, that is correct. > > So I tested the patch below (now with appropriate ChangeLog) on > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > the obvious hmmer improvement, now checking for off-noise results > with a 3-run on those that may have one (with more than +-1 second > differences in the 1-run). > > As-is the patch likely runs into the splitting issue for DImode > on i?86 and the patch misses functional testcases. I'll do the > hmmer loop with both DImode and SImode and testcases to trigger > all pattern variants with the different ISAs we have. > > Some of the patch could be split out (the cost changes that are > also effective for DImode for example). > > AFAICS we could go with only adding SImode avoiding the DImode > splitting thing and this would solve the hmmer regression. I've additionally bootstrapped with --with-arch=nehalem which reveals FAIL: gcc.target/i386/minmax-2.c scan-assembler test FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp we emit cmp + cmov here now with -msse4.1 (as soon as the max pattern is enabled I guess) Otherwise testing is clean, so I suppose this is the net effect of just doing the SImode chains; I don't have AVX512 HW handily available to really test the DImode path. Would you be fine to simplify the patch down to SImode chain handling? Thanks, Richard. > Thanks, > Richard. > > 2019-08-07 Richard Biener <rguenther@suse.de> > > PR target/91154 > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > mode arguments. > (scalar_chain::smode): New member. > (scalar_chain::vmode): Likewise. > (dimode_scalar_chain): Rename to... > (general_scalar_chain): ... this. > (general_scalar_chain::general_scalar_chain): Take mode arguments. > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > base with TImode and V1TImode. > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > (general_scalar_chain::vector_const_cost): Adjust for SImode > chains. > (general_scalar_chain::compute_convert_gain): Likewise. Fix > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > gain if not zero. > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > (general_scalar_chain::make_vector_copies): Likewise. Handle > non-DImode chains appropriately. > (general_scalar_chain::convert_reg): Likewise. > (general_scalar_chain::convert_op): Likewise. > (general_scalar_chain::convert_insn): Likewise. Add > fatal_insn_not_found if the result is not recognized. > (convertible_comparison_p): Pass in the scalar mode and use that. > (general_scalar_to_vector_candidate_p): Likewise. Rename from > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > (scalar_to_vector_candidate_p): Remove by inlining into single > caller. > (general_remove_non_convertible_regs): Rename from > dimode_remove_non_convertible_regs. > (remove_non_convertible_regs): Remove by inlining into single caller. > (convert_scalars_to_vector): Handle SImode and DImode chains > in addition to TImode chains. > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c (revision 274111) > +++ gcc/config/i386/i386-features.c (working copy) > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > /* Initialize new chain. */ > > -scalar_chain::scalar_chain () > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > { > + smode = smode_; > + vmode = vmode_; > + > chain_id = ++max_id; > > if (dump_file) > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins > conversion. */ > > void > -dimode_scalar_chain::mark_dual_mode_def (df_ref def) > +general_scalar_chain::mark_dual_mode_def (df_ref def) > { > gcc_assert (DF_REF_REG_DEF_P (def)); > > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > && !HARD_REGISTER_P (SET_DEST (def_set))) > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > + /* ??? The following is quadratic since analyze_register_chain > + iterates over all refs to look for dual-mode regs. Instead this > + should be done separately for all regs mentioned in the chain once. */ > df_ref ref; > df_ref def; > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, > instead of using a scalar one. */ > > int > -dimode_scalar_chain::vector_const_cost (rtx exp) > +general_scalar_chain::vector_const_cost (rtx exp) > { > gcc_assert (CONST_INT_P (exp)); > > - if (standard_sse_constant_p (exp, V2DImode)) > - return COSTS_N_INSNS (1); > - return ix86_cost->sse_load[1]; > + if (standard_sse_constant_p (exp, vmode)) > + return ix86_cost->sse_op; > + /* We have separate costs for SImode and DImode, use SImode costs > + for smaller modes. */ > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > } > > /* Compute a gain for chain conversion. */ > > int > -dimode_scalar_chain::compute_convert_gain () > +general_scalar_chain::compute_convert_gain () > { > bitmap_iterator bi; > unsigned insn_uid; > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > + int costs factor in the number of GPRs involved. When supporting > + smaller modes than SImode the int load/store costs need to be > + adjusted as well. */ > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > + int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > + igain += 2 * m - ix86_cost->xmm_move; > else if (REG_P (src) && MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > else if (MEM_P (src) && REG_P (dst)) > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > else if (GET_CODE (src) == ASHIFT > || GET_CODE (src) == ASHIFTRT > || GET_CODE (src) == LSHIFTRT) > { > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > - gain += ix86_cost->shift_const; > + igain -= vector_const_cost (XEXP (src, 0)); > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > if (INTVAL (XEXP (src, 1)) >= 32) > - gain -= COSTS_N_INSNS (1); > + igain -= COSTS_N_INSNS (1); > } > else if (GET_CODE (src) == PLUS > || GET_CODE (src) == MINUS > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > { > - gain += ix86_cost->add; > + igain += m * ix86_cost->add - ix86_cost->sse_op; > /* Additional gain for andnot for targets without BMI. */ > if (GET_CODE (XEXP (src, 0)) == NOT > && !TARGET_BMI) > - gain += 2 * ix86_cost->add; > + igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0)); > if (CONST_INT_P (XEXP (src, 1))) > - gain -= vector_const_cost (XEXP (src, 1)); > + igain -= vector_const_cost (XEXP (src, 1)); > } > else if (GET_CODE (src) == NEG > || GET_CODE (src) == NOT) > - gain += ix86_cost->add - COSTS_N_INSNS (1); > + igain += m * ix86_cost->add - ix86_cost->sse_op; > + else if (GET_CODE (src) == SMAX > + || GET_CODE (src) == SMIN > + || GET_CODE (src) == UMAX > + || GET_CODE (src) == UMIN) > + { > + /* We do not have any conditional move cost, estimate it as a > + reg-reg move. Comparisons are costed as adds. */ > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > + /* Integer SSE ops are all costed the same. */ > + igain -= ix86_cost->sse_op; > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > else if (CONST_INT_P (src)) > { > if (REG_P (dst)) > - gain += COSTS_N_INSNS (2); > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > + igain += COSTS_N_INSNS (m); > else if (MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > - gain -= vector_const_cost (src); > + igain += (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]); > + igain -= vector_const_cost (src); > } > else > gcc_unreachable (); > + > + if (igain != 0 && dump_file) > + { > + fprintf (dump_file, " Instruction gain %d for ", igain); > + dump_insn_slim (dump_file, insn); > + } > + gain += igain; > } > > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > + /* ??? What about integer to SSE? */ > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai > /* Replace REG in X with a V2DI subreg of NEW_REG. */ > > rtx > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > { > if (x == reg) > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > int i, j; > @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg > /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ > > void > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > rtx reg, rtx new_reg) > { > replace_with_subreg (single_set (insn), reg, new_reg); > @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx > and replace its uses in a chain. */ > > void > -dimode_scalar_chain::make_vector_copies (unsigned regno) > +general_scalar_chain::make_vector_copies (unsigned regno) > { > rtx reg = regno_reg_rtx[regno]; > - rtx vreg = gen_reg_rtx (DImode); > + rtx vreg = gen_reg_rtx (smode); > df_ref ref; > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > - emit_move_insn (adjust_address (tmp, SImode, 0), > - gen_rtx_SUBREG (SImode, reg, 0)); > - emit_move_insn (adjust_address (tmp, SImode, 4), > - gen_rtx_SUBREG (SImode, reg, 4)); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > + if (smode == DImode && !TARGET_64BIT) > + { > + emit_move_insn (adjust_address (tmp, SImode, 0), > + gen_rtx_SUBREG (SImode, reg, 0)); > + emit_move_insn (adjust_address (tmp, SImode, 4), > + gen_rtx_SUBREG (SImode, reg, 4)); > + } > + else > + emit_move_insn (tmp, reg); > emit_move_insn (vreg, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (SImode, reg, 4), > - GEN_INT (2))); > + if (TARGET_SSE4_1) > + { > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (SImode, reg, 4), > + GEN_INT (2))); > + } > + else > + { > + rtx tmp = gen_reg_rtx (DImode); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 4))); > + emit_insn (gen_vec_interleave_lowv4si > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > + } > } > else > - { > - rtx tmp = gen_reg_rtx (DImode); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 4))); > - emit_insn (gen_vec_interleave_lowv4si > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > - } > + emit_move_insn (gen_lowpart (smode, vreg), reg); > rtx_insn *seq = get_insns (); > end_sequence (); > rtx_insn *insn = DF_REF_INSN (ref); > @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies > in case register is used in not convertible insn. */ > > void > -dimode_scalar_chain::convert_reg (unsigned regno) > +general_scalar_chain::convert_reg (unsigned regno) > { > bool scalar_copy = bitmap_bit_p (defs_conv, regno); > rtx reg = regno_reg_rtx[regno]; > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign > bitmap_copy (conv, insns); > > if (scalar_copy) > - scopy = gen_reg_rtx (DImode); > + scopy = gen_reg_rtx (smode); > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > { > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > emit_move_insn (tmp, reg); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - adjust_address (tmp, SImode, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - adjust_address (tmp, SImode, 4)); > + if (!TARGET_64BIT && smode == DImode) > + { > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + adjust_address (tmp, SImode, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + adjust_address (tmp, SImode, 4)); > + } > + else > + emit_move_insn (scopy, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > - > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > + if (TARGET_SSE4_1) > + { > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > + gen_rtvec (1, const0_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + } > + else > + { > + rtx vcopy = gen_reg_rtx (V2DImode); > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + emit_move_insn (vcopy, > + gen_rtx_LSHIFTRT (V2DImode, > + vcopy, GEN_INT (32))); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + } > } > else > - { > - rtx vcopy = gen_reg_rtx (V2DImode); > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - emit_move_insn (vcopy, > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - } > + emit_move_insn (scopy, reg); > + > rtx_insn *seq = get_insns (); > end_sequence (); > emit_conversion_insns (seq, insn); > @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign > registers conversion. */ > > void > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > { > *op = copy_rtx_if_shared (*op); > > if (GET_CODE (*op) == NOT) > { > convert_op (&XEXP (*op, 0), insn); > - PUT_MODE (*op, V2DImode); > + PUT_MODE (*op, vmode); > } > else if (MEM_P (*op)) > { > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > emit_insn_before (gen_move_insn (tmp, *op), insn); > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > if (dump_file) > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op > gcc_assert (!DF_REF_CHAIN (ref)); > break; > } > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > + *op = gen_rtx_SUBREG (vmode, *op, 0); > } > else if (CONST_INT_P (*op)) > { > rtx vec_cst; > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > /* Prefer all ones vector in case of -1. */ > if (constm1_operand (*op, GET_MODE (*op))) > - vec_cst = CONSTM1_RTX (V2DImode); > + vec_cst = CONSTM1_RTX (vmode); > else > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > - gen_rtvec (2, *op, const0_rtx)); > + { > + unsigned n = GET_MODE_NUNITS (vmode); > + rtx *v = XALLOCAVEC (rtx, n); > + v[0] = *op; > + for (unsigned i = 1; i < n; ++i) > + v[i] = const0_rtx; > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > + } > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > + if (!standard_sse_constant_p (vec_cst, vmode)) > { > start_sequence (); > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > rtx_insn *seq = get_insns (); > end_sequence (); > emit_insn_before (seq, insn); > @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op > else > { > gcc_assert (SUBREG_P (*op)); > - gcc_assert (GET_MODE (*op) == V2DImode); > + gcc_assert (GET_MODE (*op) == vmode); > } > } > > /* Convert INSN to vector mode. */ > > void > -dimode_scalar_chain::convert_insn (rtx_insn *insn) > +general_scalar_chain::convert_insn (rtx_insn *insn) > { > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i > { > /* There are no scalar integer instructions and therefore > temporary register usage is required. */ > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > } > > switch (GET_CODE (src)) > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i > case ASHIFTRT: > case LSHIFTRT: > convert_op (&XEXP (src, 0), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case PLUS: > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i > case IOR: > case XOR: > case AND: > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > convert_op (&XEXP (src, 0), insn); > convert_op (&XEXP (src, 1), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case NEG: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > - src = gen_rtx_MINUS (V2DImode, subreg, src); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > + src = gen_rtx_MINUS (vmode, subreg, src); > break; > > case NOT: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > - src = gen_rtx_XOR (V2DImode, src, subreg); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > + src = gen_rtx_XOR (vmode, src, subreg); > break; > > case MEM: > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i > break; > > case SUBREG: > - gcc_assert (GET_MODE (src) == V2DImode); > + gcc_assert (GET_MODE (src) == vmode); > break; > > case COMPARE: > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > if (REG_P (src)) > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > + subreg = gen_rtx_SUBREG (vmode, src, 0); > else > subreg = copy_rtx_if_shared (src); > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i > PATTERN (insn) = def_set; > > INSN_CODE (insn) = -1; > - recog_memoized (insn); > + int patt = recog_memoized (insn); > + if (patt == -1) > + fatal_insn_not_found (insn); > df_insn_rescan (insn); > } > > @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i > } > > void > -dimode_scalar_chain::convert_registers () > +general_scalar_chain::convert_registers () > { > bitmap_iterator bi; > unsigned id; > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn > (const_int 0 [0]))) */ > > static bool > -convertible_comparison_p (rtx_insn *insn) > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > { > if (!TARGET_SSE4_1) > return false; > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn > > if (!SUBREG_P (op1) > || !SUBREG_P (op2) > - || GET_MODE (op1) != SImode > - || GET_MODE (op2) != SImode > + || GET_MODE (op1) != mode > + || GET_MODE (op2) != mode > || ((SUBREG_BYTE (op1) != 0 > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > && (SUBREG_BYTE (op2) != 0 > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > return false; > > op1 = SUBREG_REG (op1); > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn > > if (op1 != op2 > || !REG_P (op1) > - || GET_MODE (op1) != DImode) > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > return false; > > return true; > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn > /* The DImode version of scalar_to_vector_candidate_p. */ > > static bool > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > { > rtx def_set = single_set (insn); > > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx > rtx dst = SET_DEST (def_set); > > if (GET_CODE (src) == COMPARE) > - return convertible_comparison_p (insn); > + return convertible_comparison_p (insn, mode); > > /* We are interested in DImode promotion only. */ > - if ((GET_MODE (src) != DImode > + if ((GET_MODE (src) != mode > && !CONST_INT_P (src)) > - || GET_MODE (dst) != DImode) > + || GET_MODE (dst) != mode) > return false; > > if (!REG_P (dst) && !MEM_P (dst)) > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx > return false; > break; > > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > + if ((mode == DImode && !TARGET_AVX512VL) > + || (mode == SImode && !TARGET_SSE4_1)) > + return false; > + /* Fallthru. */ > + > case PLUS: > case MINUS: > case IOR: > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx > && !CONST_INT_P (XEXP (src, 1))) > return false; > > - if (GET_MODE (XEXP (src, 1)) != DImode > + if (GET_MODE (XEXP (src, 1)) != mode > && !CONST_INT_P (XEXP (src, 1))) > return false; > break; > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx > || !REG_P (XEXP (XEXP (src, 0), 0)))) > return false; > > - if (GET_MODE (XEXP (src, 0)) != DImode > + if (GET_MODE (XEXP (src, 0)) != mode > && !CONST_INT_P (XEXP (src, 0))) > return false; > > @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx > return false; > } > > -/* Return 1 if INSN may be converted into vector > - instruction. */ > - > -static bool > -scalar_to_vector_candidate_p (rtx_insn *insn) > -{ > - if (TARGET_64BIT) > - return timode_scalar_to_vector_candidate_p (insn); > - else > - return dimode_scalar_to_vector_candidate_p (insn); > -} > +/* For a given bitmap of insn UIDs scans all instruction and > + remove insn from CANDIDATES in case it has both convertible > + and not convertible definitions. > > -/* The DImode version of remove_non_convertible_regs. */ > + All insns in a bitmap are conversion candidates according to > + scalar_to_vector_candidate_p. Currently it implies all insns > + are single_set. */ > > static void > -dimode_remove_non_convertible_regs (bitmap candidates) > +general_remove_non_convertible_regs (bitmap candidates) > { > bitmap_iterator bi; > unsigned id; > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm > BITMAP_FREE (regs); > } > > -/* For a given bitmap of insn UIDs scans all instruction and > - remove insn from CANDIDATES in case it has both convertible > - and not convertible definitions. > - > - All insns in a bitmap are conversion candidates according to > - scalar_to_vector_candidate_p. Currently it implies all insns > - are single_set. */ > - > -static void > -remove_non_convertible_regs (bitmap candidates) > -{ > - if (TARGET_64BIT) > - timode_remove_non_convertible_regs (candidates); > - else > - dimode_remove_non_convertible_regs (candidates); > -} > - > /* Main STV pass function. Find and convert scalar > instructions into vector mode when profitable. */ > > @@ -1577,11 +1638,14 @@ static unsigned int > convert_scalars_to_vector () > { > basic_block bb; > - bitmap candidates; > int converted_insns = 0; > > bitmap_obstack_initialize (NULL); > - candidates = BITMAP_ALLOC (NULL); > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > + for (unsigned i = 0; i < 3; ++i) > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > calculate_dominance_info (CDI_DOMINATORS); > df_set_flags (DF_DEFER_INSN_RESCAN); > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () > { > rtx_insn *insn; > FOR_BB_INSNS (bb, insn) > - if (scalar_to_vector_candidate_p (insn)) > + if (TARGET_64BIT > + && timode_scalar_to_vector_candidate_p (insn)) > { > if (dump_file) > - fprintf (dump_file, " insn %d is marked as a candidate\n", > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > INSN_UID (insn)); > > - bitmap_set_bit (candidates, INSN_UID (insn)); > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > + } > + else > + { > + /* Check {SI,DI}mode. */ > + for (unsigned i = 0; i <= 1; ++i) > + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) > + { > + if (dump_file) > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > + > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > + break; > + } > } > } > > - remove_non_convertible_regs (candidates); > + if (TARGET_64BIT) > + timode_remove_non_convertible_regs (&candidates[2]); > + for (unsigned i = 0; i <= 1; ++i) > + general_remove_non_convertible_regs (&candidates[i]); > > - if (bitmap_empty_p (candidates)) > - if (dump_file) > + for (unsigned i = 0; i <= 2; ++i) > + if (!bitmap_empty_p (&candidates[i])) > + break; > + else if (i == 2 && dump_file) > fprintf (dump_file, "There are no candidates for optimization.\n"); > > - while (!bitmap_empty_p (candidates)) > - { > - unsigned uid = bitmap_first_set_bit (candidates); > - scalar_chain *chain; > + for (unsigned i = 0; i <= 2; ++i) > + while (!bitmap_empty_p (&candidates[i])) > + { > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > + scalar_chain *chain; > > - if (TARGET_64BIT) > - chain = new timode_scalar_chain; > - else > - chain = new dimode_scalar_chain; > + if (cand_mode[i] == TImode) > + chain = new timode_scalar_chain; > + else > + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); > > - /* Find instructions chain we want to convert to vector mode. > - Check all uses and definitions to estimate all required > - conversions. */ > - chain->build (candidates, uid); > + /* Find instructions chain we want to convert to vector mode. > + Check all uses and definitions to estimate all required > + conversions. */ > + chain->build (&candidates[i], uid); > > - if (chain->compute_convert_gain () > 0) > - converted_insns += chain->convert (); > - else > - if (dump_file) > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > - chain->chain_id); > + if (chain->compute_convert_gain () > 0) > + converted_insns += chain->convert (); > + else > + if (dump_file) > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > + chain->chain_id); > > - delete chain; > - } > + delete chain; > + } > > if (dump_file) > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > - BITMAP_FREE (candidates); > + for (unsigned i = 0; i <= 2; ++i) > + bitmap_release (&candidates[i]); > bitmap_obstack_release (NULL); > df_process_deferred_rescans (); > > Index: gcc/config/i386/i386-features.h > =================================================================== > --- gcc/config/i386/i386-features.h (revision 274111) > +++ gcc/config/i386/i386-features.h (working copy) > @@ -127,11 +127,16 @@ namespace { > class scalar_chain > { > public: > - scalar_chain (); > + scalar_chain (enum machine_mode, enum machine_mode); > virtual ~scalar_chain (); > > static unsigned max_id; > > + /* Scalar mode. */ > + enum machine_mode smode; > + /* Vector mode. */ > + enum machine_mode vmode; > + > /* ID of a chain. */ > unsigned int chain_id; > /* A queue of instructions to be included into a chain. */ > @@ -159,9 +164,11 @@ class scalar_chain > virtual void convert_registers () = 0; > }; > > -class dimode_scalar_chain : public scalar_chain > +class general_scalar_chain : public scalar_chain > { > public: > + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > + : scalar_chain (smode_, vmode_) {} > int compute_convert_gain (); > private: > void mark_dual_mode_def (df_ref def); > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > class timode_scalar_chain : public scalar_chain > { > public: > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > + > /* Convert from TImode to V1TImode is always faster. */ > int compute_convert_gain () { return 1; } > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 274111) > +++ gcc/config/i386/i386.md (working copy) > @@ -17721,6 +17721,30 @@ (define_peephole2 > std::swap (operands[4], operands[5]); > }) > > +;; min/max patterns > + > +(define_code_attr maxmin_rel > + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) > +(define_code_attr maxmin_cmpmode > + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:SWI48 0 "register_operand") > + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") > + (match_operand:SWI48 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && TARGET_SSE4_1 > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:<maxmin_cmpmode> FLAGS_REG) > + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:SWI48 > + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") > -- Richard Biener <rguenther@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG NÌrnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 12:04 ` Richard Biener @ 2019-08-07 12:11 ` Uros Bizjak 2019-08-07 12:42 ` Uros Bizjak 1 sibling, 0 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-07 12:11 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote: > > On Wed, 7 Aug 2019, Richard Biener wrote: > > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > case SMAX: > > > > > > case SMIN: > > > > > > case UMAX: > > > > > > case UMIN: > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > return false; > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > This is of course doable, but somehow more complex than simply > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > splitter does. So, a follow-up task. > > > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > > check we just need to properly split if we enable the scalar minmax > > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > > > Yes, that is correct. > > > > So I tested the patch below (now with appropriate ChangeLog) on > > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > > the obvious hmmer improvement, now checking for off-noise results > > with a 3-run on those that may have one (with more than +-1 second > > differences in the 1-run). > > > > As-is the patch likely runs into the splitting issue for DImode > > on i?86 and the patch misses functional testcases. I'll do the > > hmmer loop with both DImode and SImode and testcases to trigger > > all pattern variants with the different ISAs we have. > > > > Some of the patch could be split out (the cost changes that are > > also effective for DImode for example). > > > > AFAICS we could go with only adding SImode avoiding the DImode > > splitting thing and this would solve the hmmer regression. > > I've additionally bootstrapped with --with-arch=nehalem which > reveals > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp > > we emit cmp + cmov here now with -msse4.1 (as soon as the max > pattern is enabled I guess) > > Otherwise testing is clean, so I suppose this is the net effect > of just doing the SImode chains; I don't have AVX512 HW handily > available to really test the DImode path. > > Would you be fine to simplify the patch down to SImode chain handling? Just leave DImode for a couple of days to see what HJ's autotesters reveal. I'd just disable DImode for 32bit targets for now, we know that splitters are missing. Some remarks below. Uros. > > Thanks, > Richard. > > > Thanks, > > Richard. > > > > 2019-08-07 Richard Biener <rguenther@suse.de> > > > > PR target/91154 > > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > > mode arguments. > > (scalar_chain::smode): New member. > > (scalar_chain::vmode): Likewise. > > (dimode_scalar_chain): Rename to... > > (general_scalar_chain): ... this. > > (general_scalar_chain::general_scalar_chain): Take mode arguments. > > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > > base with TImode and V1TImode. > > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > > (general_scalar_chain::vector_const_cost): Adjust for SImode > > chains. > > (general_scalar_chain::compute_convert_gain): Likewise. Fix > > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > > gain if not zero. > > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > > (general_scalar_chain::make_vector_copies): Likewise. Handle > > non-DImode chains appropriately. > > (general_scalar_chain::convert_reg): Likewise. > > (general_scalar_chain::convert_op): Likewise. > > (general_scalar_chain::convert_insn): Likewise. Add > > fatal_insn_not_found if the result is not recognized. > > (convertible_comparison_p): Pass in the scalar mode and use that. > > (general_scalar_to_vector_candidate_p): Likewise. Rename from > > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > > (scalar_to_vector_candidate_p): Remove by inlining into single > > caller. > > (general_remove_non_convertible_regs): Rename from > > dimode_remove_non_convertible_regs. > > (remove_non_convertible_regs): Remove by inlining into single caller. > > (convert_scalars_to_vector): Handle SImode and DImode chains > > in addition to TImode chains. > > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > > > Index: gcc/config/i386/i386-features.c > > =================================================================== > > --- gcc/config/i386/i386-features.c (revision 274111) > > +++ gcc/config/i386/i386-features.c (working copy) > > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > > > /* Initialize new chain. */ > > > > -scalar_chain::scalar_chain () > > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > > { > > + smode = smode_; > > + vmode = vmode_; > > + > > chain_id = ++max_id; > > > > if (dump_file) > > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins > > conversion. */ > > > > void > > -dimode_scalar_chain::mark_dual_mode_def (df_ref def) > > +general_scalar_chain::mark_dual_mode_def (df_ref def) > > { > > gcc_assert (DF_REF_REG_DEF_P (def)); > > > > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > > && !HARD_REGISTER_P (SET_DEST (def_set))) > > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > > > + /* ??? The following is quadratic since analyze_register_chain > > + iterates over all refs to look for dual-mode regs. Instead this > > + should be done separately for all regs mentioned in the chain once. */ > > df_ref ref; > > df_ref def; > > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, > > instead of using a scalar one. */ > > > > int > > -dimode_scalar_chain::vector_const_cost (rtx exp) > > +general_scalar_chain::vector_const_cost (rtx exp) > > { > > gcc_assert (CONST_INT_P (exp)); > > > > - if (standard_sse_constant_p (exp, V2DImode)) > > - return COSTS_N_INSNS (1); > > - return ix86_cost->sse_load[1]; > > + if (standard_sse_constant_p (exp, vmode)) > > + return ix86_cost->sse_op; > > + /* We have separate costs for SImode and DImode, use SImode costs > > + for smaller modes. */ > > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > > } > > > > /* Compute a gain for chain conversion. */ > > > > int > > -dimode_scalar_chain::compute_convert_gain () > > +general_scalar_chain::compute_convert_gain () > > { > > bitmap_iterator bi; > > unsigned insn_uid; > > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > > if (dump_file) > > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > > + int costs factor in the number of GPRs involved. When supporting > > + smaller modes than SImode the int load/store costs need to be > > + adjusted as well. */ > > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > > + > > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > > { > > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > > rtx def_set = single_set (insn); > > rtx src = SET_SRC (def_set); > > rtx dst = SET_DEST (def_set); > > + int igain = 0; > > > > if (REG_P (src) && REG_P (dst)) > > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > > + igain += 2 * m - ix86_cost->xmm_move; > > else if (REG_P (src) && MEM_P (dst)) > > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > > + igain > > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > > else if (MEM_P (src) && REG_P (dst)) > > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > > else if (GET_CODE (src) == ASHIFT > > || GET_CODE (src) == ASHIFTRT > > || GET_CODE (src) == LSHIFTRT) > > { > > if (CONST_INT_P (XEXP (src, 0))) > > - gain -= vector_const_cost (XEXP (src, 0)); > > - gain += ix86_cost->shift_const; > > + igain -= vector_const_cost (XEXP (src, 0)); > > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > > if (INTVAL (XEXP (src, 1)) >= 32) > > - gain -= COSTS_N_INSNS (1); > > + igain -= COSTS_N_INSNS (1); > > } > > else if (GET_CODE (src) == PLUS > > || GET_CODE (src) == MINUS > > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > > || GET_CODE (src) == XOR > > || GET_CODE (src) == AND) > > { > > - gain += ix86_cost->add; > > + igain += m * ix86_cost->add - ix86_cost->sse_op; > > /* Additional gain for andnot for targets without BMI. */ > > if (GET_CODE (XEXP (src, 0)) == NOT > > && !TARGET_BMI) > > - gain += 2 * ix86_cost->add; > > + igain += m * ix86_cost->add; > > > > if (CONST_INT_P (XEXP (src, 0))) > > - gain -= vector_const_cost (XEXP (src, 0)); > > + igain -= vector_const_cost (XEXP (src, 0)); > > if (CONST_INT_P (XEXP (src, 1))) > > - gain -= vector_const_cost (XEXP (src, 1)); > > + igain -= vector_const_cost (XEXP (src, 1)); > > } > > else if (GET_CODE (src) == NEG > > || GET_CODE (src) == NOT) > > - gain += ix86_cost->add - COSTS_N_INSNS (1); > > + igain += m * ix86_cost->add - ix86_cost->sse_op; > > + else if (GET_CODE (src) == SMAX > > + || GET_CODE (src) == SMIN > > + || GET_CODE (src) == UMAX > > + || GET_CODE (src) == UMIN) > > + { > > + /* We do not have any conditional move cost, estimate it as a > > + reg-reg move. Comparisons are costed as adds. */ > > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > > + /* Integer SSE ops are all costed the same. */ > > + igain -= ix86_cost->sse_op; > > + } > > else if (GET_CODE (src) == COMPARE) > > { > > /* Assume comparison cost is the same. */ > > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > > else if (CONST_INT_P (src)) > > { > > if (REG_P (dst)) > > - gain += COSTS_N_INSNS (2); > > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > > + igain += COSTS_N_INSNS (m); > > else if (MEM_P (dst)) > > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > > - gain -= vector_const_cost (src); > > + igain += (m * ix86_cost->int_store[2] > > + - ix86_cost->sse_store[sse_cost_idx]); > > + igain -= vector_const_cost (src); > > } > > else > > gcc_unreachable (); > > + > > + if (igain != 0 && dump_file) > > + { > > + fprintf (dump_file, " Instruction gain %d for ", igain); > > + dump_insn_slim (dump_file, insn); > > + } > > + gain += igain; > > } > > > > if (dump_file) > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > + /* ??? What about integer to SSE? */ > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > > > @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai > > /* Replace REG in X with a V2DI subreg of NEW_REG. */ > > > > rtx > > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > > { > > if (x == reg) > > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > > int i, j; > > @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg > > /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ > > > > void > > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > > rtx reg, rtx new_reg) > > { > > replace_with_subreg (single_set (insn), reg, new_reg); > > @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx > > and replace its uses in a chain. */ > > > > void > > -dimode_scalar_chain::make_vector_copies (unsigned regno) > > +general_scalar_chain::make_vector_copies (unsigned regno) > > { > > rtx reg = regno_reg_rtx[regno]; > > - rtx vreg = gen_reg_rtx (DImode); > > + rtx vreg = gen_reg_rtx (smode); > > df_ref ref; > > > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies > > start_sequence (); > > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > > { > > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > > - emit_move_insn (adjust_address (tmp, SImode, 0), > > - gen_rtx_SUBREG (SImode, reg, 0)); > > - emit_move_insn (adjust_address (tmp, SImode, 4), > > - gen_rtx_SUBREG (SImode, reg, 4)); > > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > > + if (smode == DImode && !TARGET_64BIT) > > + { > > + emit_move_insn (adjust_address (tmp, SImode, 0), > > + gen_rtx_SUBREG (SImode, reg, 0)); > > + emit_move_insn (adjust_address (tmp, SImode, 4), > > + gen_rtx_SUBREG (SImode, reg, 4)); > > + } > > + else > > + emit_move_insn (tmp, reg); > > emit_move_insn (vreg, tmp); > > } > > - else if (TARGET_SSE4_1) > > + else if (!TARGET_64BIT && smode == DImode) > > { > > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - CONST0_RTX (V4SImode), > > - gen_rtx_SUBREG (SImode, reg, 0))); > > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (SImode, reg, 4), > > - GEN_INT (2))); > > + if (TARGET_SSE4_1) > > + { > > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + CONST0_RTX (V4SImode), > > + gen_rtx_SUBREG (SImode, reg, 0))); > > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (SImode, reg, 4), > > + GEN_INT (2))); > > + } > > + else > > + { > > + rtx tmp = gen_reg_rtx (DImode); > > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + CONST0_RTX (V4SImode), > > + gen_rtx_SUBREG (SImode, reg, 0))); > > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > > + CONST0_RTX (V4SImode), > > + gen_rtx_SUBREG (SImode, reg, 4))); > > + emit_insn (gen_vec_interleave_lowv4si > > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > > + } > > } > > else > > - { > > - rtx tmp = gen_reg_rtx (DImode); > > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - CONST0_RTX (V4SImode), > > - gen_rtx_SUBREG (SImode, reg, 0))); > > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > > - CONST0_RTX (V4SImode), > > - gen_rtx_SUBREG (SImode, reg, 4))); > > - emit_insn (gen_vec_interleave_lowv4si > > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > > - } > > + emit_move_insn (gen_lowpart (smode, vreg), reg); > > rtx_insn *seq = get_insns (); > > end_sequence (); > > rtx_insn *insn = DF_REF_INSN (ref); > > @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies > > in case register is used in not convertible insn. */ > > > > void > > -dimode_scalar_chain::convert_reg (unsigned regno) > > +general_scalar_chain::convert_reg (unsigned regno) > > { > > bool scalar_copy = bitmap_bit_p (defs_conv, regno); > > rtx reg = regno_reg_rtx[regno]; > > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign > > bitmap_copy (conv, insns); > > > > if (scalar_copy) > > - scopy = gen_reg_rtx (DImode); > > + scopy = gen_reg_rtx (smode); > > > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > > { > > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign > > start_sequence (); > > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > > { > > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > > emit_move_insn (tmp, reg); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > - adjust_address (tmp, SImode, 0)); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > - adjust_address (tmp, SImode, 4)); > > + if (!TARGET_64BIT && smode == DImode) > > + { > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > + adjust_address (tmp, SImode, 0)); > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > + adjust_address (tmp, SImode, 4)); > > + } > > + else > > + emit_move_insn (scopy, tmp); > > } > > - else if (TARGET_SSE4_1) > > + else if (!TARGET_64BIT && smode == DImode) > > { > > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > > - emit_insn > > - (gen_rtx_SET > > - (gen_rtx_SUBREG (SImode, scopy, 0), > > - gen_rtx_VEC_SELECT (SImode, > > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > > - > > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > > - emit_insn > > - (gen_rtx_SET > > - (gen_rtx_SUBREG (SImode, scopy, 4), > > - gen_rtx_VEC_SELECT (SImode, > > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > > + if (TARGET_SSE4_1) > > + { > > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > > + gen_rtvec (1, const0_rtx)); > > + emit_insn > > + (gen_rtx_SET > > + (gen_rtx_SUBREG (SImode, scopy, 0), > > + gen_rtx_VEC_SELECT (SImode, > > + gen_rtx_SUBREG (V4SImode, reg, 0), > > + tmp))); > > + > > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > > + emit_insn > > + (gen_rtx_SET > > + (gen_rtx_SUBREG (SImode, scopy, 4), > > + gen_rtx_VEC_SELECT (SImode, > > + gen_rtx_SUBREG (V4SImode, reg, 0), > > + tmp))); > > + } > > + else > > + { > > + rtx vcopy = gen_reg_rtx (V2DImode); > > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > + gen_rtx_SUBREG (SImode, vcopy, 0)); > > + emit_move_insn (vcopy, > > + gen_rtx_LSHIFTRT (V2DImode, > > + vcopy, GEN_INT (32))); > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > + gen_rtx_SUBREG (SImode, vcopy, 0)); > > + } > > } > > else > > - { > > - rtx vcopy = gen_reg_rtx (V2DImode); > > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > - gen_rtx_SUBREG (SImode, vcopy, 0)); > > - emit_move_insn (vcopy, > > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > - gen_rtx_SUBREG (SImode, vcopy, 0)); > > - } > > + emit_move_insn (scopy, reg); > > + > > rtx_insn *seq = get_insns (); > > end_sequence (); > > emit_conversion_insns (seq, insn); > > @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign > > registers conversion. */ > > > > void > > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > > { > > *op = copy_rtx_if_shared (*op); > > > > if (GET_CODE (*op) == NOT) > > { > > convert_op (&XEXP (*op, 0), insn); > > - PUT_MODE (*op, V2DImode); > > + PUT_MODE (*op, vmode); > > } > > else if (MEM_P (*op)) > > { > > - rtx tmp = gen_reg_rtx (DImode); > > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > > > emit_insn_before (gen_move_insn (tmp, *op), insn); > > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > > > if (dump_file) > > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op > > gcc_assert (!DF_REF_CHAIN (ref)); > > break; > > } > > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > > + *op = gen_rtx_SUBREG (vmode, *op, 0); > > } > > else if (CONST_INT_P (*op)) > > { > > rtx vec_cst; > > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > > > /* Prefer all ones vector in case of -1. */ > > if (constm1_operand (*op, GET_MODE (*op))) > > - vec_cst = CONSTM1_RTX (V2DImode); > > + vec_cst = CONSTM1_RTX (vmode); > > else > > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > > - gen_rtvec (2, *op, const0_rtx)); > > + { > > + unsigned n = GET_MODE_NUNITS (vmode); > > + rtx *v = XALLOCAVEC (rtx, n); > > + v[0] = *op; > > + for (unsigned i = 1; i < n; ++i) > > + v[i] = const0_rtx; > > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > > + } > > > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > > + if (!standard_sse_constant_p (vec_cst, vmode)) > > { > > start_sequence (); > > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > > rtx_insn *seq = get_insns (); > > end_sequence (); > > emit_insn_before (seq, insn); > > @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op > > else > > { > > gcc_assert (SUBREG_P (*op)); > > - gcc_assert (GET_MODE (*op) == V2DImode); > > + gcc_assert (GET_MODE (*op) == vmode); > > } > > } > > > > /* Convert INSN to vector mode. */ > > > > void > > -dimode_scalar_chain::convert_insn (rtx_insn *insn) > > +general_scalar_chain::convert_insn (rtx_insn *insn) > > { > > rtx def_set = single_set (insn); > > rtx src = SET_SRC (def_set); > > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i > > { > > /* There are no scalar integer instructions and therefore > > temporary register usage is required. */ > > - rtx tmp = gen_reg_rtx (DImode); > > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > > } > > > > switch (GET_CODE (src)) > > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i > > case ASHIFTRT: > > case LSHIFTRT: > > convert_op (&XEXP (src, 0), insn); > > - PUT_MODE (src, V2DImode); > > + PUT_MODE (src, vmode); > > break; > > > > case PLUS: > > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i > > case IOR: > > case XOR: > > case AND: > > + case SMAX: > > + case SMIN: > > + case UMAX: > > + case UMIN: > > convert_op (&XEXP (src, 0), insn); > > convert_op (&XEXP (src, 1), insn); > > - PUT_MODE (src, V2DImode); > > + PUT_MODE (src, vmode); > > break; > > > > case NEG: > > src = XEXP (src, 0); > > convert_op (&src, insn); > > - subreg = gen_reg_rtx (V2DImode); > > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > > - src = gen_rtx_MINUS (V2DImode, subreg, src); > > + subreg = gen_reg_rtx (vmode); > > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > > + src = gen_rtx_MINUS (vmode, subreg, src); > > break; > > > > case NOT: > > src = XEXP (src, 0); > > convert_op (&src, insn); > > - subreg = gen_reg_rtx (V2DImode); > > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > > - src = gen_rtx_XOR (V2DImode, src, subreg); > > + subreg = gen_reg_rtx (vmode); > > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > > + src = gen_rtx_XOR (vmode, src, subreg); > > break; > > > > case MEM: > > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i > > break; > > > > case SUBREG: > > - gcc_assert (GET_MODE (src) == V2DImode); > > + gcc_assert (GET_MODE (src) == vmode); > > break; > > > > case COMPARE: > > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > > > if (REG_P (src)) > > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > > + subreg = gen_rtx_SUBREG (vmode, src, 0); > > else > > subreg = copy_rtx_if_shared (src); > > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i > > PATTERN (insn) = def_set; > > > > INSN_CODE (insn) = -1; > > - recog_memoized (insn); > > + int patt = recog_memoized (insn); > > + if (patt == -1) > > + fatal_insn_not_found (insn); > > df_insn_rescan (insn); > > } > > > > @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i > > } > > > > void > > -dimode_scalar_chain::convert_registers () > > +general_scalar_chain::convert_registers () > > { > > bitmap_iterator bi; > > unsigned id; > > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn > > (const_int 0 [0]))) */ > > > > static bool > > -convertible_comparison_p (rtx_insn *insn) > > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > > { > > if (!TARGET_SSE4_1) > > return false; > > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn > > > > if (!SUBREG_P (op1) > > || !SUBREG_P (op2) > > - || GET_MODE (op1) != SImode > > - || GET_MODE (op2) != SImode > > + || GET_MODE (op1) != mode > > + || GET_MODE (op2) != mode > > || ((SUBREG_BYTE (op1) != 0 > > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > > && (SUBREG_BYTE (op2) != 0 > > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > > return false; > > > > op1 = SUBREG_REG (op1); > > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn > > > > if (op1 != op2 > > || !REG_P (op1) > > - || GET_MODE (op1) != DImode) > > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > > return false; > > > > return true; > > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn > > /* The DImode version of scalar_to_vector_candidate_p. */ > > > > static bool > > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > > { > > rtx def_set = single_set (insn); > > > > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx > > rtx dst = SET_DEST (def_set); > > > > if (GET_CODE (src) == COMPARE) > > - return convertible_comparison_p (insn); > > + return convertible_comparison_p (insn, mode); > > > > /* We are interested in DImode promotion only. */ > > - if ((GET_MODE (src) != DImode > > + if ((GET_MODE (src) != mode > > && !CONST_INT_P (src)) > > - || GET_MODE (dst) != DImode) > > + || GET_MODE (dst) != mode) > > return false; > > > > if (!REG_P (dst) && !MEM_P (dst)) > > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx > > return false; > > break; > > > > + case SMAX: > > + case SMIN: > > + case UMAX: > > + case UMIN: > > + if ((mode == DImode && !TARGET_AVX512VL) Please enable only for TARGET64_BIT for now. > > + || (mode == SImode && !TARGET_SSE4_1)) > > + return false; > > + /* Fallthru. */ > > + > > case PLUS: > > case MINUS: > > case IOR: > > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx > > && !CONST_INT_P (XEXP (src, 1))) > > return false; > > > > - if (GET_MODE (XEXP (src, 1)) != DImode > > + if (GET_MODE (XEXP (src, 1)) != mode > > && !CONST_INT_P (XEXP (src, 1))) > > return false; > > break; > > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx > > || !REG_P (XEXP (XEXP (src, 0), 0)))) > > return false; > > > > - if (GET_MODE (XEXP (src, 0)) != DImode > > + if (GET_MODE (XEXP (src, 0)) != mode > > && !CONST_INT_P (XEXP (src, 0))) > > return false; > > > > @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx > > return false; > > } > > > > -/* Return 1 if INSN may be converted into vector > > - instruction. */ > > - > > -static bool > > -scalar_to_vector_candidate_p (rtx_insn *insn) > > -{ > > - if (TARGET_64BIT) > > - return timode_scalar_to_vector_candidate_p (insn); > > - else > > - return dimode_scalar_to_vector_candidate_p (insn); > > -} > > +/* For a given bitmap of insn UIDs scans all instruction and > > + remove insn from CANDIDATES in case it has both convertible > > + and not convertible definitions. > > > > -/* The DImode version of remove_non_convertible_regs. */ > > + All insns in a bitmap are conversion candidates according to > > + scalar_to_vector_candidate_p. Currently it implies all insns > > + are single_set. */ > > > > static void > > -dimode_remove_non_convertible_regs (bitmap candidates) > > +general_remove_non_convertible_regs (bitmap candidates) > > { > > bitmap_iterator bi; > > unsigned id; > > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm > > BITMAP_FREE (regs); > > } > > > > -/* For a given bitmap of insn UIDs scans all instruction and > > - remove insn from CANDIDATES in case it has both convertible > > - and not convertible definitions. > > - > > - All insns in a bitmap are conversion candidates according to > > - scalar_to_vector_candidate_p. Currently it implies all insns > > - are single_set. */ > > - > > -static void > > -remove_non_convertible_regs (bitmap candidates) > > -{ > > - if (TARGET_64BIT) > > - timode_remove_non_convertible_regs (candidates); > > - else > > - dimode_remove_non_convertible_regs (candidates); > > -} > > - > > /* Main STV pass function. Find and convert scalar > > instructions into vector mode when profitable. */ > > > > @@ -1577,11 +1638,14 @@ static unsigned int > > convert_scalars_to_vector () > > { > > basic_block bb; > > - bitmap candidates; > > int converted_insns = 0; > > > > bitmap_obstack_initialize (NULL); > > - candidates = BITMAP_ALLOC (NULL); > > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > > + for (unsigned i = 0; i < 3; ++i) > > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > > > calculate_dominance_info (CDI_DOMINATORS); > > df_set_flags (DF_DEFER_INSN_RESCAN); > > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () > > { > > rtx_insn *insn; > > FOR_BB_INSNS (bb, insn) > > - if (scalar_to_vector_candidate_p (insn)) > > + if (TARGET_64BIT > > + && timode_scalar_to_vector_candidate_p (insn)) > > { > > if (dump_file) > > - fprintf (dump_file, " insn %d is marked as a candidate\n", > > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > > INSN_UID (insn)); > > > > - bitmap_set_bit (candidates, INSN_UID (insn)); > > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > > + } > > + else > > + { > > + /* Check {SI,DI}mode. */ > > + for (unsigned i = 0; i <= 1; ++i) > > + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) > > + { > > + if (dump_file) > > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > > + > > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > > + break; > > + } > > } > > } > > > > - remove_non_convertible_regs (candidates); > > + if (TARGET_64BIT) > > + timode_remove_non_convertible_regs (&candidates[2]); > > + for (unsigned i = 0; i <= 1; ++i) > > + general_remove_non_convertible_regs (&candidates[i]); > > > > - if (bitmap_empty_p (candidates)) > > - if (dump_file) > > + for (unsigned i = 0; i <= 2; ++i) > > + if (!bitmap_empty_p (&candidates[i])) > > + break; > > + else if (i == 2 && dump_file) > > fprintf (dump_file, "There are no candidates for optimization.\n"); > > > > - while (!bitmap_empty_p (candidates)) > > - { > > - unsigned uid = bitmap_first_set_bit (candidates); > > - scalar_chain *chain; > > + for (unsigned i = 0; i <= 2; ++i) > > + while (!bitmap_empty_p (&candidates[i])) > > + { > > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > > + scalar_chain *chain; > > > > - if (TARGET_64BIT) > > - chain = new timode_scalar_chain; > > - else > > - chain = new dimode_scalar_chain; > > + if (cand_mode[i] == TImode) > > + chain = new timode_scalar_chain; > > + else > > + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); > > > > - /* Find instructions chain we want to convert to vector mode. > > - Check all uses and definitions to estimate all required > > - conversions. */ > > - chain->build (candidates, uid); > > + /* Find instructions chain we want to convert to vector mode. > > + Check all uses and definitions to estimate all required > > + conversions. */ > > + chain->build (&candidates[i], uid); > > > > - if (chain->compute_convert_gain () > 0) > > - converted_insns += chain->convert (); > > - else > > - if (dump_file) > > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > > - chain->chain_id); > > + if (chain->compute_convert_gain () > 0) > > + converted_insns += chain->convert (); > > + else > > + if (dump_file) > > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > > + chain->chain_id); > > > > - delete chain; > > - } > > + delete chain; > > + } > > > > if (dump_file) > > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > > > - BITMAP_FREE (candidates); > > + for (unsigned i = 0; i <= 2; ++i) > > + bitmap_release (&candidates[i]); > > bitmap_obstack_release (NULL); > > df_process_deferred_rescans (); > > > > Index: gcc/config/i386/i386-features.h > > =================================================================== > > --- gcc/config/i386/i386-features.h (revision 274111) > > +++ gcc/config/i386/i386-features.h (working copy) > > @@ -127,11 +127,16 @@ namespace { > > class scalar_chain > > { > > public: > > - scalar_chain (); > > + scalar_chain (enum machine_mode, enum machine_mode); > > virtual ~scalar_chain (); > > > > static unsigned max_id; > > > > + /* Scalar mode. */ > > + enum machine_mode smode; > > + /* Vector mode. */ > > + enum machine_mode vmode; > > + > > /* ID of a chain. */ > > unsigned int chain_id; > > /* A queue of instructions to be included into a chain. */ > > @@ -159,9 +164,11 @@ class scalar_chain > > virtual void convert_registers () = 0; > > }; > > > > -class dimode_scalar_chain : public scalar_chain > > +class general_scalar_chain : public scalar_chain > > { > > public: > > + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > > + : scalar_chain (smode_, vmode_) {} > > int compute_convert_gain (); > > private: > > void mark_dual_mode_def (df_ref def); > > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > > class timode_scalar_chain : public scalar_chain > > { > > public: > > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > > + > > /* Convert from TImode to V1TImode is always faster. */ > > int compute_convert_gain () { return 1; } > > > > Index: gcc/config/i386/i386.md > > =================================================================== > > --- gcc/config/i386/i386.md (revision 274111) > > +++ gcc/config/i386/i386.md (working copy) > > @@ -17721,6 +17721,30 @@ (define_peephole2 > > std::swap (operands[4], operands[5]); > > }) > > > > +;; min/max patterns You should use: (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_64BIT && TARGET_AVX512F"]) in the pattern below. Otherwise, middle-end detects and emits minmax patterns that have no chance of being converted and always split back to integer insns. > > +(define_code_attr maxmin_rel > > + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) > > +(define_code_attr maxmin_cmpmode > > + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) > > + > > +(define_insn_and_split "<code><mode>3" > > + [(set (match_operand:SWI48 0 "register_operand") > > + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") > > + (match_operand:SWI48 2 "register_operand"))) > > + (clobber (reg:CC FLAGS_REG))] > > + "TARGET_STV && TARGET_SSE4_1 leave only TARGET_STV if MAXMIN_IMODE will be used. > > + && can_create_pseudo_p ()" > > + "#" > > + "&& 1" > > + [(set (reg:<maxmin_cmpmode> FLAGS_REG) > > + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) > > + (set (match_dup 0) > > + (if_then_else:SWI48 > > + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) > > + (match_dup 1) > > + (match_dup 2)))]) > > + > > ;; Conditional addition patterns > > (define_expand "add<mode>cc" > > [(match_operand:SWI 0 "register_operand") > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 12:04 ` Richard Biener 2019-08-07 12:11 ` Uros Bizjak @ 2019-08-07 12:42 ` Uros Bizjak 2019-08-07 12:58 ` Uros Bizjak 1 sibling, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-07 12:42 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote: > > On Wed, 7 Aug 2019, Richard Biener wrote: > > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > case SMAX: > > > > > > case SMIN: > > > > > > case UMAX: > > > > > > case UMIN: > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > return false; > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > This is of course doable, but somehow more complex than simply > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > splitter does. So, a follow-up task. > > > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > > check we just need to properly split if we enable the scalar minmax > > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > > > Yes, that is correct. > > > > So I tested the patch below (now with appropriate ChangeLog) on > > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > > the obvious hmmer improvement, now checking for off-noise results > > with a 3-run on those that may have one (with more than +-1 second > > differences in the 1-run). > > > > As-is the patch likely runs into the splitting issue for DImode > > on i?86 and the patch misses functional testcases. I'll do the > > hmmer loop with both DImode and SImode and testcases to trigger > > all pattern variants with the different ISAs we have. > > > > Some of the patch could be split out (the cost changes that are > > also effective for DImode for example). > > > > AFAICS we could go with only adding SImode avoiding the DImode > > splitting thing and this would solve the hmmer regression. > > I've additionally bootstrapped with --with-arch=nehalem which > reveals > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp > > we emit cmp + cmov here now with -msse4.1 (as soon as the max > pattern is enabled I guess) Actually, we have to split using ix86_expand_int_compare. This will generate optimized CC mode. Uros. > > Otherwise testing is clean, so I suppose this is the net effect > of just doing the SImode chains; I don't have AVX512 HW handily > available to really test the DImode path. > > Would you be fine to simplify the patch down to SImode chain handling? > > Thanks, > Richard. > > > Thanks, > > Richard. > > > > 2019-08-07 Richard Biener <rguenther@suse.de> > > > > PR target/91154 > > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > > mode arguments. > > (scalar_chain::smode): New member. > > (scalar_chain::vmode): Likewise. > > (dimode_scalar_chain): Rename to... > > (general_scalar_chain): ... this. > > (general_scalar_chain::general_scalar_chain): Take mode arguments. > > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > > base with TImode and V1TImode. > > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > > (general_scalar_chain::vector_const_cost): Adjust for SImode > > chains. > > (general_scalar_chain::compute_convert_gain): Likewise. Fix > > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > > gain if not zero. > > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > > (general_scalar_chain::make_vector_copies): Likewise. Handle > > non-DImode chains appropriately. > > (general_scalar_chain::convert_reg): Likewise. > > (general_scalar_chain::convert_op): Likewise. > > (general_scalar_chain::convert_insn): Likewise. Add > > fatal_insn_not_found if the result is not recognized. > > (convertible_comparison_p): Pass in the scalar mode and use that. > > (general_scalar_to_vector_candidate_p): Likewise. Rename from > > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > > (scalar_to_vector_candidate_p): Remove by inlining into single > > caller. > > (general_remove_non_convertible_regs): Rename from > > dimode_remove_non_convertible_regs. > > (remove_non_convertible_regs): Remove by inlining into single caller. > > (convert_scalars_to_vector): Handle SImode and DImode chains > > in addition to TImode chains. > > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > > > Index: gcc/config/i386/i386-features.c > > =================================================================== > > --- gcc/config/i386/i386-features.c (revision 274111) > > +++ gcc/config/i386/i386-features.c (working copy) > > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > > > /* Initialize new chain. */ > > > > -scalar_chain::scalar_chain () > > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > > { > > + smode = smode_; > > + vmode = vmode_; > > + > > chain_id = ++max_id; > > > > if (dump_file) > > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins > > conversion. */ > > > > void > > -dimode_scalar_chain::mark_dual_mode_def (df_ref def) > > +general_scalar_chain::mark_dual_mode_def (df_ref def) > > { > > gcc_assert (DF_REF_REG_DEF_P (def)); > > > > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > > && !HARD_REGISTER_P (SET_DEST (def_set))) > > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > > > + /* ??? The following is quadratic since analyze_register_chain > > + iterates over all refs to look for dual-mode regs. Instead this > > + should be done separately for all regs mentioned in the chain once. */ > > df_ref ref; > > df_ref def; > > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, > > instead of using a scalar one. */ > > > > int > > -dimode_scalar_chain::vector_const_cost (rtx exp) > > +general_scalar_chain::vector_const_cost (rtx exp) > > { > > gcc_assert (CONST_INT_P (exp)); > > > > - if (standard_sse_constant_p (exp, V2DImode)) > > - return COSTS_N_INSNS (1); > > - return ix86_cost->sse_load[1]; > > + if (standard_sse_constant_p (exp, vmode)) > > + return ix86_cost->sse_op; > > + /* We have separate costs for SImode and DImode, use SImode costs > > + for smaller modes. */ > > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > > } > > > > /* Compute a gain for chain conversion. */ > > > > int > > -dimode_scalar_chain::compute_convert_gain () > > +general_scalar_chain::compute_convert_gain () > > { > > bitmap_iterator bi; > > unsigned insn_uid; > > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > > if (dump_file) > > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > > + int costs factor in the number of GPRs involved. When supporting > > + smaller modes than SImode the int load/store costs need to be > > + adjusted as well. */ > > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > > + > > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > > { > > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > > rtx def_set = single_set (insn); > > rtx src = SET_SRC (def_set); > > rtx dst = SET_DEST (def_set); > > + int igain = 0; > > > > if (REG_P (src) && REG_P (dst)) > > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > > + igain += 2 * m - ix86_cost->xmm_move; > > else if (REG_P (src) && MEM_P (dst)) > > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > > + igain > > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > > else if (MEM_P (src) && REG_P (dst)) > > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > > else if (GET_CODE (src) == ASHIFT > > || GET_CODE (src) == ASHIFTRT > > || GET_CODE (src) == LSHIFTRT) > > { > > if (CONST_INT_P (XEXP (src, 0))) > > - gain -= vector_const_cost (XEXP (src, 0)); > > - gain += ix86_cost->shift_const; > > + igain -= vector_const_cost (XEXP (src, 0)); > > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > > if (INTVAL (XEXP (src, 1)) >= 32) > > - gain -= COSTS_N_INSNS (1); > > + igain -= COSTS_N_INSNS (1); > > } > > else if (GET_CODE (src) == PLUS > > || GET_CODE (src) == MINUS > > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > > || GET_CODE (src) == XOR > > || GET_CODE (src) == AND) > > { > > - gain += ix86_cost->add; > > + igain += m * ix86_cost->add - ix86_cost->sse_op; > > /* Additional gain for andnot for targets without BMI. */ > > if (GET_CODE (XEXP (src, 0)) == NOT > > && !TARGET_BMI) > > - gain += 2 * ix86_cost->add; > > + igain += m * ix86_cost->add; > > > > if (CONST_INT_P (XEXP (src, 0))) > > - gain -= vector_const_cost (XEXP (src, 0)); > > + igain -= vector_const_cost (XEXP (src, 0)); > > if (CONST_INT_P (XEXP (src, 1))) > > - gain -= vector_const_cost (XEXP (src, 1)); > > + igain -= vector_const_cost (XEXP (src, 1)); > > } > > else if (GET_CODE (src) == NEG > > || GET_CODE (src) == NOT) > > - gain += ix86_cost->add - COSTS_N_INSNS (1); > > + igain += m * ix86_cost->add - ix86_cost->sse_op; > > + else if (GET_CODE (src) == SMAX > > + || GET_CODE (src) == SMIN > > + || GET_CODE (src) == UMAX > > + || GET_CODE (src) == UMIN) > > + { > > + /* We do not have any conditional move cost, estimate it as a > > + reg-reg move. Comparisons are costed as adds. */ > > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > > + /* Integer SSE ops are all costed the same. */ > > + igain -= ix86_cost->sse_op; > > + } > > else if (GET_CODE (src) == COMPARE) > > { > > /* Assume comparison cost is the same. */ > > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > > else if (CONST_INT_P (src)) > > { > > if (REG_P (dst)) > > - gain += COSTS_N_INSNS (2); > > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > > + igain += COSTS_N_INSNS (m); > > else if (MEM_P (dst)) > > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > > - gain -= vector_const_cost (src); > > + igain += (m * ix86_cost->int_store[2] > > + - ix86_cost->sse_store[sse_cost_idx]); > > + igain -= vector_const_cost (src); > > } > > else > > gcc_unreachable (); > > + > > + if (igain != 0 && dump_file) > > + { > > + fprintf (dump_file, " Instruction gain %d for ", igain); > > + dump_insn_slim (dump_file, insn); > > + } > > + gain += igain; > > } > > > > if (dump_file) > > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > > > + /* ??? What about integer to SSE? */ > > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > > > @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai > > /* Replace REG in X with a V2DI subreg of NEW_REG. */ > > > > rtx > > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > > { > > if (x == reg) > > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > > int i, j; > > @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg > > /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ > > > > void > > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > > rtx reg, rtx new_reg) > > { > > replace_with_subreg (single_set (insn), reg, new_reg); > > @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx > > and replace its uses in a chain. */ > > > > void > > -dimode_scalar_chain::make_vector_copies (unsigned regno) > > +general_scalar_chain::make_vector_copies (unsigned regno) > > { > > rtx reg = regno_reg_rtx[regno]; > > - rtx vreg = gen_reg_rtx (DImode); > > + rtx vreg = gen_reg_rtx (smode); > > df_ref ref; > > > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies > > start_sequence (); > > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > > { > > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > > - emit_move_insn (adjust_address (tmp, SImode, 0), > > - gen_rtx_SUBREG (SImode, reg, 0)); > > - emit_move_insn (adjust_address (tmp, SImode, 4), > > - gen_rtx_SUBREG (SImode, reg, 4)); > > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > > + if (smode == DImode && !TARGET_64BIT) > > + { > > + emit_move_insn (adjust_address (tmp, SImode, 0), > > + gen_rtx_SUBREG (SImode, reg, 0)); > > + emit_move_insn (adjust_address (tmp, SImode, 4), > > + gen_rtx_SUBREG (SImode, reg, 4)); > > + } > > + else > > + emit_move_insn (tmp, reg); > > emit_move_insn (vreg, tmp); > > } > > - else if (TARGET_SSE4_1) > > + else if (!TARGET_64BIT && smode == DImode) > > { > > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - CONST0_RTX (V4SImode), > > - gen_rtx_SUBREG (SImode, reg, 0))); > > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (SImode, reg, 4), > > - GEN_INT (2))); > > + if (TARGET_SSE4_1) > > + { > > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + CONST0_RTX (V4SImode), > > + gen_rtx_SUBREG (SImode, reg, 0))); > > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (SImode, reg, 4), > > + GEN_INT (2))); > > + } > > + else > > + { > > + rtx tmp = gen_reg_rtx (DImode); > > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + CONST0_RTX (V4SImode), > > + gen_rtx_SUBREG (SImode, reg, 0))); > > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > > + CONST0_RTX (V4SImode), > > + gen_rtx_SUBREG (SImode, reg, 4))); > > + emit_insn (gen_vec_interleave_lowv4si > > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (V4SImode, vreg, 0), > > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > > + } > > } > > else > > - { > > - rtx tmp = gen_reg_rtx (DImode); > > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - CONST0_RTX (V4SImode), > > - gen_rtx_SUBREG (SImode, reg, 0))); > > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > > - CONST0_RTX (V4SImode), > > - gen_rtx_SUBREG (SImode, reg, 4))); > > - emit_insn (gen_vec_interleave_lowv4si > > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (V4SImode, vreg, 0), > > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > > - } > > + emit_move_insn (gen_lowpart (smode, vreg), reg); > > rtx_insn *seq = get_insns (); > > end_sequence (); > > rtx_insn *insn = DF_REF_INSN (ref); > > @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies > > in case register is used in not convertible insn. */ > > > > void > > -dimode_scalar_chain::convert_reg (unsigned regno) > > +general_scalar_chain::convert_reg (unsigned regno) > > { > > bool scalar_copy = bitmap_bit_p (defs_conv, regno); > > rtx reg = regno_reg_rtx[regno]; > > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign > > bitmap_copy (conv, insns); > > > > if (scalar_copy) > > - scopy = gen_reg_rtx (DImode); > > + scopy = gen_reg_rtx (smode); > > > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > > { > > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign > > start_sequence (); > > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > > { > > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > > emit_move_insn (tmp, reg); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > - adjust_address (tmp, SImode, 0)); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > - adjust_address (tmp, SImode, 4)); > > + if (!TARGET_64BIT && smode == DImode) > > + { > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > + adjust_address (tmp, SImode, 0)); > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > + adjust_address (tmp, SImode, 4)); > > + } > > + else > > + emit_move_insn (scopy, tmp); > > } > > - else if (TARGET_SSE4_1) > > + else if (!TARGET_64BIT && smode == DImode) > > { > > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > > - emit_insn > > - (gen_rtx_SET > > - (gen_rtx_SUBREG (SImode, scopy, 0), > > - gen_rtx_VEC_SELECT (SImode, > > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > > - > > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > > - emit_insn > > - (gen_rtx_SET > > - (gen_rtx_SUBREG (SImode, scopy, 4), > > - gen_rtx_VEC_SELECT (SImode, > > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > > + if (TARGET_SSE4_1) > > + { > > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > > + gen_rtvec (1, const0_rtx)); > > + emit_insn > > + (gen_rtx_SET > > + (gen_rtx_SUBREG (SImode, scopy, 0), > > + gen_rtx_VEC_SELECT (SImode, > > + gen_rtx_SUBREG (V4SImode, reg, 0), > > + tmp))); > > + > > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > > + emit_insn > > + (gen_rtx_SET > > + (gen_rtx_SUBREG (SImode, scopy, 4), > > + gen_rtx_VEC_SELECT (SImode, > > + gen_rtx_SUBREG (V4SImode, reg, 0), > > + tmp))); > > + } > > + else > > + { > > + rtx vcopy = gen_reg_rtx (V2DImode); > > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > + gen_rtx_SUBREG (SImode, vcopy, 0)); > > + emit_move_insn (vcopy, > > + gen_rtx_LSHIFTRT (V2DImode, > > + vcopy, GEN_INT (32))); > > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > + gen_rtx_SUBREG (SImode, vcopy, 0)); > > + } > > } > > else > > - { > > - rtx vcopy = gen_reg_rtx (V2DImode); > > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > > - gen_rtx_SUBREG (SImode, vcopy, 0)); > > - emit_move_insn (vcopy, > > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > > - gen_rtx_SUBREG (SImode, vcopy, 0)); > > - } > > + emit_move_insn (scopy, reg); > > + > > rtx_insn *seq = get_insns (); > > end_sequence (); > > emit_conversion_insns (seq, insn); > > @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign > > registers conversion. */ > > > > void > > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > > { > > *op = copy_rtx_if_shared (*op); > > > > if (GET_CODE (*op) == NOT) > > { > > convert_op (&XEXP (*op, 0), insn); > > - PUT_MODE (*op, V2DImode); > > + PUT_MODE (*op, vmode); > > } > > else if (MEM_P (*op)) > > { > > - rtx tmp = gen_reg_rtx (DImode); > > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > > > emit_insn_before (gen_move_insn (tmp, *op), insn); > > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > > > if (dump_file) > > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op > > gcc_assert (!DF_REF_CHAIN (ref)); > > break; > > } > > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > > + *op = gen_rtx_SUBREG (vmode, *op, 0); > > } > > else if (CONST_INT_P (*op)) > > { > > rtx vec_cst; > > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > > > /* Prefer all ones vector in case of -1. */ > > if (constm1_operand (*op, GET_MODE (*op))) > > - vec_cst = CONSTM1_RTX (V2DImode); > > + vec_cst = CONSTM1_RTX (vmode); > > else > > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > > - gen_rtvec (2, *op, const0_rtx)); > > + { > > + unsigned n = GET_MODE_NUNITS (vmode); > > + rtx *v = XALLOCAVEC (rtx, n); > > + v[0] = *op; > > + for (unsigned i = 1; i < n; ++i) > > + v[i] = const0_rtx; > > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > > + } > > > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > > + if (!standard_sse_constant_p (vec_cst, vmode)) > > { > > start_sequence (); > > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > > rtx_insn *seq = get_insns (); > > end_sequence (); > > emit_insn_before (seq, insn); > > @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op > > else > > { > > gcc_assert (SUBREG_P (*op)); > > - gcc_assert (GET_MODE (*op) == V2DImode); > > + gcc_assert (GET_MODE (*op) == vmode); > > } > > } > > > > /* Convert INSN to vector mode. */ > > > > void > > -dimode_scalar_chain::convert_insn (rtx_insn *insn) > > +general_scalar_chain::convert_insn (rtx_insn *insn) > > { > > rtx def_set = single_set (insn); > > rtx src = SET_SRC (def_set); > > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i > > { > > /* There are no scalar integer instructions and therefore > > temporary register usage is required. */ > > - rtx tmp = gen_reg_rtx (DImode); > > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > > } > > > > switch (GET_CODE (src)) > > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i > > case ASHIFTRT: > > case LSHIFTRT: > > convert_op (&XEXP (src, 0), insn); > > - PUT_MODE (src, V2DImode); > > + PUT_MODE (src, vmode); > > break; > > > > case PLUS: > > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i > > case IOR: > > case XOR: > > case AND: > > + case SMAX: > > + case SMIN: > > + case UMAX: > > + case UMIN: > > convert_op (&XEXP (src, 0), insn); > > convert_op (&XEXP (src, 1), insn); > > - PUT_MODE (src, V2DImode); > > + PUT_MODE (src, vmode); > > break; > > > > case NEG: > > src = XEXP (src, 0); > > convert_op (&src, insn); > > - subreg = gen_reg_rtx (V2DImode); > > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > > - src = gen_rtx_MINUS (V2DImode, subreg, src); > > + subreg = gen_reg_rtx (vmode); > > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > > + src = gen_rtx_MINUS (vmode, subreg, src); > > break; > > > > case NOT: > > src = XEXP (src, 0); > > convert_op (&src, insn); > > - subreg = gen_reg_rtx (V2DImode); > > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > > - src = gen_rtx_XOR (V2DImode, src, subreg); > > + subreg = gen_reg_rtx (vmode); > > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > > + src = gen_rtx_XOR (vmode, src, subreg); > > break; > > > > case MEM: > > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i > > break; > > > > case SUBREG: > > - gcc_assert (GET_MODE (src) == V2DImode); > > + gcc_assert (GET_MODE (src) == vmode); > > break; > > > > case COMPARE: > > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > > > if (REG_P (src)) > > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > > + subreg = gen_rtx_SUBREG (vmode, src, 0); > > else > > subreg = copy_rtx_if_shared (src); > > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i > > PATTERN (insn) = def_set; > > > > INSN_CODE (insn) = -1; > > - recog_memoized (insn); > > + int patt = recog_memoized (insn); > > + if (patt == -1) > > + fatal_insn_not_found (insn); > > df_insn_rescan (insn); > > } > > > > @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i > > } > > > > void > > -dimode_scalar_chain::convert_registers () > > +general_scalar_chain::convert_registers () > > { > > bitmap_iterator bi; > > unsigned id; > > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn > > (const_int 0 [0]))) */ > > > > static bool > > -convertible_comparison_p (rtx_insn *insn) > > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > > { > > if (!TARGET_SSE4_1) > > return false; > > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn > > > > if (!SUBREG_P (op1) > > || !SUBREG_P (op2) > > - || GET_MODE (op1) != SImode > > - || GET_MODE (op2) != SImode > > + || GET_MODE (op1) != mode > > + || GET_MODE (op2) != mode > > || ((SUBREG_BYTE (op1) != 0 > > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > > && (SUBREG_BYTE (op2) != 0 > > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > > return false; > > > > op1 = SUBREG_REG (op1); > > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn > > > > if (op1 != op2 > > || !REG_P (op1) > > - || GET_MODE (op1) != DImode) > > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > > return false; > > > > return true; > > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn > > /* The DImode version of scalar_to_vector_candidate_p. */ > > > > static bool > > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > > { > > rtx def_set = single_set (insn); > > > > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx > > rtx dst = SET_DEST (def_set); > > > > if (GET_CODE (src) == COMPARE) > > - return convertible_comparison_p (insn); > > + return convertible_comparison_p (insn, mode); > > > > /* We are interested in DImode promotion only. */ > > - if ((GET_MODE (src) != DImode > > + if ((GET_MODE (src) != mode > > && !CONST_INT_P (src)) > > - || GET_MODE (dst) != DImode) > > + || GET_MODE (dst) != mode) > > return false; > > > > if (!REG_P (dst) && !MEM_P (dst)) > > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx > > return false; > > break; > > > > + case SMAX: > > + case SMIN: > > + case UMAX: > > + case UMIN: > > + if ((mode == DImode && !TARGET_AVX512VL) > > + || (mode == SImode && !TARGET_SSE4_1)) > > + return false; > > + /* Fallthru. */ > > + > > case PLUS: > > case MINUS: > > case IOR: > > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx > > && !CONST_INT_P (XEXP (src, 1))) > > return false; > > > > - if (GET_MODE (XEXP (src, 1)) != DImode > > + if (GET_MODE (XEXP (src, 1)) != mode > > && !CONST_INT_P (XEXP (src, 1))) > > return false; > > break; > > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx > > || !REG_P (XEXP (XEXP (src, 0), 0)))) > > return false; > > > > - if (GET_MODE (XEXP (src, 0)) != DImode > > + if (GET_MODE (XEXP (src, 0)) != mode > > && !CONST_INT_P (XEXP (src, 0))) > > return false; > > > > @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx > > return false; > > } > > > > -/* Return 1 if INSN may be converted into vector > > - instruction. */ > > - > > -static bool > > -scalar_to_vector_candidate_p (rtx_insn *insn) > > -{ > > - if (TARGET_64BIT) > > - return timode_scalar_to_vector_candidate_p (insn); > > - else > > - return dimode_scalar_to_vector_candidate_p (insn); > > -} > > +/* For a given bitmap of insn UIDs scans all instruction and > > + remove insn from CANDIDATES in case it has both convertible > > + and not convertible definitions. > > > > -/* The DImode version of remove_non_convertible_regs. */ > > + All insns in a bitmap are conversion candidates according to > > + scalar_to_vector_candidate_p. Currently it implies all insns > > + are single_set. */ > > > > static void > > -dimode_remove_non_convertible_regs (bitmap candidates) > > +general_remove_non_convertible_regs (bitmap candidates) > > { > > bitmap_iterator bi; > > unsigned id; > > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm > > BITMAP_FREE (regs); > > } > > > > -/* For a given bitmap of insn UIDs scans all instruction and > > - remove insn from CANDIDATES in case it has both convertible > > - and not convertible definitions. > > - > > - All insns in a bitmap are conversion candidates according to > > - scalar_to_vector_candidate_p. Currently it implies all insns > > - are single_set. */ > > - > > -static void > > -remove_non_convertible_regs (bitmap candidates) > > -{ > > - if (TARGET_64BIT) > > - timode_remove_non_convertible_regs (candidates); > > - else > > - dimode_remove_non_convertible_regs (candidates); > > -} > > - > > /* Main STV pass function. Find and convert scalar > > instructions into vector mode when profitable. */ > > > > @@ -1577,11 +1638,14 @@ static unsigned int > > convert_scalars_to_vector () > > { > > basic_block bb; > > - bitmap candidates; > > int converted_insns = 0; > > > > bitmap_obstack_initialize (NULL); > > - candidates = BITMAP_ALLOC (NULL); > > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > > + for (unsigned i = 0; i < 3; ++i) > > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > > > calculate_dominance_info (CDI_DOMINATORS); > > df_set_flags (DF_DEFER_INSN_RESCAN); > > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () > > { > > rtx_insn *insn; > > FOR_BB_INSNS (bb, insn) > > - if (scalar_to_vector_candidate_p (insn)) > > + if (TARGET_64BIT > > + && timode_scalar_to_vector_candidate_p (insn)) > > { > > if (dump_file) > > - fprintf (dump_file, " insn %d is marked as a candidate\n", > > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > > INSN_UID (insn)); > > > > - bitmap_set_bit (candidates, INSN_UID (insn)); > > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > > + } > > + else > > + { > > + /* Check {SI,DI}mode. */ > > + for (unsigned i = 0; i <= 1; ++i) > > + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) > > + { > > + if (dump_file) > > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > > + > > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > > + break; > > + } > > } > > } > > > > - remove_non_convertible_regs (candidates); > > + if (TARGET_64BIT) > > + timode_remove_non_convertible_regs (&candidates[2]); > > + for (unsigned i = 0; i <= 1; ++i) > > + general_remove_non_convertible_regs (&candidates[i]); > > > > - if (bitmap_empty_p (candidates)) > > - if (dump_file) > > + for (unsigned i = 0; i <= 2; ++i) > > + if (!bitmap_empty_p (&candidates[i])) > > + break; > > + else if (i == 2 && dump_file) > > fprintf (dump_file, "There are no candidates for optimization.\n"); > > > > - while (!bitmap_empty_p (candidates)) > > - { > > - unsigned uid = bitmap_first_set_bit (candidates); > > - scalar_chain *chain; > > + for (unsigned i = 0; i <= 2; ++i) > > + while (!bitmap_empty_p (&candidates[i])) > > + { > > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > > + scalar_chain *chain; > > > > - if (TARGET_64BIT) > > - chain = new timode_scalar_chain; > > - else > > - chain = new dimode_scalar_chain; > > + if (cand_mode[i] == TImode) > > + chain = new timode_scalar_chain; > > + else > > + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); > > > > - /* Find instructions chain we want to convert to vector mode. > > - Check all uses and definitions to estimate all required > > - conversions. */ > > - chain->build (candidates, uid); > > + /* Find instructions chain we want to convert to vector mode. > > + Check all uses and definitions to estimate all required > > + conversions. */ > > + chain->build (&candidates[i], uid); > > > > - if (chain->compute_convert_gain () > 0) > > - converted_insns += chain->convert (); > > - else > > - if (dump_file) > > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > > - chain->chain_id); > > + if (chain->compute_convert_gain () > 0) > > + converted_insns += chain->convert (); > > + else > > + if (dump_file) > > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > > + chain->chain_id); > > > > - delete chain; > > - } > > + delete chain; > > + } > > > > if (dump_file) > > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > > > - BITMAP_FREE (candidates); > > + for (unsigned i = 0; i <= 2; ++i) > > + bitmap_release (&candidates[i]); > > bitmap_obstack_release (NULL); > > df_process_deferred_rescans (); > > > > Index: gcc/config/i386/i386-features.h > > =================================================================== > > --- gcc/config/i386/i386-features.h (revision 274111) > > +++ gcc/config/i386/i386-features.h (working copy) > > @@ -127,11 +127,16 @@ namespace { > > class scalar_chain > > { > > public: > > - scalar_chain (); > > + scalar_chain (enum machine_mode, enum machine_mode); > > virtual ~scalar_chain (); > > > > static unsigned max_id; > > > > + /* Scalar mode. */ > > + enum machine_mode smode; > > + /* Vector mode. */ > > + enum machine_mode vmode; > > + > > /* ID of a chain. */ > > unsigned int chain_id; > > /* A queue of instructions to be included into a chain. */ > > @@ -159,9 +164,11 @@ class scalar_chain > > virtual void convert_registers () = 0; > > }; > > > > -class dimode_scalar_chain : public scalar_chain > > +class general_scalar_chain : public scalar_chain > > { > > public: > > + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > > + : scalar_chain (smode_, vmode_) {} > > int compute_convert_gain (); > > private: > > void mark_dual_mode_def (df_ref def); > > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > > class timode_scalar_chain : public scalar_chain > > { > > public: > > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > > + > > /* Convert from TImode to V1TImode is always faster. */ > > int compute_convert_gain () { return 1; } > > > > Index: gcc/config/i386/i386.md > > =================================================================== > > --- gcc/config/i386/i386.md (revision 274111) > > +++ gcc/config/i386/i386.md (working copy) > > @@ -17721,6 +17721,30 @@ (define_peephole2 > > std::swap (operands[4], operands[5]); > > }) > > > > +;; min/max patterns > > + > > +(define_code_attr maxmin_rel > > + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) > > +(define_code_attr maxmin_cmpmode > > + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) > > + > > +(define_insn_and_split "<code><mode>3" > > + [(set (match_operand:SWI48 0 "register_operand") > > + (maxmin:SWI48 (match_operand:SWI48 1 "register_operand") > > + (match_operand:SWI48 2 "register_operand"))) > > + (clobber (reg:CC FLAGS_REG))] > > + "TARGET_STV && TARGET_SSE4_1 > > + && can_create_pseudo_p ()" > > + "#" > > + "&& 1" > > + [(set (reg:<maxmin_cmpmode> FLAGS_REG) > > + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) > > + (set (match_dup 0) > > + (if_then_else:SWI48 > > + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) > > + (match_dup 1) > > + (match_dup 2)))]) > > + > > ;; Conditional addition patterns > > (define_expand "add<mode>cc" > > [(match_operand:SWI 0 "register_operand") > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; > GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 12:42 ` Uros Bizjak @ 2019-08-07 12:58 ` Uros Bizjak 2019-08-07 13:00 ` Richard Biener 0 siblings, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-07 12:58 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches On Wed, Aug 7, 2019 at 2:20 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote: > > > > On Wed, 7 Aug 2019, Richard Biener wrote: > > > > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > > > case SMAX: > > > > > > > case SMIN: > > > > > > > case UMAX: > > > > > > > case UMIN: > > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > > return false; > > > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > > This is of course doable, but somehow more complex than simply > > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > > splitter does. So, a follow-up task. > > > > > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > > > check we just need to properly split if we enable the scalar minmax > > > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > > > > > Yes, that is correct. > > > > > > So I tested the patch below (now with appropriate ChangeLog) on > > > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > > > the obvious hmmer improvement, now checking for off-noise results > > > with a 3-run on those that may have one (with more than +-1 second > > > differences in the 1-run). > > > > > > As-is the patch likely runs into the splitting issue for DImode > > > on i?86 and the patch misses functional testcases. I'll do the > > > hmmer loop with both DImode and SImode and testcases to trigger > > > all pattern variants with the different ISAs we have. > > > > > > Some of the patch could be split out (the cost changes that are > > > also effective for DImode for example). > > > > > > AFAICS we could go with only adding SImode avoiding the DImode > > > splitting thing and this would solve the hmmer regression. > > > > I've additionally bootstrapped with --with-arch=nehalem which > > reveals > > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test > > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp > > > > we emit cmp + cmov here now with -msse4.1 (as soon as the max > > pattern is enabled I guess) > > Actually, we have to split using ix86_expand_int_compare. This will > generate optimized CC mode. So, this only matters for comparisons against zero. Currently, the insn_and_split pattern allows only registers, but we can add other types, too. I'd say that this is benign issue. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 12:58 ` Uros Bizjak @ 2019-08-07 13:00 ` Richard Biener 2019-08-07 13:32 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-07 13:00 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On Wed, 7 Aug 2019, Uros Bizjak wrote: > On Wed, Aug 7, 2019 at 2:20 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > On Wed, 7 Aug 2019, Richard Biener wrote: > > > > > > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > > > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > > > > > case SMAX: > > > > > > > > case SMIN: > > > > > > > > case UMAX: > > > > > > > > case UMIN: > > > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > > > return false; > > > > > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > > > This is of course doable, but somehow more complex than simply > > > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > > > splitter does. So, a follow-up task. > > > > > > > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > > > > check we just need to properly split if we enable the scalar minmax > > > > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > > > > > > > Yes, that is correct. > > > > > > > > So I tested the patch below (now with appropriate ChangeLog) on > > > > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > > > > the obvious hmmer improvement, now checking for off-noise results > > > > with a 3-run on those that may have one (with more than +-1 second > > > > differences in the 1-run). > > > > > > > > As-is the patch likely runs into the splitting issue for DImode > > > > on i?86 and the patch misses functional testcases. I'll do the > > > > hmmer loop with both DImode and SImode and testcases to trigger > > > > all pattern variants with the different ISAs we have. > > > > > > > > Some of the patch could be split out (the cost changes that are > > > > also effective for DImode for example). > > > > > > > > AFAICS we could go with only adding SImode avoiding the DImode > > > > splitting thing and this would solve the hmmer regression. > > > > > > I've additionally bootstrapped with --with-arch=nehalem which > > > reveals > > > > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp > > > > > > we emit cmp + cmov here now with -msse4.1 (as soon as the max > > > pattern is enabled I guess) > > > > Actually, we have to split using ix86_expand_int_compare. This will > > generate optimized CC mode. > > So, this only matters for comparisons against zero. Currently, the > insn_and_split pattern allows only registers, but we can add other > types, too. I'd say that this is benign issue. OK. So this is with your suggestions applied plus testcases as promised. If we remove DImode support minmax-5.c has to be adjusted at least. Currently re-bootstrapping / testing on x86_64-unknown-linux-gnu. I'll followup with the performance assessment (currently only testing on Haswell), but I guess it is easy enough to address issues that pop up with the various auto-testers as followup by adjusting the cost function (and we may get additional testcases then as well). OK if the re-testing shows no issues? Thanks, Richard. 2019-08-07 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Fix reg-reg move cost gain, use ix86_cost->sse_op cost and adjust scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction gain if not zero. (general_scalar_chain::replace_with_subreg): Use vmode/smode. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. * gcc.target/i386/pr91154.c: New testcase. * gcc.target/i386/minmax-3.c: Likewise. * gcc.target/i386/minmax-4.c: Likewise. * gcc.target/i386/minmax-5.c: Likewise. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274111) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (smode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); emit_move_insn (vreg, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else - { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); - } + emit_move_insn (gen_lowpart (smode, vreg), reg); rtx_insn *seq = get_insns (); end_sequence (); rtx_insn *insn = DF_REF_INSN (ref); @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1638,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274111) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274111) +++ gcc/config/i386/i386.md (working copy) @@ -17721,6 +17721,31 @@ (define_peephole2 std::swap (operands[4], operands[5]); }) +;; min/max patterns + +(define_mode_iterator MAXMIN_IMODE + [(SI "TARGET_SSE4_1") (DI "TARGET_64BIT && TARGET_AVX512VL")]) +(define_code_attr maxmin_rel + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) +(define_code_attr maxmin_cmpmode + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) + +(define_insn_and_split "<code><mode>3" + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "register_operand"))) + (clobber (reg:CC FLAGS_REG))] + "TARGET_STV && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (reg:<maxmin_cmpmode> FLAGS_REG) + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) + (set (match_dup 0) + (if_then_else:MAXMIN_IMODE + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) + (match_dup 1) + (match_dup 2)))]) + ;; Conditional addition patterns (define_expand "add<mode>cc" [(match_operand:SWI 0 "register_operand") Index: gcc/testsuite/gcc.target/i386/pr91154.c =================================================================== --- gcc/testsuite/gcc.target/i386/pr91154.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/pr91154.c (working copy) @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse4.1 -mstv" } */ + +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) +{ + int sc; + int k; + for (k = 1; k <= M; k++) + { + dc[k] = dc[k-1] + tpdd[k-1]; + if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; + if (dc[k] < -987654321) dc[k] = -987654321; + } +} + +/* We want to convert the loop to SSE since SSE pmaxsd is faster than + compare + conditional move. */ +/* { dg-final { scan-assembler-not "cmov" } } */ +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */ +/* { dg-final { scan-assembler-times "paddd" 2 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-3.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int ssi[1024]; +unsigned int usi[1024]; +long long sdi[1024]; +unsigned long long udi[1024]; + +#define CHECK(FN, VARIANT) \ +void \ +FN ## VARIANT (void) \ +{ \ + for (int i = 1; i < 1024; ++i) \ + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ +} + +CHECK(max, ssi); +CHECK(min, ssi); +CHECK(max, usi); +CHECK(min, usi); +CHECK(max, sdi); +CHECK(min, sdi); +CHECK(max, udi); +CHECK(min, udi); Index: gcc/testsuite/gcc.target/i386/minmax-4.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -msse4.1" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ +/* { dg-final { scan-assembler-times "pminud" 1 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-5.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-5.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-5.c (working copy) @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -mavx512vl" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "vpminsd" 1 } } */ +/* { dg-final { scan-assembler-times "vpminud" 1 } } */ +/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */ +/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */ +/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */ +/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 13:00 ` Richard Biener @ 2019-08-07 13:32 ` Uros Bizjak 0 siblings, 0 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-07 13:32 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches On Wed, Aug 7, 2019 at 2:52 PM Richard Biener <rguenther@suse.de> wrote: > > On Wed, 7 Aug 2019, Uros Bizjak wrote: > > > On Wed, Aug 7, 2019 at 2:20 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > On Wed, Aug 7, 2019 at 1:51 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > On Wed, 7 Aug 2019, Richard Biener wrote: > > > > > > > > > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > > > > > > > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > > > > > > > case SMAX: > > > > > > > > > case SMIN: > > > > > > > > > case UMAX: > > > > > > > > > case UMIN: > > > > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > > > > return false; > > > > > > > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > > > > This is of course doable, but somehow more complex than simply > > > > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > > > > splitter does. So, a follow-up task. > > > > > > > > > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > > > > > check we just need to properly split if we enable the scalar minmax > > > > > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > > > > > > > > > Yes, that is correct. > > > > > > > > > > So I tested the patch below (now with appropriate ChangeLog) on > > > > > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > > > > > the obvious hmmer improvement, now checking for off-noise results > > > > > with a 3-run on those that may have one (with more than +-1 second > > > > > differences in the 1-run). > > > > > > > > > > As-is the patch likely runs into the splitting issue for DImode > > > > > on i?86 and the patch misses functional testcases. I'll do the > > > > > hmmer loop with both DImode and SImode and testcases to trigger > > > > > all pattern variants with the different ISAs we have. > > > > > > > > > > Some of the patch could be split out (the cost changes that are > > > > > also effective for DImode for example). > > > > > > > > > > AFAICS we could go with only adding SImode avoiding the DImode > > > > > splitting thing and this would solve the hmmer regression. > > > > > > > > I've additionally bootstrapped with --with-arch=nehalem which > > > > reveals > > > > > > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler test > > > > FAIL: gcc.target/i386/minmax-2.c scan-assembler-not cmp > > > > > > > > we emit cmp + cmov here now with -msse4.1 (as soon as the max > > > > pattern is enabled I guess) > > > > > > Actually, we have to split using ix86_expand_int_compare. This will > > > generate optimized CC mode. > > > > So, this only matters for comparisons against zero. Currently, the > > insn_and_split pattern allows only registers, but we can add other > > types, too. I'd say that this is benign issue. > > OK. So this is with your suggestions applied plus testcases as > promised. If we remove DImode support minmax-5.c has to be adjusted > at least. > > Currently re-bootstrapping / testing on x86_64-unknown-linux-gnu. > > I'll followup with the performance assessment (currently only > testing on Haswell), but I guess it is easy enough to address > issues that pop up with the various auto-testers as followup > by adjusting the cost function (and we may get additional testcases > then as well). > > OK if the re-testing shows no issues? > > Thanks, > Richard. > > 2019-08-07 Richard Biener <rguenther@suse.de> > > PR target/91154 > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > mode arguments. > (scalar_chain::smode): New member. > (scalar_chain::vmode): Likewise. > (dimode_scalar_chain): Rename to... > (general_scalar_chain): ... this. > (general_scalar_chain::general_scalar_chain): Take mode arguments. > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > base with TImode and V1TImode. > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > (general_scalar_chain::vector_const_cost): Adjust for SImode > chains. > (general_scalar_chain::compute_convert_gain): Likewise. Fix > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > gain if not zero. > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > (general_scalar_chain::make_vector_copies): Likewise. Handle > non-DImode chains appropriately. > (general_scalar_chain::convert_reg): Likewise. > (general_scalar_chain::convert_op): Likewise. > (general_scalar_chain::convert_insn): Likewise. Add > fatal_insn_not_found if the result is not recognized. > (convertible_comparison_p): Pass in the scalar mode and use that. > (general_scalar_to_vector_candidate_p): Likewise. Rename from > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > (scalar_to_vector_candidate_p): Remove by inlining into single > caller. > (general_remove_non_convertible_regs): Rename from > dimode_remove_non_convertible_regs. > (remove_non_convertible_regs): Remove by inlining into single caller. > (convert_scalars_to_vector): Handle SImode and DImode chains > in addition to TImode chains. > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > * gcc.target/i386/pr91154.c: New testcase. > * gcc.target/i386/minmax-3.c: Likewise. > * gcc.target/i386/minmax-4.c: Likewise. > * gcc.target/i386/minmax-5.c: Likewise. LGTM, perhaps someone with RTL background should also take a look. (I plan to enhance the new pattern in .md a bit once the patch landing settles.) Uros. > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c (revision 274111) > +++ gcc/config/i386/i386-features.c (working copy) > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > /* Initialize new chain. */ > > -scalar_chain::scalar_chain () > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > { > + smode = smode_; > + vmode = vmode_; > + > chain_id = ++max_id; > > if (dump_file) > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins > conversion. */ > > void > -dimode_scalar_chain::mark_dual_mode_def (df_ref def) > +general_scalar_chain::mark_dual_mode_def (df_ref def) > { > gcc_assert (DF_REF_REG_DEF_P (def)); > > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > && !HARD_REGISTER_P (SET_DEST (def_set))) > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > + /* ??? The following is quadratic since analyze_register_chain > + iterates over all refs to look for dual-mode regs. Instead this > + should be done separately for all regs mentioned in the chain once. */ > df_ref ref; > df_ref def; > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, > instead of using a scalar one. */ > > int > -dimode_scalar_chain::vector_const_cost (rtx exp) > +general_scalar_chain::vector_const_cost (rtx exp) > { > gcc_assert (CONST_INT_P (exp)); > > - if (standard_sse_constant_p (exp, V2DImode)) > - return COSTS_N_INSNS (1); > - return ix86_cost->sse_load[1]; > + if (standard_sse_constant_p (exp, vmode)) > + return ix86_cost->sse_op; > + /* We have separate costs for SImode and DImode, use SImode costs > + for smaller modes. */ > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > } > > /* Compute a gain for chain conversion. */ > > int > -dimode_scalar_chain::compute_convert_gain () > +general_scalar_chain::compute_convert_gain () > { > bitmap_iterator bi; > unsigned insn_uid; > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > + int costs factor in the number of GPRs involved. When supporting > + smaller modes than SImode the int load/store costs need to be > + adjusted as well. */ > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > + int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > + igain += 2 * m - ix86_cost->xmm_move; > else if (REG_P (src) && MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > else if (MEM_P (src) && REG_P (dst)) > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > else if (GET_CODE (src) == ASHIFT > || GET_CODE (src) == ASHIFTRT > || GET_CODE (src) == LSHIFTRT) > { > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > - gain += ix86_cost->shift_const; > + igain -= vector_const_cost (XEXP (src, 0)); > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > if (INTVAL (XEXP (src, 1)) >= 32) > - gain -= COSTS_N_INSNS (1); > + igain -= COSTS_N_INSNS (1); > } > else if (GET_CODE (src) == PLUS > || GET_CODE (src) == MINUS > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > { > - gain += ix86_cost->add; > + igain += m * ix86_cost->add - ix86_cost->sse_op; > /* Additional gain for andnot for targets without BMI. */ > if (GET_CODE (XEXP (src, 0)) == NOT > && !TARGET_BMI) > - gain += 2 * ix86_cost->add; > + igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0)); > if (CONST_INT_P (XEXP (src, 1))) > - gain -= vector_const_cost (XEXP (src, 1)); > + igain -= vector_const_cost (XEXP (src, 1)); > } > else if (GET_CODE (src) == NEG > || GET_CODE (src) == NOT) > - gain += ix86_cost->add - COSTS_N_INSNS (1); > + igain += m * ix86_cost->add - ix86_cost->sse_op; > + else if (GET_CODE (src) == SMAX > + || GET_CODE (src) == SMIN > + || GET_CODE (src) == UMAX > + || GET_CODE (src) == UMIN) > + { > + /* We do not have any conditional move cost, estimate it as a > + reg-reg move. Comparisons are costed as adds. */ > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > + /* Integer SSE ops are all costed the same. */ > + igain -= ix86_cost->sse_op; > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > else if (CONST_INT_P (src)) > { > if (REG_P (dst)) > - gain += COSTS_N_INSNS (2); > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > + igain += COSTS_N_INSNS (m); > else if (MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > - gain -= vector_const_cost (src); > + igain += (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]); > + igain -= vector_const_cost (src); > } > else > gcc_unreachable (); > + > + if (igain != 0 && dump_file) > + { > + fprintf (dump_file, " Instruction gain %d for ", igain); > + dump_insn_slim (dump_file, insn); > + } > + gain += igain; > } > > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > + /* ??? What about integer to SSE? */ > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > @@ -570,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai > /* Replace REG in X with a V2DI subreg of NEW_REG. */ > > rtx > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > { > if (x == reg) > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > int i, j; > @@ -593,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg > /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ > > void > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > rtx reg, rtx new_reg) > { > replace_with_subreg (single_set (insn), reg, new_reg); > @@ -624,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx > and replace its uses in a chain. */ > > void > -dimode_scalar_chain::make_vector_copies (unsigned regno) > +general_scalar_chain::make_vector_copies (unsigned regno) > { > rtx reg = regno_reg_rtx[regno]; > - rtx vreg = gen_reg_rtx (DImode); > + rtx vreg = gen_reg_rtx (smode); > df_ref ref; > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > @@ -636,37 +674,47 @@ dimode_scalar_chain::make_vector_copies > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > - emit_move_insn (adjust_address (tmp, SImode, 0), > - gen_rtx_SUBREG (SImode, reg, 0)); > - emit_move_insn (adjust_address (tmp, SImode, 4), > - gen_rtx_SUBREG (SImode, reg, 4)); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > + if (smode == DImode && !TARGET_64BIT) > + { > + emit_move_insn (adjust_address (tmp, SImode, 0), > + gen_rtx_SUBREG (SImode, reg, 0)); > + emit_move_insn (adjust_address (tmp, SImode, 4), > + gen_rtx_SUBREG (SImode, reg, 4)); > + } > + else > + emit_move_insn (tmp, reg); > emit_move_insn (vreg, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (SImode, reg, 4), > - GEN_INT (2))); > + if (TARGET_SSE4_1) > + { > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (SImode, reg, 4), > + GEN_INT (2))); > + } > + else > + { > + rtx tmp = gen_reg_rtx (DImode); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 4))); > + emit_insn (gen_vec_interleave_lowv4si > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > + } > } > else > - { > - rtx tmp = gen_reg_rtx (DImode); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 4))); > - emit_insn (gen_vec_interleave_lowv4si > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > - } > + emit_move_insn (gen_lowpart (smode, vreg), reg); > rtx_insn *seq = get_insns (); > end_sequence (); > rtx_insn *insn = DF_REF_INSN (ref); > @@ -695,7 +743,7 @@ dimode_scalar_chain::make_vector_copies > in case register is used in not convertible insn. */ > > void > -dimode_scalar_chain::convert_reg (unsigned regno) > +general_scalar_chain::convert_reg (unsigned regno) > { > bool scalar_copy = bitmap_bit_p (defs_conv, regno); > rtx reg = regno_reg_rtx[regno]; > @@ -707,7 +755,7 @@ dimode_scalar_chain::convert_reg (unsign > bitmap_copy (conv, insns); > > if (scalar_copy) > - scopy = gen_reg_rtx (DImode); > + scopy = gen_reg_rtx (smode); > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > { > @@ -727,40 +775,55 @@ dimode_scalar_chain::convert_reg (unsign > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > emit_move_insn (tmp, reg); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - adjust_address (tmp, SImode, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - adjust_address (tmp, SImode, 4)); > + if (!TARGET_64BIT && smode == DImode) > + { > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + adjust_address (tmp, SImode, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + adjust_address (tmp, SImode, 4)); > + } > + else > + emit_move_insn (scopy, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > - > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > + if (TARGET_SSE4_1) > + { > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > + gen_rtvec (1, const0_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + } > + else > + { > + rtx vcopy = gen_reg_rtx (V2DImode); > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + emit_move_insn (vcopy, > + gen_rtx_LSHIFTRT (V2DImode, > + vcopy, GEN_INT (32))); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + } > } > else > - { > - rtx vcopy = gen_reg_rtx (V2DImode); > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - emit_move_insn (vcopy, > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - } > + emit_move_insn (scopy, reg); > + > rtx_insn *seq = get_insns (); > end_sequence (); > emit_conversion_insns (seq, insn); > @@ -809,21 +872,21 @@ dimode_scalar_chain::convert_reg (unsign > registers conversion. */ > > void > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > { > *op = copy_rtx_if_shared (*op); > > if (GET_CODE (*op) == NOT) > { > convert_op (&XEXP (*op, 0), insn); > - PUT_MODE (*op, V2DImode); > + PUT_MODE (*op, vmode); > } > else if (MEM_P (*op)) > { > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > emit_insn_before (gen_move_insn (tmp, *op), insn); > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > if (dump_file) > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > @@ -841,24 +904,30 @@ dimode_scalar_chain::convert_op (rtx *op > gcc_assert (!DF_REF_CHAIN (ref)); > break; > } > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > + *op = gen_rtx_SUBREG (vmode, *op, 0); > } > else if (CONST_INT_P (*op)) > { > rtx vec_cst; > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > /* Prefer all ones vector in case of -1. */ > if (constm1_operand (*op, GET_MODE (*op))) > - vec_cst = CONSTM1_RTX (V2DImode); > + vec_cst = CONSTM1_RTX (vmode); > else > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > - gen_rtvec (2, *op, const0_rtx)); > + { > + unsigned n = GET_MODE_NUNITS (vmode); > + rtx *v = XALLOCAVEC (rtx, n); > + v[0] = *op; > + for (unsigned i = 1; i < n; ++i) > + v[i] = const0_rtx; > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > + } > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > + if (!standard_sse_constant_p (vec_cst, vmode)) > { > start_sequence (); > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > rtx_insn *seq = get_insns (); > end_sequence (); > emit_insn_before (seq, insn); > @@ -870,14 +939,14 @@ dimode_scalar_chain::convert_op (rtx *op > else > { > gcc_assert (SUBREG_P (*op)); > - gcc_assert (GET_MODE (*op) == V2DImode); > + gcc_assert (GET_MODE (*op) == vmode); > } > } > > /* Convert INSN to vector mode. */ > > void > -dimode_scalar_chain::convert_insn (rtx_insn *insn) > +general_scalar_chain::convert_insn (rtx_insn *insn) > { > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > @@ -888,9 +957,9 @@ dimode_scalar_chain::convert_insn (rtx_i > { > /* There are no scalar integer instructions and therefore > temporary register usage is required. */ > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > } > > switch (GET_CODE (src)) > @@ -899,7 +968,7 @@ dimode_scalar_chain::convert_insn (rtx_i > case ASHIFTRT: > case LSHIFTRT: > convert_op (&XEXP (src, 0), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case PLUS: > @@ -907,25 +976,29 @@ dimode_scalar_chain::convert_insn (rtx_i > case IOR: > case XOR: > case AND: > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > convert_op (&XEXP (src, 0), insn); > convert_op (&XEXP (src, 1), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case NEG: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > - src = gen_rtx_MINUS (V2DImode, subreg, src); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > + src = gen_rtx_MINUS (vmode, subreg, src); > break; > > case NOT: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > - src = gen_rtx_XOR (V2DImode, src, subreg); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > + src = gen_rtx_XOR (vmode, src, subreg); > break; > > case MEM: > @@ -939,17 +1012,17 @@ dimode_scalar_chain::convert_insn (rtx_i > break; > > case SUBREG: > - gcc_assert (GET_MODE (src) == V2DImode); > + gcc_assert (GET_MODE (src) == vmode); > break; > > case COMPARE: > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > if (REG_P (src)) > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > + subreg = gen_rtx_SUBREG (vmode, src, 0); > else > subreg = copy_rtx_if_shared (src); > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > @@ -977,7 +1050,9 @@ dimode_scalar_chain::convert_insn (rtx_i > PATTERN (insn) = def_set; > > INSN_CODE (insn) = -1; > - recog_memoized (insn); > + int patt = recog_memoized (insn); > + if (patt == -1) > + fatal_insn_not_found (insn); > df_insn_rescan (insn); > } > > @@ -1116,7 +1191,7 @@ timode_scalar_chain::convert_insn (rtx_i > } > > void > -dimode_scalar_chain::convert_registers () > +general_scalar_chain::convert_registers () > { > bitmap_iterator bi; > unsigned id; > @@ -1186,7 +1261,7 @@ has_non_address_hard_reg (rtx_insn *insn > (const_int 0 [0]))) */ > > static bool > -convertible_comparison_p (rtx_insn *insn) > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > { > if (!TARGET_SSE4_1) > return false; > @@ -1219,12 +1294,12 @@ convertible_comparison_p (rtx_insn *insn > > if (!SUBREG_P (op1) > || !SUBREG_P (op2) > - || GET_MODE (op1) != SImode > - || GET_MODE (op2) != SImode > + || GET_MODE (op1) != mode > + || GET_MODE (op2) != mode > || ((SUBREG_BYTE (op1) != 0 > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > && (SUBREG_BYTE (op2) != 0 > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > return false; > > op1 = SUBREG_REG (op1); > @@ -1232,7 +1307,7 @@ convertible_comparison_p (rtx_insn *insn > > if (op1 != op2 > || !REG_P (op1) > - || GET_MODE (op1) != DImode) > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > return false; > > return true; > @@ -1241,7 +1316,7 @@ convertible_comparison_p (rtx_insn *insn > /* The DImode version of scalar_to_vector_candidate_p. */ > > static bool > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > { > rtx def_set = single_set (insn); > > @@ -1255,12 +1330,12 @@ dimode_scalar_to_vector_candidate_p (rtx > rtx dst = SET_DEST (def_set); > > if (GET_CODE (src) == COMPARE) > - return convertible_comparison_p (insn); > + return convertible_comparison_p (insn, mode); > > /* We are interested in DImode promotion only. */ > - if ((GET_MODE (src) != DImode > + if ((GET_MODE (src) != mode > && !CONST_INT_P (src)) > - || GET_MODE (dst) != DImode) > + || GET_MODE (dst) != mode) > return false; > > if (!REG_P (dst) && !MEM_P (dst)) > @@ -1280,6 +1355,15 @@ dimode_scalar_to_vector_candidate_p (rtx > return false; > break; > > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > + if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > + || (mode == SImode && !TARGET_SSE4_1)) > + return false; > + /* Fallthru. */ > + > case PLUS: > case MINUS: > case IOR: > @@ -1290,7 +1374,7 @@ dimode_scalar_to_vector_candidate_p (rtx > && !CONST_INT_P (XEXP (src, 1))) > return false; > > - if (GET_MODE (XEXP (src, 1)) != DImode > + if (GET_MODE (XEXP (src, 1)) != mode > && !CONST_INT_P (XEXP (src, 1))) > return false; > break; > @@ -1319,7 +1403,7 @@ dimode_scalar_to_vector_candidate_p (rtx > || !REG_P (XEXP (XEXP (src, 0), 0)))) > return false; > > - if (GET_MODE (XEXP (src, 0)) != DImode > + if (GET_MODE (XEXP (src, 0)) != mode > && !CONST_INT_P (XEXP (src, 0))) > return false; > > @@ -1383,22 +1467,16 @@ timode_scalar_to_vector_candidate_p (rtx > return false; > } > > -/* Return 1 if INSN may be converted into vector > - instruction. */ > - > -static bool > -scalar_to_vector_candidate_p (rtx_insn *insn) > -{ > - if (TARGET_64BIT) > - return timode_scalar_to_vector_candidate_p (insn); > - else > - return dimode_scalar_to_vector_candidate_p (insn); > -} > +/* For a given bitmap of insn UIDs scans all instruction and > + remove insn from CANDIDATES in case it has both convertible > + and not convertible definitions. > > -/* The DImode version of remove_non_convertible_regs. */ > + All insns in a bitmap are conversion candidates according to > + scalar_to_vector_candidate_p. Currently it implies all insns > + are single_set. */ > > static void > -dimode_remove_non_convertible_regs (bitmap candidates) > +general_remove_non_convertible_regs (bitmap candidates) > { > bitmap_iterator bi; > unsigned id; > @@ -1553,23 +1631,6 @@ timode_remove_non_convertible_regs (bitm > BITMAP_FREE (regs); > } > > -/* For a given bitmap of insn UIDs scans all instruction and > - remove insn from CANDIDATES in case it has both convertible > - and not convertible definitions. > - > - All insns in a bitmap are conversion candidates according to > - scalar_to_vector_candidate_p. Currently it implies all insns > - are single_set. */ > - > -static void > -remove_non_convertible_regs (bitmap candidates) > -{ > - if (TARGET_64BIT) > - timode_remove_non_convertible_regs (candidates); > - else > - dimode_remove_non_convertible_regs (candidates); > -} > - > /* Main STV pass function. Find and convert scalar > instructions into vector mode when profitable. */ > > @@ -1577,11 +1638,14 @@ static unsigned int > convert_scalars_to_vector () > { > basic_block bb; > - bitmap candidates; > int converted_insns = 0; > > bitmap_obstack_initialize (NULL); > - candidates = BITMAP_ALLOC (NULL); > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > + for (unsigned i = 0; i < 3; ++i) > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > calculate_dominance_info (CDI_DOMINATORS); > df_set_flags (DF_DEFER_INSN_RESCAN); > @@ -1597,51 +1661,73 @@ convert_scalars_to_vector () > { > rtx_insn *insn; > FOR_BB_INSNS (bb, insn) > - if (scalar_to_vector_candidate_p (insn)) > + if (TARGET_64BIT > + && timode_scalar_to_vector_candidate_p (insn)) > { > if (dump_file) > - fprintf (dump_file, " insn %d is marked as a candidate\n", > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > INSN_UID (insn)); > > - bitmap_set_bit (candidates, INSN_UID (insn)); > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > + } > + else > + { > + /* Check {SI,DI}mode. */ > + for (unsigned i = 0; i <= 1; ++i) > + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) > + { > + if (dump_file) > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > + > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > + break; > + } > } > } > > - remove_non_convertible_regs (candidates); > + if (TARGET_64BIT) > + timode_remove_non_convertible_regs (&candidates[2]); > + for (unsigned i = 0; i <= 1; ++i) > + general_remove_non_convertible_regs (&candidates[i]); > > - if (bitmap_empty_p (candidates)) > - if (dump_file) > + for (unsigned i = 0; i <= 2; ++i) > + if (!bitmap_empty_p (&candidates[i])) > + break; > + else if (i == 2 && dump_file) > fprintf (dump_file, "There are no candidates for optimization.\n"); > > - while (!bitmap_empty_p (candidates)) > - { > - unsigned uid = bitmap_first_set_bit (candidates); > - scalar_chain *chain; > + for (unsigned i = 0; i <= 2; ++i) > + while (!bitmap_empty_p (&candidates[i])) > + { > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > + scalar_chain *chain; > > - if (TARGET_64BIT) > - chain = new timode_scalar_chain; > - else > - chain = new dimode_scalar_chain; > + if (cand_mode[i] == TImode) > + chain = new timode_scalar_chain; > + else > + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); > > - /* Find instructions chain we want to convert to vector mode. > - Check all uses and definitions to estimate all required > - conversions. */ > - chain->build (candidates, uid); > + /* Find instructions chain we want to convert to vector mode. > + Check all uses and definitions to estimate all required > + conversions. */ > + chain->build (&candidates[i], uid); > > - if (chain->compute_convert_gain () > 0) > - converted_insns += chain->convert (); > - else > - if (dump_file) > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > - chain->chain_id); > + if (chain->compute_convert_gain () > 0) > + converted_insns += chain->convert (); > + else > + if (dump_file) > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > + chain->chain_id); > > - delete chain; > - } > + delete chain; > + } > > if (dump_file) > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > - BITMAP_FREE (candidates); > + for (unsigned i = 0; i <= 2; ++i) > + bitmap_release (&candidates[i]); > bitmap_obstack_release (NULL); > df_process_deferred_rescans (); > > Index: gcc/config/i386/i386-features.h > =================================================================== > --- gcc/config/i386/i386-features.h (revision 274111) > +++ gcc/config/i386/i386-features.h (working copy) > @@ -127,11 +127,16 @@ namespace { > class scalar_chain > { > public: > - scalar_chain (); > + scalar_chain (enum machine_mode, enum machine_mode); > virtual ~scalar_chain (); > > static unsigned max_id; > > + /* Scalar mode. */ > + enum machine_mode smode; > + /* Vector mode. */ > + enum machine_mode vmode; > + > /* ID of a chain. */ > unsigned int chain_id; > /* A queue of instructions to be included into a chain. */ > @@ -159,9 +164,11 @@ class scalar_chain > virtual void convert_registers () = 0; > }; > > -class dimode_scalar_chain : public scalar_chain > +class general_scalar_chain : public scalar_chain > { > public: > + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > + : scalar_chain (smode_, vmode_) {} > int compute_convert_gain (); > private: > void mark_dual_mode_def (df_ref def); > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > class timode_scalar_chain : public scalar_chain > { > public: > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > + > /* Convert from TImode to V1TImode is always faster. */ > int compute_convert_gain () { return 1; } > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 274111) > +++ gcc/config/i386/i386.md (working copy) > @@ -17721,6 +17721,31 @@ (define_peephole2 > std::swap (operands[4], operands[5]); > }) > > +;; min/max patterns > + > +(define_mode_iterator MAXMIN_IMODE > + [(SI "TARGET_SSE4_1") (DI "TARGET_64BIT && TARGET_AVX512VL")]) > +(define_code_attr maxmin_rel > + [(smax "ge") (smin "le") (umax "geu") (umin "leu")]) > +(define_code_attr maxmin_cmpmode > + [(smax "CCGC") (smin "CCGC") (umax "CC") (umin "CC")]) > + > +(define_insn_and_split "<code><mode>3" > + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") > + (maxmin:MAXMIN_IMODE (match_operand:MAXMIN_IMODE 1 "register_operand") > + (match_operand:MAXMIN_IMODE 2 "register_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "TARGET_STV && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (reg:<maxmin_cmpmode> FLAGS_REG) > + (compare:<maxmin_cmpmode> (match_dup 1)(match_dup 2))) > + (set (match_dup 0) > + (if_then_else:MAXMIN_IMODE > + (<maxmin_rel> (reg:<maxmin_cmpmode> FLAGS_REG)(const_int 0)) > + (match_dup 1) > + (match_dup 2)))]) > + > ;; Conditional addition patterns > (define_expand "add<mode>cc" > [(match_operand:SWI 0 "register_operand") > Index: gcc/testsuite/gcc.target/i386/pr91154.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/pr91154.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/pr91154.c (working copy) > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -msse4.1 -mstv" } */ > + > +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) > +{ > + int sc; > + int k; > + for (k = 1; k <= M; k++) > + { > + dc[k] = dc[k-1] + tpdd[k-1]; > + if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; > + if (dc[k] < -987654321) dc[k] = -987654321; > + } > +} > + > +/* We want to convert the loop to SSE since SSE pmaxsd is faster than > + compare + conditional move. */ > +/* { dg-final { scan-assembler-not "cmov" } } */ > +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */ > +/* { dg-final { scan-assembler-times "paddd" 2 } } */ > Index: gcc/testsuite/gcc.target/i386/minmax-3.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv" } */ > + > +#define max(a,b) (((a) > (b))? (a) : (b)) > +#define min(a,b) (((a) < (b))? (a) : (b)) > + > +int ssi[1024]; > +unsigned int usi[1024]; > +long long sdi[1024]; > +unsigned long long udi[1024]; > + > +#define CHECK(FN, VARIANT) \ > +void \ > +FN ## VARIANT (void) \ > +{ \ > + for (int i = 1; i < 1024; ++i) \ > + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ > +} > + > +CHECK(max, ssi); > +CHECK(min, ssi); > +CHECK(max, usi); > +CHECK(min, usi); > +CHECK(max, sdi); > +CHECK(min, sdi); > +CHECK(max, udi); > +CHECK(min, udi); > Index: gcc/testsuite/gcc.target/i386/minmax-4.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv -msse4.1" } */ > + > +#include "minmax-3.c" > + > +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ > +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ > +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ > +/* { dg-final { scan-assembler-times "pminud" 1 } } */ > Index: gcc/testsuite/gcc.target/i386/minmax-5.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-5.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-5.c (working copy) > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv -mavx512vl" } */ > + > +#include "minmax-3.c" > + > +/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */ > +/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */ > +/* { dg-final { scan-assembler-times "vpminsd" 1 } } */ > +/* { dg-final { scan-assembler-times "vpminud" 1 } } */ > +/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */ > +/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */ > +/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */ > +/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-07 9:52 ` Richard Biener 2019-08-07 12:04 ` Richard Biener @ 2019-08-07 14:15 ` Richard Biener 1 sibling, 0 replies; 61+ messages in thread From: Richard Biener @ 2019-08-07 14:15 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On Wed, 7 Aug 2019, Richard Biener wrote: > On Mon, 5 Aug 2019, Uros Bizjak wrote: > > > On Mon, Aug 5, 2019 at 3:29 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > to force use of %zmmN? > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > case SMAX: > > > > > case SMIN: > > > > > case UMAX: > > > > > case UMIN: > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > return false; > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > This is of course doable, but somehow more complex than simply > > > > emitting a DImode compare + DImode cmove, which is what current > > > > splitter does. So, a follow-up task. > > > > > > Ah, OK. So for the above condition we can elide the !TARGET_64BIT > > > check we just need to properly split if we enable the scalar minmax > > > pattern for DImode on 32bits, the STV conversion would go fine. > > > > Yes, that is correct. > > So I tested the patch below (now with appropriate ChangeLog) on > x86_64-unknown-linux-gnu. I've thrown it at SPEC CPU 2006 with > the obvious hmmer improvement, now checking for off-noise results > with a 3-run on those that may have one (with more than +-1 second > differences in the 1-run). Update on this one. On Haswell I see (besides hmmer and the ones +-1 second in the 1-run); base is unpatched, peak is patched: 401.bzip2 9650 382 25.3 S 9650 380 25.4 S 401.bzip2 9650 381 25.3 * 9650 377 25.6 * 401.bzip2 9650 381 25.3 S 9650 376 25.7 S 458.sjeng 12100 433 28.0 S 12100 433 28.0 S 458.sjeng 12100 428 28.3 S 12100 424 28.5 * 458.sjeng 12100 432 28.0 * 12100 424 28.6 S 464.h264ref 22130 413 53.6 S 22130 422 52.5 S 464.h264ref 22130 413 53.6 * 22130 421 52.5 S 464.h264ref 22130 413 53.6 S 22130 421 52.5 * 473.astar 7020 328 21.4 S 7020 316 22.2 S 473.astar 7020 322 21.8 S 7020 314 22.4 * 473.astar 7020 322 21.8 * 7020 311 22.6 S 416.gamess 19580 593 33.0 S 19580 601 32.6 S 416.gamess 19580 593 33.0 S 19580 601 32.6 * 416.gamess 19580 593 33.0 * 19580 601 32.6 S so it's a loss for 464.h264ref and 416.gamess from the above numbers and a slight win for the others (and a big one for 456.hmmer). I plan to have a look at the two as followup only, possibly adding a debug couter to be able to bisect to a specific chain. Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-05 13:09 ` Uros Bizjak 2019-08-05 13:29 ` Richard Biener @ 2019-08-09 7:28 ` Uros Bizjak 2019-08-09 10:13 ` Richard Biener 1 sibling, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-09 7:28 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches [-- Attachment #1: Type: text/plain, Size: 1734 bytes --] On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > to force use of %zmmN? > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > case SMAX: > > case SMIN: > > case UMAX: > > case UMIN: > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > || (mode == SImode && !TARGET_SSE4_1)) > > return false; > > > > so there's no way to use AVX512VL for 32bit? > > There is a way, but on 32bit targets, we need to split DImode > operation to a sequence of SImode operations for unconverted pattern. > This is of course doable, but somehow more complex than simply > emitting a DImode compare + DImode cmove, which is what current > splitter does. So, a follow-up task. Please find attached the complete .md part that enables SImode for TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit targets. The patterns also allows for memory operand 2, so STV has chance to create the vector pattern with implicit load. In case STV fails, the memory operand 2 is loaded to the register first; operand 2 is used in compare and cmove instruction, so pre-loading of the operand should be beneficial. Also note, that splitting should happen rarely. Due to the cost function, STV should effectively always convert minmax to a vector insn. Uros. [-- Attachment #2: maxmin-md.diff.txt --] [-- Type: text/plain, Size: 3376 bytes --] Index: config/i386/i386.md =================================================================== --- config/i386/i386.md (revision 274210) +++ config/i386/i386.md (working copy) @@ -17719,6 +17719,110 @@ (match_operand:SWI 3 "const_int_operand")] "" "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") + +;; min/max patterns + +(define_mode_iterator MAXMIN_IMODE + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) +(define_code_attr maxmin_rel + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) + +(define_expand "<code><mode>3" + [(parallel + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))])] + "TARGET_STV") + +(define_insn_and_split "*<code><mode>3_1" + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:MAXMIN_IMODE (match_dup 3) + (match_dup 1) + (match_dup 2)))] +{ + machine_mode mode = <MODE>mode; + + if (!register_operand (operands[2], mode)) + operands[2] = force_reg (mode, operands[2]); + + enum rtx_code code = <maxmin_rel>; + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); + + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + emit_insn (gen_rtx_SET (flags, tmp)); + + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); +}) + +(define_insn_and_split "*<code>di3_doubleword" + [(set (match_operand:DI 0 "register_operand") + (maxmin:DI (match_operand:DI 1 "register_operand") + (match_operand:DI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:SI (match_dup 6) + (match_dup 1) + (match_dup 2))) + (set (match_dup 3) + (if_then_else:SI (match_dup 6) + (match_dup 4) + (match_dup 5)))] +{ + if (!register_operand (operands[2], DImode)) + operands[2] = force_reg (DImode, operands[2]); + + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); + + rtx cmplo[2] = { operands[1], operands[2] }; + rtx cmphi[2] = { operands[4], operands[5] }; + + enum rtx_code code = <maxmin_rel>; + + switch (code) + { + case LE: case LEU: + std::swap (cmplo[0], cmplo[1]); + std::swap (cmphi[0], cmphi[1]); + code = swap_condition (code); + /* FALLTHRU */ + + case GE: case GEU: + { + bool uns = (code == GEU); + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; + + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); + + rtx tmp = gen_rtx_SCRATCH (SImode); + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); + + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); + + break; + } + + default: + gcc_unreachable (); + } +}) \f ;; Misc patterns (?) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 7:28 ` Uros Bizjak @ 2019-08-09 10:13 ` Richard Biener 2019-08-09 10:26 ` Jakub Jelinek 2019-08-09 11:06 ` Richard Biener 0 siblings, 2 replies; 61+ messages in thread From: Richard Biener @ 2019-08-09 10:13 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches [-- Attachment #1: Type: text/plain, Size: 4339 bytes --] On Fri, 9 Aug 2019, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > to force use of %zmmN? > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > case SMAX: > > > case SMIN: > > > case UMAX: > > > case UMIN: > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > || (mode == SImode && !TARGET_SSE4_1)) > > > return false; > > > > > > so there's no way to use AVX512VL for 32bit? > > > > There is a way, but on 32bit targets, we need to split DImode > > operation to a sequence of SImode operations for unconverted pattern. > > This is of course doable, but somehow more complex than simply > > emitting a DImode compare + DImode cmove, which is what current > > splitter does. So, a follow-up task. > > Please find attached the complete .md part that enables SImode for > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > targets. The patterns also allows for memory operand 2, so STV has > chance to create the vector pattern with implicit load. In case STV > fails, the memory operand 2 is loaded to the register first; operand > 2 is used in compare and cmove instruction, so pre-loading of the > operand should be beneficial. Thanks. > Also note, that splitting should happen rarely. Due to the cost > function, STV should effectively always convert minmax to a vector > insn. I've analyzed the 464.h264ref slowdown on Haswell and it is due to this kind of "simple" conversion: 5.50 â1d0: test %esi,%es 0.07 â mov $0x0,%ex â cmovs %eax,%es 5.84 â imul %r8d,%es to 0.65 â1e0: vpxor %xmm0,%xmm0,%xmm0 0.32 â vpmaxs -0x10(%rsp),%xmm0,%xmm0 40.45 â vmovd %xmm0,%eax 2.45 â imul %r8d,%eax which looks like a RA artifact in the end. We spill %esi only with -mstv here as STV introduces a (subreg:V4SI ...) use of a pseudo ultimatively set from di. STV creates an additional pseudo for this (copy-in) but it places that copy next to the original def rather than next to the start of the chain it converts which is probably the issue why we spill. And this is because it inserts those at each definition of the pseudo rather than just at the reaching definition(s) or at the uses of the pseudo in the chain (that because there may be defs of that pseudo in the chain itself). Note that STV emits such "conversion" copies as simple reg-reg moves: (insn 1094 3 4 2 (set (reg:SI 777) (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 (nil)) but those do not prevail very long (this one gets removed by CSE2). So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use and computes r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 so I wonder if STV shouldn't instead emit gpr->xmm moves here (but I guess nothing again prevents RTL optimizers from combining that with the single-use in the max instruction...). So this boils down to STV splitting live-ranges but other passes undoing that and then RA not considering splitting live-ranges here, arriving at unoptimal allocation. A testcase showing this issue is (simplified from 464.h264ref UMVLine16Y_11): unsigned short UMVLine16Y_11 (short unsigned int * Pic, int y, int width) { if (y != width) { y = y < 0 ? 0 : y; return Pic[y * width]; } return Pic[y]; } where the condition and the Pic[y] load mimics the other use of y. Different, even worse spilling is generated by unsigned short UMVLine16Y_11 (short unsigned int * Pic, int y, int width) { y = y < 0 ? 0 : y; return Pic[y * width] + y; } I guess this all shows that STVs "trick" of simply wrapping integer mode pseudos in (subreg:vector-mode ...) is bad? I've added a (failing) testcase to reflect the above. Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 10:13 ` Richard Biener @ 2019-08-09 10:26 ` Jakub Jelinek 2019-08-09 11:15 ` Richard Biener 2019-08-09 11:06 ` Richard Biener 1 sibling, 1 reply; 61+ messages in thread From: Jakub Jelinek @ 2019-08-09 10:26 UTC (permalink / raw) To: Richard Biener; +Cc: Uros Bizjak, gcc-patches On Fri, Aug 09, 2019 at 11:25:30AM +0200, Richard Biener wrote: > 0.65 â1e0: vpxor %xmm0,%xmm0,%xmm0 > 0.32 â vpmaxs -0x10(%rsp),%xmm0,%xmm0 > 40.45 â vmovd %xmm0,%eax > 2.45 â imul %r8d,%eax Shouldn't we hoist the vpxor before the loop? Is it STV being done too late that we don't do that anymore? Couldn't e.g. STV itself detect that and put the clearing instruction before the loop instead of right before the minmax? Jakub ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 10:26 ` Jakub Jelinek @ 2019-08-09 11:15 ` Richard Biener 0 siblings, 0 replies; 61+ messages in thread From: Richard Biener @ 2019-08-09 11:15 UTC (permalink / raw) To: Jakub Jelinek; +Cc: Uros Bizjak, gcc-patches [-- Attachment #1: Type: text/plain, Size: 727 bytes --] On Fri, 9 Aug 2019, Jakub Jelinek wrote: > On Fri, Aug 09, 2019 at 11:25:30AM +0200, Richard Biener wrote: > > 0.65 â1e0: vpxor %xmm0,%xmm0,%xmm0 > > 0.32 â vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > 40.45 â vmovd %xmm0,%eax > > 2.45 â imul %r8d,%eax > > Shouldn't we hoist the vpxor before the loop? Is it STV being done too late > that we don't do that anymore? Couldn't e.g. STV itself detect that and put > the clearing instruction before the loop instead of right before the minmax? This testcase doesn't have a loop, since the minmax patterns do not allow constants we need to deal with this for the GPR case as well. And we do when you look at the loop testcase. Richard. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 10:13 ` Richard Biener 2019-08-09 10:26 ` Jakub Jelinek @ 2019-08-09 11:06 ` Richard Biener 2019-08-09 13:13 ` Richard Biener 1 sibling, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-09 11:06 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches [-- Attachment #1: Type: text/plain, Size: 43211 bytes --] On Fri, 9 Aug 2019, Richard Biener wrote: > On Fri, 9 Aug 2019, Uros Bizjak wrote: > > > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > to force use of %zmmN? > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > case SMAX: > > > > case SMIN: > > > > case UMAX: > > > > case UMIN: > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > return false; > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > operation to a sequence of SImode operations for unconverted pattern. > > > This is of course doable, but somehow more complex than simply > > > emitting a DImode compare + DImode cmove, which is what current > > > splitter does. So, a follow-up task. > > > > Please find attached the complete .md part that enables SImode for > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > targets. The patterns also allows for memory operand 2, so STV has > > chance to create the vector pattern with implicit load. In case STV > > fails, the memory operand 2 is loaded to the register first; operand > > 2 is used in compare and cmove instruction, so pre-loading of the > > operand should be beneficial. > > Thanks. > > > Also note, that splitting should happen rarely. Due to the cost > > function, STV should effectively always convert minmax to a vector > > insn. > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > this kind of "simple" conversion: > > 5.50 â1d0: test %esi,%es > 0.07 â mov $0x0,%ex > â cmovs %eax,%es > 5.84 â imul %r8d,%es > > to > > 0.65 â1e0: vpxor %xmm0,%xmm0,%xmm0 > 0.32 â vpmaxs -0x10(%rsp),%xmm0,%xmm0 > 40.45 â vmovd %xmm0,%eax > 2.45 â imul %r8d,%eax > > which looks like a RA artifact in the end. We spill %esi only > with -mstv here as STV introduces a (subreg:V4SI ...) use > of a pseudo ultimatively set from di. STV creates an additional > pseudo for this (copy-in) but it places that copy next to the > original def rather than next to the start of the chain it > converts which is probably the issue why we spill. And this > is because it inserts those at each definition of the pseudo > rather than just at the reaching definition(s) or at the > uses of the pseudo in the chain (that because there may be > defs of that pseudo in the chain itself). Note that STV emits > such "conversion" copies as simple reg-reg moves: > > (insn 1094 3 4 2 (set (reg:SI 777) > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > (nil)) > > but those do not prevail very long (this one gets removed by CSE2). > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > and computes > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > so I wonder if STV shouldn't instead emit gpr->xmm moves > here (but I guess nothing again prevents RTL optimizers from > combining that with the single-use in the max instruction...). > > So this boils down to STV splitting live-ranges but other > passes undoing that and then RA not considering splitting > live-ranges here, arriving at unoptimal allocation. > > A testcase showing this issue is (simplified from 464.h264ref > UMVLine16Y_11): > > unsigned short > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > { > if (y != width) > { > y = y < 0 ? 0 : y; > return Pic[y * width]; > } > return Pic[y]; > } > > where the condition and the Pic[y] load mimics the other use of y. > Different, even worse spilling is generated by > > unsigned short > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > { > y = y < 0 ? 0 : y; > return Pic[y * width] + y; > } > > I guess this all shows that STVs "trick" of simply wrapping > integer mode pseudos in (subreg:vector-mode ...) is bad? > > I've added a (failing) testcase to reflect the above. Experimenting a bit with just for the conversion insns using V4SImode pseudos we end up preserving those moves (but I do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg ends up using movv4si_internal which only leaves us with memory for the SImode operand) _plus_ moving the move next to the actual use has an effect. Not necssarily a good one though: vpxor %xmm0, %xmm0, %xmm0 vmovaps %xmm0, -16(%rsp) movl %esi, -16(%rsp) vpmaxsd -16(%rsp), %xmm0, %xmm0 vmovd %xmm0, %eax eh? I guess the lowpart set is not good (my patch has this as well, but I got saved by never having vector modes to subset...). Using (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) (const_vector:V4SI [ (const_int 0 [0]) repeated x4 ]) (const_int 1 [0x1]))) "t3.c":5:10 -1 for the move ends up with vpxor %xmm1, %xmm1, %xmm1 vpinsrd $0, %esi, %xmm1, %xmm0 eh? LRA chooses the correct alternative here but somehow postreload CSE CSEs the zero with the xmm1 clearing, leading to the vpinsrd... (I guess a general issue, not sure if really worse - definitely a larger instruction). Unfortunately postreload-cse doesn't add a reg-equal note. This happens only when emitting the reg move before the use, not doing that emits a vmovd as expected. At least the spilling is gone here. I am re-testing as follows, the main change is that general_scalar_chain::make_vector_copies now generates a vector pseudo as destination (and I've fixed up the code to not generate (subreg:V4SI (reg:V4SI 1234) 0)). Hope this fixes the observed slowdowns (it fixes the new testcase). Richard. mccas.F:twotff_ for 416.gamess refbuf.c:UMVLine16Y_11 for 464.h264ref 2019-08-07 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Fix reg-reg move cost gain, use ix86_cost->sse_op cost and adjust scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction gain if not zero. (general_scalar_chain::replace_with_subreg): Use vmode/smode. Elide the subreg if the reg is already vector. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. Use a vector-mode pseudo as destination. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. Elide the subreg if the reg is already vector. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. * gcc.target/i386/pr91154.c: New testcase. * gcc.target/i386/minmax-3.c: Likewise. * gcc.target/i386/minmax-4.c: Likewise. * gcc.target/i386/minmax-5.c: Likewise. * gcc.target/i386/minmax-6.c: Likewise. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274111) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return (GET_MODE (new_reg) == vmode + ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0)); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (vmode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); - emit_move_insn (vreg, tmp); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); + emit_move_insn (vreg, + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + tmp), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U))); + } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); + emit_move_insn (vreg, + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + reg), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U))); } rtx_insn *seq = get_insns (); end_sequence (); @@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + if (GET_MODE (*op) != vmode) + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -939,17 +1027,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1065,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1116,7 +1206,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1186,7 +1276,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1309,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1322,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1331,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1345,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1370,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512VL) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1389,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1418,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,22 +1482,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1553,23 +1646,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1653,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1676,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274111) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274111) +++ gcc/config/i386/i386.md (working copy) @@ -17729,6 +17729,110 @@ (define_expand "add<mode>cc" (match_operand:SWI 3 "const_int_operand")] "" "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") + +;; min/max patterns + +(define_mode_iterator MAXMIN_IMODE + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) +(define_code_attr maxmin_rel + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) + +(define_expand "<code><mode>3" + [(parallel + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))])] + "TARGET_STV") + +(define_insn_and_split "*<code><mode>3_1" + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:MAXMIN_IMODE (match_dup 3) + (match_dup 1) + (match_dup 2)))] +{ + machine_mode mode = <MODE>mode; + + if (!register_operand (operands[2], mode)) + operands[2] = force_reg (mode, operands[2]); + + enum rtx_code code = <maxmin_rel>; + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); + + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + emit_insn (gen_rtx_SET (flags, tmp)); + + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); +}) + +(define_insn_and_split "*<code>di3_doubleword" + [(set (match_operand:DI 0 "register_operand") + (maxmin:DI (match_operand:DI 1 "register_operand") + (match_operand:DI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:SI (match_dup 6) + (match_dup 1) + (match_dup 2))) + (set (match_dup 3) + (if_then_else:SI (match_dup 6) + (match_dup 4) + (match_dup 5)))] +{ + if (!register_operand (operands[2], DImode)) + operands[2] = force_reg (DImode, operands[2]); + + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); + + rtx cmplo[2] = { operands[1], operands[2] }; + rtx cmphi[2] = { operands[4], operands[5] }; + + enum rtx_code code = <maxmin_rel>; + + switch (code) + { + case LE: case LEU: + std::swap (cmplo[0], cmplo[1]); + std::swap (cmphi[0], cmphi[1]); + code = swap_condition (code); + /* FALLTHRU */ + + case GE: case GEU: + { + bool uns = (code == GEU); + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; + + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); + + rtx tmp = gen_rtx_SCRATCH (SImode); + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); + + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); + + break; + } + + default: + gcc_unreachable (); + } +}) \f ;; Misc patterns (?) Index: gcc/testsuite/gcc.target/i386/minmax-3.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int ssi[1024]; +unsigned int usi[1024]; +long long sdi[1024]; +unsigned long long udi[1024]; + +#define CHECK(FN, VARIANT) \ +void \ +FN ## VARIANT (void) \ +{ \ + for (int i = 1; i < 1024; ++i) \ + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ +} + +CHECK(max, ssi); +CHECK(min, ssi); +CHECK(max, usi); +CHECK(min, usi); +CHECK(max, sdi); +CHECK(min, sdi); +CHECK(max, udi); +CHECK(min, udi); Index: gcc/testsuite/gcc.target/i386/minmax-4.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -msse4.1" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ +/* { dg-final { scan-assembler-times "pminud" 1 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-6.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-6.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-6.c (working copy) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=haswell" } */ + +unsigned short +UMVLine16Y_11 (short unsigned int * Pic, int y, int width) +{ + if (y != width) + { + y = y < 0 ? 0 : y; + return Pic[y * width]; + } + return Pic[y]; +} + +/* We do not want the RA to spill %esi for it's dual-use but using + pmaxsd is OK. */ +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ +/* { dg-final { scan-assembler "pmaxsd" } } */ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 11:06 ` Richard Biener @ 2019-08-09 13:13 ` Richard Biener 2019-08-09 14:39 ` Uros Bizjak 2019-08-13 15:20 ` Jeff Law 0 siblings, 2 replies; 61+ messages in thread From: Richard Biener @ 2019-08-09 13:13 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches, law [-- Attachment #1: Type: text/plain, Size: 46211 bytes --] On Fri, 9 Aug 2019, Richard Biener wrote: > On Fri, 9 Aug 2019, Richard Biener wrote: > > > On Fri, 9 Aug 2019, Uros Bizjak wrote: > > > > > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > to force use of %zmmN? > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > case SMAX: > > > > > case SMIN: > > > > > case UMAX: > > > > > case UMIN: > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > return false; > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > This is of course doable, but somehow more complex than simply > > > > emitting a DImode compare + DImode cmove, which is what current > > > > splitter does. So, a follow-up task. > > > > > > Please find attached the complete .md part that enables SImode for > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > > targets. The patterns also allows for memory operand 2, so STV has > > > chance to create the vector pattern with implicit load. In case STV > > > fails, the memory operand 2 is loaded to the register first; operand > > > 2 is used in compare and cmove instruction, so pre-loading of the > > > operand should be beneficial. > > > > Thanks. > > > > > Also note, that splitting should happen rarely. Due to the cost > > > function, STV should effectively always convert minmax to a vector > > > insn. > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > > this kind of "simple" conversion: > > > > 5.50 â1d0: test %esi,%es > > 0.07 â mov $0x0,%ex > > â cmovs %eax,%es > > 5.84 â imul %r8d,%es > > > > to > > > > 0.65 â1e0: vpxor %xmm0,%xmm0,%xmm0 > > 0.32 â vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > 40.45 â vmovd %xmm0,%eax > > 2.45 â imul %r8d,%eax > > > > which looks like a RA artifact in the end. We spill %esi only > > with -mstv here as STV introduces a (subreg:V4SI ...) use > > of a pseudo ultimatively set from di. STV creates an additional > > pseudo for this (copy-in) but it places that copy next to the > > original def rather than next to the start of the chain it > > converts which is probably the issue why we spill. And this > > is because it inserts those at each definition of the pseudo > > rather than just at the reaching definition(s) or at the > > uses of the pseudo in the chain (that because there may be > > defs of that pseudo in the chain itself). Note that STV emits > > such "conversion" copies as simple reg-reg moves: > > > > (insn 1094 3 4 2 (set (reg:SI 777) > > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > > (nil)) > > > > but those do not prevail very long (this one gets removed by CSE2). > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > > and computes > > > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves > > here (but I guess nothing again prevents RTL optimizers from > > combining that with the single-use in the max instruction...). > > > > So this boils down to STV splitting live-ranges but other > > passes undoing that and then RA not considering splitting > > live-ranges here, arriving at unoptimal allocation. > > > > A testcase showing this issue is (simplified from 464.h264ref > > UMVLine16Y_11): > > > > unsigned short > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > { > > if (y != width) > > { > > y = y < 0 ? 0 : y; > > return Pic[y * width]; > > } > > return Pic[y]; > > } > > > > where the condition and the Pic[y] load mimics the other use of y. > > Different, even worse spilling is generated by > > > > unsigned short > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > { > > y = y < 0 ? 0 : y; > > return Pic[y * width] + y; > > } > > > > I guess this all shows that STVs "trick" of simply wrapping > > integer mode pseudos in (subreg:vector-mode ...) is bad? > > > > I've added a (failing) testcase to reflect the above. > > Experimenting a bit with just for the conversion insns using > V4SImode pseudos we end up preserving those moves (but I > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > ends up using movv4si_internal which only leaves us with > memory for the SImode operand) _plus_ moving the move next > to the actual use has an effect. Not necssarily a good one > though: > > vpxor %xmm0, %xmm0, %xmm0 > vmovaps %xmm0, -16(%rsp) > movl %esi, -16(%rsp) > vpmaxsd -16(%rsp), %xmm0, %xmm0 > vmovd %xmm0, %eax > > eh? I guess the lowpart set is not good (my patch has this > as well, but I got saved by never having vector modes to subset...). > Using > > (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > (const_vector:V4SI [ > (const_int 0 [0]) repeated x4 > ]) > (const_int 1 [0x1]))) "t3.c":5:10 -1 > > for the move ends up with > > vpxor %xmm1, %xmm1, %xmm1 > vpinsrd $0, %esi, %xmm1, %xmm0 > > eh? LRA chooses the correct alternative here but somehow > postreload CSE CSEs the zero with the xmm1 clearing, leading > to the vpinsrd... (I guess a general issue, not sure if really > worse - definitely a larger instruction). Unfortunately > postreload-cse doesn't add a reg-equal note. This happens only > when emitting the reg move before the use, not doing that emits > a vmovd as expected. > > At least the spilling is gone here. > > I am re-testing as follows, the main change is that > general_scalar_chain::make_vector_copies now generates a > vector pseudo as destination (and I've fixed up the code > to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > > Hope this fixes the observed slowdowns (it fixes the new testcase). It fixes the slowdown observed in 416.gamess and 464.h264ref. Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. CCing Jeff who "knows RTL". OK? Thanks, Richard. > Richard. > > mccas.F:twotff_ for 416.gamess > refbuf.c:UMVLine16Y_11 for 464.h264ref > > 2019-08-07 Richard Biener <rguenther@suse.de> > > PR target/91154 > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > mode arguments. > (scalar_chain::smode): New member. > (scalar_chain::vmode): Likewise. > (dimode_scalar_chain): Rename to... > (general_scalar_chain): ... this. > (general_scalar_chain::general_scalar_chain): Take mode arguments. > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > base with TImode and V1TImode. > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > (general_scalar_chain::vector_const_cost): Adjust for SImode > chains. > (general_scalar_chain::compute_convert_gain): Likewise. Fix > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > gain if not zero. > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > Elide the subreg if the reg is already vector. > (general_scalar_chain::make_vector_copies): Likewise. Handle > non-DImode chains appropriately. Use a vector-mode pseudo as > destination. > (general_scalar_chain::convert_reg): Likewise. > (general_scalar_chain::convert_op): Likewise. Elide the > subreg if the reg is already vector. > (general_scalar_chain::convert_insn): Likewise. Add > fatal_insn_not_found if the result is not recognized. > (convertible_comparison_p): Pass in the scalar mode and use that. > (general_scalar_to_vector_candidate_p): Likewise. Rename from > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > (scalar_to_vector_candidate_p): Remove by inlining into single > caller. > (general_remove_non_convertible_regs): Rename from > dimode_remove_non_convertible_regs. > (remove_non_convertible_regs): Remove by inlining into single caller. > (convert_scalars_to_vector): Handle SImode and DImode chains > in addition to TImode chains. > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > * gcc.target/i386/pr91154.c: New testcase. > * gcc.target/i386/minmax-3.c: Likewise. > * gcc.target/i386/minmax-4.c: Likewise. > * gcc.target/i386/minmax-5.c: Likewise. > * gcc.target/i386/minmax-6.c: Likewise. > > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c (revision 274111) > +++ gcc/config/i386/i386-features.c (working copy) > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > /* Initialize new chain. */ > > -scalar_chain::scalar_chain () > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > { > + smode = smode_; > + vmode = vmode_; > + > chain_id = ++max_id; > > if (dump_file) > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins > conversion. */ > > void > -dimode_scalar_chain::mark_dual_mode_def (df_ref def) > +general_scalar_chain::mark_dual_mode_def (df_ref def) > { > gcc_assert (DF_REF_REG_DEF_P (def)); > > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > && !HARD_REGISTER_P (SET_DEST (def_set))) > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > + /* ??? The following is quadratic since analyze_register_chain > + iterates over all refs to look for dual-mode regs. Instead this > + should be done separately for all regs mentioned in the chain once. */ > df_ref ref; > df_ref def; > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, > instead of using a scalar one. */ > > int > -dimode_scalar_chain::vector_const_cost (rtx exp) > +general_scalar_chain::vector_const_cost (rtx exp) > { > gcc_assert (CONST_INT_P (exp)); > > - if (standard_sse_constant_p (exp, V2DImode)) > - return COSTS_N_INSNS (1); > - return ix86_cost->sse_load[1]; > + if (standard_sse_constant_p (exp, vmode)) > + return ix86_cost->sse_op; > + /* We have separate costs for SImode and DImode, use SImode costs > + for smaller modes. */ > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > } > > /* Compute a gain for chain conversion. */ > > int > -dimode_scalar_chain::compute_convert_gain () > +general_scalar_chain::compute_convert_gain () > { > bitmap_iterator bi; > unsigned insn_uid; > @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > + int costs factor in the number of GPRs involved. When supporting > + smaller modes than SImode the int load/store costs need to be > + adjusted as well. */ > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > rtx dst = SET_DEST (def_set); > + int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; > + igain += 2 * m - ix86_cost->xmm_move; > else if (REG_P (src) && MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > else if (MEM_P (src) && REG_P (dst)) > - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > else if (GET_CODE (src) == ASHIFT > || GET_CODE (src) == ASHIFTRT > || GET_CODE (src) == LSHIFTRT) > { > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > - gain += ix86_cost->shift_const; > + igain -= vector_const_cost (XEXP (src, 0)); > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > if (INTVAL (XEXP (src, 1)) >= 32) > - gain -= COSTS_N_INSNS (1); > + igain -= COSTS_N_INSNS (1); > } > else if (GET_CODE (src) == PLUS > || GET_CODE (src) == MINUS > @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > { > - gain += ix86_cost->add; > + igain += m * ix86_cost->add - ix86_cost->sse_op; > /* Additional gain for andnot for targets without BMI. */ > if (GET_CODE (XEXP (src, 0)) == NOT > && !TARGET_BMI) > - gain += 2 * ix86_cost->add; > + igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > - gain -= vector_const_cost (XEXP (src, 0)); > + igain -= vector_const_cost (XEXP (src, 0)); > if (CONST_INT_P (XEXP (src, 1))) > - gain -= vector_const_cost (XEXP (src, 1)); > + igain -= vector_const_cost (XEXP (src, 1)); > } > else if (GET_CODE (src) == NEG > || GET_CODE (src) == NOT) > - gain += ix86_cost->add - COSTS_N_INSNS (1); > + igain += m * ix86_cost->add - ix86_cost->sse_op; > + else if (GET_CODE (src) == SMAX > + || GET_CODE (src) == SMIN > + || GET_CODE (src) == UMAX > + || GET_CODE (src) == UMIN) > + { > + /* We do not have any conditional move cost, estimate it as a > + reg-reg move. Comparisons are costed as adds. */ > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > + /* Integer SSE ops are all costed the same. */ > + igain -= ix86_cost->sse_op; > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai > else if (CONST_INT_P (src)) > { > if (REG_P (dst)) > - gain += COSTS_N_INSNS (2); > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > + igain += COSTS_N_INSNS (m); > else if (MEM_P (dst)) > - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > - gain -= vector_const_cost (src); > + igain += (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]); > + igain -= vector_const_cost (src); > } > else > gcc_unreachable (); > + > + if (igain != 0 && dump_file) > + { > + fprintf (dump_file, " Instruction gain %d for ", igain); > + dump_insn_slim (dump_file, insn); > + } > + gain += igain; > } > > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > + /* ??? What about integer to SSE? */ > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > @@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai > /* Replace REG in X with a V2DI subreg of NEW_REG. */ > > rtx > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > { > if (x == reg) > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > + return (GET_MODE (new_reg) == vmode > + ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0)); > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > int i, j; > @@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg > /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ > > void > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > rtx reg, rtx new_reg) > { > replace_with_subreg (single_set (insn), reg, new_reg); > @@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx > and replace its uses in a chain. */ > > void > -dimode_scalar_chain::make_vector_copies (unsigned regno) > +general_scalar_chain::make_vector_copies (unsigned regno) > { > rtx reg = regno_reg_rtx[regno]; > - rtx vreg = gen_reg_rtx (DImode); > + rtx vreg = gen_reg_rtx (vmode); > df_ref ref; > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > @@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > - emit_move_insn (adjust_address (tmp, SImode, 0), > - gen_rtx_SUBREG (SImode, reg, 0)); > - emit_move_insn (adjust_address (tmp, SImode, 4), > - gen_rtx_SUBREG (SImode, reg, 4)); > - emit_move_insn (vreg, tmp); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > + if (smode == DImode && !TARGET_64BIT) > + { > + emit_move_insn (adjust_address (tmp, SImode, 0), > + gen_rtx_SUBREG (SImode, reg, 0)); > + emit_move_insn (adjust_address (tmp, SImode, 4), > + gen_rtx_SUBREG (SImode, reg, 4)); > + } > + else > + emit_move_insn (tmp, reg); > + emit_move_insn (vreg, > + gen_rtx_VEC_MERGE (vmode, > + gen_rtx_VEC_DUPLICATE (vmode, > + tmp), > + CONST0_RTX (vmode), > + GEN_INT (HOST_WIDE_INT_1U))); > + > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (SImode, reg, 4), > - GEN_INT (2))); > + if (TARGET_SSE4_1) > + { > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (SImode, reg, 4), > + GEN_INT (2))); > + } > + else > + { > + rtx tmp = gen_reg_rtx (DImode); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 4))); > + emit_insn (gen_vec_interleave_lowv4si > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > + } > } > else > { > - rtx tmp = gen_reg_rtx (DImode); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 4))); > - emit_insn (gen_vec_interleave_lowv4si > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > + emit_move_insn (vreg, > + gen_rtx_VEC_MERGE (vmode, > + gen_rtx_VEC_DUPLICATE (vmode, > + reg), > + CONST0_RTX (vmode), > + GEN_INT (HOST_WIDE_INT_1U))); > } > rtx_insn *seq = get_insns (); > end_sequence (); > @@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies > in case register is used in not convertible insn. */ > > void > -dimode_scalar_chain::convert_reg (unsigned regno) > +general_scalar_chain::convert_reg (unsigned regno) > { > bool scalar_copy = bitmap_bit_p (defs_conv, regno); > rtx reg = regno_reg_rtx[regno]; > @@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign > bitmap_copy (conv, insns); > > if (scalar_copy) > - scopy = gen_reg_rtx (DImode); > + scopy = gen_reg_rtx (smode); > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > { > @@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > emit_move_insn (tmp, reg); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - adjust_address (tmp, SImode, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - adjust_address (tmp, SImode, 4)); > + if (!TARGET_64BIT && smode == DImode) > + { > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + adjust_address (tmp, SImode, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + adjust_address (tmp, SImode, 4)); > + } > + else > + emit_move_insn (scopy, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > - > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > + if (TARGET_SSE4_1) > + { > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > + gen_rtvec (1, const0_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + } > + else > + { > + rtx vcopy = gen_reg_rtx (V2DImode); > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + emit_move_insn (vcopy, > + gen_rtx_LSHIFTRT (V2DImode, > + vcopy, GEN_INT (32))); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + } > } > else > - { > - rtx vcopy = gen_reg_rtx (V2DImode); > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - emit_move_insn (vcopy, > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - } > + emit_move_insn (scopy, reg); > + > rtx_insn *seq = get_insns (); > end_sequence (); > emit_conversion_insns (seq, insn); > @@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign > registers conversion. */ > > void > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > { > *op = copy_rtx_if_shared (*op); > > if (GET_CODE (*op) == NOT) > { > convert_op (&XEXP (*op, 0), insn); > - PUT_MODE (*op, V2DImode); > + PUT_MODE (*op, vmode); > } > else if (MEM_P (*op)) > { > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > emit_insn_before (gen_move_insn (tmp, *op), insn); > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > if (dump_file) > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > @@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op > gcc_assert (!DF_REF_CHAIN (ref)); > break; > } > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > + if (GET_MODE (*op) != vmode) > + *op = gen_rtx_SUBREG (vmode, *op, 0); > } > else if (CONST_INT_P (*op)) > { > rtx vec_cst; > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > /* Prefer all ones vector in case of -1. */ > if (constm1_operand (*op, GET_MODE (*op))) > - vec_cst = CONSTM1_RTX (V2DImode); > + vec_cst = CONSTM1_RTX (vmode); > else > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > - gen_rtvec (2, *op, const0_rtx)); > + { > + unsigned n = GET_MODE_NUNITS (vmode); > + rtx *v = XALLOCAVEC (rtx, n); > + v[0] = *op; > + for (unsigned i = 1; i < n; ++i) > + v[i] = const0_rtx; > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > + } > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > + if (!standard_sse_constant_p (vec_cst, vmode)) > { > start_sequence (); > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > rtx_insn *seq = get_insns (); > end_sequence (); > emit_insn_before (seq, insn); > @@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op > else > { > gcc_assert (SUBREG_P (*op)); > - gcc_assert (GET_MODE (*op) == V2DImode); > + gcc_assert (GET_MODE (*op) == vmode); > } > } > > /* Convert INSN to vector mode. */ > > void > -dimode_scalar_chain::convert_insn (rtx_insn *insn) > +general_scalar_chain::convert_insn (rtx_insn *insn) > { > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > @@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i > { > /* There are no scalar integer instructions and therefore > temporary register usage is required. */ > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (dst)); > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > } > > switch (GET_CODE (src)) > @@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i > case ASHIFTRT: > case LSHIFTRT: > convert_op (&XEXP (src, 0), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case PLUS: > @@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i > case IOR: > case XOR: > case AND: > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > convert_op (&XEXP (src, 0), insn); > convert_op (&XEXP (src, 1), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case NEG: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > - src = gen_rtx_MINUS (V2DImode, subreg, src); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > + src = gen_rtx_MINUS (vmode, subreg, src); > break; > > case NOT: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > - src = gen_rtx_XOR (V2DImode, src, subreg); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > + src = gen_rtx_XOR (vmode, src, subreg); > break; > > case MEM: > @@ -939,17 +1027,17 @@ dimode_scalar_chain::convert_insn (rtx_i > break; > > case SUBREG: > - gcc_assert (GET_MODE (src) == V2DImode); > + gcc_assert (GET_MODE (src) == vmode); > break; > > case COMPARE: > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > if (REG_P (src)) > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > + subreg = gen_rtx_SUBREG (vmode, src, 0); > else > subreg = copy_rtx_if_shared (src); > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > @@ -977,7 +1065,9 @@ dimode_scalar_chain::convert_insn (rtx_i > PATTERN (insn) = def_set; > > INSN_CODE (insn) = -1; > - recog_memoized (insn); > + int patt = recog_memoized (insn); > + if (patt == -1) > + fatal_insn_not_found (insn); > df_insn_rescan (insn); > } > > @@ -1116,7 +1206,7 @@ timode_scalar_chain::convert_insn (rtx_i > } > > void > -dimode_scalar_chain::convert_registers () > +general_scalar_chain::convert_registers () > { > bitmap_iterator bi; > unsigned id; > @@ -1186,7 +1276,7 @@ has_non_address_hard_reg (rtx_insn *insn > (const_int 0 [0]))) */ > > static bool > -convertible_comparison_p (rtx_insn *insn) > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > { > if (!TARGET_SSE4_1) > return false; > @@ -1219,12 +1309,12 @@ convertible_comparison_p (rtx_insn *insn > > if (!SUBREG_P (op1) > || !SUBREG_P (op2) > - || GET_MODE (op1) != SImode > - || GET_MODE (op2) != SImode > + || GET_MODE (op1) != mode > + || GET_MODE (op2) != mode > || ((SUBREG_BYTE (op1) != 0 > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > && (SUBREG_BYTE (op2) != 0 > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > return false; > > op1 = SUBREG_REG (op1); > @@ -1232,7 +1322,7 @@ convertible_comparison_p (rtx_insn *insn > > if (op1 != op2 > || !REG_P (op1) > - || GET_MODE (op1) != DImode) > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > return false; > > return true; > @@ -1241,7 +1331,7 @@ convertible_comparison_p (rtx_insn *insn > /* The DImode version of scalar_to_vector_candidate_p. */ > > static bool > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > { > rtx def_set = single_set (insn); > > @@ -1255,12 +1345,12 @@ dimode_scalar_to_vector_candidate_p (rtx > rtx dst = SET_DEST (def_set); > > if (GET_CODE (src) == COMPARE) > - return convertible_comparison_p (insn); > + return convertible_comparison_p (insn, mode); > > /* We are interested in DImode promotion only. */ > - if ((GET_MODE (src) != DImode > + if ((GET_MODE (src) != mode > && !CONST_INT_P (src)) > - || GET_MODE (dst) != DImode) > + || GET_MODE (dst) != mode) > return false; > > if (!REG_P (dst) && !MEM_P (dst)) > @@ -1280,6 +1370,15 @@ dimode_scalar_to_vector_candidate_p (rtx > return false; > break; > > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > + if ((mode == DImode && !TARGET_AVX512VL) > + || (mode == SImode && !TARGET_SSE4_1)) > + return false; > + /* Fallthru. */ > + > case PLUS: > case MINUS: > case IOR: > @@ -1290,7 +1389,7 @@ dimode_scalar_to_vector_candidate_p (rtx > && !CONST_INT_P (XEXP (src, 1))) > return false; > > - if (GET_MODE (XEXP (src, 1)) != DImode > + if (GET_MODE (XEXP (src, 1)) != mode > && !CONST_INT_P (XEXP (src, 1))) > return false; > break; > @@ -1319,7 +1418,7 @@ dimode_scalar_to_vector_candidate_p (rtx > || !REG_P (XEXP (XEXP (src, 0), 0)))) > return false; > > - if (GET_MODE (XEXP (src, 0)) != DImode > + if (GET_MODE (XEXP (src, 0)) != mode > && !CONST_INT_P (XEXP (src, 0))) > return false; > > @@ -1383,22 +1482,16 @@ timode_scalar_to_vector_candidate_p (rtx > return false; > } > > -/* Return 1 if INSN may be converted into vector > - instruction. */ > - > -static bool > -scalar_to_vector_candidate_p (rtx_insn *insn) > -{ > - if (TARGET_64BIT) > - return timode_scalar_to_vector_candidate_p (insn); > - else > - return dimode_scalar_to_vector_candidate_p (insn); > -} > +/* For a given bitmap of insn UIDs scans all instruction and > + remove insn from CANDIDATES in case it has both convertible > + and not convertible definitions. > > -/* The DImode version of remove_non_convertible_regs. */ > + All insns in a bitmap are conversion candidates according to > + scalar_to_vector_candidate_p. Currently it implies all insns > + are single_set. */ > > static void > -dimode_remove_non_convertible_regs (bitmap candidates) > +general_remove_non_convertible_regs (bitmap candidates) > { > bitmap_iterator bi; > unsigned id; > @@ -1553,23 +1646,6 @@ timode_remove_non_convertible_regs (bitm > BITMAP_FREE (regs); > } > > -/* For a given bitmap of insn UIDs scans all instruction and > - remove insn from CANDIDATES in case it has both convertible > - and not convertible definitions. > - > - All insns in a bitmap are conversion candidates according to > - scalar_to_vector_candidate_p. Currently it implies all insns > - are single_set. */ > - > -static void > -remove_non_convertible_regs (bitmap candidates) > -{ > - if (TARGET_64BIT) > - timode_remove_non_convertible_regs (candidates); > - else > - dimode_remove_non_convertible_regs (candidates); > -} > - > /* Main STV pass function. Find and convert scalar > instructions into vector mode when profitable. */ > > @@ -1577,11 +1653,14 @@ static unsigned int > convert_scalars_to_vector () > { > basic_block bb; > - bitmap candidates; > int converted_insns = 0; > > bitmap_obstack_initialize (NULL); > - candidates = BITMAP_ALLOC (NULL); > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > + for (unsigned i = 0; i < 3; ++i) > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > calculate_dominance_info (CDI_DOMINATORS); > df_set_flags (DF_DEFER_INSN_RESCAN); > @@ -1597,51 +1676,73 @@ convert_scalars_to_vector () > { > rtx_insn *insn; > FOR_BB_INSNS (bb, insn) > - if (scalar_to_vector_candidate_p (insn)) > + if (TARGET_64BIT > + && timode_scalar_to_vector_candidate_p (insn)) > { > if (dump_file) > - fprintf (dump_file, " insn %d is marked as a candidate\n", > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > INSN_UID (insn)); > > - bitmap_set_bit (candidates, INSN_UID (insn)); > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > + } > + else > + { > + /* Check {SI,DI}mode. */ > + for (unsigned i = 0; i <= 1; ++i) > + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) > + { > + if (dump_file) > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > + > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > + break; > + } > } > } > > - remove_non_convertible_regs (candidates); > + if (TARGET_64BIT) > + timode_remove_non_convertible_regs (&candidates[2]); > + for (unsigned i = 0; i <= 1; ++i) > + general_remove_non_convertible_regs (&candidates[i]); > > - if (bitmap_empty_p (candidates)) > - if (dump_file) > + for (unsigned i = 0; i <= 2; ++i) > + if (!bitmap_empty_p (&candidates[i])) > + break; > + else if (i == 2 && dump_file) > fprintf (dump_file, "There are no candidates for optimization.\n"); > > - while (!bitmap_empty_p (candidates)) > - { > - unsigned uid = bitmap_first_set_bit (candidates); > - scalar_chain *chain; > + for (unsigned i = 0; i <= 2; ++i) > + while (!bitmap_empty_p (&candidates[i])) > + { > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > + scalar_chain *chain; > > - if (TARGET_64BIT) > - chain = new timode_scalar_chain; > - else > - chain = new dimode_scalar_chain; > + if (cand_mode[i] == TImode) > + chain = new timode_scalar_chain; > + else > + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); > > - /* Find instructions chain we want to convert to vector mode. > - Check all uses and definitions to estimate all required > - conversions. */ > - chain->build (candidates, uid); > + /* Find instructions chain we want to convert to vector mode. > + Check all uses and definitions to estimate all required > + conversions. */ > + chain->build (&candidates[i], uid); > > - if (chain->compute_convert_gain () > 0) > - converted_insns += chain->convert (); > - else > - if (dump_file) > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > - chain->chain_id); > + if (chain->compute_convert_gain () > 0) > + converted_insns += chain->convert (); > + else > + if (dump_file) > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > + chain->chain_id); > > - delete chain; > - } > + delete chain; > + } > > if (dump_file) > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > - BITMAP_FREE (candidates); > + for (unsigned i = 0; i <= 2; ++i) > + bitmap_release (&candidates[i]); > bitmap_obstack_release (NULL); > df_process_deferred_rescans (); > > Index: gcc/config/i386/i386-features.h > =================================================================== > --- gcc/config/i386/i386-features.h (revision 274111) > +++ gcc/config/i386/i386-features.h (working copy) > @@ -127,11 +127,16 @@ namespace { > class scalar_chain > { > public: > - scalar_chain (); > + scalar_chain (enum machine_mode, enum machine_mode); > virtual ~scalar_chain (); > > static unsigned max_id; > > + /* Scalar mode. */ > + enum machine_mode smode; > + /* Vector mode. */ > + enum machine_mode vmode; > + > /* ID of a chain. */ > unsigned int chain_id; > /* A queue of instructions to be included into a chain. */ > @@ -159,9 +164,11 @@ class scalar_chain > virtual void convert_registers () = 0; > }; > > -class dimode_scalar_chain : public scalar_chain > +class general_scalar_chain : public scalar_chain > { > public: > + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > + : scalar_chain (smode_, vmode_) {} > int compute_convert_gain (); > private: > void mark_dual_mode_def (df_ref def); > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > class timode_scalar_chain : public scalar_chain > { > public: > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > + > /* Convert from TImode to V1TImode is always faster. */ > int compute_convert_gain () { return 1; } > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 274111) > +++ gcc/config/i386/i386.md (working copy) > @@ -17729,6 +17729,110 @@ (define_expand "add<mode>cc" > (match_operand:SWI 3 "const_int_operand")] > "" > "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") > + > +;; min/max patterns > + > +(define_mode_iterator MAXMIN_IMODE > + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) > +(define_code_attr maxmin_rel > + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) > + > +(define_expand "<code><mode>3" > + [(parallel > + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") > + (maxmin:MAXMIN_IMODE > + (match_operand:MAXMIN_IMODE 1 "register_operand") > + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) > + (clobber (reg:CC FLAGS_REG))])] > + "TARGET_STV") > + > +(define_insn_and_split "*<code><mode>3_1" > + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") > + (maxmin:MAXMIN_IMODE > + (match_operand:MAXMIN_IMODE 1 "register_operand") > + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (if_then_else:MAXMIN_IMODE (match_dup 3) > + (match_dup 1) > + (match_dup 2)))] > +{ > + machine_mode mode = <MODE>mode; > + > + if (!register_operand (operands[2], mode)) > + operands[2] = force_reg (mode, operands[2]); > + > + enum rtx_code code = <maxmin_rel>; > + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); > + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); > + > + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); > + emit_insn (gen_rtx_SET (flags, tmp)); > + > + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); > +}) > + > +(define_insn_and_split "*<code>di3_doubleword" > + [(set (match_operand:DI 0 "register_operand") > + (maxmin:DI (match_operand:DI 1 "register_operand") > + (match_operand:DI 2 "nonimmediate_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (if_then_else:SI (match_dup 6) > + (match_dup 1) > + (match_dup 2))) > + (set (match_dup 3) > + (if_then_else:SI (match_dup 6) > + (match_dup 4) > + (match_dup 5)))] > +{ > + if (!register_operand (operands[2], DImode)) > + operands[2] = force_reg (DImode, operands[2]); > + > + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); > + > + rtx cmplo[2] = { operands[1], operands[2] }; > + rtx cmphi[2] = { operands[4], operands[5] }; > + > + enum rtx_code code = <maxmin_rel>; > + > + switch (code) > + { > + case LE: case LEU: > + std::swap (cmplo[0], cmplo[1]); > + std::swap (cmphi[0], cmphi[1]); > + code = swap_condition (code); > + /* FALLTHRU */ > + > + case GE: case GEU: > + { > + bool uns = (code == GEU); > + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) > + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; > + > + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); > + > + rtx tmp = gen_rtx_SCRATCH (SImode); > + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); > + > + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); > + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); > + > + break; > + } > + > + default: > + gcc_unreachable (); > + } > +}) > \f > ;; Misc patterns (?) > > Index: gcc/testsuite/gcc.target/i386/minmax-3.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv" } */ > + > +#define max(a,b) (((a) > (b))? (a) : (b)) > +#define min(a,b) (((a) < (b))? (a) : (b)) > + > +int ssi[1024]; > +unsigned int usi[1024]; > +long long sdi[1024]; > +unsigned long long udi[1024]; > + > +#define CHECK(FN, VARIANT) \ > +void \ > +FN ## VARIANT (void) \ > +{ \ > + for (int i = 1; i < 1024; ++i) \ > + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ > +} > + > +CHECK(max, ssi); > +CHECK(min, ssi); > +CHECK(max, usi); > +CHECK(min, usi); > +CHECK(max, sdi); > +CHECK(min, sdi); > +CHECK(max, udi); > +CHECK(min, udi); > Index: gcc/testsuite/gcc.target/i386/minmax-4.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv -msse4.1" } */ > + > +#include "minmax-3.c" > + > +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ > +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ > +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ > +/* { dg-final { scan-assembler-times "pminud" 1 } } */ > Index: gcc/testsuite/gcc.target/i386/minmax-6.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-6.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-6.c (working copy) > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=haswell" } */ > + > +unsigned short > +UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > +{ > + if (y != width) > + { > + y = y < 0 ? 0 : y; > + return Pic[y * width]; > + } > + return Pic[y]; > +} > + > +/* We do not want the RA to spill %esi for it's dual-use but using > + pmaxsd is OK. */ > +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ > +/* { dg-final { scan-assembler "pmaxsd" } } */ -- Richard Biener <rguenther@suse.de> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 13:13 ` Richard Biener @ 2019-08-09 14:39 ` Uros Bizjak 2019-08-12 12:57 ` Richard Biener 2019-08-13 15:20 ` Jeff Law 1 sibling, 1 reply; 61+ messages in thread From: Uros Bizjak @ 2019-08-09 14:39 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches, Jeff Law On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > case SMAX: > > > > > > case SMIN: > > > > > > case UMAX: > > > > > > case UMIN: > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > return false; > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > This is of course doable, but somehow more complex than simply > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > splitter does. So, a follow-up task. > > > > > > > > Please find attached the complete .md part that enables SImode for > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > > > targets. The patterns also allows for memory operand 2, so STV has > > > > chance to create the vector pattern with implicit load. In case STV > > > > fails, the memory operand 2 is loaded to the register first; operand > > > > 2 is used in compare and cmove instruction, so pre-loading of the > > > > operand should be beneficial. > > > > > > Thanks. > > > > > > > Also note, that splitting should happen rarely. Due to the cost > > > > function, STV should effectively always convert minmax to a vector > > > > insn. > > > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > > > this kind of "simple" conversion: > > > > > > 5.50 │1d0: test %esi,%es > > > 0.07 │ mov $0x0,%ex > > > │ cmovs %eax,%es > > > 5.84 │ imul %r8d,%es > > > > > > to > > > > > > 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 > > > 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > > 40.45 │ vmovd %xmm0,%eax > > > 2.45 │ imul %r8d,%eax > > > > > > which looks like a RA artifact in the end. We spill %esi only > > > with -mstv here as STV introduces a (subreg:V4SI ...) use > > > of a pseudo ultimatively set from di. STV creates an additional > > > pseudo for this (copy-in) but it places that copy next to the > > > original def rather than next to the start of the chain it > > > converts which is probably the issue why we spill. And this > > > is because it inserts those at each definition of the pseudo > > > rather than just at the reaching definition(s) or at the > > > uses of the pseudo in the chain (that because there may be > > > defs of that pseudo in the chain itself). Note that STV emits > > > such "conversion" copies as simple reg-reg moves: > > > > > > (insn 1094 3 4 2 (set (reg:SI 777) > > > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > > > (nil)) > > > > > > but those do not prevail very long (this one gets removed by CSE2). > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > > > and computes > > > > > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > > > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves > > > here (but I guess nothing again prevents RTL optimizers from > > > combining that with the single-use in the max instruction...). > > > > > > So this boils down to STV splitting live-ranges but other > > > passes undoing that and then RA not considering splitting > > > live-ranges here, arriving at unoptimal allocation. > > > > > > A testcase showing this issue is (simplified from 464.h264ref > > > UMVLine16Y_11): > > > > > > unsigned short > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > { > > > if (y != width) > > > { > > > y = y < 0 ? 0 : y; > > > return Pic[y * width]; > > > } > > > return Pic[y]; > > > } > > > > > > where the condition and the Pic[y] load mimics the other use of y. > > > Different, even worse spilling is generated by > > > > > > unsigned short > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > { > > > y = y < 0 ? 0 : y; > > > return Pic[y * width] + y; > > > } > > > > > > I guess this all shows that STVs "trick" of simply wrapping > > > integer mode pseudos in (subreg:vector-mode ...) is bad? > > > > > > I've added a (failing) testcase to reflect the above. > > > > Experimenting a bit with just for the conversion insns using > > V4SImode pseudos we end up preserving those moves (but I > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > > ends up using movv4si_internal which only leaves us with > > memory for the SImode operand) _plus_ moving the move next > > to the actual use has an effect. Not necssarily a good one > > though: > > > > vpxor %xmm0, %xmm0, %xmm0 > > vmovaps %xmm0, -16(%rsp) > > movl %esi, -16(%rsp) > > vpmaxsd -16(%rsp), %xmm0, %xmm0 > > vmovd %xmm0, %eax > > > > eh? I guess the lowpart set is not good (my patch has this > > as well, but I got saved by never having vector modes to subset...). > > Using > > > > (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > > (const_vector:V4SI [ > > (const_int 0 [0]) repeated x4 > > ]) > > (const_int 1 [0x1]))) "t3.c":5:10 -1 > > > > for the move ends up with > > > > vpxor %xmm1, %xmm1, %xmm1 > > vpinsrd $0, %esi, %xmm1, %xmm0 > > > > eh? LRA chooses the correct alternative here but somehow > > postreload CSE CSEs the zero with the xmm1 clearing, leading > > to the vpinsrd... (I guess a general issue, not sure if really > > worse - definitely a larger instruction). Unfortunately > > postreload-cse doesn't add a reg-equal note. This happens only > > when emitting the reg move before the use, not doing that emits > > a vmovd as expected. > > > > At least the spilling is gone here. > > > > I am re-testing as follows, the main change is that > > general_scalar_chain::make_vector_copies now generates a > > vector pseudo as destination (and I've fixed up the code > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > > > > Hope this fixes the observed slowdowns (it fixes the new testcase). > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > CCing Jeff who "knows RTL". > > OK? Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid spurious test failures on SSE4.1 targets. Uros. > Thanks, > Richard. > > > Richard. > > > > mccas.F:twotff_ for 416.gamess > > refbuf.c:UMVLine16Y_11 for 464.h264ref > > > > 2019-08-07 Richard Biener <rguenther@suse.de> > > > > PR target/91154 > > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > > mode arguments. > > (scalar_chain::smode): New member. > > (scalar_chain::vmode): Likewise. > > (dimode_scalar_chain): Rename to... > > (general_scalar_chain): ... this. > > (general_scalar_chain::general_scalar_chain): Take mode arguments. > > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > > base with TImode and V1TImode. > > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > > (general_scalar_chain::vector_const_cost): Adjust for SImode > > chains. > > (general_scalar_chain::compute_convert_gain): Likewise. Fix > > reg-reg move cost gain, use ix86_cost->sse_op cost and adjust > > scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction > > gain if not zero. > > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > > Elide the subreg if the reg is already vector. > > (general_scalar_chain::make_vector_copies): Likewise. Handle > > non-DImode chains appropriately. Use a vector-mode pseudo as > > destination. > > (general_scalar_chain::convert_reg): Likewise. > > (general_scalar_chain::convert_op): Likewise. Elide the > > subreg if the reg is already vector. > > (general_scalar_chain::convert_insn): Likewise. Add > > fatal_insn_not_found if the result is not recognized. > > (convertible_comparison_p): Pass in the scalar mode and use that. > > (general_scalar_to_vector_candidate_p): Likewise. Rename from > > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > > (scalar_to_vector_candidate_p): Remove by inlining into single > > caller. > > (general_remove_non_convertible_regs): Rename from > > dimode_remove_non_convertible_regs. > > (remove_non_convertible_regs): Remove by inlining into single caller. > > (convert_scalars_to_vector): Handle SImode and DImode chains > > in addition to TImode chains. > > * config/i386/i386.md (<maxmin><SWI48>3): New insn split after STV. > > > > * gcc.target/i386/pr91154.c: New testcase. > > * gcc.target/i386/minmax-3.c: Likewise. > > * gcc.target/i386/minmax-4.c: Likewise. > > * gcc.target/i386/minmax-5.c: Likewise. > > * gcc.target/i386/minmax-6.c: Likewise. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 14:39 ` Uros Bizjak @ 2019-08-12 12:57 ` Richard Biener 2019-08-12 14:48 ` Uros Bizjak 2019-08-13 16:28 ` Jeff Law 0 siblings, 2 replies; 61+ messages in thread From: Richard Biener @ 2019-08-12 12:57 UTC (permalink / raw) To: Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches, Jeff Law, hjl.tools [-- Attachment #1: Type: text/plain, Size: 47997 bytes --] On Fri, 9 Aug 2019, Uros Bizjak wrote: > On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > > > case SMAX: > > > > > > > case SMIN: > > > > > > > case UMAX: > > > > > > > case UMIN: > > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > > return false; > > > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > > This is of course doable, but somehow more complex than simply > > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > > splitter does. So, a follow-up task. > > > > > > > > > > Please find attached the complete .md part that enables SImode for > > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > > > > targets. The patterns also allows for memory operand 2, so STV has > > > > > chance to create the vector pattern with implicit load. In case STV > > > > > fails, the memory operand 2 is loaded to the register first; operand > > > > > 2 is used in compare and cmove instruction, so pre-loading of the > > > > > operand should be beneficial. > > > > > > > > Thanks. > > > > > > > > > Also note, that splitting should happen rarely. Due to the cost > > > > > function, STV should effectively always convert minmax to a vector > > > > > insn. > > > > > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > > > > this kind of "simple" conversion: > > > > > > > > 5.50 б│1d0: test %esi,%es > > > > 0.07 б│ mov $0x0,%ex > > > > б│ cmovs %eax,%es > > > > 5.84 б│ imul %r8d,%es > > > > > > > > to > > > > > > > > 0.65 б│1e0: vpxor %xmm0,%xmm0,%xmm0 > > > > 0.32 б│ vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > > > 40.45 б│ vmovd %xmm0,%eax > > > > 2.45 б│ imul %r8d,%eax > > > > > > > > which looks like a RA artifact in the end. We spill %esi only > > > > with -mstv here as STV introduces a (subreg:V4SI ...) use > > > > of a pseudo ultimatively set from di. STV creates an additional > > > > pseudo for this (copy-in) but it places that copy next to the > > > > original def rather than next to the start of the chain it > > > > converts which is probably the issue why we spill. And this > > > > is because it inserts those at each definition of the pseudo > > > > rather than just at the reaching definition(s) or at the > > > > uses of the pseudo in the chain (that because there may be > > > > defs of that pseudo in the chain itself). Note that STV emits > > > > such "conversion" copies as simple reg-reg moves: > > > > > > > > (insn 1094 3 4 2 (set (reg:SI 777) > > > > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > > > > (nil)) > > > > > > > > but those do not prevail very long (this one gets removed by CSE2). > > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > > > > and computes > > > > > > > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > > > > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > > > > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves > > > > here (but I guess nothing again prevents RTL optimizers from > > > > combining that with the single-use in the max instruction...). > > > > > > > > So this boils down to STV splitting live-ranges but other > > > > passes undoing that and then RA not considering splitting > > > > live-ranges here, arriving at unoptimal allocation. > > > > > > > > A testcase showing this issue is (simplified from 464.h264ref > > > > UMVLine16Y_11): > > > > > > > > unsigned short > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > > { > > > > if (y != width) > > > > { > > > > y = y < 0 ? 0 : y; > > > > return Pic[y * width]; > > > > } > > > > return Pic[y]; > > > > } > > > > > > > > where the condition and the Pic[y] load mimics the other use of y. > > > > Different, even worse spilling is generated by > > > > > > > > unsigned short > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > > { > > > > y = y < 0 ? 0 : y; > > > > return Pic[y * width] + y; > > > > } > > > > > > > > I guess this all shows that STVs "trick" of simply wrapping > > > > integer mode pseudos in (subreg:vector-mode ...) is bad? > > > > > > > > I've added a (failing) testcase to reflect the above. > > > > > > Experimenting a bit with just for the conversion insns using > > > V4SImode pseudos we end up preserving those moves (but I > > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > > > ends up using movv4si_internal which only leaves us with > > > memory for the SImode operand) _plus_ moving the move next > > > to the actual use has an effect. Not necssarily a good one > > > though: > > > > > > vpxor %xmm0, %xmm0, %xmm0 > > > vmovaps %xmm0, -16(%rsp) > > > movl %esi, -16(%rsp) > > > vpmaxsd -16(%rsp), %xmm0, %xmm0 > > > vmovd %xmm0, %eax > > > > > > eh? I guess the lowpart set is not good (my patch has this > > > as well, but I got saved by never having vector modes to subset...). > > > Using > > > > > > (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > > > (const_vector:V4SI [ > > > (const_int 0 [0]) repeated x4 > > > ]) > > > (const_int 1 [0x1]))) "t3.c":5:10 -1 > > > > > > for the move ends up with > > > > > > vpxor %xmm1, %xmm1, %xmm1 > > > vpinsrd $0, %esi, %xmm1, %xmm0 > > > > > > eh? LRA chooses the correct alternative here but somehow > > > postreload CSE CSEs the zero with the xmm1 clearing, leading > > > to the vpinsrd... (I guess a general issue, not sure if really > > > worse - definitely a larger instruction). Unfortunately > > > postreload-cse doesn't add a reg-equal note. This happens only > > > when emitting the reg move before the use, not doing that emits > > > a vmovd as expected. > > > > > > At least the spilling is gone here. > > > > > > I am re-testing as follows, the main change is that > > > general_scalar_chain::make_vector_copies now generates a > > > vector pseudo as destination (and I've fixed up the code > > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > > > > > > Hope this fixes the observed slowdowns (it fixes the new testcase). > > > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > > > CCing Jeff who "knows RTL". > > > > OK? > > Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid > spurious test failures on SSE4.1 targets. Done. I've also adjusted the i386.md changelog as follows: * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. (*<maxmin>di3_doubleword): Likewise. I see FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest FAIL: gcc.target/i386/pr78794.c scan-assembler pandn with the latest patch (this is with -m32) where -mstv causes all spills to go away and the cmoves replaced (so clearly better code after the patch) for pr65105-5.c, no obvious improvements for pr65105-3.c where cmov does appear with -mstv. I'd rather not "fix" those by adding -mno-stv but instead have the Intel people fix costing for slm and/or decide what to do. For pr65105-3.c I'm not sure why if-conversion didn't choose to use cmov, so clearly the enabled minmax patterns expose the "failure" here. I've also seen a 32bit ICE for a bogus store we create with the live-range splitting fix fixed in the patch below (convert_insn REG src handling with MEM dst needs to account for a vector-mode src case). Maybe it would help to split out changes unrelated to {DI,SI}mode chain support from the STV costing and also separately install the live-range splitting "fix"? I'm willing to do some more legwork to make review and approval easier here. Anyway, bootstrapped & tested on x86_64-unknown-linux-gnu. I've re-checked SPEC CPU 2006 on Haswell with no changes over the previous results. Thanks, Richard. 2019-08-12 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Fix reg-reg move cost gain, use ix86_cost->sse_op cost and adjust scalar costs. Add {S,U}{MIN,MAX} support. Dump per-instruction gain if not zero. (general_scalar_chain::replace_with_subreg): Use vmode/smode. Elide the subreg if the reg is already vector. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. Use a vector-mode pseudo as destination. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. Elide the subreg if the reg is already vector. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. (*<maxmin>di3_doubleword): Likewise. * gcc.target/i386/pr91154.c: New testcase. * gcc.target/i386/minmax-3.c: Likewise. * gcc.target/i386/minmax-4.c: Likewise. * gcc.target/i386/minmax-5.c: Likewise. * gcc.target/i386/minmax-6.c: Likewise. * gcc.target/i386/minmax-1.c: Add -mno-stv. * gcc.target/i386/minmax-2.c: Likewise. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274278) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,28 +499,37 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); rtx dst = SET_DEST (def_set); + int igain = 0; if (REG_P (src) && REG_P (dst)) - gain += COSTS_N_INSNS (2) - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - gain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); - gain += ix86_cost->shift_const; + igain -= vector_const_cost (XEXP (src, 0)); + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) - gain -= COSTS_N_INSNS (1); + igain -= COSTS_N_INSNS (1); } else if (GET_CODE (src) == PLUS || GET_CODE (src) == MINUS @@ -520,20 +537,31 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - gain += ix86_cost->add; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - gain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) - gain -= vector_const_cost (XEXP (src, 0)); + igain -= vector_const_cost (XEXP (src, 0)); if (CONST_INT_P (XEXP (src, 1))) - gain -= vector_const_cost (XEXP (src, 1)); + igain -= vector_const_cost (XEXP (src, 1)); } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - gain += ix86_cost->add - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op; + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -541,18 +569,28 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - gain += COSTS_N_INSNS (2); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += COSTS_N_INSNS (m); else if (MEM_P (dst)) - gain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; - gain -= vector_const_cost (src); + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); + igain -= vector_const_cost (src); } else gcc_unreachable (); + + if (igain != 0 && dump_file) + { + fprintf (dump_file, " Instruction gain %d for ", igain); + dump_insn_slim (dump_file, insn); + } + gain += igain; } if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -570,10 +608,11 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return (GET_MODE (new_reg) == vmode + ? new_reg : gen_rtx_SUBREG (vmode, new_reg, 0)); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -593,7 +632,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -624,10 +663,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (vmode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -636,36 +675,59 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); - emit_move_insn (vreg, tmp); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); + emit_move_insn (vreg, + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + tmp), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U))); + } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); + emit_move_insn (vreg, + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + reg), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U))); } rtx_insn *seq = get_insns (); end_sequence (); @@ -695,7 +757,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -707,7 +769,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -727,40 +789,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -809,21 +886,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -841,24 +918,31 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + if (GET_MODE (*op) != vmode) + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -870,14 +954,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -888,9 +972,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (dst)); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -899,7 +983,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -907,25 +991,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -936,20 +1024,22 @@ dimode_scalar_chain::convert_insn (rtx_i case REG: if (!MEM_P (dst)) convert_op (&src, insn); + else if (GET_MODE (src) != smode) + src = gen_rtx_SUBREG (smode, src, 0); break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -977,7 +1067,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1116,7 +1208,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1186,7 +1278,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1219,12 +1311,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1232,7 +1324,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1241,7 +1333,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1255,12 +1347,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1280,6 +1372,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512VL) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1290,7 +1391,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1319,7 +1420,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1383,22 +1484,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1553,23 +1648,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1577,11 +1655,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1597,51 +1678,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274278) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274278) +++ gcc/config/i386/i386.md (working copy) @@ -17719,6 +17719,110 @@ (define_expand "add<mode>cc" (match_operand:SWI 3 "const_int_operand")] "" "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") + +;; min/max patterns + +(define_mode_iterator MAXMIN_IMODE + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) +(define_code_attr maxmin_rel + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) + +(define_expand "<code><mode>3" + [(parallel + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))])] + "TARGET_STV") + +(define_insn_and_split "*<code><mode>3_1" + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:MAXMIN_IMODE (match_dup 3) + (match_dup 1) + (match_dup 2)))] +{ + machine_mode mode = <MODE>mode; + + if (!register_operand (operands[2], mode)) + operands[2] = force_reg (mode, operands[2]); + + enum rtx_code code = <maxmin_rel>; + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); + + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + emit_insn (gen_rtx_SET (flags, tmp)); + + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); +}) + +(define_insn_and_split "*<code>di3_doubleword" + [(set (match_operand:DI 0 "register_operand") + (maxmin:DI (match_operand:DI 1 "register_operand") + (match_operand:DI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:SI (match_dup 6) + (match_dup 1) + (match_dup 2))) + (set (match_dup 3) + (if_then_else:SI (match_dup 6) + (match_dup 4) + (match_dup 5)))] +{ + if (!register_operand (operands[2], DImode)) + operands[2] = force_reg (DImode, operands[2]); + + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); + + rtx cmplo[2] = { operands[1], operands[2] }; + rtx cmphi[2] = { operands[4], operands[5] }; + + enum rtx_code code = <maxmin_rel>; + + switch (code) + { + case LE: case LEU: + std::swap (cmplo[0], cmplo[1]); + std::swap (cmphi[0], cmphi[1]); + code = swap_condition (code); + /* FALLTHRU */ + + case GE: case GEU: + { + bool uns = (code == GEU); + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; + + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); + + rtx tmp = gen_rtx_SCRATCH (SImode); + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); + + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); + + break; + } + + default: + gcc_unreachable (); + } +}) \f ;; Misc patterns (?) Index: gcc/testsuite/gcc.target/i386/pr91154.c =================================================================== --- gcc/testsuite/gcc.target/i386/pr91154.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/pr91154.c (working copy) @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse4.1 -mstv" } */ + +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) +{ + int sc; + int k; + for (k = 1; k <= M; k++) + { + dc[k] = dc[k-1] + tpdd[k-1]; + if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; + if (dc[k] < -987654321) dc[k] = -987654321; + } +} + +/* We want to convert the loop to SSE since SSE pmaxsd is faster than + compare + conditional move. */ +/* { dg-final { scan-assembler-not "cmov" } } */ +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */ +/* { dg-final { scan-assembler-times "paddd" 2 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-1.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-1.c (revision 274278) +++ gcc/testsuite/gcc.target/i386/minmax-1.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -march=opteron" } */ +/* { dg-options "-O2 -march=opteron -mno-stv" } */ /* { dg-final { scan-assembler "test" } } */ /* { dg-final { scan-assembler-not "cmp" } } */ #define max(a,b) (((a) > (b))? (a) : (b)) Index: gcc/testsuite/gcc.target/i386/minmax-2.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-2.c (revision 274278) +++ gcc/testsuite/gcc.target/i386/minmax-2.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2" } */ +/* { dg-options "-O2 -mno-stv" } */ /* { dg-final { scan-assembler "test" } } */ /* { dg-final { scan-assembler-not "cmp" } } */ #define max(a,b) (((a) > (b))? (a) : (b)) Index: gcc/testsuite/gcc.target/i386/minmax-3.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int ssi[1024]; +unsigned int usi[1024]; +long long sdi[1024]; +unsigned long long udi[1024]; + +#define CHECK(FN, VARIANT) \ +void \ +FN ## VARIANT (void) \ +{ \ + for (int i = 1; i < 1024; ++i) \ + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ +} + +CHECK(max, ssi); +CHECK(min, ssi); +CHECK(max, usi); +CHECK(min, usi); +CHECK(max, sdi); +CHECK(min, sdi); +CHECK(max, udi); +CHECK(min, udi); Index: gcc/testsuite/gcc.target/i386/minmax-4.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -msse4.1" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ +/* { dg-final { scan-assembler-times "pminud" 1 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-6.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-6.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-6.c (working copy) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=haswell" } */ + +unsigned short +UMVLine16Y_11 (short unsigned int * Pic, int y, int width) +{ + if (y != width) + { + y = y < 0 ? 0 : y; + return Pic[y * width]; + } + return Pic[y]; +} + +/* We do not want the RA to spill %esi for it's dual-use but using + pmaxsd is OK. */ +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ +/* { dg-final { scan-assembler "pmaxsd" } } */ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-12 12:57 ` Richard Biener @ 2019-08-12 14:48 ` Uros Bizjak 2019-08-13 16:28 ` Jeff Law 1 sibling, 0 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-12 14:48 UTC (permalink / raw) To: Richard Biener; +Cc: Jakub Jelinek, gcc-patches, Jeff Law, H. J. Lu On Mon, Aug 12, 2019 at 2:27 PM Richard Biener <rguenther@suse.de> wrote: > > On Fri, 9 Aug 2019, Uros Bizjak wrote: > > > On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > > > > > > to force use of %zmmN? > > > > > > > > > > > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > > > > > > > > > > > case SMAX: > > > > > > > > case SMIN: > > > > > > > > case UMAX: > > > > > > > > case UMIN: > > > > > > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > > > > > > || (mode == SImode && !TARGET_SSE4_1)) > > > > > > > > return false; > > > > > > > > > > > > > > > > so there's no way to use AVX512VL for 32bit? > > > > > > > > > > > > > > There is a way, but on 32bit targets, we need to split DImode > > > > > > > operation to a sequence of SImode operations for unconverted pattern. > > > > > > > This is of course doable, but somehow more complex than simply > > > > > > > emitting a DImode compare + DImode cmove, which is what current > > > > > > > splitter does. So, a follow-up task. > > > > > > > > > > > > Please find attached the complete .md part that enables SImode for > > > > > > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > > > > > > targets. The patterns also allows for memory operand 2, so STV has > > > > > > chance to create the vector pattern with implicit load. In case STV > > > > > > fails, the memory operand 2 is loaded to the register first; operand > > > > > > 2 is used in compare and cmove instruction, so pre-loading of the > > > > > > operand should be beneficial. > > > > > > > > > > Thanks. > > > > > > > > > > > Also note, that splitting should happen rarely. Due to the cost > > > > > > function, STV should effectively always convert minmax to a vector > > > > > > insn. > > > > > > > > > > I've analyzed the 464.h264ref slowdown on Haswell and it is due to > > > > > this kind of "simple" conversion: > > > > > > > > > > 5.50 │1d0: test %esi,%es > > > > > 0.07 │ mov $0x0,%ex > > > > > │ cmovs %eax,%es > > > > > 5.84 │ imul %r8d,%es > > > > > > > > > > to > > > > > > > > > > 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 > > > > > 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 > > > > > 40.45 │ vmovd %xmm0,%eax > > > > > 2.45 │ imul %r8d,%eax > > > > > > > > > > which looks like a RA artifact in the end. We spill %esi only > > > > > with -mstv here as STV introduces a (subreg:V4SI ...) use > > > > > of a pseudo ultimatively set from di. STV creates an additional > > > > > pseudo for this (copy-in) but it places that copy next to the > > > > > original def rather than next to the start of the chain it > > > > > converts which is probably the issue why we spill. And this > > > > > is because it inserts those at each definition of the pseudo > > > > > rather than just at the reaching definition(s) or at the > > > > > uses of the pseudo in the chain (that because there may be > > > > > defs of that pseudo in the chain itself). Note that STV emits > > > > > such "conversion" copies as simple reg-reg moves: > > > > > > > > > > (insn 1094 3 4 2 (set (reg:SI 777) > > > > > (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > > > > > (nil)) > > > > > > > > > > but those do not prevail very long (this one gets removed by CSE2). > > > > > So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > > > > > and computes > > > > > > > > > > r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > > > > > a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > > > > > > > > > > so I wonder if STV shouldn't instead emit gpr->xmm moves > > > > > here (but I guess nothing again prevents RTL optimizers from > > > > > combining that with the single-use in the max instruction...). > > > > > > > > > > So this boils down to STV splitting live-ranges but other > > > > > passes undoing that and then RA not considering splitting > > > > > live-ranges here, arriving at unoptimal allocation. > > > > > > > > > > A testcase showing this issue is (simplified from 464.h264ref > > > > > UMVLine16Y_11): > > > > > > > > > > unsigned short > > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > > > { > > > > > if (y != width) > > > > > { > > > > > y = y < 0 ? 0 : y; > > > > > return Pic[y * width]; > > > > > } > > > > > return Pic[y]; > > > > > } > > > > > > > > > > where the condition and the Pic[y] load mimics the other use of y. > > > > > Different, even worse spilling is generated by > > > > > > > > > > unsigned short > > > > > UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > > > > > { > > > > > y = y < 0 ? 0 : y; > > > > > return Pic[y * width] + y; > > > > > } > > > > > > > > > > I guess this all shows that STVs "trick" of simply wrapping > > > > > integer mode pseudos in (subreg:vector-mode ...) is bad? > > > > > > > > > > I've added a (failing) testcase to reflect the above. > > > > > > > > Experimenting a bit with just for the conversion insns using > > > > V4SImode pseudos we end up preserving those moves (but I > > > > do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > > > > ends up using movv4si_internal which only leaves us with > > > > memory for the SImode operand) _plus_ moving the move next > > > > to the actual use has an effect. Not necssarily a good one > > > > though: > > > > > > > > vpxor %xmm0, %xmm0, %xmm0 > > > > vmovaps %xmm0, -16(%rsp) > > > > movl %esi, -16(%rsp) > > > > vpmaxsd -16(%rsp), %xmm0, %xmm0 > > > > vmovd %xmm0, %eax > > > > > > > > eh? I guess the lowpart set is not good (my patch has this > > > > as well, but I got saved by never having vector modes to subset...). > > > > Using > > > > > > > > (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > > > > (const_vector:V4SI [ > > > > (const_int 0 [0]) repeated x4 > > > > ]) > > > > (const_int 1 [0x1]))) "t3.c":5:10 -1 > > > > > > > > for the move ends up with > > > > > > > > vpxor %xmm1, %xmm1, %xmm1 > > > > vpinsrd $0, %esi, %xmm1, %xmm0 > > > > > > > > eh? LRA chooses the correct alternative here but somehow > > > > postreload CSE CSEs the zero with the xmm1 clearing, leading > > > > to the vpinsrd... (I guess a general issue, not sure if really > > > > worse - definitely a larger instruction). Unfortunately > > > > postreload-cse doesn't add a reg-equal note. This happens only > > > > when emitting the reg move before the use, not doing that emits > > > > a vmovd as expected. > > > > > > > > At least the spilling is gone here. > > > > > > > > I am re-testing as follows, the main change is that > > > > general_scalar_chain::make_vector_copies now generates a > > > > vector pseudo as destination (and I've fixed up the code > > > > to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > > > > > > > > Hope this fixes the observed slowdowns (it fixes the new testcase). > > > > > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > > > > > CCing Jeff who "knows RTL". > > > > > > OK? > > > > Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid > > spurious test failures on SSE4.1 targets. > > Done. I've also adjusted the i386.md changelog as follows: > > * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. > (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. > (*<maxmin>di3_doubleword): Likewise. > > I see > > FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest > FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest > FAIL: gcc.target/i386/pr78794.c scan-assembler pandn > > with the latest patch (this is with -m32) where -mstv causes > all spills to go away and the cmoves replaced (so clearly > better code after the patch) for pr65105-5.c, no obvious > improvements for pr65105-3.c where cmov does appear with -mstv. > I'd rather not "fix" those by adding -mno-stv but instead have > the Intel people fix costing for slm and/or decide what to do. > For pr65105-3.c I'm not sure why if-conversion didn't choose > to use cmov, so clearly the enabled minmax patterns expose the > "failure" here. > > I've also seen a 32bit ICE for a bogus store we create with the > live-range splitting fix fixed in the patch below (convert_insn > REG src handling with MEM dst needs to account for a vector-mode > src case). > > Maybe it would help to split out changes unrelated to {DI,SI}mode > chain support from the STV costing and also separately install > the live-range splitting "fix"? I'm willing to do some more > legwork to make review and approval easier here. I think this is a good idea. Now we have three sem-related changes here, if these can be split by topic into independent changes, this would also help bisection. It looks that generalization from DImode support to DI/SImode comprises of mostly mechanical changes, and consting is only tangential to these changes. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-12 12:57 ` Richard Biener 2019-08-12 14:48 ` Uros Bizjak @ 2019-08-13 16:28 ` Jeff Law 2019-08-13 20:07 ` H.J. Lu 1 sibling, 1 reply; 61+ messages in thread From: Jeff Law @ 2019-08-13 16:28 UTC (permalink / raw) To: Richard Biener, Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches, hjl.tools On 8/12/19 6:27 AM, Richard Biener wrote: > On Fri, 9 Aug 2019, Uros Bizjak wrote: > >> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote: >> >>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) >>>>>>>>>>>> >>>>>>>>>>>> and then we need to split DImode for 32bits, too. >>>>>>>>>>> >>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode >>>>>>>>>>> condition, I'll provide _doubleword splitter later. >>>>>>>>>> >>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. >>>>>>>>>> to force use of %zmmN? >>>>>>>>> >>>>>>>>> It generates V4SI mode, so - yes, AVX512VL. >>>>>>>> >>>>>>>> case SMAX: >>>>>>>> case SMIN: >>>>>>>> case UMAX: >>>>>>>> case UMIN: >>>>>>>> if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) >>>>>>>> || (mode == SImode && !TARGET_SSE4_1)) >>>>>>>> return false; >>>>>>>> >>>>>>>> so there's no way to use AVX512VL for 32bit? >>>>>>> >>>>>>> There is a way, but on 32bit targets, we need to split DImode >>>>>>> operation to a sequence of SImode operations for unconverted pattern. >>>>>>> This is of course doable, but somehow more complex than simply >>>>>>> emitting a DImode compare + DImode cmove, which is what current >>>>>>> splitter does. So, a follow-up task. >>>>>> >>>>>> Please find attached the complete .md part that enables SImode for >>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit >>>>>> targets. The patterns also allows for memory operand 2, so STV has >>>>>> chance to create the vector pattern with implicit load. In case STV >>>>>> fails, the memory operand 2 is loaded to the register first; operand >>>>>> 2 is used in compare and cmove instruction, so pre-loading of the >>>>>> operand should be beneficial. >>>>> >>>>> Thanks. >>>>> >>>>>> Also note, that splitting should happen rarely. Due to the cost >>>>>> function, STV should effectively always convert minmax to a vector >>>>>> insn. >>>>> >>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to >>>>> this kind of "simple" conversion: >>>>> >>>>> 5.50 б│1d0: test %esi,%es >>>>> 0.07 б│ mov $0x0,%ex >>>>> б│ cmovs %eax,%es >>>>> 5.84 б│ imul %r8d,%es >>>>> >>>>> to >>>>> >>>>> 0.65 б│1e0: vpxor %xmm0,%xmm0,%xmm0 >>>>> 0.32 б│ vpmaxs -0x10(%rsp),%xmm0,%xmm0 >>>>> 40.45 б│ vmovd %xmm0,%eax >>>>> 2.45 б│ imul %r8d,%eax >>>>> >>>>> which looks like a RA artifact in the end. We spill %esi only >>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use >>>>> of a pseudo ultimatively set from di. STV creates an additional >>>>> pseudo for this (copy-in) but it places that copy next to the >>>>> original def rather than next to the start of the chain it >>>>> converts which is probably the issue why we spill. And this >>>>> is because it inserts those at each definition of the pseudo >>>>> rather than just at the reaching definition(s) or at the >>>>> uses of the pseudo in the chain (that because there may be >>>>> defs of that pseudo in the chain itself). Note that STV emits >>>>> such "conversion" copies as simple reg-reg moves: >>>>> >>>>> (insn 1094 3 4 2 (set (reg:SI 777) >>>>> (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 >>>>> (nil)) >>>>> >>>>> but those do not prevail very long (this one gets removed by CSE2). >>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use >>>>> and computes >>>>> >>>>> r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS >>>>> a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 >>>>> >>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves >>>>> here (but I guess nothing again prevents RTL optimizers from >>>>> combining that with the single-use in the max instruction...). >>>>> >>>>> So this boils down to STV splitting live-ranges but other >>>>> passes undoing that and then RA not considering splitting >>>>> live-ranges here, arriving at unoptimal allocation. >>>>> >>>>> A testcase showing this issue is (simplified from 464.h264ref >>>>> UMVLine16Y_11): >>>>> >>>>> unsigned short >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) >>>>> { >>>>> if (y != width) >>>>> { >>>>> y = y < 0 ? 0 : y; >>>>> return Pic[y * width]; >>>>> } >>>>> return Pic[y]; >>>>> } >>>>> >>>>> where the condition and the Pic[y] load mimics the other use of y. >>>>> Different, even worse spilling is generated by >>>>> >>>>> unsigned short >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) >>>>> { >>>>> y = y < 0 ? 0 : y; >>>>> return Pic[y * width] + y; >>>>> } >>>>> >>>>> I guess this all shows that STVs "trick" of simply wrapping >>>>> integer mode pseudos in (subreg:vector-mode ...) is bad? >>>>> >>>>> I've added a (failing) testcase to reflect the above. >>>> >>>> Experimenting a bit with just for the conversion insns using >>>> V4SImode pseudos we end up preserving those moves (but I >>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg >>>> ends up using movv4si_internal which only leaves us with >>>> memory for the SImode operand) _plus_ moving the move next >>>> to the actual use has an effect. Not necssarily a good one >>>> though: >>>> >>>> vpxor %xmm0, %xmm0, %xmm0 >>>> vmovaps %xmm0, -16(%rsp) >>>> movl %esi, -16(%rsp) >>>> vpmaxsd -16(%rsp), %xmm0, %xmm0 >>>> vmovd %xmm0, %eax >>>> >>>> eh? I guess the lowpart set is not good (my patch has this >>>> as well, but I got saved by never having vector modes to subset...). >>>> Using >>>> >>>> (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) >>>> (const_vector:V4SI [ >>>> (const_int 0 [0]) repeated x4 >>>> ]) >>>> (const_int 1 [0x1]))) "t3.c":5:10 -1 >>>> >>>> for the move ends up with >>>> >>>> vpxor %xmm1, %xmm1, %xmm1 >>>> vpinsrd $0, %esi, %xmm1, %xmm0 >>>> >>>> eh? LRA chooses the correct alternative here but somehow >>>> postreload CSE CSEs the zero with the xmm1 clearing, leading >>>> to the vpinsrd... (I guess a general issue, not sure if really >>>> worse - definitely a larger instruction). Unfortunately >>>> postreload-cse doesn't add a reg-equal note. This happens only >>>> when emitting the reg move before the use, not doing that emits >>>> a vmovd as expected. >>>> >>>> At least the spilling is gone here. >>>> >>>> I am re-testing as follows, the main change is that >>>> general_scalar_chain::make_vector_copies now generates a >>>> vector pseudo as destination (and I've fixed up the code >>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)). >>>> >>>> Hope this fixes the observed slowdowns (it fixes the new testcase). >>> >>> It fixes the slowdown observed in 416.gamess and 464.h264ref. >>> >>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. >>> >>> CCing Jeff who "knows RTL". >>> >>> OK? >> >> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid >> spurious test failures on SSE4.1 targets. > > Done. I've also adjusted the i386.md changelog as follows: > > * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. > (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. > (*<maxmin>di3_doubleword): Likewise. > > I see > > FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest > FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest > FAIL: gcc.target/i386/pr78794.c scan-assembler pandn > > with the latest patch (this is with -m32) where -mstv causes > all spills to go away and the cmoves replaced (so clearly > better code after the patch) for pr65105-5.c, no obvious > improvements for pr65105-3.c where cmov does appear with -mstv. > I'd rather not "fix" those by adding -mno-stv but instead have > the Intel people fix costing for slm and/or decide what to do. > For pr65105-3.c I'm not sure why if-conversion didn't choose > to use cmov, so clearly the enabled minmax patterns expose the > "failure" here. I'm not sure how much effort Intel is putting into Silvermont tuning these days. So I'd suggest giving HJ a heads-up and a reasonable period of time to take a looksie, but I wouldn't hold the patch for long due to a Silvermont tuning issue. jeff ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-13 16:28 ` Jeff Law @ 2019-08-13 20:07 ` H.J. Lu 2019-08-15 9:24 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: H.J. Lu @ 2019-08-13 20:07 UTC (permalink / raw) To: Jeff Law; +Cc: Richard Biener, Uros Bizjak, Jakub Jelinek, gcc-patches On Tue, Aug 13, 2019 at 8:20 AM Jeff Law <law@redhat.com> wrote: > > On 8/12/19 6:27 AM, Richard Biener wrote: > > On Fri, 9 Aug 2019, Uros Bizjak wrote: > > > >> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote: > >> > >>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > >>>>>>>>>>>> > >>>>>>>>>>>> and then we need to split DImode for 32bits, too. > >>>>>>>>>>> > >>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > >>>>>>>>>>> condition, I'll provide _doubleword splitter later. > >>>>>>>>>> > >>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > >>>>>>>>>> to force use of %zmmN? > >>>>>>>>> > >>>>>>>>> It generates V4SI mode, so - yes, AVX512VL. > >>>>>>>> > >>>>>>>> case SMAX: > >>>>>>>> case SMIN: > >>>>>>>> case UMAX: > >>>>>>>> case UMIN: > >>>>>>>> if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > >>>>>>>> || (mode == SImode && !TARGET_SSE4_1)) > >>>>>>>> return false; > >>>>>>>> > >>>>>>>> so there's no way to use AVX512VL for 32bit? > >>>>>>> > >>>>>>> There is a way, but on 32bit targets, we need to split DImode > >>>>>>> operation to a sequence of SImode operations for unconverted pattern. > >>>>>>> This is of course doable, but somehow more complex than simply > >>>>>>> emitting a DImode compare + DImode cmove, which is what current > >>>>>>> splitter does. So, a follow-up task. > >>>>>> > >>>>>> Please find attached the complete .md part that enables SImode for > >>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > >>>>>> targets. The patterns also allows for memory operand 2, so STV has > >>>>>> chance to create the vector pattern with implicit load. In case STV > >>>>>> fails, the memory operand 2 is loaded to the register first; operand > >>>>>> 2 is used in compare and cmove instruction, so pre-loading of the > >>>>>> operand should be beneficial. > >>>>> > >>>>> Thanks. > >>>>> > >>>>>> Also note, that splitting should happen rarely. Due to the cost > >>>>>> function, STV should effectively always convert minmax to a vector > >>>>>> insn. > >>>>> > >>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to > >>>>> this kind of "simple" conversion: > >>>>> > >>>>> 5.50 │1d0: test %esi,%es > >>>>> 0.07 │ mov $0x0,%ex > >>>>> │ cmovs %eax,%es > >>>>> 5.84 │ imul %r8d,%es > >>>>> > >>>>> to > >>>>> > >>>>> 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 > >>>>> 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 > >>>>> 40.45 │ vmovd %xmm0,%eax > >>>>> 2.45 │ imul %r8d,%eax > >>>>> > >>>>> which looks like a RA artifact in the end. We spill %esi only > >>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use > >>>>> of a pseudo ultimatively set from di. STV creates an additional > >>>>> pseudo for this (copy-in) but it places that copy next to the > >>>>> original def rather than next to the start of the chain it > >>>>> converts which is probably the issue why we spill. And this > >>>>> is because it inserts those at each definition of the pseudo > >>>>> rather than just at the reaching definition(s) or at the > >>>>> uses of the pseudo in the chain (that because there may be > >>>>> defs of that pseudo in the chain itself). Note that STV emits > >>>>> such "conversion" copies as simple reg-reg moves: > >>>>> > >>>>> (insn 1094 3 4 2 (set (reg:SI 777) > >>>>> (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 > >>>>> (nil)) > >>>>> > >>>>> but those do not prevail very long (this one gets removed by CSE2). > >>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use > >>>>> and computes > >>>>> > >>>>> r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS > >>>>> a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 > >>>>> > >>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves > >>>>> here (but I guess nothing again prevents RTL optimizers from > >>>>> combining that with the single-use in the max instruction...). > >>>>> > >>>>> So this boils down to STV splitting live-ranges but other > >>>>> passes undoing that and then RA not considering splitting > >>>>> live-ranges here, arriving at unoptimal allocation. > >>>>> > >>>>> A testcase showing this issue is (simplified from 464.h264ref > >>>>> UMVLine16Y_11): > >>>>> > >>>>> unsigned short > >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > >>>>> { > >>>>> if (y != width) > >>>>> { > >>>>> y = y < 0 ? 0 : y; > >>>>> return Pic[y * width]; > >>>>> } > >>>>> return Pic[y]; > >>>>> } > >>>>> > >>>>> where the condition and the Pic[y] load mimics the other use of y. > >>>>> Different, even worse spilling is generated by > >>>>> > >>>>> unsigned short > >>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > >>>>> { > >>>>> y = y < 0 ? 0 : y; > >>>>> return Pic[y * width] + y; > >>>>> } > >>>>> > >>>>> I guess this all shows that STVs "trick" of simply wrapping > >>>>> integer mode pseudos in (subreg:vector-mode ...) is bad? > >>>>> > >>>>> I've added a (failing) testcase to reflect the above. > >>>> > >>>> Experimenting a bit with just for the conversion insns using > >>>> V4SImode pseudos we end up preserving those moves (but I > >>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg > >>>> ends up using movv4si_internal which only leaves us with > >>>> memory for the SImode operand) _plus_ moving the move next > >>>> to the actual use has an effect. Not necssarily a good one > >>>> though: > >>>> > >>>> vpxor %xmm0, %xmm0, %xmm0 > >>>> vmovaps %xmm0, -16(%rsp) > >>>> movl %esi, -16(%rsp) > >>>> vpmaxsd -16(%rsp), %xmm0, %xmm0 > >>>> vmovd %xmm0, %eax > >>>> > >>>> eh? I guess the lowpart set is not good (my patch has this > >>>> as well, but I got saved by never having vector modes to subset...). > >>>> Using > >>>> > >>>> (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) > >>>> (const_vector:V4SI [ > >>>> (const_int 0 [0]) repeated x4 > >>>> ]) > >>>> (const_int 1 [0x1]))) "t3.c":5:10 -1 > >>>> > >>>> for the move ends up with > >>>> > >>>> vpxor %xmm1, %xmm1, %xmm1 > >>>> vpinsrd $0, %esi, %xmm1, %xmm0 > >>>> > >>>> eh? LRA chooses the correct alternative here but somehow > >>>> postreload CSE CSEs the zero with the xmm1 clearing, leading > >>>> to the vpinsrd... (I guess a general issue, not sure if really > >>>> worse - definitely a larger instruction). Unfortunately > >>>> postreload-cse doesn't add a reg-equal note. This happens only > >>>> when emitting the reg move before the use, not doing that emits > >>>> a vmovd as expected. > >>>> > >>>> At least the spilling is gone here. > >>>> > >>>> I am re-testing as follows, the main change is that > >>>> general_scalar_chain::make_vector_copies now generates a > >>>> vector pseudo as destination (and I've fixed up the code > >>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)). > >>>> > >>>> Hope this fixes the observed slowdowns (it fixes the new testcase). > >>> > >>> It fixes the slowdown observed in 416.gamess and 464.h264ref. > >>> > >>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > >>> > >>> CCing Jeff who "knows RTL". > >>> > >>> OK? > >> > >> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid > >> spurious test failures on SSE4.1 targets. > > > > Done. I've also adjusted the i386.md changelog as follows: > > > > * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. > > (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. > > (*<maxmin>di3_doubleword): Likewise. > > > > I see > > > > FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest > > FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest > > FAIL: gcc.target/i386/pr78794.c scan-assembler pandn > > > > with the latest patch (this is with -m32) where -mstv causes > > all spills to go away and the cmoves replaced (so clearly > > better code after the patch) for pr65105-5.c, no obvious > > improvements for pr65105-3.c where cmov does appear with -mstv. > > I'd rather not "fix" those by adding -mno-stv but instead have > > the Intel people fix costing for slm and/or decide what to do. > > For pr65105-3.c I'm not sure why if-conversion didn't choose > > to use cmov, so clearly the enabled minmax patterns expose the > > "failure" here. > I'm not sure how much effort Intel is putting into Silvermont tuning > these days. So I'd suggest giving HJ a heads-up and a reasonable period > of time to take a looksie, but I wouldn't hold the patch for long due to > a Silvermont tuning issue. Leave pr65105-3.c to fail for now. We can take a look later. Thanks. -- H.J. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-13 20:07 ` H.J. Lu @ 2019-08-15 9:24 ` Uros Bizjak 0 siblings, 0 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-15 9:24 UTC (permalink / raw) To: H.J. Lu; +Cc: Jeff Law, Richard Biener, Jakub Jelinek, gcc-patches On Tue, Aug 13, 2019 at 9:54 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > with the latest patch (this is with -m32) where -mstv causes > > > all spills to go away and the cmoves replaced (so clearly > > > better code after the patch) for pr65105-5.c, no obvious > > > improvements for pr65105-3.c where cmov does appear with -mstv. > > > I'd rather not "fix" those by adding -mno-stv but instead have > > > the Intel people fix costing for slm and/or decide what to do. > > > For pr65105-3.c I'm not sure why if-conversion didn't choose > > > to use cmov, so clearly the enabled minmax patterns expose the > > > "failure" here. > > I'm not sure how much effort Intel is putting into Silvermont tuning > > these days. So I'd suggest giving HJ a heads-up and a reasonable period > > of time to take a looksie, but I wouldn't hold the patch for long due to > > a Silvermont tuning issue. > > Leave pr65105-3.c to fail for now. We can take a look later. I have a patch for this. The problem is with conversion of COMPARE, which gets assigned to SImode chain, while in fact we expect very specific form of DImode compare. Uros. ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-09 13:13 ` Richard Biener 2019-08-09 14:39 ` Uros Bizjak @ 2019-08-13 15:20 ` Jeff Law 2019-08-14 9:15 ` Richard Biener 1 sibling, 1 reply; 61+ messages in thread From: Jeff Law @ 2019-08-13 15:20 UTC (permalink / raw) To: Richard Biener, Uros Bizjak; +Cc: Jakub Jelinek, gcc-patches On 8/9/19 7:00 AM, Richard Biener wrote: > On Fri, 9 Aug 2019, Richard Biener wrote: > >> On Fri, 9 Aug 2019, Richard Biener wrote: >> >>> On Fri, 9 Aug 2019, Uros Bizjak wrote: >>> >>>> On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote: >>>> >>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) >>>>>>>>>> >>>>>>>>>> and then we need to split DImode for 32bits, too. >>>>>>>>> >>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode >>>>>>>>> condition, I'll provide _doubleword splitter later. >>>>>>>> >>>>>>>> Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. >>>>>>>> to force use of %zmmN? >>>>>>> >>>>>>> It generates V4SI mode, so - yes, AVX512VL. >>>>>> >>>>>> case SMAX: >>>>>> case SMIN: >>>>>> case UMAX: >>>>>> case UMIN: >>>>>> if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) >>>>>> || (mode == SImode && !TARGET_SSE4_1)) >>>>>> return false; >>>>>> >>>>>> so there's no way to use AVX512VL for 32bit? >>>>> >>>>> There is a way, but on 32bit targets, we need to split DImode >>>>> operation to a sequence of SImode operations for unconverted pattern. >>>>> This is of course doable, but somehow more complex than simply >>>>> emitting a DImode compare + DImode cmove, which is what current >>>>> splitter does. So, a follow-up task. >>>> >>>> Please find attached the complete .md part that enables SImode for >>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit >>>> targets. The patterns also allows for memory operand 2, so STV has >>>> chance to create the vector pattern with implicit load. In case STV >>>> fails, the memory operand 2 is loaded to the register first; operand >>>> 2 is used in compare and cmove instruction, so pre-loading of the >>>> operand should be beneficial. >>> >>> Thanks. >>> >>>> Also note, that splitting should happen rarely. Due to the cost >>>> function, STV should effectively always convert minmax to a vector >>>> insn. >>> >>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to >>> this kind of "simple" conversion: >>> >>> 5.50 â1d0: test %esi,%es >>> 0.07 â mov $0x0,%ex >>> â cmovs %eax,%es >>> 5.84 â imul %r8d,%es >>> >>> to >>> >>> 0.65 â1e0: vpxor %xmm0,%xmm0,%xmm0 >>> 0.32 â vpmaxs -0x10(%rsp),%xmm0,%xmm0 >>> 40.45 â vmovd %xmm0,%eax >>> 2.45 â imul %r8d,%eax >>> >>> which looks like a RA artifact in the end. We spill %esi only >>> with -mstv here as STV introduces a (subreg:V4SI ...) use >>> of a pseudo ultimatively set from di. STV creates an additional >>> pseudo for this (copy-in) but it places that copy next to the >>> original def rather than next to the start of the chain it >>> converts which is probably the issue why we spill. And this >>> is because it inserts those at each definition of the pseudo >>> rather than just at the reaching definition(s) or at the >>> uses of the pseudo in the chain (that because there may be >>> defs of that pseudo in the chain itself). Note that STV emits >>> such "conversion" copies as simple reg-reg moves: >>> >>> (insn 1094 3 4 2 (set (reg:SI 777) >>> (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 >>> (nil)) >>> >>> but those do not prevail very long (this one gets removed by CSE2). >>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use >>> and computes >>> >>> r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS >>> a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 >>> >>> so I wonder if STV shouldn't instead emit gpr->xmm moves >>> here (but I guess nothing again prevents RTL optimizers from >>> combining that with the single-use in the max instruction...). >>> >>> So this boils down to STV splitting live-ranges but other >>> passes undoing that and then RA not considering splitting >>> live-ranges here, arriving at unoptimal allocation. >>> >>> A testcase showing this issue is (simplified from 464.h264ref >>> UMVLine16Y_11): >>> >>> unsigned short >>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) >>> { >>> if (y != width) >>> { >>> y = y < 0 ? 0 : y; >>> return Pic[y * width]; >>> } >>> return Pic[y]; >>> } >>> >>> where the condition and the Pic[y] load mimics the other use of y. >>> Different, even worse spilling is generated by >>> >>> unsigned short >>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width) >>> { >>> y = y < 0 ? 0 : y; >>> return Pic[y * width] + y; >>> } >>> >>> I guess this all shows that STVs "trick" of simply wrapping >>> integer mode pseudos in (subreg:vector-mode ...) is bad? >>> >>> I've added a (failing) testcase to reflect the above. >> >> Experimenting a bit with just for the conversion insns using >> V4SImode pseudos we end up preserving those moves (but I >> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg >> ends up using movv4si_internal which only leaves us with >> memory for the SImode operand) _plus_ moving the move next >> to the actual use has an effect. Not necssarily a good one >> though: >> >> vpxor %xmm0, %xmm0, %xmm0 >> vmovaps %xmm0, -16(%rsp) >> movl %esi, -16(%rsp) >> vpmaxsd -16(%rsp), %xmm0, %xmm0 >> vmovd %xmm0, %eax >> >> eh? I guess the lowpart set is not good (my patch has this >> as well, but I got saved by never having vector modes to subset...). >> Using >> >> (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ])) >> (const_vector:V4SI [ >> (const_int 0 [0]) repeated x4 >> ]) >> (const_int 1 [0x1]))) "t3.c":5:10 -1 >> >> for the move ends up with >> >> vpxor %xmm1, %xmm1, %xmm1 >> vpinsrd $0, %esi, %xmm1, %xmm0 >> >> eh? LRA chooses the correct alternative here but somehow >> postreload CSE CSEs the zero with the xmm1 clearing, leading >> to the vpinsrd... (I guess a general issue, not sure if really >> worse - definitely a larger instruction). Unfortunately >> postreload-cse doesn't add a reg-equal note. This happens only >> when emitting the reg move before the use, not doing that emits >> a vmovd as expected. >> >> At least the spilling is gone here. >> >> I am re-testing as follows, the main change is that >> general_scalar_chain::make_vector_copies now generates a >> vector pseudo as destination (and I've fixed up the code >> to not generate (subreg:V4SI (reg:V4SI 1234) 0)). >> >> Hope this fixes the observed slowdowns (it fixes the new testcase). > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > CCing Jeff who "knows RTL". What specifically do you want me to look at? I'm not really familiar with the STV stuff, but can certainly take a peek. Jeff ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-13 15:20 ` Jeff Law @ 2019-08-14 9:15 ` Richard Biener 2019-08-14 9:36 ` Uros Bizjak 0 siblings, 1 reply; 61+ messages in thread From: Richard Biener @ 2019-08-14 9:15 UTC (permalink / raw) To: Jeff Law; +Cc: Uros Bizjak, Jakub Jelinek, gcc-patches On Tue, 13 Aug 2019, Jeff Law wrote: > On 8/9/19 7:00 AM, Richard Biener wrote: > > > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > > > CCing Jeff who "knows RTL". > What specifically do you want me to look at? I'm not really familiar > with the STV stuff, but can certainly take a peek. Below is the updated patch with the already approved and committed parts taken out. It is not mostly mechanical apart from the make_vector_copies and convert_reg changes which move existing "patterns" under appropriate conditionals and adds handling of the case where the scalar mode fits in a single GPR (previously it was -m32 DImode only, now it handles -m32/-m64 SImode and DImode). I'm redoing bootstrap / regtest on x86_64-unknown-linux-gnu now just to be safe. OK? I do expect we need to work on the compile-time issue I placed ??? comments on and more generally try to avoid using DF so much. Thanks, Richard. 2019-08-13 Richard Biener <rguenther@suse.de> PR target/91154 * config/i386/i386-features.h (scalar_chain::scalar_chain): Add mode arguments. (scalar_chain::smode): New member. (scalar_chain::vmode): Likewise. (dimode_scalar_chain): Rename to... (general_scalar_chain): ... this. (general_scalar_chain::general_scalar_chain): Take mode arguments. (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain base with TImode and V1TImode. * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. (general_scalar_chain::vector_const_cost): Adjust for SImode chains. (general_scalar_chain::compute_convert_gain): Likewise. Add {S,U}{MIN,MAX} support. (general_scalar_chain::replace_with_subreg): Use vmode/smode. (general_scalar_chain::make_vector_copies): Likewise. Handle non-DImode chains appropriately. (general_scalar_chain::convert_reg): Likewise. (general_scalar_chain::convert_op): Likewise. (general_scalar_chain::convert_insn): Likewise. Add fatal_insn_not_found if the result is not recognized. (convertible_comparison_p): Pass in the scalar mode and use that. (general_scalar_to_vector_candidate_p): Likewise. Rename from dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. (scalar_to_vector_candidate_p): Remove by inlining into single caller. (general_remove_non_convertible_regs): Rename from dimode_remove_non_convertible_regs. (remove_non_convertible_regs): Remove by inlining into single caller. (convert_scalars_to_vector): Handle SImode and DImode chains in addition to TImode chains. * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. (*<maxmin>di3_doubleword): Likewise. * gcc.target/i386/pr91154.c: New testcase. * gcc.target/i386/minmax-3.c: Likewise. * gcc.target/i386/minmax-4.c: Likewise. * gcc.target/i386/minmax-5.c: Likewise. * gcc.target/i386/minmax-6.c: Likewise. * gcc.target/i386/minmax-1.c: Add -mno-stv. * gcc.target/i386/minmax-2.c: Likewise. Index: gcc/config/i386/i386-features.c =================================================================== --- gcc/config/i386/i386-features.c (revision 274422) +++ gcc/config/i386/i386-features.c (working copy) @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; /* Initialize new chain. */ -scalar_chain::scalar_chain () +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) { + smode = smode_; + vmode = vmode_; + chain_id = ++max_id; if (dump_file) @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins conversion. */ void -dimode_scalar_chain::mark_dual_mode_def (df_ref def) +general_scalar_chain::mark_dual_mode_def (df_ref def) { gcc_assert (DF_REF_REG_DEF_P (def)); @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate && !HARD_REGISTER_P (SET_DEST (def_set))) bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); + /* ??? The following is quadratic since analyze_register_chain + iterates over all refs to look for dual-mode regs. Instead this + should be done separately for all regs mentioned in the chain once. */ df_ref ref; df_ref def; for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, instead of using a scalar one. */ int -dimode_scalar_chain::vector_const_cost (rtx exp) +general_scalar_chain::vector_const_cost (rtx exp) { gcc_assert (CONST_INT_P (exp)); - if (standard_sse_constant_p (exp, V2DImode)) - return COSTS_N_INSNS (1); - return ix86_cost->sse_load[1]; + if (standard_sse_constant_p (exp, vmode)) + return ix86_cost->sse_op; + /* We have separate costs for SImode and DImode, use SImode costs + for smaller modes. */ + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; } /* Compute a gain for chain conversion. */ int -dimode_scalar_chain::compute_convert_gain () +general_scalar_chain::compute_convert_gain () { bitmap_iterator bi; unsigned insn_uid; @@ -491,6 +499,13 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); + /* SSE costs distinguish between SImode and DImode loads/stores, for + int costs factor in the number of GPRs involved. When supporting + smaller modes than SImode the int load/store costs need to be + adjusted as well. */ + unsigned sse_cost_idx = smode == DImode ? 1 : 0; + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; + EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) { rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; @@ -500,18 +515,19 @@ dimode_scalar_chain::compute_convert_gai int igain = 0; if (REG_P (src) && REG_P (dst)) - igain += 2 - ix86_cost->xmm_move; + igain += 2 * m - ix86_cost->xmm_move; else if (REG_P (src) && MEM_P (dst)) - igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; else if (MEM_P (src) && REG_P (dst)) - igain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; else if (GET_CODE (src) == ASHIFT || GET_CODE (src) == ASHIFTRT || GET_CODE (src) == LSHIFTRT) { if (CONST_INT_P (XEXP (src, 0))) igain -= vector_const_cost (XEXP (src, 0)); - igain += 2 * ix86_cost->shift_const - ix86_cost->sse_op; + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; if (INTVAL (XEXP (src, 1)) >= 32) igain -= COSTS_N_INSNS (1); } @@ -521,11 +537,11 @@ dimode_scalar_chain::compute_convert_gai || GET_CODE (src) == XOR || GET_CODE (src) == AND) { - igain += 2 * ix86_cost->add - ix86_cost->sse_op; + igain += m * ix86_cost->add - ix86_cost->sse_op; /* Additional gain for andnot for targets without BMI. */ if (GET_CODE (XEXP (src, 0)) == NOT && !TARGET_BMI) - igain += 2 * ix86_cost->add; + igain += m * ix86_cost->add; if (CONST_INT_P (XEXP (src, 0))) igain -= vector_const_cost (XEXP (src, 0)); @@ -534,7 +550,18 @@ dimode_scalar_chain::compute_convert_gai } else if (GET_CODE (src) == NEG || GET_CODE (src) == NOT) - igain += 2 * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1); + igain += m * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1); + else if (GET_CODE (src) == SMAX + || GET_CODE (src) == SMIN + || GET_CODE (src) == UMAX + || GET_CODE (src) == UMIN) + { + /* We do not have any conditional move cost, estimate it as a + reg-reg move. Comparisons are costed as adds. */ + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); + /* Integer SSE ops are all costed the same. */ + igain -= ix86_cost->sse_op; + } else if (GET_CODE (src) == COMPARE) { /* Assume comparison cost is the same. */ @@ -542,9 +569,11 @@ dimode_scalar_chain::compute_convert_gai else if (CONST_INT_P (src)) { if (REG_P (dst)) - igain += 2 * COSTS_N_INSNS (1); + /* DImode can be immediate for TARGET_64BIT and SImode always. */ + igain += m * COSTS_N_INSNS (1); else if (MEM_P (dst)) - igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; + igain += (m * ix86_cost->int_store[2] + - ix86_cost->sse_store[sse_cost_idx]); igain -= vector_const_cost (src); } else @@ -561,6 +590,7 @@ dimode_scalar_chain::compute_convert_gai if (dump_file) fprintf (dump_file, " Instruction conversion gain: %d\n", gain); + /* ??? What about integer to SSE? */ EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; @@ -578,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai /* Replace REG in X with a V2DI subreg of NEW_REG. */ rtx -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) { if (x == reg) - return gen_rtx_SUBREG (V2DImode, new_reg, 0); + return gen_rtx_SUBREG (vmode, new_reg, 0); const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); int i, j; @@ -601,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ void -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, rtx reg, rtx new_reg) { replace_with_subreg (single_set (insn), reg, new_reg); @@ -632,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx and replace its uses in a chain. */ void -dimode_scalar_chain::make_vector_copies (unsigned regno) +general_scalar_chain::make_vector_copies (unsigned regno) { rtx reg = regno_reg_rtx[regno]; - rtx vreg = gen_reg_rtx (DImode); + rtx vreg = gen_reg_rtx (smode); df_ref ref; for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) @@ -644,37 +674,59 @@ dimode_scalar_chain::make_vector_copies start_sequence (); if (!TARGET_INTER_UNIT_MOVES_TO_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); - emit_move_insn (adjust_address (tmp, SImode, 0), - gen_rtx_SUBREG (SImode, reg, 0)); - emit_move_insn (adjust_address (tmp, SImode, 4), - gen_rtx_SUBREG (SImode, reg, 4)); - emit_move_insn (vreg, tmp); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); + if (smode == DImode && !TARGET_64BIT) + { + emit_move_insn (adjust_address (tmp, SImode, 0), + gen_rtx_SUBREG (SImode, reg, 0)); + emit_move_insn (adjust_address (tmp, SImode, 4), + gen_rtx_SUBREG (SImode, reg, 4)); + } + else + emit_move_insn (tmp, reg); + emit_insn (gen_rtx_SET + (gen_rtx_SUBREG (vmode, vreg, 0), + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + tmp), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U)))); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (SImode, reg, 4), - GEN_INT (2))); + if (TARGET_SSE4_1) + { + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (SImode, reg, 4), + GEN_INT (2))); + } + else + { + rtx tmp = gen_reg_rtx (DImode); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 0))); + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), + CONST0_RTX (V4SImode), + gen_rtx_SUBREG (SImode, reg, 4))); + emit_insn (gen_vec_interleave_lowv4si + (gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, vreg, 0), + gen_rtx_SUBREG (V4SImode, tmp, 0))); + } } else - { - rtx tmp = gen_reg_rtx (DImode); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 0))); - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), - CONST0_RTX (V4SImode), - gen_rtx_SUBREG (SImode, reg, 4))); - emit_insn (gen_vec_interleave_lowv4si - (gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, vreg, 0), - gen_rtx_SUBREG (V4SImode, tmp, 0))); - } + emit_insn (gen_rtx_SET + (gen_rtx_SUBREG (vmode, vreg, 0), + gen_rtx_VEC_MERGE (vmode, + gen_rtx_VEC_DUPLICATE (vmode, + reg), + CONST0_RTX (vmode), + GEN_INT (HOST_WIDE_INT_1U)))); rtx_insn *seq = get_insns (); end_sequence (); rtx_insn *insn = DF_REF_INSN (ref); @@ -703,7 +755,7 @@ dimode_scalar_chain::make_vector_copies in case register is used in not convertible insn. */ void -dimode_scalar_chain::convert_reg (unsigned regno) +general_scalar_chain::convert_reg (unsigned regno) { bool scalar_copy = bitmap_bit_p (defs_conv, regno); rtx reg = regno_reg_rtx[regno]; @@ -715,7 +767,7 @@ dimode_scalar_chain::convert_reg (unsign bitmap_copy (conv, insns); if (scalar_copy) - scopy = gen_reg_rtx (DImode); + scopy = gen_reg_rtx (smode); for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) { @@ -735,40 +787,55 @@ dimode_scalar_chain::convert_reg (unsign start_sequence (); if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) { - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); emit_move_insn (tmp, reg); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - adjust_address (tmp, SImode, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - adjust_address (tmp, SImode, 4)); + if (!TARGET_64BIT && smode == DImode) + { + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + adjust_address (tmp, SImode, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + adjust_address (tmp, SImode, 4)); + } + else + emit_move_insn (scopy, tmp); } - else if (TARGET_SSE4_1) + else if (!TARGET_64BIT && smode == DImode) { - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); - - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); - emit_insn - (gen_rtx_SET - (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_VEC_SELECT (SImode, - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); + if (TARGET_SSE4_1) + { + rtx tmp = gen_rtx_PARALLEL (VOIDmode, + gen_rtvec (1, const0_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); + emit_insn + (gen_rtx_SET + (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_VEC_SELECT (SImode, + gen_rtx_SUBREG (V4SImode, reg, 0), + tmp))); + } + else + { + rtx vcopy = gen_reg_rtx (V2DImode); + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), + gen_rtx_SUBREG (SImode, vcopy, 0)); + emit_move_insn (vcopy, + gen_rtx_LSHIFTRT (V2DImode, + vcopy, GEN_INT (32))); + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), + gen_rtx_SUBREG (SImode, vcopy, 0)); + } } else - { - rtx vcopy = gen_reg_rtx (V2DImode); - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), - gen_rtx_SUBREG (SImode, vcopy, 0)); - emit_move_insn (vcopy, - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), - gen_rtx_SUBREG (SImode, vcopy, 0)); - } + emit_move_insn (scopy, reg); + rtx_insn *seq = get_insns (); end_sequence (); emit_conversion_insns (seq, insn); @@ -817,21 +884,21 @@ dimode_scalar_chain::convert_reg (unsign registers conversion. */ void -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) { *op = copy_rtx_if_shared (*op); if (GET_CODE (*op) == NOT) { convert_op (&XEXP (*op, 0), insn); - PUT_MODE (*op, V2DImode); + PUT_MODE (*op, vmode); } else if (MEM_P (*op)) { - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (GET_MODE (*op)); emit_insn_before (gen_move_insn (tmp, *op), insn); - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); + *op = gen_rtx_SUBREG (vmode, tmp, 0); if (dump_file) fprintf (dump_file, " Preloading operand for insn %d into r%d\n", @@ -849,24 +916,30 @@ dimode_scalar_chain::convert_op (rtx *op gcc_assert (!DF_REF_CHAIN (ref)); break; } - *op = gen_rtx_SUBREG (V2DImode, *op, 0); + *op = gen_rtx_SUBREG (vmode, *op, 0); } else if (CONST_INT_P (*op)) { rtx vec_cst; - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); /* Prefer all ones vector in case of -1. */ if (constm1_operand (*op, GET_MODE (*op))) - vec_cst = CONSTM1_RTX (V2DImode); + vec_cst = CONSTM1_RTX (vmode); else - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, - gen_rtvec (2, *op, const0_rtx)); + { + unsigned n = GET_MODE_NUNITS (vmode); + rtx *v = XALLOCAVEC (rtx, n); + v[0] = *op; + for (unsigned i = 1; i < n; ++i) + v[i] = const0_rtx; + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); + } - if (!standard_sse_constant_p (vec_cst, V2DImode)) + if (!standard_sse_constant_p (vec_cst, vmode)) { start_sequence (); - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); rtx_insn *seq = get_insns (); end_sequence (); emit_insn_before (seq, insn); @@ -878,14 +951,14 @@ dimode_scalar_chain::convert_op (rtx *op else { gcc_assert (SUBREG_P (*op)); - gcc_assert (GET_MODE (*op) == V2DImode); + gcc_assert (GET_MODE (*op) == vmode); } } /* Convert INSN to vector mode. */ void -dimode_scalar_chain::convert_insn (rtx_insn *insn) +general_scalar_chain::convert_insn (rtx_insn *insn) { rtx def_set = single_set (insn); rtx src = SET_SRC (def_set); @@ -896,9 +969,9 @@ dimode_scalar_chain::convert_insn (rtx_i { /* There are no scalar integer instructions and therefore temporary register usage is required. */ - rtx tmp = gen_reg_rtx (DImode); + rtx tmp = gen_reg_rtx (smode); emit_conversion_insns (gen_move_insn (dst, tmp), insn); - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); + dst = gen_rtx_SUBREG (vmode, tmp, 0); } switch (GET_CODE (src)) @@ -907,7 +980,7 @@ dimode_scalar_chain::convert_insn (rtx_i case ASHIFTRT: case LSHIFTRT: convert_op (&XEXP (src, 0), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case PLUS: @@ -915,25 +988,29 @@ dimode_scalar_chain::convert_insn (rtx_i case IOR: case XOR: case AND: + case SMAX: + case SMIN: + case UMAX: + case UMIN: convert_op (&XEXP (src, 0), insn); convert_op (&XEXP (src, 1), insn); - PUT_MODE (src, V2DImode); + PUT_MODE (src, vmode); break; case NEG: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); - src = gen_rtx_MINUS (V2DImode, subreg, src); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); + src = gen_rtx_MINUS (vmode, subreg, src); break; case NOT: src = XEXP (src, 0); convert_op (&src, insn); - subreg = gen_reg_rtx (V2DImode); - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); - src = gen_rtx_XOR (V2DImode, src, subreg); + subreg = gen_reg_rtx (vmode); + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); + src = gen_rtx_XOR (vmode, src, subreg); break; case MEM: @@ -947,17 +1024,17 @@ dimode_scalar_chain::convert_insn (rtx_i break; case SUBREG: - gcc_assert (GET_MODE (src) == V2DImode); + gcc_assert (GET_MODE (src) == vmode); break; case COMPARE: src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) + || (SUBREG_P (src) && GET_MODE (src) == vmode)); if (REG_P (src)) - subreg = gen_rtx_SUBREG (V2DImode, src, 0); + subreg = gen_rtx_SUBREG (vmode, src, 0); else subreg = copy_rtx_if_shared (src); emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), @@ -985,7 +1062,9 @@ dimode_scalar_chain::convert_insn (rtx_i PATTERN (insn) = def_set; INSN_CODE (insn) = -1; - recog_memoized (insn); + int patt = recog_memoized (insn); + if (patt == -1) + fatal_insn_not_found (insn); df_insn_rescan (insn); } @@ -1124,7 +1203,7 @@ timode_scalar_chain::convert_insn (rtx_i } void -dimode_scalar_chain::convert_registers () +general_scalar_chain::convert_registers () { bitmap_iterator bi; unsigned id; @@ -1194,7 +1273,7 @@ has_non_address_hard_reg (rtx_insn *insn (const_int 0 [0]))) */ static bool -convertible_comparison_p (rtx_insn *insn) +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) { if (!TARGET_SSE4_1) return false; @@ -1227,12 +1306,12 @@ convertible_comparison_p (rtx_insn *insn if (!SUBREG_P (op1) || !SUBREG_P (op2) - || GET_MODE (op1) != SImode - || GET_MODE (op2) != SImode + || GET_MODE (op1) != mode + || GET_MODE (op2) != mode || ((SUBREG_BYTE (op1) != 0 - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) && (SUBREG_BYTE (op2) != 0 - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) return false; op1 = SUBREG_REG (op1); @@ -1240,7 +1319,7 @@ convertible_comparison_p (rtx_insn *insn if (op1 != op2 || !REG_P (op1) - || GET_MODE (op1) != DImode) + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) return false; return true; @@ -1249,7 +1328,7 @@ convertible_comparison_p (rtx_insn *insn /* The DImode version of scalar_to_vector_candidate_p. */ static bool -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) { rtx def_set = single_set (insn); @@ -1263,12 +1342,12 @@ dimode_scalar_to_vector_candidate_p (rtx rtx dst = SET_DEST (def_set); if (GET_CODE (src) == COMPARE) - return convertible_comparison_p (insn); + return convertible_comparison_p (insn, mode); /* We are interested in DImode promotion only. */ - if ((GET_MODE (src) != DImode + if ((GET_MODE (src) != mode && !CONST_INT_P (src)) - || GET_MODE (dst) != DImode) + || GET_MODE (dst) != mode) return false; if (!REG_P (dst) && !MEM_P (dst)) @@ -1288,6 +1367,15 @@ dimode_scalar_to_vector_candidate_p (rtx return false; break; + case SMAX: + case SMIN: + case UMAX: + case UMIN: + if ((mode == DImode && !TARGET_AVX512VL) + || (mode == SImode && !TARGET_SSE4_1)) + return false; + /* Fallthru. */ + case PLUS: case MINUS: case IOR: @@ -1298,7 +1386,7 @@ dimode_scalar_to_vector_candidate_p (rtx && !CONST_INT_P (XEXP (src, 1))) return false; - if (GET_MODE (XEXP (src, 1)) != DImode + if (GET_MODE (XEXP (src, 1)) != mode && !CONST_INT_P (XEXP (src, 1))) return false; break; @@ -1327,7 +1415,7 @@ dimode_scalar_to_vector_candidate_p (rtx || !REG_P (XEXP (XEXP (src, 0), 0)))) return false; - if (GET_MODE (XEXP (src, 0)) != DImode + if (GET_MODE (XEXP (src, 0)) != mode && !CONST_INT_P (XEXP (src, 0))) return false; @@ -1391,22 +1479,16 @@ timode_scalar_to_vector_candidate_p (rtx return false; } -/* Return 1 if INSN may be converted into vector - instruction. */ - -static bool -scalar_to_vector_candidate_p (rtx_insn *insn) -{ - if (TARGET_64BIT) - return timode_scalar_to_vector_candidate_p (insn); - else - return dimode_scalar_to_vector_candidate_p (insn); -} +/* For a given bitmap of insn UIDs scans all instruction and + remove insn from CANDIDATES in case it has both convertible + and not convertible definitions. -/* The DImode version of remove_non_convertible_regs. */ + All insns in a bitmap are conversion candidates according to + scalar_to_vector_candidate_p. Currently it implies all insns + are single_set. */ static void -dimode_remove_non_convertible_regs (bitmap candidates) +general_remove_non_convertible_regs (bitmap candidates) { bitmap_iterator bi; unsigned id; @@ -1561,23 +1643,6 @@ timode_remove_non_convertible_regs (bitm BITMAP_FREE (regs); } -/* For a given bitmap of insn UIDs scans all instruction and - remove insn from CANDIDATES in case it has both convertible - and not convertible definitions. - - All insns in a bitmap are conversion candidates according to - scalar_to_vector_candidate_p. Currently it implies all insns - are single_set. */ - -static void -remove_non_convertible_regs (bitmap candidates) -{ - if (TARGET_64BIT) - timode_remove_non_convertible_regs (candidates); - else - dimode_remove_non_convertible_regs (candidates); -} - /* Main STV pass function. Find and convert scalar instructions into vector mode when profitable. */ @@ -1585,11 +1650,14 @@ static unsigned int convert_scalars_to_vector () { basic_block bb; - bitmap candidates; int converted_insns = 0; bitmap_obstack_initialize (NULL); - candidates = BITMAP_ALLOC (NULL); + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ + for (unsigned i = 0; i < 3; ++i) + bitmap_initialize (&candidates[i], &bitmap_default_obstack); calculate_dominance_info (CDI_DOMINATORS); df_set_flags (DF_DEFER_INSN_RESCAN); @@ -1605,51 +1673,73 @@ convert_scalars_to_vector () { rtx_insn *insn; FOR_BB_INSNS (bb, insn) - if (scalar_to_vector_candidate_p (insn)) + if (TARGET_64BIT + && timode_scalar_to_vector_candidate_p (insn)) { if (dump_file) - fprintf (dump_file, " insn %d is marked as a candidate\n", + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", INSN_UID (insn)); - bitmap_set_bit (candidates, INSN_UID (insn)); + bitmap_set_bit (&candidates[2], INSN_UID (insn)); + } + else + { + /* Check {SI,DI}mode. */ + for (unsigned i = 0; i <= 1; ++i) + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) + { + if (dump_file) + fprintf (dump_file, " insn %d is marked as a %s candidate\n", + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); + + bitmap_set_bit (&candidates[i], INSN_UID (insn)); + break; + } } } - remove_non_convertible_regs (candidates); + if (TARGET_64BIT) + timode_remove_non_convertible_regs (&candidates[2]); + for (unsigned i = 0; i <= 1; ++i) + general_remove_non_convertible_regs (&candidates[i]); - if (bitmap_empty_p (candidates)) - if (dump_file) + for (unsigned i = 0; i <= 2; ++i) + if (!bitmap_empty_p (&candidates[i])) + break; + else if (i == 2 && dump_file) fprintf (dump_file, "There are no candidates for optimization.\n"); - while (!bitmap_empty_p (candidates)) - { - unsigned uid = bitmap_first_set_bit (candidates); - scalar_chain *chain; + for (unsigned i = 0; i <= 2; ++i) + while (!bitmap_empty_p (&candidates[i])) + { + unsigned uid = bitmap_first_set_bit (&candidates[i]); + scalar_chain *chain; - if (TARGET_64BIT) - chain = new timode_scalar_chain; - else - chain = new dimode_scalar_chain; + if (cand_mode[i] == TImode) + chain = new timode_scalar_chain; + else + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); - /* Find instructions chain we want to convert to vector mode. - Check all uses and definitions to estimate all required - conversions. */ - chain->build (candidates, uid); + /* Find instructions chain we want to convert to vector mode. + Check all uses and definitions to estimate all required + conversions. */ + chain->build (&candidates[i], uid); - if (chain->compute_convert_gain () > 0) - converted_insns += chain->convert (); - else - if (dump_file) - fprintf (dump_file, "Chain #%d conversion is not profitable\n", - chain->chain_id); + if (chain->compute_convert_gain () > 0) + converted_insns += chain->convert (); + else + if (dump_file) + fprintf (dump_file, "Chain #%d conversion is not profitable\n", + chain->chain_id); - delete chain; - } + delete chain; + } if (dump_file) fprintf (dump_file, "Total insns converted: %d\n", converted_insns); - BITMAP_FREE (candidates); + for (unsigned i = 0; i <= 2; ++i) + bitmap_release (&candidates[i]); bitmap_obstack_release (NULL); df_process_deferred_rescans (); Index: gcc/config/i386/i386-features.h =================================================================== --- gcc/config/i386/i386-features.h (revision 274422) +++ gcc/config/i386/i386-features.h (working copy) @@ -127,11 +127,16 @@ namespace { class scalar_chain { public: - scalar_chain (); + scalar_chain (enum machine_mode, enum machine_mode); virtual ~scalar_chain (); static unsigned max_id; + /* Scalar mode. */ + enum machine_mode smode; + /* Vector mode. */ + enum machine_mode vmode; + /* ID of a chain. */ unsigned int chain_id; /* A queue of instructions to be included into a chain. */ @@ -159,9 +164,11 @@ class scalar_chain virtual void convert_registers () = 0; }; -class dimode_scalar_chain : public scalar_chain +class general_scalar_chain : public scalar_chain { public: + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) + : scalar_chain (smode_, vmode_) {} int compute_convert_gain (); private: void mark_dual_mode_def (df_ref def); @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala class timode_scalar_chain : public scalar_chain { public: + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} + /* Convert from TImode to V1TImode is always faster. */ int compute_convert_gain () { return 1; } Index: gcc/config/i386/i386.md =================================================================== --- gcc/config/i386/i386.md (revision 274422) +++ gcc/config/i386/i386.md (working copy) @@ -17719,6 +17719,110 @@ (define_expand "add<mode>cc" (match_operand:SWI 3 "const_int_operand")] "" "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") + +;; min/max patterns + +(define_mode_iterator MAXMIN_IMODE + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) +(define_code_attr maxmin_rel + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) + +(define_expand "<code><mode>3" + [(parallel + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))])] + "TARGET_STV") + +(define_insn_and_split "*<code><mode>3_1" + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") + (maxmin:MAXMIN_IMODE + (match_operand:MAXMIN_IMODE 1 "register_operand") + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:MAXMIN_IMODE (match_dup 3) + (match_dup 1) + (match_dup 2)))] +{ + machine_mode mode = <MODE>mode; + + if (!register_operand (operands[2], mode)) + operands[2] = force_reg (mode, operands[2]); + + enum rtx_code code = <maxmin_rel>; + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); + + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + emit_insn (gen_rtx_SET (flags, tmp)); + + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); +}) + +(define_insn_and_split "*<code>di3_doubleword" + [(set (match_operand:DI 0 "register_operand") + (maxmin:DI (match_operand:DI 1 "register_operand") + (match_operand:DI 2 "nonimmediate_operand"))) + (clobber (reg:CC FLAGS_REG))] + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + && can_create_pseudo_p ()" + "#" + "&& 1" + [(set (match_dup 0) + (if_then_else:SI (match_dup 6) + (match_dup 1) + (match_dup 2))) + (set (match_dup 3) + (if_then_else:SI (match_dup 6) + (match_dup 4) + (match_dup 5)))] +{ + if (!register_operand (operands[2], DImode)) + operands[2] = force_reg (DImode, operands[2]); + + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); + + rtx cmplo[2] = { operands[1], operands[2] }; + rtx cmphi[2] = { operands[4], operands[5] }; + + enum rtx_code code = <maxmin_rel>; + + switch (code) + { + case LE: case LEU: + std::swap (cmplo[0], cmplo[1]); + std::swap (cmphi[0], cmphi[1]); + code = swap_condition (code); + /* FALLTHRU */ + + case GE: case GEU: + { + bool uns = (code == GEU); + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; + + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); + + rtx tmp = gen_rtx_SCRATCH (SImode); + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); + + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); + + break; + } + + default: + gcc_unreachable (); + } +}) \f ;; Misc patterns (?) Index: gcc/testsuite/gcc.target/i386/minmax-1.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-1.c (revision 274422) +++ gcc/testsuite/gcc.target/i386/minmax-1.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -march=opteron" } */ +/* { dg-options "-O2 -march=opteron -mno-stv" } */ /* { dg-final { scan-assembler "test" } } */ /* { dg-final { scan-assembler-not "cmp" } } */ #define max(a,b) (((a) > (b))? (a) : (b)) Index: gcc/testsuite/gcc.target/i386/minmax-2.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-2.c (revision 274422) +++ gcc/testsuite/gcc.target/i386/minmax-2.c (working copy) @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2" } */ +/* { dg-options "-O2 -mno-stv" } */ /* { dg-final { scan-assembler "test" } } */ /* { dg-final { scan-assembler-not "cmp" } } */ #define max(a,b) (((a) > (b))? (a) : (b)) Index: gcc/testsuite/gcc.target/i386/minmax-3.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int ssi[1024]; +unsigned int usi[1024]; +long long sdi[1024]; +unsigned long long udi[1024]; + +#define CHECK(FN, VARIANT) \ +void \ +FN ## VARIANT (void) \ +{ \ + for (int i = 1; i < 1024; ++i) \ + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ +} + +CHECK(max, ssi); +CHECK(min, ssi); +CHECK(max, usi); +CHECK(min, usi); +CHECK(max, sdi); +CHECK(min, sdi); +CHECK(max, udi); +CHECK(min, udi); Index: gcc/testsuite/gcc.target/i386/minmax-4.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) @@ -0,0 +1,9 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -msse4.1" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ +/* { dg-final { scan-assembler-times "pminud" 1 } } */ Index: gcc/testsuite/gcc.target/i386/minmax-5.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-5.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-5.c (working copy) @@ -0,0 +1,13 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mstv -mavx512vl" } */ + +#include "minmax-3.c" + +/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */ +/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */ +/* { dg-final { scan-assembler-times "vpminsd" 1 } } */ +/* { dg-final { scan-assembler-times "vpminud" 1 } } */ +/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */ +/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */ +/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */ +/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */ Index: gcc/testsuite/gcc.target/i386/minmax-6.c =================================================================== --- gcc/testsuite/gcc.target/i386/minmax-6.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/minmax-6.c (working copy) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -march=haswell" } */ + +unsigned short +UMVLine16Y_11 (short unsigned int * Pic, int y, int width) +{ + if (y != width) + { + y = y < 0 ? 0 : y; + return Pic[y * width]; + } + return Pic[y]; +} + +/* We do not want the RA to spill %esi for it's dual-use but using + pmaxsd is OK. */ +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ +/* { dg-final { scan-assembler "pmaxsd" } } */ Index: gcc/testsuite/gcc.target/i386/pr91154.c =================================================================== --- gcc/testsuite/gcc.target/i386/pr91154.c (nonexistent) +++ gcc/testsuite/gcc.target/i386/pr91154.c (working copy) @@ -0,0 +1,20 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -msse4.1 -mstv" } */ + +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) +{ + int sc; + int k; + for (k = 1; k <= M; k++) + { + dc[k] = dc[k-1] + tpdd[k-1]; + if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; + if (dc[k] < -987654321) dc[k] = -987654321; + } +} + +/* We want to convert the loop to SSE since SSE pmaxsd is faster than + compare + conditional move. */ +/* { dg-final { scan-assembler-not "cmov" } } */ +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */ +/* { dg-final { scan-assembler-times "paddd" 2 } } */ ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs 2019-08-14 9:15 ` Richard Biener @ 2019-08-14 9:36 ` Uros Bizjak 0 siblings, 0 replies; 61+ messages in thread From: Uros Bizjak @ 2019-08-14 9:36 UTC (permalink / raw) To: Richard Biener; +Cc: Jeff Law, Jakub Jelinek, gcc-patches On Wed, Aug 14, 2019 at 11:08 AM Richard Biener <rguenther@suse.de> wrote: > > On Tue, 13 Aug 2019, Jeff Law wrote: > > > On 8/9/19 7:00 AM, Richard Biener wrote: > > > > > > It fixes the slowdown observed in 416.gamess and 464.h264ref. > > > > > > Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress. > > > > > > CCing Jeff who "knows RTL". > > What specifically do you want me to look at? I'm not really familiar > > with the STV stuff, but can certainly take a peek. > > Below is the updated patch with the already approved and committed > parts taken out. It is not mostly mechanical apart from the > make_vector_copies and convert_reg changes which move existing > "patterns" under appropriate conditionals and adds handling of the > case where the scalar mode fits in a single GPR (previously it > was -m32 DImode only, now it handles -m32/-m64 SImode and DImode). > > I'm redoing bootstrap / regtest on x86_64-unknown-linux-gnu now just > to be safe. > > OK? > > I do expect we need to work on the compile-time issue I placed ??? > comments on and more generally try to avoid using DF so much. > > Thanks, > Richard. > > 2019-08-13 Richard Biener <rguenther@suse.de> > > PR target/91154 > * config/i386/i386-features.h (scalar_chain::scalar_chain): Add > mode arguments. > (scalar_chain::smode): New member. > (scalar_chain::vmode): Likewise. > (dimode_scalar_chain): Rename to... > (general_scalar_chain): ... this. > (general_scalar_chain::general_scalar_chain): Take mode arguments. > (timode_scalar_chain::timode_scalar_chain): Initialize scalar_chain > base with TImode and V1TImode. > * config/i386/i386-features.c (scalar_chain::scalar_chain): Adjust. > (general_scalar_chain::vector_const_cost): Adjust for SImode > chains. > (general_scalar_chain::compute_convert_gain): Likewise. Add > {S,U}{MIN,MAX} support. > (general_scalar_chain::replace_with_subreg): Use vmode/smode. > (general_scalar_chain::make_vector_copies): Likewise. Handle > non-DImode chains appropriately. > (general_scalar_chain::convert_reg): Likewise. > (general_scalar_chain::convert_op): Likewise. > (general_scalar_chain::convert_insn): Likewise. Add > fatal_insn_not_found if the result is not recognized. > (convertible_comparison_p): Pass in the scalar mode and use that. > (general_scalar_to_vector_candidate_p): Likewise. Rename from > dimode_scalar_to_vector_candidate_p. Add {S,U}{MIN,MAX} support. > (scalar_to_vector_candidate_p): Remove by inlining into single > caller. > (general_remove_non_convertible_regs): Rename from > dimode_remove_non_convertible_regs. > (remove_non_convertible_regs): Remove by inlining into single caller. > (convert_scalars_to_vector): Handle SImode and DImode chains > in addition to TImode chains. > * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander. > (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split. > (*<maxmin>di3_doubleword): Likewise. > > * gcc.target/i386/pr91154.c: New testcase. > * gcc.target/i386/minmax-3.c: Likewise. > * gcc.target/i386/minmax-4.c: Likewise. > * gcc.target/i386/minmax-5.c: Likewise. > * gcc.target/i386/minmax-6.c: Likewise. > * gcc.target/i386/minmax-1.c: Add -mno-stv. > * gcc.target/i386/minmax-2.c: Likewise. OK. Thanks, Uros. > Index: gcc/config/i386/i386-features.c > =================================================================== > --- gcc/config/i386/i386-features.c (revision 274422) > +++ gcc/config/i386/i386-features.c (working copy) > @@ -276,8 +276,11 @@ unsigned scalar_chain::max_id = 0; > > /* Initialize new chain. */ > > -scalar_chain::scalar_chain () > +scalar_chain::scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > { > + smode = smode_; > + vmode = vmode_; > + > chain_id = ++max_id; > > if (dump_file) > @@ -319,7 +322,7 @@ scalar_chain::add_to_queue (unsigned ins > conversion. */ > > void > -dimode_scalar_chain::mark_dual_mode_def (df_ref def) > +general_scalar_chain::mark_dual_mode_def (df_ref def) > { > gcc_assert (DF_REF_REG_DEF_P (def)); > > @@ -409,6 +412,9 @@ scalar_chain::add_insn (bitmap candidate > && !HARD_REGISTER_P (SET_DEST (def_set))) > bitmap_set_bit (defs, REGNO (SET_DEST (def_set))); > > + /* ??? The following is quadratic since analyze_register_chain > + iterates over all refs to look for dual-mode regs. Instead this > + should be done separately for all regs mentioned in the chain once. */ > df_ref ref; > df_ref def; > for (ref = DF_INSN_UID_DEFS (insn_uid); ref; ref = DF_REF_NEXT_LOC (ref)) > @@ -469,19 +475,21 @@ scalar_chain::build (bitmap candidates, > instead of using a scalar one. */ > > int > -dimode_scalar_chain::vector_const_cost (rtx exp) > +general_scalar_chain::vector_const_cost (rtx exp) > { > gcc_assert (CONST_INT_P (exp)); > > - if (standard_sse_constant_p (exp, V2DImode)) > - return COSTS_N_INSNS (1); > - return ix86_cost->sse_load[1]; > + if (standard_sse_constant_p (exp, vmode)) > + return ix86_cost->sse_op; > + /* We have separate costs for SImode and DImode, use SImode costs > + for smaller modes. */ > + return ix86_cost->sse_load[smode == DImode ? 1 : 0]; > } > > /* Compute a gain for chain conversion. */ > > int > -dimode_scalar_chain::compute_convert_gain () > +general_scalar_chain::compute_convert_gain () > { > bitmap_iterator bi; > unsigned insn_uid; > @@ -491,6 +499,13 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, "Computing gain for chain #%d...\n", chain_id); > > + /* SSE costs distinguish between SImode and DImode loads/stores, for > + int costs factor in the number of GPRs involved. When supporting > + smaller modes than SImode the int load/store costs need to be > + adjusted as well. */ > + unsigned sse_cost_idx = smode == DImode ? 1 : 0; > + unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1; > + > EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi) > { > rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn; > @@ -500,18 +515,19 @@ dimode_scalar_chain::compute_convert_gai > int igain = 0; > > if (REG_P (src) && REG_P (dst)) > - igain += 2 - ix86_cost->xmm_move; > + igain += 2 * m - ix86_cost->xmm_move; > else if (REG_P (src) && MEM_P (dst)) > - igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain > + += m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx]; > else if (MEM_P (src) && REG_P (dst)) > - igain += 2 * ix86_cost->int_load[2] - ix86_cost->sse_load[1]; > + igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx]; > else if (GET_CODE (src) == ASHIFT > || GET_CODE (src) == ASHIFTRT > || GET_CODE (src) == LSHIFTRT) > { > if (CONST_INT_P (XEXP (src, 0))) > igain -= vector_const_cost (XEXP (src, 0)); > - igain += 2 * ix86_cost->shift_const - ix86_cost->sse_op; > + igain += m * ix86_cost->shift_const - ix86_cost->sse_op; > if (INTVAL (XEXP (src, 1)) >= 32) > igain -= COSTS_N_INSNS (1); > } > @@ -521,11 +537,11 @@ dimode_scalar_chain::compute_convert_gai > || GET_CODE (src) == XOR > || GET_CODE (src) == AND) > { > - igain += 2 * ix86_cost->add - ix86_cost->sse_op; > + igain += m * ix86_cost->add - ix86_cost->sse_op; > /* Additional gain for andnot for targets without BMI. */ > if (GET_CODE (XEXP (src, 0)) == NOT > && !TARGET_BMI) > - igain += 2 * ix86_cost->add; > + igain += m * ix86_cost->add; > > if (CONST_INT_P (XEXP (src, 0))) > igain -= vector_const_cost (XEXP (src, 0)); > @@ -534,7 +550,18 @@ dimode_scalar_chain::compute_convert_gai > } > else if (GET_CODE (src) == NEG > || GET_CODE (src) == NOT) > - igain += 2 * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1); > + igain += m * ix86_cost->add - ix86_cost->sse_op - COSTS_N_INSNS (1); > + else if (GET_CODE (src) == SMAX > + || GET_CODE (src) == SMIN > + || GET_CODE (src) == UMAX > + || GET_CODE (src) == UMIN) > + { > + /* We do not have any conditional move cost, estimate it as a > + reg-reg move. Comparisons are costed as adds. */ > + igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); > + /* Integer SSE ops are all costed the same. */ > + igain -= ix86_cost->sse_op; > + } > else if (GET_CODE (src) == COMPARE) > { > /* Assume comparison cost is the same. */ > @@ -542,9 +569,11 @@ dimode_scalar_chain::compute_convert_gai > else if (CONST_INT_P (src)) > { > if (REG_P (dst)) > - igain += 2 * COSTS_N_INSNS (1); > + /* DImode can be immediate for TARGET_64BIT and SImode always. */ > + igain += m * COSTS_N_INSNS (1); > else if (MEM_P (dst)) > - igain += 2 * ix86_cost->int_store[2] - ix86_cost->sse_store[1]; > + igain += (m * ix86_cost->int_store[2] > + - ix86_cost->sse_store[sse_cost_idx]); > igain -= vector_const_cost (src); > } > else > @@ -561,6 +590,7 @@ dimode_scalar_chain::compute_convert_gai > if (dump_file) > fprintf (dump_file, " Instruction conversion gain: %d\n", gain); > > + /* ??? What about integer to SSE? */ > EXECUTE_IF_SET_IN_BITMAP (defs_conv, 0, insn_uid, bi) > cost += DF_REG_DEF_COUNT (insn_uid) * ix86_cost->sse_to_integer; > > @@ -578,10 +608,10 @@ dimode_scalar_chain::compute_convert_gai > /* Replace REG in X with a V2DI subreg of NEW_REG. */ > > rtx > -dimode_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > +general_scalar_chain::replace_with_subreg (rtx x, rtx reg, rtx new_reg) > { > if (x == reg) > - return gen_rtx_SUBREG (V2DImode, new_reg, 0); > + return gen_rtx_SUBREG (vmode, new_reg, 0); > > const char *fmt = GET_RTX_FORMAT (GET_CODE (x)); > int i, j; > @@ -601,7 +631,7 @@ dimode_scalar_chain::replace_with_subreg > /* Replace REG in INSN with a V2DI subreg of NEW_REG. */ > > void > -dimode_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > +general_scalar_chain::replace_with_subreg_in_insn (rtx_insn *insn, > rtx reg, rtx new_reg) > { > replace_with_subreg (single_set (insn), reg, new_reg); > @@ -632,10 +662,10 @@ scalar_chain::emit_conversion_insns (rtx > and replace its uses in a chain. */ > > void > -dimode_scalar_chain::make_vector_copies (unsigned regno) > +general_scalar_chain::make_vector_copies (unsigned regno) > { > rtx reg = regno_reg_rtx[regno]; > - rtx vreg = gen_reg_rtx (DImode); > + rtx vreg = gen_reg_rtx (smode); > df_ref ref; > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > @@ -644,37 +674,59 @@ dimode_scalar_chain::make_vector_copies > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_TO_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > - emit_move_insn (adjust_address (tmp, SImode, 0), > - gen_rtx_SUBREG (SImode, reg, 0)); > - emit_move_insn (adjust_address (tmp, SImode, 4), > - gen_rtx_SUBREG (SImode, reg, 4)); > - emit_move_insn (vreg, tmp); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > + if (smode == DImode && !TARGET_64BIT) > + { > + emit_move_insn (adjust_address (tmp, SImode, 0), > + gen_rtx_SUBREG (SImode, reg, 0)); > + emit_move_insn (adjust_address (tmp, SImode, 4), > + gen_rtx_SUBREG (SImode, reg, 4)); > + } > + else > + emit_move_insn (tmp, reg); > + emit_insn (gen_rtx_SET > + (gen_rtx_SUBREG (vmode, vreg, 0), > + gen_rtx_VEC_MERGE (vmode, > + gen_rtx_VEC_DUPLICATE (vmode, > + tmp), > + CONST0_RTX (vmode), > + GEN_INT (HOST_WIDE_INT_1U)))); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (SImode, reg, 4), > - GEN_INT (2))); > + if (TARGET_SSE4_1) > + { > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse4_1_pinsrd (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (SImode, reg, 4), > + GEN_INT (2))); > + } > + else > + { > + rtx tmp = gen_reg_rtx (DImode); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 0))); > + emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > + CONST0_RTX (V4SImode), > + gen_rtx_SUBREG (SImode, reg, 4))); > + emit_insn (gen_vec_interleave_lowv4si > + (gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, vreg, 0), > + gen_rtx_SUBREG (V4SImode, tmp, 0))); > + } > } > else > - { > - rtx tmp = gen_reg_rtx (DImode); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, vreg, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 0))); > - emit_insn (gen_sse2_loadld (gen_rtx_SUBREG (V4SImode, tmp, 0), > - CONST0_RTX (V4SImode), > - gen_rtx_SUBREG (SImode, reg, 4))); > - emit_insn (gen_vec_interleave_lowv4si > - (gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, vreg, 0), > - gen_rtx_SUBREG (V4SImode, tmp, 0))); > - } > + emit_insn (gen_rtx_SET > + (gen_rtx_SUBREG (vmode, vreg, 0), > + gen_rtx_VEC_MERGE (vmode, > + gen_rtx_VEC_DUPLICATE (vmode, > + reg), > + CONST0_RTX (vmode), > + GEN_INT (HOST_WIDE_INT_1U)))); > rtx_insn *seq = get_insns (); > end_sequence (); > rtx_insn *insn = DF_REF_INSN (ref); > @@ -703,7 +755,7 @@ dimode_scalar_chain::make_vector_copies > in case register is used in not convertible insn. */ > > void > -dimode_scalar_chain::convert_reg (unsigned regno) > +general_scalar_chain::convert_reg (unsigned regno) > { > bool scalar_copy = bitmap_bit_p (defs_conv, regno); > rtx reg = regno_reg_rtx[regno]; > @@ -715,7 +767,7 @@ dimode_scalar_chain::convert_reg (unsign > bitmap_copy (conv, insns); > > if (scalar_copy) > - scopy = gen_reg_rtx (DImode); > + scopy = gen_reg_rtx (smode); > > for (ref = DF_REG_DEF_CHAIN (regno); ref; ref = DF_REF_NEXT_REG (ref)) > { > @@ -735,40 +787,55 @@ dimode_scalar_chain::convert_reg (unsign > start_sequence (); > if (!TARGET_INTER_UNIT_MOVES_FROM_VEC) > { > - rtx tmp = assign_386_stack_local (DImode, SLOT_STV_TEMP); > + rtx tmp = assign_386_stack_local (smode, SLOT_STV_TEMP); > emit_move_insn (tmp, reg); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - adjust_address (tmp, SImode, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - adjust_address (tmp, SImode, 4)); > + if (!TARGET_64BIT && smode == DImode) > + { > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + adjust_address (tmp, SImode, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + adjust_address (tmp, SImode, 4)); > + } > + else > + emit_move_insn (scopy, tmp); > } > - else if (TARGET_SSE4_1) > + else if (!TARGET_64BIT && smode == DImode) > { > - rtx tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const0_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > - > - tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > - emit_insn > - (gen_rtx_SET > - (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_VEC_SELECT (SImode, > - gen_rtx_SUBREG (V4SImode, reg, 0), tmp))); > + if (TARGET_SSE4_1) > + { > + rtx tmp = gen_rtx_PARALLEL (VOIDmode, > + gen_rtvec (1, const0_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + > + tmp = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (1, const1_rtx)); > + emit_insn > + (gen_rtx_SET > + (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_VEC_SELECT (SImode, > + gen_rtx_SUBREG (V4SImode, reg, 0), > + tmp))); > + } > + else > + { > + rtx vcopy = gen_reg_rtx (V2DImode); > + emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + emit_move_insn (vcopy, > + gen_rtx_LSHIFTRT (V2DImode, > + vcopy, GEN_INT (32))); > + emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > + gen_rtx_SUBREG (SImode, vcopy, 0)); > + } > } > else > - { > - rtx vcopy = gen_reg_rtx (V2DImode); > - emit_move_insn (vcopy, gen_rtx_SUBREG (V2DImode, reg, 0)); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 0), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - emit_move_insn (vcopy, > - gen_rtx_LSHIFTRT (V2DImode, vcopy, GEN_INT (32))); > - emit_move_insn (gen_rtx_SUBREG (SImode, scopy, 4), > - gen_rtx_SUBREG (SImode, vcopy, 0)); > - } > + emit_move_insn (scopy, reg); > + > rtx_insn *seq = get_insns (); > end_sequence (); > emit_conversion_insns (seq, insn); > @@ -817,21 +884,21 @@ dimode_scalar_chain::convert_reg (unsign > registers conversion. */ > > void > -dimode_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > +general_scalar_chain::convert_op (rtx *op, rtx_insn *insn) > { > *op = copy_rtx_if_shared (*op); > > if (GET_CODE (*op) == NOT) > { > convert_op (&XEXP (*op, 0), insn); > - PUT_MODE (*op, V2DImode); > + PUT_MODE (*op, vmode); > } > else if (MEM_P (*op)) > { > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (GET_MODE (*op)); > > emit_insn_before (gen_move_insn (tmp, *op), insn); > - *op = gen_rtx_SUBREG (V2DImode, tmp, 0); > + *op = gen_rtx_SUBREG (vmode, tmp, 0); > > if (dump_file) > fprintf (dump_file, " Preloading operand for insn %d into r%d\n", > @@ -849,24 +916,30 @@ dimode_scalar_chain::convert_op (rtx *op > gcc_assert (!DF_REF_CHAIN (ref)); > break; > } > - *op = gen_rtx_SUBREG (V2DImode, *op, 0); > + *op = gen_rtx_SUBREG (vmode, *op, 0); > } > else if (CONST_INT_P (*op)) > { > rtx vec_cst; > - rtx tmp = gen_rtx_SUBREG (V2DImode, gen_reg_rtx (DImode), 0); > + rtx tmp = gen_rtx_SUBREG (vmode, gen_reg_rtx (smode), 0); > > /* Prefer all ones vector in case of -1. */ > if (constm1_operand (*op, GET_MODE (*op))) > - vec_cst = CONSTM1_RTX (V2DImode); > + vec_cst = CONSTM1_RTX (vmode); > else > - vec_cst = gen_rtx_CONST_VECTOR (V2DImode, > - gen_rtvec (2, *op, const0_rtx)); > + { > + unsigned n = GET_MODE_NUNITS (vmode); > + rtx *v = XALLOCAVEC (rtx, n); > + v[0] = *op; > + for (unsigned i = 1; i < n; ++i) > + v[i] = const0_rtx; > + vec_cst = gen_rtx_CONST_VECTOR (vmode, gen_rtvec_v (n, v)); > + } > > - if (!standard_sse_constant_p (vec_cst, V2DImode)) > + if (!standard_sse_constant_p (vec_cst, vmode)) > { > start_sequence (); > - vec_cst = validize_mem (force_const_mem (V2DImode, vec_cst)); > + vec_cst = validize_mem (force_const_mem (vmode, vec_cst)); > rtx_insn *seq = get_insns (); > end_sequence (); > emit_insn_before (seq, insn); > @@ -878,14 +951,14 @@ dimode_scalar_chain::convert_op (rtx *op > else > { > gcc_assert (SUBREG_P (*op)); > - gcc_assert (GET_MODE (*op) == V2DImode); > + gcc_assert (GET_MODE (*op) == vmode); > } > } > > /* Convert INSN to vector mode. */ > > void > -dimode_scalar_chain::convert_insn (rtx_insn *insn) > +general_scalar_chain::convert_insn (rtx_insn *insn) > { > rtx def_set = single_set (insn); > rtx src = SET_SRC (def_set); > @@ -896,9 +969,9 @@ dimode_scalar_chain::convert_insn (rtx_i > { > /* There are no scalar integer instructions and therefore > temporary register usage is required. */ > - rtx tmp = gen_reg_rtx (DImode); > + rtx tmp = gen_reg_rtx (smode); > emit_conversion_insns (gen_move_insn (dst, tmp), insn); > - dst = gen_rtx_SUBREG (V2DImode, tmp, 0); > + dst = gen_rtx_SUBREG (vmode, tmp, 0); > } > > switch (GET_CODE (src)) > @@ -907,7 +980,7 @@ dimode_scalar_chain::convert_insn (rtx_i > case ASHIFTRT: > case LSHIFTRT: > convert_op (&XEXP (src, 0), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case PLUS: > @@ -915,25 +988,29 @@ dimode_scalar_chain::convert_insn (rtx_i > case IOR: > case XOR: > case AND: > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > convert_op (&XEXP (src, 0), insn); > convert_op (&XEXP (src, 1), insn); > - PUT_MODE (src, V2DImode); > + PUT_MODE (src, vmode); > break; > > case NEG: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONST0_RTX (V2DImode)), insn); > - src = gen_rtx_MINUS (V2DImode, subreg, src); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONST0_RTX (vmode)), insn); > + src = gen_rtx_MINUS (vmode, subreg, src); > break; > > case NOT: > src = XEXP (src, 0); > convert_op (&src, insn); > - subreg = gen_reg_rtx (V2DImode); > - emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (V2DImode)), insn); > - src = gen_rtx_XOR (V2DImode, src, subreg); > + subreg = gen_reg_rtx (vmode); > + emit_insn_before (gen_move_insn (subreg, CONSTM1_RTX (vmode)), insn); > + src = gen_rtx_XOR (vmode, src, subreg); > break; > > case MEM: > @@ -947,17 +1024,17 @@ dimode_scalar_chain::convert_insn (rtx_i > break; > > case SUBREG: > - gcc_assert (GET_MODE (src) == V2DImode); > + gcc_assert (GET_MODE (src) == vmode); > break; > > case COMPARE: > src = SUBREG_REG (XEXP (XEXP (src, 0), 0)); > > - gcc_assert ((REG_P (src) && GET_MODE (src) == DImode) > - || (SUBREG_P (src) && GET_MODE (src) == V2DImode)); > + gcc_assert ((REG_P (src) && GET_MODE (src) == GET_MODE_INNER (vmode)) > + || (SUBREG_P (src) && GET_MODE (src) == vmode)); > > if (REG_P (src)) > - subreg = gen_rtx_SUBREG (V2DImode, src, 0); > + subreg = gen_rtx_SUBREG (vmode, src, 0); > else > subreg = copy_rtx_if_shared (src); > emit_insn_before (gen_vec_interleave_lowv2di (copy_rtx_if_shared (subreg), > @@ -985,7 +1062,9 @@ dimode_scalar_chain::convert_insn (rtx_i > PATTERN (insn) = def_set; > > INSN_CODE (insn) = -1; > - recog_memoized (insn); > + int patt = recog_memoized (insn); > + if (patt == -1) > + fatal_insn_not_found (insn); > df_insn_rescan (insn); > } > > @@ -1124,7 +1203,7 @@ timode_scalar_chain::convert_insn (rtx_i > } > > void > -dimode_scalar_chain::convert_registers () > +general_scalar_chain::convert_registers () > { > bitmap_iterator bi; > unsigned id; > @@ -1194,7 +1273,7 @@ has_non_address_hard_reg (rtx_insn *insn > (const_int 0 [0]))) */ > > static bool > -convertible_comparison_p (rtx_insn *insn) > +convertible_comparison_p (rtx_insn *insn, enum machine_mode mode) > { > if (!TARGET_SSE4_1) > return false; > @@ -1227,12 +1306,12 @@ convertible_comparison_p (rtx_insn *insn > > if (!SUBREG_P (op1) > || !SUBREG_P (op2) > - || GET_MODE (op1) != SImode > - || GET_MODE (op2) != SImode > + || GET_MODE (op1) != mode > + || GET_MODE (op2) != mode > || ((SUBREG_BYTE (op1) != 0 > - || SUBREG_BYTE (op2) != GET_MODE_SIZE (SImode)) > + || SUBREG_BYTE (op2) != GET_MODE_SIZE (mode)) > && (SUBREG_BYTE (op2) != 0 > - || SUBREG_BYTE (op1) != GET_MODE_SIZE (SImode)))) > + || SUBREG_BYTE (op1) != GET_MODE_SIZE (mode)))) > return false; > > op1 = SUBREG_REG (op1); > @@ -1240,7 +1319,7 @@ convertible_comparison_p (rtx_insn *insn > > if (op1 != op2 > || !REG_P (op1) > - || GET_MODE (op1) != DImode) > + || GET_MODE (op1) != GET_MODE_WIDER_MODE (mode).else_blk ()) > return false; > > return true; > @@ -1249,7 +1328,7 @@ convertible_comparison_p (rtx_insn *insn > /* The DImode version of scalar_to_vector_candidate_p. */ > > static bool > -dimode_scalar_to_vector_candidate_p (rtx_insn *insn) > +general_scalar_to_vector_candidate_p (rtx_insn *insn, enum machine_mode mode) > { > rtx def_set = single_set (insn); > > @@ -1263,12 +1342,12 @@ dimode_scalar_to_vector_candidate_p (rtx > rtx dst = SET_DEST (def_set); > > if (GET_CODE (src) == COMPARE) > - return convertible_comparison_p (insn); > + return convertible_comparison_p (insn, mode); > > /* We are interested in DImode promotion only. */ > - if ((GET_MODE (src) != DImode > + if ((GET_MODE (src) != mode > && !CONST_INT_P (src)) > - || GET_MODE (dst) != DImode) > + || GET_MODE (dst) != mode) > return false; > > if (!REG_P (dst) && !MEM_P (dst)) > @@ -1288,6 +1367,15 @@ dimode_scalar_to_vector_candidate_p (rtx > return false; > break; > > + case SMAX: > + case SMIN: > + case UMAX: > + case UMIN: > + if ((mode == DImode && !TARGET_AVX512VL) > + || (mode == SImode && !TARGET_SSE4_1)) > + return false; > + /* Fallthru. */ > + > case PLUS: > case MINUS: > case IOR: > @@ -1298,7 +1386,7 @@ dimode_scalar_to_vector_candidate_p (rtx > && !CONST_INT_P (XEXP (src, 1))) > return false; > > - if (GET_MODE (XEXP (src, 1)) != DImode > + if (GET_MODE (XEXP (src, 1)) != mode > && !CONST_INT_P (XEXP (src, 1))) > return false; > break; > @@ -1327,7 +1415,7 @@ dimode_scalar_to_vector_candidate_p (rtx > || !REG_P (XEXP (XEXP (src, 0), 0)))) > return false; > > - if (GET_MODE (XEXP (src, 0)) != DImode > + if (GET_MODE (XEXP (src, 0)) != mode > && !CONST_INT_P (XEXP (src, 0))) > return false; > > @@ -1391,22 +1479,16 @@ timode_scalar_to_vector_candidate_p (rtx > return false; > } > > -/* Return 1 if INSN may be converted into vector > - instruction. */ > - > -static bool > -scalar_to_vector_candidate_p (rtx_insn *insn) > -{ > - if (TARGET_64BIT) > - return timode_scalar_to_vector_candidate_p (insn); > - else > - return dimode_scalar_to_vector_candidate_p (insn); > -} > +/* For a given bitmap of insn UIDs scans all instruction and > + remove insn from CANDIDATES in case it has both convertible > + and not convertible definitions. > > -/* The DImode version of remove_non_convertible_regs. */ > + All insns in a bitmap are conversion candidates according to > + scalar_to_vector_candidate_p. Currently it implies all insns > + are single_set. */ > > static void > -dimode_remove_non_convertible_regs (bitmap candidates) > +general_remove_non_convertible_regs (bitmap candidates) > { > bitmap_iterator bi; > unsigned id; > @@ -1561,23 +1643,6 @@ timode_remove_non_convertible_regs (bitm > BITMAP_FREE (regs); > } > > -/* For a given bitmap of insn UIDs scans all instruction and > - remove insn from CANDIDATES in case it has both convertible > - and not convertible definitions. > - > - All insns in a bitmap are conversion candidates according to > - scalar_to_vector_candidate_p. Currently it implies all insns > - are single_set. */ > - > -static void > -remove_non_convertible_regs (bitmap candidates) > -{ > - if (TARGET_64BIT) > - timode_remove_non_convertible_regs (candidates); > - else > - dimode_remove_non_convertible_regs (candidates); > -} > - > /* Main STV pass function. Find and convert scalar > instructions into vector mode when profitable. */ > > @@ -1585,11 +1650,14 @@ static unsigned int > convert_scalars_to_vector () > { > basic_block bb; > - bitmap candidates; > int converted_insns = 0; > > bitmap_obstack_initialize (NULL); > - candidates = BITMAP_ALLOC (NULL); > + const machine_mode cand_mode[3] = { SImode, DImode, TImode }; > + const machine_mode cand_vmode[3] = { V4SImode, V2DImode, V1TImode }; > + bitmap_head candidates[3]; /* { SImode, DImode, TImode } */ > + for (unsigned i = 0; i < 3; ++i) > + bitmap_initialize (&candidates[i], &bitmap_default_obstack); > > calculate_dominance_info (CDI_DOMINATORS); > df_set_flags (DF_DEFER_INSN_RESCAN); > @@ -1605,51 +1673,73 @@ convert_scalars_to_vector () > { > rtx_insn *insn; > FOR_BB_INSNS (bb, insn) > - if (scalar_to_vector_candidate_p (insn)) > + if (TARGET_64BIT > + && timode_scalar_to_vector_candidate_p (insn)) > { > if (dump_file) > - fprintf (dump_file, " insn %d is marked as a candidate\n", > + fprintf (dump_file, " insn %d is marked as a TImode candidate\n", > INSN_UID (insn)); > > - bitmap_set_bit (candidates, INSN_UID (insn)); > + bitmap_set_bit (&candidates[2], INSN_UID (insn)); > + } > + else > + { > + /* Check {SI,DI}mode. */ > + for (unsigned i = 0; i <= 1; ++i) > + if (general_scalar_to_vector_candidate_p (insn, cand_mode[i])) > + { > + if (dump_file) > + fprintf (dump_file, " insn %d is marked as a %s candidate\n", > + INSN_UID (insn), i == 0 ? "SImode" : "DImode"); > + > + bitmap_set_bit (&candidates[i], INSN_UID (insn)); > + break; > + } > } > } > > - remove_non_convertible_regs (candidates); > + if (TARGET_64BIT) > + timode_remove_non_convertible_regs (&candidates[2]); > + for (unsigned i = 0; i <= 1; ++i) > + general_remove_non_convertible_regs (&candidates[i]); > > - if (bitmap_empty_p (candidates)) > - if (dump_file) > + for (unsigned i = 0; i <= 2; ++i) > + if (!bitmap_empty_p (&candidates[i])) > + break; > + else if (i == 2 && dump_file) > fprintf (dump_file, "There are no candidates for optimization.\n"); > > - while (!bitmap_empty_p (candidates)) > - { > - unsigned uid = bitmap_first_set_bit (candidates); > - scalar_chain *chain; > + for (unsigned i = 0; i <= 2; ++i) > + while (!bitmap_empty_p (&candidates[i])) > + { > + unsigned uid = bitmap_first_set_bit (&candidates[i]); > + scalar_chain *chain; > > - if (TARGET_64BIT) > - chain = new timode_scalar_chain; > - else > - chain = new dimode_scalar_chain; > + if (cand_mode[i] == TImode) > + chain = new timode_scalar_chain; > + else > + chain = new general_scalar_chain (cand_mode[i], cand_vmode[i]); > > - /* Find instructions chain we want to convert to vector mode. > - Check all uses and definitions to estimate all required > - conversions. */ > - chain->build (candidates, uid); > + /* Find instructions chain we want to convert to vector mode. > + Check all uses and definitions to estimate all required > + conversions. */ > + chain->build (&candidates[i], uid); > > - if (chain->compute_convert_gain () > 0) > - converted_insns += chain->convert (); > - else > - if (dump_file) > - fprintf (dump_file, "Chain #%d conversion is not profitable\n", > - chain->chain_id); > + if (chain->compute_convert_gain () > 0) > + converted_insns += chain->convert (); > + else > + if (dump_file) > + fprintf (dump_file, "Chain #%d conversion is not profitable\n", > + chain->chain_id); > > - delete chain; > - } > + delete chain; > + } > > if (dump_file) > fprintf (dump_file, "Total insns converted: %d\n", converted_insns); > > - BITMAP_FREE (candidates); > + for (unsigned i = 0; i <= 2; ++i) > + bitmap_release (&candidates[i]); > bitmap_obstack_release (NULL); > df_process_deferred_rescans (); > > Index: gcc/config/i386/i386-features.h > =================================================================== > --- gcc/config/i386/i386-features.h (revision 274422) > +++ gcc/config/i386/i386-features.h (working copy) > @@ -127,11 +127,16 @@ namespace { > class scalar_chain > { > public: > - scalar_chain (); > + scalar_chain (enum machine_mode, enum machine_mode); > virtual ~scalar_chain (); > > static unsigned max_id; > > + /* Scalar mode. */ > + enum machine_mode smode; > + /* Vector mode. */ > + enum machine_mode vmode; > + > /* ID of a chain. */ > unsigned int chain_id; > /* A queue of instructions to be included into a chain. */ > @@ -159,9 +164,11 @@ class scalar_chain > virtual void convert_registers () = 0; > }; > > -class dimode_scalar_chain : public scalar_chain > +class general_scalar_chain : public scalar_chain > { > public: > + general_scalar_chain (enum machine_mode smode_, enum machine_mode vmode_) > + : scalar_chain (smode_, vmode_) {} > int compute_convert_gain (); > private: > void mark_dual_mode_def (df_ref def); > @@ -178,6 +185,8 @@ class dimode_scalar_chain : public scala > class timode_scalar_chain : public scalar_chain > { > public: > + timode_scalar_chain () : scalar_chain (TImode, V1TImode) {} > + > /* Convert from TImode to V1TImode is always faster. */ > int compute_convert_gain () { return 1; } > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 274422) > +++ gcc/config/i386/i386.md (working copy) > @@ -17719,6 +17719,110 @@ (define_expand "add<mode>cc" > (match_operand:SWI 3 "const_int_operand")] > "" > "if (ix86_expand_int_addcc (operands)) DONE; else FAIL;") > + > +;; min/max patterns > + > +(define_mode_iterator MAXMIN_IMODE > + [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) > +(define_code_attr maxmin_rel > + [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) > + > +(define_expand "<code><mode>3" > + [(parallel > + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") > + (maxmin:MAXMIN_IMODE > + (match_operand:MAXMIN_IMODE 1 "register_operand") > + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) > + (clobber (reg:CC FLAGS_REG))])] > + "TARGET_STV") > + > +(define_insn_and_split "*<code><mode>3_1" > + [(set (match_operand:MAXMIN_IMODE 0 "register_operand") > + (maxmin:MAXMIN_IMODE > + (match_operand:MAXMIN_IMODE 1 "register_operand") > + (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (if_then_else:MAXMIN_IMODE (match_dup 3) > + (match_dup 1) > + (match_dup 2)))] > +{ > + machine_mode mode = <MODE>mode; > + > + if (!register_operand (operands[2], mode)) > + operands[2] = force_reg (mode, operands[2]); > + > + enum rtx_code code = <maxmin_rel>; > + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); > + rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); > + > + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); > + emit_insn (gen_rtx_SET (flags, tmp)); > + > + operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); > +}) > + > +(define_insn_and_split "*<code>di3_doubleword" > + [(set (match_operand:DI 0 "register_operand") > + (maxmin:DI (match_operand:DI 1 "register_operand") > + (match_operand:DI 2 "nonimmediate_operand"))) > + (clobber (reg:CC FLAGS_REG))] > + "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL > + && can_create_pseudo_p ()" > + "#" > + "&& 1" > + [(set (match_dup 0) > + (if_then_else:SI (match_dup 6) > + (match_dup 1) > + (match_dup 2))) > + (set (match_dup 3) > + (if_then_else:SI (match_dup 6) > + (match_dup 4) > + (match_dup 5)))] > +{ > + if (!register_operand (operands[2], DImode)) > + operands[2] = force_reg (DImode, operands[2]); > + > + split_double_mode (DImode, &operands[0], 3, &operands[0], &operands[3]); > + > + rtx cmplo[2] = { operands[1], operands[2] }; > + rtx cmphi[2] = { operands[4], operands[5] }; > + > + enum rtx_code code = <maxmin_rel>; > + > + switch (code) > + { > + case LE: case LEU: > + std::swap (cmplo[0], cmplo[1]); > + std::swap (cmphi[0], cmphi[1]); > + code = swap_condition (code); > + /* FALLTHRU */ > + > + case GE: case GEU: > + { > + bool uns = (code == GEU); > + rtx (*sbb_insn) (machine_mode, rtx, rtx, rtx) > + = uns ? gen_sub3_carry_ccc : gen_sub3_carry_ccgz; > + > + emit_insn (gen_cmp_1 (SImode, cmplo[0], cmplo[1])); > + > + rtx tmp = gen_rtx_SCRATCH (SImode); > + emit_insn (sbb_insn (SImode, tmp, cmphi[0], cmphi[1])); > + > + rtx flags = gen_rtx_REG (uns ? CCCmode : CCGZmode, FLAGS_REG); > + operands[6] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); > + > + break; > + } > + > + default: > + gcc_unreachable (); > + } > +}) > > ;; Misc patterns (?) > > Index: gcc/testsuite/gcc.target/i386/minmax-1.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-1.c (revision 274422) > +++ gcc/testsuite/gcc.target/i386/minmax-1.c (working copy) > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -march=opteron" } */ > +/* { dg-options "-O2 -march=opteron -mno-stv" } */ > /* { dg-final { scan-assembler "test" } } */ > /* { dg-final { scan-assembler-not "cmp" } } */ > #define max(a,b) (((a) > (b))? (a) : (b)) > Index: gcc/testsuite/gcc.target/i386/minmax-2.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-2.c (revision 274422) > +++ gcc/testsuite/gcc.target/i386/minmax-2.c (working copy) > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2" } */ > +/* { dg-options "-O2 -mno-stv" } */ > /* { dg-final { scan-assembler "test" } } */ > /* { dg-final { scan-assembler-not "cmp" } } */ > #define max(a,b) (((a) > (b))? (a) : (b)) > Index: gcc/testsuite/gcc.target/i386/minmax-3.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-3.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-3.c (working copy) > @@ -0,0 +1,27 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv" } */ > + > +#define max(a,b) (((a) > (b))? (a) : (b)) > +#define min(a,b) (((a) < (b))? (a) : (b)) > + > +int ssi[1024]; > +unsigned int usi[1024]; > +long long sdi[1024]; > +unsigned long long udi[1024]; > + > +#define CHECK(FN, VARIANT) \ > +void \ > +FN ## VARIANT (void) \ > +{ \ > + for (int i = 1; i < 1024; ++i) \ > + VARIANT[i] = FN(VARIANT[i-1], VARIANT[i]); \ > +} > + > +CHECK(max, ssi); > +CHECK(min, ssi); > +CHECK(max, usi); > +CHECK(min, usi); > +CHECK(max, sdi); > +CHECK(min, sdi); > +CHECK(max, udi); > +CHECK(min, udi); > Index: gcc/testsuite/gcc.target/i386/minmax-4.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-4.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-4.c (working copy) > @@ -0,0 +1,9 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv -msse4.1" } */ > + > +#include "minmax-3.c" > + > +/* { dg-final { scan-assembler-times "pmaxsd" 1 } } */ > +/* { dg-final { scan-assembler-times "pmaxud" 1 } } */ > +/* { dg-final { scan-assembler-times "pminsd" 1 } } */ > +/* { dg-final { scan-assembler-times "pminud" 1 } } */ > Index: gcc/testsuite/gcc.target/i386/minmax-5.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-5.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-5.c (working copy) > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mstv -mavx512vl" } */ > + > +#include "minmax-3.c" > + > +/* { dg-final { scan-assembler-times "vpmaxsd" 1 } } */ > +/* { dg-final { scan-assembler-times "vpmaxud" 1 } } */ > +/* { dg-final { scan-assembler-times "vpminsd" 1 } } */ > +/* { dg-final { scan-assembler-times "vpminud" 1 } } */ > +/* { dg-final { scan-assembler-times "vpmaxsq" 1 { target lp64 } } } */ > +/* { dg-final { scan-assembler-times "vpmaxuq" 1 { target lp64 } } } */ > +/* { dg-final { scan-assembler-times "vpminsq" 1 { target lp64 } } } */ > +/* { dg-final { scan-assembler-times "vpminuq" 1 { target lp64 } } } */ > Index: gcc/testsuite/gcc.target/i386/minmax-6.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/minmax-6.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/minmax-6.c (working copy) > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -march=haswell" } */ > + > +unsigned short > +UMVLine16Y_11 (short unsigned int * Pic, int y, int width) > +{ > + if (y != width) > + { > + y = y < 0 ? 0 : y; > + return Pic[y * width]; > + } > + return Pic[y]; > +} > + > +/* We do not want the RA to spill %esi for it's dual-use but using > + pmaxsd is OK. */ > +/* { dg-final { scan-assembler-not "rsp" { target { ! { ia32 } } } } } */ > +/* { dg-final { scan-assembler "pmaxsd" } } */ > Index: gcc/testsuite/gcc.target/i386/pr91154.c > =================================================================== > --- gcc/testsuite/gcc.target/i386/pr91154.c (nonexistent) > +++ gcc/testsuite/gcc.target/i386/pr91154.c (working copy) > @@ -0,0 +1,20 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -msse4.1 -mstv" } */ > + > +void foo (int *dc, int *mc, int *tpdd, int *tpmd, int M) > +{ > + int sc; > + int k; > + for (k = 1; k <= M; k++) > + { > + dc[k] = dc[k-1] + tpdd[k-1]; > + if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; > + if (dc[k] < -987654321) dc[k] = -987654321; > + } > +} > + > +/* We want to convert the loop to SSE since SSE pmaxsd is faster than > + compare + conditional move. */ > +/* { dg-final { scan-assembler-not "cmov" } } */ > +/* { dg-final { scan-assembler-times "pmaxsd" 2 } } */ > +/* { dg-final { scan-assembler-times "paddd" 2 } } */ ^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2019-08-15 9:00 UTC | newest] Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-07-23 14:03 [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs Richard Biener 2019-07-24 9:14 ` Richard Biener 2019-07-24 11:30 ` Richard Biener 2019-07-24 15:12 ` Jeff Law 2019-07-27 10:07 ` Uros Bizjak 2019-08-09 22:15 ` Jeff Law 2019-07-25 9:15 ` Martin Jambor 2019-07-25 12:57 ` Richard Biener 2019-07-27 11:14 ` Uros Bizjak 2019-07-27 18:23 ` Uros Bizjak 2019-07-31 12:01 ` Richard Biener 2019-08-01 8:54 ` Uros Bizjak 2019-08-01 9:28 ` Richard Biener 2019-08-01 9:38 ` Uros Bizjak 2019-08-03 17:26 ` Richard Biener 2019-08-04 17:11 ` Uros Bizjak 2019-08-04 17:23 ` Jakub Jelinek 2019-08-04 17:36 ` Uros Bizjak 2019-08-05 8:47 ` Richard Biener 2019-08-05 9:13 ` Richard Sandiford 2019-08-05 10:08 ` Uros Bizjak 2019-08-05 10:12 ` Richard Sandiford 2019-08-05 10:24 ` Uros Bizjak 2019-08-05 10:39 ` Richard Sandiford 2019-08-05 11:50 ` Richard Biener 2019-08-05 11:59 ` Uros Bizjak 2019-08-05 12:16 ` Richard Biener 2019-08-05 12:23 ` Uros Bizjak 2019-08-05 12:33 ` Uros Bizjak 2019-08-08 16:23 ` Jeff Law 2019-08-05 12:44 ` Uros Bizjak 2019-08-05 12:51 ` Uros Bizjak 2019-08-05 12:54 ` Jakub Jelinek 2019-08-05 12:57 ` Uros Bizjak 2019-08-05 13:04 ` Richard Biener 2019-08-05 13:09 ` Uros Bizjak 2019-08-05 13:29 ` Richard Biener 2019-08-05 19:35 ` Uros Bizjak 2019-08-07 9:52 ` Richard Biener 2019-08-07 12:04 ` Richard Biener 2019-08-07 12:11 ` Uros Bizjak 2019-08-07 12:42 ` Uros Bizjak 2019-08-07 12:58 ` Uros Bizjak 2019-08-07 13:00 ` Richard Biener 2019-08-07 13:32 ` Uros Bizjak 2019-08-07 14:15 ` Richard Biener 2019-08-09 7:28 ` Uros Bizjak 2019-08-09 10:13 ` Richard Biener 2019-08-09 10:26 ` Jakub Jelinek 2019-08-09 11:15 ` Richard Biener 2019-08-09 11:06 ` Richard Biener 2019-08-09 13:13 ` Richard Biener 2019-08-09 14:39 ` Uros Bizjak 2019-08-12 12:57 ` Richard Biener 2019-08-12 14:48 ` Uros Bizjak 2019-08-13 16:28 ` Jeff Law 2019-08-13 20:07 ` H.J. Lu 2019-08-15 9:24 ` Uros Bizjak 2019-08-13 15:20 ` Jeff Law 2019-08-14 9:15 ` Richard Biener 2019-08-14 9:36 ` Uros Bizjak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).