On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak wrote: > > How would one write smaxsi3 as a splitter to be split after > > reload in the case LRA assigned the GPR alternative? Is it > > even worth doing? Even the SSE reg alternative can be split > > to remove the not needed CC clobber. > > > > Finally I'm unsure about the add where I needed to place > > the SSE alternative before the 2nd op memory one since it > > otherwise gets the same cost and wins. > > > > So - how to go forward with this? > > Sorry to come a bit late to the discussion. > > We are aware of CMOV issue for quite some time, but the issue is not > understood yet in detail (I was hoping for Intel people to look at > this). However, you demonstrated that using PMAX and PMIN instead of > scalar CMOV can bring us big gains, and this thread now deals on how > to best implement PMAX/PMIN for scalar code. > > I think that the way to go forward is with STV infrastructure. > Currently, the implementation only deals with DImode on SSE2 32bit > targets, but I see no issues on using STV pass also for SImode (on > 32bit and 64bit targets). There are actually two STV passes, the first > one (currently run on 64bit targets) is run before cse2, and the > second (which currently runs on 32bit SSE2 only) is run after combine > and before split1 pass. The second pass is interesting to us. > > The base idea of the second STV pass (for 32bit targets!) is that we > introduce a DImode _doubleword instructons that otherwise do not exist > with integer registers. Now, the passes up to and including combine > pass can use these instructions to simplify and optimize the insn > flow. Later, based on cost analysis, STV pass either converts the > _doubleword instructions to a real vector ones (e.g. V2DImode > patterns) or leaves them intact, and a follow-up split pass splits > them into scalar SImode instruction pairs. STV pass also takes care to > move and preload values from their scalar form to a vector > representation (using SUBREGs). Please note that all this happens on > pseudos, and register allocator will later simply use scalar (integer) > registers in scalar patterns and vector registers with vector insn > patterns. > > Your approach to amend existing scalar SImode patterns with vector > registers will introduce no end of problems. Register allocator will > do funny things during register pressure, where values will take a > trip to a vector register before being stored to memory (and vice > versa, you already found some of them). Current RA simply can't > distinguish clearly between two register sets. > > So, my advice would be to use STV pass also for SImode values, on > 64bit and 32bit targets. On both targets, we will be able to use > instructions that operate on vector register set, and for 32bit > targets (and to some extent on 64bit targets), we would perhaps be > able to relax register pressure in a kind of controlled way. > > So, to demonstrate the benefits of existing STV pass, it should be > relatively easy to introduce 64bit max/min pattern on 32bit target to > handle 64bit values. For 32bit values, the pass should be re-run to > convert SImode scalar operations to vector operations in a controlled > way, based on various cost functions. Please find attached patch to see STV in action. The compilation will crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2 dump, you will be able to see chain building, cost calculation and conversion insertion. The testcase: --cut here-- long long test (long long a, long long b) { return (a > b) ? a : b; } --cut here-- gcc -O2 -m32 -msse2 (-mstv): _.268r.stv2 dump: Searching for mode conversion candidates... insn 2 is marked as a candidate insn 3 is marked as a candidate insn 7 is marked as a candidate Created a new instruction chain #1 Building chain #1... Adding insn 2 to chain #1 Adding insn 7 into chain's #1 queue Adding insn 7 to chain #1 r85 use in insn 12 isn't convertible Mark r85 def in insn 7 as requiring both modes in chain #1 Adding insn 3 into chain's #1 queue Adding insn 3 to chain #1 Collected chain #1... insns: 2, 3, 7 defs to convert: r85 Computing gain for chain #1... Instruction conversion gain: 24 Registers conversion cost: 6 Total gain: 18 Converting chain #1... ... (insn 2 5 3 2 (set (reg/v:DI 83 [ a ]) (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66 {*movdi_internal} (nil)) (insn 3 2 4 2 (set (reg/v:DI 84 [ b ]) (mem/c:DI (plus:SI (reg/f:SI 16 argp) (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66 {*movdi_internal} (nil)) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0) (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0) (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1 (expr_list:REG_DEAD (reg/v:DI 84 [ b ]) (expr_list:REG_DEAD (reg/v:DI 83 [ a ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) (insn 15 7 16 2 (set (reg:V2DI 87) (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1 (nil)) (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0) (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 (nil)) (insn 17 16 18 2 (set (reg:V2DI 87) (lshiftrt:V2DI (reg:V2DI 87) (const_int 32 [0x20]))) "max.c":3:22 -1 (nil)) (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4) (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1 (nil)) (insn 12 18 13 2 (set (reg/i:DI 0 ax) (reg:DI 86)) "max.c":4:1 66 {*movdi_internal} (expr_list:REG_DEAD (reg:DI 86) (nil))) (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1 (nil)) Uros.