On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:

> > How would one write smaxsi3 as a splitter to be split after
> > reload in the case LRA assigned the GPR alternative?  Is it
> > even worth doing?  Even the SSE reg alternative can be split
> > to remove the not needed CC clobber.
> >
> > Finally I'm unsure about the add where I needed to place
> > the SSE alternative before the 2nd op memory one since it
> > otherwise gets the same cost and wins.
> >
> > So - how to go forward with this?
>
> Sorry to come a bit late to the discussion.
>
> We are aware of CMOV issue for quite some time, but the issue is not
> understood yet in detail (I was hoping for Intel people to look at
> this). However, you demonstrated that using PMAX and PMIN  instead of
> scalar CMOV can bring us big gains, and this thread now deals on how
> to best implement PMAX/PMIN for scalar code.
>
> I think that the way to go forward is with STV infrastructure.
> Currently, the implementation only deals with DImode on SSE2 32bit
> targets, but I see no issues on using STV pass also for SImode (on
> 32bit and 64bit targets). There are actually two STV passes, the first
> one (currently run on 64bit targets) is run before cse2, and the
> second (which currently runs on 32bit SSE2 only) is run after combine
> and before split1 pass. The second pass is interesting to us.
>
> The base idea of the second STV pass (for 32bit targets!) is that we
> introduce a DImode _doubleword instructons that otherwise do not exist
> with integer registers. Now, the passes up to and including combine
> pass can use these instructions to simplify and optimize the insn
> flow. Later, based on cost analysis, STV pass either converts the
> _doubleword instructions to a real vector ones (e.g. V2DImode
> patterns) or leaves them intact, and a follow-up split pass splits
> them into scalar SImode instruction pairs. STV pass also takes care to
> move and preload values from their scalar form to a vector
> representation (using SUBREGs). Please note that all this happens on
> pseudos, and register allocator will later simply use scalar (integer)
> registers in scalar patterns and vector registers with vector insn
> patterns.
>
> Your approach to amend existing scalar SImode patterns with vector
> registers will introduce no end of problems. Register allocator will
> do funny things during register pressure, where values will take a
> trip to a vector register before being stored to memory (and vice
> versa, you already found some of them). Current RA simply can't
> distinguish clearly between two register sets.
>
> So, my advice would be to use STV pass also for SImode values, on
> 64bit and 32bit targets. On both targets, we will be able to use
> instructions that operate on vector register set, and for 32bit
> targets (and to some extent on 64bit targets), we would perhaps be
> able to relax register pressure in a kind of controlled way.
>
> So, to demonstrate the benefits of existing STV pass, it should be
> relatively easy to introduce 64bit max/min pattern on 32bit target to
> handle 64bit values. For 32bit values, the pass should be re-run to
> convert SImode scalar operations to vector operations in a controlled
> way, based on various cost functions.

Please find attached patch to see STV in action. The compilation will
crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
dump, you will be able to see chain building, cost calculation and
conversion insertion.

The testcase:

--cut here--
long long test (long long a, long long b)
{
  return (a > b) ? a : b;
}
--cut here--

gcc -O2 -m32 -msse2 (-mstv):

_.268r.stv2 dump:

Searching for mode conversion candidates...
  insn 2 is marked as a candidate
  insn 3 is marked as a candidate
  insn 7 is marked as a candidate
Created a new instruction chain #1
Building chain #1...
  Adding insn 2 to chain #1
  Adding insn 7 into chain's #1 queue
  Adding insn 7 to chain #1
  r85 use in insn 12 isn't convertible
  Mark r85 def in insn 7 as requiring both modes in chain #1
  Adding insn 3 into chain's #1 queue
  Adding insn 3 to chain #1
Collected chain #1...
  insns: 2, 3, 7
  defs to convert: r85
Computing gain for chain #1...
  Instruction conversion gain: 24
  Registers conversion cost: 6
  Total gain: 18
Converting chain #1...

...

(insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
        (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
{*movdi_internal}
     (nil))
(insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
        (mem/c:DI (plus:SI (reg/f:SI 16 argp)
                (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
{*movdi_internal}
     (nil))
(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
(insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
        (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
            (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
     (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
        (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))
(insn 15 7 16 2 (set (reg:V2DI 87)
        (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
     (nil))
(insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
        (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
     (nil))
(insn 17 16 18 2 (set (reg:V2DI 87)
        (lshiftrt:V2DI (reg:V2DI 87)
            (const_int 32 [0x20]))) "max.c":3:22 -1
     (nil))
(insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
        (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
     (nil))
(insn 12 18 13 2 (set (reg/i:DI 0 ax)
        (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 86)
        (nil)))
(insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
     (nil))

Uros.