From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-505963-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 110178 invoked by alias); 1 Aug 2019 08:54:18 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 110170 invoked by uid 89); 1 Aug 2019 08:54:18 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.1 spammy=ATM, H*f:sk:CAFULd4
X-HELO: mail-io1-f65.google.com
Received: from mail-io1-f65.google.com (HELO mail-io1-f65.google.com) (209.85.166.65) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 01 Aug 2019 08:54:16 +0000
Received: by mail-io1-f65.google.com with SMTP id j6so22813692ioa.5        for <gcc-patches@gcc.gnu.org>; Thu, 01 Aug 2019 01:54:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc:content-transfer-encoding;        bh=srTZxzC4o4TyksdsATO3pxB25APY81QEBFk2BGj1wcg=;        b=PK2dWXSKldh2yB2ML/e01Z914j+OAM5Lhit0eNmtkq58o/iN4EBBrMi1D5D+YJ1k6z         zzcHQZ0nfa4+6ERl8uIHmdKjjId09NpRWOrDo0HR1EvoValnIkG+2kpzT0xXUNvAiBUb         r6wcMiSSy9sVYvU3TOCzJFiHSGI8VVDxbCzBuGVuL8yVBZJr3JMsvyavBcultyhBj7xE         XeG/MgGrBSroLCxalnCveV6E8yxr5P5u0ZpHnKpBwg+JzFt1vgSMFy6Mdx+TOWfOPoHF         uEuD5wQNf40ezAuKWFNmwjgqE8HBP8wMC4FqVos4eSoh8SopP374g05m0kKqapUjmF0g         zF5A==
MIME-Version: 1.0
References: <alpine.LSU.2.20.1907231548290.30921@zhemvz.fhfr.qr> <ri6o91i32a4.fsf@suse.cz> <alpine.LSU.2.20.1907251405590.4958@zhemvz.fhfr.qr> <CAFULd4ZEAOsRSGZ5SD7kOz2NLmkvkfd6X6v=vkTm99=NgxXZiw@mail.gmail.com> <CAFULd4b9LXOZyqLSSS5mz1r7xe71LiwXFAgm5=VZFRLE57E2cw@mail.gmail.com> <alpine.LSU.2.20.1907311309240.19626@zhemvz.fhfr.qr>
In-Reply-To: <alpine.LSU.2.20.1907311309240.19626@zhemvz.fhfr.qr>
From: Uros Bizjak <ubizjak@gmail.com>
Date: Thu, 01 Aug 2019 08:54:00 -0000
Message-ID: <CAFULd4YHaEZsYn8Ym_F61QZ52hFyG4bRwJPU6y3qtSj39xRBuw@mail.gmail.com>
Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
To: Richard Biener <rguenther@suse.de>
Cc: Martin Jambor <mjambor@suse.cz>, "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, 	Jakub Jelinek <jakub@redhat.com>, Vladimir Makarov <vmakarov@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2019-08/txt/msg00008.txt.bz2

On Wed, Jul 31, 2019 at 1:21 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Sat, 27 Jul 2019, Uros Bizjak wrote:
>
> > On Sat, Jul 27, 2019 at 12:07 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> >
> > > > How would one write smaxsi3 as a splitter to be split after
> > > > reload in the case LRA assigned the GPR alternative?  Is it
> > > > even worth doing?  Even the SSE reg alternative can be split
> > > > to remove the not needed CC clobber.
> > > >
> > > > Finally I'm unsure about the add where I needed to place
> > > > the SSE alternative before the 2nd op memory one since it
> > > > otherwise gets the same cost and wins.
> > > >
> > > > So - how to go forward with this?
> > >
> > > Sorry to come a bit late to the discussion.
> > >
> > > We are aware of CMOV issue for quite some time, but the issue is not
> > > understood yet in detail (I was hoping for Intel people to look at
> > > this). However, you demonstrated that using PMAX and PMIN  instead of
> > > scalar CMOV can bring us big gains, and this thread now deals on how
> > > to best implement PMAX/PMIN for scalar code.
> > >
> > > I think that the way to go forward is with STV infrastructure.
> > > Currently, the implementation only deals with DImode on SSE2 32bit
> > > targets, but I see no issues on using STV pass also for SImode (on
> > > 32bit and 64bit targets). There are actually two STV passes, the first
> > > one (currently run on 64bit targets) is run before cse2, and the
> > > second (which currently runs on 32bit SSE2 only) is run after combine
> > > and before split1 pass. The second pass is interesting to us.
> > >
> > > The base idea of the second STV pass (for 32bit targets!) is that we
> > > introduce a DImode _doubleword instructons that otherwise do not exist
> > > with integer registers. Now, the passes up to and including combine
> > > pass can use these instructions to simplify and optimize the insn
> > > flow. Later, based on cost analysis, STV pass either converts the
> > > _doubleword instructions to a real vector ones (e.g. V2DImode
> > > patterns) or leaves them intact, and a follow-up split pass splits
> > > them into scalar SImode instruction pairs. STV pass also takes care to
> > > move and preload values from their scalar form to a vector
> > > representation (using SUBREGs). Please note that all this happens on
> > > pseudos, and register allocator will later simply use scalar (integer)
> > > registers in scalar patterns and vector registers with vector insn
> > > patterns.
> > >
> > > Your approach to amend existing scalar SImode patterns with vector
> > > registers will introduce no end of problems. Register allocator will
> > > do funny things during register pressure, where values will take a
> > > trip to a vector register before being stored to memory (and vice
> > > versa, you already found some of them). Current RA simply can't
> > > distinguish clearly between two register sets.
> > >
> > > So, my advice would be to use STV pass also for SImode values, on
> > > 64bit and 32bit targets. On both targets, we will be able to use
> > > instructions that operate on vector register set, and for 32bit
> > > targets (and to some extent on 64bit targets), we would perhaps be
> > > able to relax register pressure in a kind of controlled way.
> > >
> > > So, to demonstrate the benefits of existing STV pass, it should be
> > > relatively easy to introduce 64bit max/min pattern on 32bit target to
> > > handle 64bit values. For 32bit values, the pass should be re-run to
> > > convert SImode scalar operations to vector operations in a controlled
> > > way, based on various cost functions.
>
> I've looked at STV before trying to use RA to solve the issue but
> quickly stepped away because of its structure which seems to be
> tied to particular modes, duplicating things for TImode and DImode
> so it looked like I have to write up everything again for SImode...

ATM, DImode is used exclusively for x86_32 while TImode is used
exclusively for x86_64. Also, TImode is used for different purpose
before combine, while DImode is used after combine. I don't remember
the details, but IIRC it made sense for the intended purpose.
>
> It really should be possible to run the pass once, handling a set
> of modes rather than re-running it for the SImode case I am after.
> See also a recent PR about STV slowness and tendency to hog memory
> because it seems to enable every DF problem that is around...

Huh, I was not aware of implementation details...

> > Please find attached patch to see STV in action. The compilation will
> > crash due to non-existing V2DImode SMAX insn, but in the _.268r.stv2
> > dump, you will be able to see chain building, cost calculation and
> > conversion insertion.
>
> So you unconditionally add a smaxdi3 pattern - indeed this looks
> necessary even when going the STV route.  The actual regression
> for the testcase could also be solved by turing the smaxsi3
> back into a compare and jump rather than a conditional move sequence.
> So I wonder how you'd do that given that there's pass_if_after_reload
> after pass_split_after_reload and I'm not sure we can split
> as late as pass_split_before_sched2 (there's also a split _after_
> sched2 on x86 it seems).
>
> So how would you go implement {s,u}{min,max}{si,di}3 for the
> case STV doesn't end up doing any transform?

If STV doesn't transform the insn, then a pre-reload splitter splits
the insn back to compare+cmove. However, considering the SImode move
from/to int/xmm register is relatively cheap, the cost function should
be tuned so that STV always converts smaxsi3 pattern. (As said before,
the fix of the slowdown with consecutive cmov insns is a side effect
of the transformation to smax insn that helps in this particular case,
I think that this issue should be fixed in a general way, there are
already a couple of PRs reported).

> You could save me some guesswork here if you can come up with
> a reasonably complete final set of patterns (ok, I only care
> about smaxsi3) so I can have a look at the STV approach again
> (you may remember I simply "split" at assembler emission time).

I think that the cost function should always enable smaxsi3
generation. To further optimize STV chain (to avoid unnecessary
xmm<->int transitions) we could add all integer logic, arithmetic and
constant shifts to the candidates (the ones that DImode STV converts).

Uros.

> Thanks,
> Richard.
>
> > The testcase:
> >
> > --cut here--
> > long long test (long long a, long long b)
> > {
> >   return (a > b) ? a : b;
> > }
> > --cut here--
> >
> > gcc -O2 -m32 -msse2 (-mstv):
> >
> > _.268r.stv2 dump:
> >
> > Searching for mode conversion candidates...
> >   insn 2 is marked as a candidate
> >   insn 3 is marked as a candidate
> >   insn 7 is marked as a candidate
> > Created a new instruction chain #1
> > Building chain #1...
> >   Adding insn 2 to chain #1
> >   Adding insn 7 into chain's #1 queue
> >   Adding insn 7 to chain #1
> >   r85 use in insn 12 isn't convertible
> >   Mark r85 def in insn 7 as requiring both modes in chain #1
> >   Adding insn 3 into chain's #1 queue
> >   Adding insn 3 to chain #1
> > Collected chain #1...
> >   insns: 2, 3, 7
> >   defs to convert: r85
> > Computing gain for chain #1...
> >   Instruction conversion gain: 24
> >   Registers conversion cost: 6
> >   Total gain: 18
> > Converting chain #1...
> >
> > ...
> >
> > (insn 2 5 3 2 (set (reg/v:DI 83 [ a ])
> >         (mem/c:DI (reg/f:SI 16 argp) [1 a+0 S8 A32])) "max.c":2:1 66
> > {*movdi_internal}
> >      (nil))
> > (insn 3 2 4 2 (set (reg/v:DI 84 [ b ])
> >         (mem/c:DI (plus:SI (reg/f:SI 16 argp)
> >                 (const_int 8 [0x8])) [1 b+0 S8 A32])) "max.c":2:1 66
> > {*movdi_internal}
> >      (nil))
> > (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)
> > (insn 7 4 15 2 (set (subreg:V2DI (reg:DI 85) 0)
> >         (smax:V2DI (subreg:V2DI (reg/v:DI 84 [ b ]) 0)
> >             (subreg:V2DI (reg/v:DI 83 [ a ]) 0))) "max.c":3:22 -1
> >      (expr_list:REG_DEAD (reg/v:DI 84 [ b ])
> >         (expr_list:REG_DEAD (reg/v:DI 83 [ a ])
> >             (expr_list:REG_UNUSED (reg:CC 17 flags)
> >                 (nil)))))
> > (insn 15 7 16 2 (set (reg:V2DI 87)
> >         (subreg:V2DI (reg:DI 85) 0)) "max.c":3:22 -1
> >      (nil))
> > (insn 16 15 17 2 (set (subreg:SI (reg:DI 86) 0)
> >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> >      (nil))
> > (insn 17 16 18 2 (set (reg:V2DI 87)
> >         (lshiftrt:V2DI (reg:V2DI 87)
> >             (const_int 32 [0x20]))) "max.c":3:22 -1
> >      (nil))
> > (insn 18 17 12 2 (set (subreg:SI (reg:DI 86) 4)
> >         (subreg:SI (reg:V2DI 87) 0)) "max.c":3:22 -1
> >      (nil))
> > (insn 12 18 13 2 (set (reg/i:DI 0 ax)
> >         (reg:DI 86)) "max.c":4:1 66 {*movdi_internal}
> >      (expr_list:REG_DEAD (reg:DI 86)
> >         (nil)))
> > (insn 13 12 0 2 (use (reg/i:DI 0 ax)) "max.c":4:1 -1
> >      (nil))
> >
> > Uros.
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> GF: Felix Imend=C3=B6rffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG N=C3=
=BCrnberg)