From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-505596-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 115082 invoked by alias); 24 Jul 2019 11:11:32 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 114747 invoked by uid 89); 24 Jul 2019 11:11:32 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-5.8 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy=
X-HELO: mx1.suse.de
Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 24 Jul 2019 11:11:31 +0000
Received: from relay2.suse.de (unknown [195.135.220.254])	by mx1.suse.de (Postfix) with ESMTP id BE2F2AFA4;	Wed, 24 Jul 2019 11:11:28 +0000 (UTC)
Date: Wed, 24 Jul 2019 11:30:00 -0000
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
cc: Jan Hubicka <hubicka@ucw.cz>, ubizjak@gmail.com, kirill.yukhin@gmail.com,     vmakarov@redhat.com, hjl.tools@gmail.com, Martin Jambor <mjambor@suse.de>
Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
In-Reply-To: <alpine.LSU.2.20.1907240921310.30921@zhemvz.fhfr.qr>
Message-ID: <alpine.LSU.2.20.1907241309040.30921@zhemvz.fhfr.qr>
References: <alpine.LSU.2.20.1907231548290.30921@zhemvz.fhfr.qr> <alpine.LSU.2.20.1907240921310.30921@zhemvz.fhfr.qr>
User-Agent: Alpine 2.20 (LSU 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-SW-Source: 2019-07/txt/msg01564.txt.bz2

On Wed, 24 Jul 2019, Richard Biener wrote:

> On Tue, 23 Jul 2019, Richard Biener wrote:
> 
> > 
> > The following fixes the runtime regression of 456.hmmer caused
> > by matching ICC in code generation and using cmov more aggressively
> > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > by manual assembler editing) using the SSE unit for performing
> > SImode loads, adds and then two singed max operations plus stores
> > is quite a bit faster than cmovs - even faster than the original
> > single cmov plus branchy second max.  Even more so for AMD CPUs
> > than Intel CPUs.
> > 
> > Instead of hacking up some pattern recognition pass to transform
> > integer mode memory-to-memory computation chains involving
> > conditional moves to "vector" code (similar to what STV does
> > for TImode ops on x86_64) the following simply allows SImode
> > into SSE registers (support for this is already there in some
> > important places like move patterns!).  For the particular
> > case of 456.hmmer the required support is loads/stores
> > (already implemented), SImode adds and SImode smax.
> > 
> > So the patch adds a smax pattern for SImode (we don't have any
> > for scalar modes but currently expand via a conditional move sequence)
> > emitting as SSE vector max or cmp/cmov depending on the alternative.
> > 
> > And it amends the *add<mode>_1 pattern with SSE alternatives
> > (which have to come before the memory alternative as IRA otherwise
> > doesn't consider reloading a memory operand to a register).
> > 
> > With this in place the runtime of 456.hmmer improves by 10%
> > on Haswell which is back to before regression speed but not
> > to same levels as seen with manually editing just the single
> > important loop.
> > 
> > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > interesting is probably Zen where moves crossing the
> > integer - vector domain are excessively expensive (they get
> > done via the stack).
> > 
> > Clearly this approach will run into register allocation issues
> > but it looks cleaner than writing yet another STV-like pass
> > (STV itself is quite awkwardly structured so I refrain from
> > touching it...).
> > 
> > Anyway - comments?  It seems to me that MMX-in-SSE does
> > something very similar.
> > 
> > Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> > revealed some issue.  Forgot that *add<mode>_1 also handles
> > DImode..., fixed below, re-testing in progress.
> 
> Bootstrapped/tested on x86_64-unknown-linux-gnu.  A 3-run of
> SPEC CPU 2006 on a Haswell machine completed and results
> are in the noise besides the 456.hmmer improvement:
> 
> 456.hmmer        9330        184       50.7 S    9330        162       
> 57.4 S
> 456.hmmer        9330        182       51.2 *    9330        162       
> 57.7 *
> 456.hmmer        9330        182       51.2 S    9330        162       
> 57.7 S
> 
> the peak binaries (patched) are all a slightly bit bigger, the
> smaxsi3 pattern triggers 6840 times, every time using SSE
> registers and never expanding to the cmov variant.  The
> *add<mode>_1 pattern ends up using SSE regs 264 times
> (out of undoubtly many more, uncounted, times).
> 
> I do see cases where the RA ends up moving sources of
> the max from GPR to XMM when the destination is stored
> to memory and used in other ops with SSE but still
> it could have used XMM regs for the sources as well:
> 
>         movl    -208(%rbp), %r8d
>         addl    (%r9,%rax), %r8d
>         vmovd   %r8d, %xmm2
>         movq    -120(%rbp), %r8
>         # MAX WITH SSE
>         vpmaxsd %xmm4, %xmm2, %xmm2
> 
> amending the *add<mode>_1 was of course the trickiest part,
> mostly because the GPR case has memory alternatives while
> the SSE part does not (since we have to use a whole-vector
> add we can't use a memory operand which would be wider
> than SImode - AVX512 might come to the rescue with
> using {splat} from scalar/immediate or masking
> but that might come at a runtime cost as well).  Allowing
> memory and splitting after reload, adding a match-scratch
> might work as well.  But I'm not sure if that wouldn't
> make using SSE regs too obvious if it's not all in the
> same alternative.  While the above code isn't too bad
> on Core, both Bulldozer and Zen take a big hit.
> 
> Another case from 400.perlbench:
> 
>         vmovd   .LC45(%rip), %xmm7
>         vmovd   %ebp, %xmm5
>         # MAX WITH SSE
>         vpmaxsd %xmm7, %xmm5, %xmm4
>         vmovd   %xmm4, %ecx
> 
> eh?  I can't see why the RA would ever choose the second
> alternative.  It looks like it prefers SSE_REGS for the
> operand set from a constant.  A testcase like
> 
> int foo (int a)
> {
>   return a > 5 ? a : 5;
> }
> 
> produces the above with -mavx2, possibly IRA thinks
> the missing matching constraint for the 2nd alternative
> makes it win?  The dumps aren't too verbose here just
> showing the costs, not how we arrive at them.

Eh, this is to my use of the "cpu" attribute for smaxsi3 which
makes it only enable this alternative for -mavx.  Removing that
we fail to consider SSE regs for the original and this testcase :/

Oh well.

RA needs some more pixie dust it seems ...

Richard.