From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 115082 invoked by alias); 24 Jul 2019 11:11:32 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 114747 invoked by uid 89); 24 Jul 2019 11:11:32 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-5.8 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy= X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 24 Jul 2019 11:11:31 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BE2F2AFA4; Wed, 24 Jul 2019 11:11:28 +0000 (UTC) Date: Wed, 24 Jul 2019 11:30:00 -0000 From: Richard Biener To: gcc-patches@gcc.gnu.org cc: Jan Hubicka , ubizjak@gmail.com, kirill.yukhin@gmail.com, vmakarov@redhat.com, hjl.tools@gmail.com, Martin Jambor Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs In-Reply-To: Message-ID: References: User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SW-Source: 2019-07/txt/msg01564.txt.bz2 On Wed, 24 Jul 2019, Richard Biener wrote: > On Tue, 23 Jul 2019, Richard Biener wrote: > > > > > The following fixes the runtime regression of 456.hmmer caused > > by matching ICC in code generation and using cmov more aggressively > > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > > by manual assembler editing) using the SSE unit for performing > > SImode loads, adds and then two singed max operations plus stores > > is quite a bit faster than cmovs - even faster than the original > > single cmov plus branchy second max. Even more so for AMD CPUs > > than Intel CPUs. > > > > Instead of hacking up some pattern recognition pass to transform > > integer mode memory-to-memory computation chains involving > > conditional moves to "vector" code (similar to what STV does > > for TImode ops on x86_64) the following simply allows SImode > > into SSE registers (support for this is already there in some > > important places like move patterns!). For the particular > > case of 456.hmmer the required support is loads/stores > > (already implemented), SImode adds and SImode smax. > > > > So the patch adds a smax pattern for SImode (we don't have any > > for scalar modes but currently expand via a conditional move sequence) > > emitting as SSE vector max or cmp/cmov depending on the alternative. > > > > And it amends the *add_1 pattern with SSE alternatives > > (which have to come before the memory alternative as IRA otherwise > > doesn't consider reloading a memory operand to a register). > > > > With this in place the runtime of 456.hmmer improves by 10% > > on Haswell which is back to before regression speed but not > > to same levels as seen with manually editing just the single > > important loop. > > > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > > interesting is probably Zen where moves crossing the > > integer - vector domain are excessively expensive (they get > > done via the stack). > > > > Clearly this approach will run into register allocation issues > > but it looks cleaner than writing yet another STV-like pass > > (STV itself is quite awkwardly structured so I refrain from > > touching it...). > > > > Anyway - comments? It seems to me that MMX-in-SSE does > > something very similar. > > > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > > revealed some issue. Forgot that *add_1 also handles > > DImode..., fixed below, re-testing in progress. > > Bootstrapped/tested on x86_64-unknown-linux-gnu. A 3-run of > SPEC CPU 2006 on a Haswell machine completed and results > are in the noise besides the 456.hmmer improvement: > > 456.hmmer 9330 184 50.7 S 9330 162 > 57.4 S > 456.hmmer 9330 182 51.2 * 9330 162 > 57.7 * > 456.hmmer 9330 182 51.2 S 9330 162 > 57.7 S > > the peak binaries (patched) are all a slightly bit bigger, the > smaxsi3 pattern triggers 6840 times, every time using SSE > registers and never expanding to the cmov variant. The > *add_1 pattern ends up using SSE regs 264 times > (out of undoubtly many more, uncounted, times). > > I do see cases where the RA ends up moving sources of > the max from GPR to XMM when the destination is stored > to memory and used in other ops with SSE but still > it could have used XMM regs for the sources as well: > > movl -208(%rbp), %r8d > addl (%r9,%rax), %r8d > vmovd %r8d, %xmm2 > movq -120(%rbp), %r8 > # MAX WITH SSE > vpmaxsd %xmm4, %xmm2, %xmm2 > > amending the *add_1 was of course the trickiest part, > mostly because the GPR case has memory alternatives while > the SSE part does not (since we have to use a whole-vector > add we can't use a memory operand which would be wider > than SImode - AVX512 might come to the rescue with > using {splat} from scalar/immediate or masking > but that might come at a runtime cost as well). Allowing > memory and splitting after reload, adding a match-scratch > might work as well. But I'm not sure if that wouldn't > make using SSE regs too obvious if it's not all in the > same alternative. While the above code isn't too bad > on Core, both Bulldozer and Zen take a big hit. > > Another case from 400.perlbench: > > vmovd .LC45(%rip), %xmm7 > vmovd %ebp, %xmm5 > # MAX WITH SSE > vpmaxsd %xmm7, %xmm5, %xmm4 > vmovd %xmm4, %ecx > > eh? I can't see why the RA would ever choose the second > alternative. It looks like it prefers SSE_REGS for the > operand set from a constant. A testcase like > > int foo (int a) > { > return a > 5 ? a : 5; > } > > produces the above with -mavx2, possibly IRA thinks > the missing matching constraint for the 2nd alternative > makes it win? The dumps aren't too verbose here just > showing the costs, not how we arrive at them. Eh, this is to my use of the "cpu" attribute for smaxsi3 which makes it only enable this alternative for -mavx. Removing that we fail to consider SSE regs for the original and this testcase :/ Oh well. RA needs some more pixie dust it seems ... Richard.