From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 67405 invoked by alias); 9 Aug 2019 09:25:34 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 67397 invoked by uid 89); 9 Aug 2019 09:25:34 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-5.8 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.1 spammy=spilling X-HELO: mx1.suse.de Received: from mx2.suse.de (HELO mx1.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 09 Aug 2019 09:25:33 +0000 Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C1D2EAF11; Fri, 9 Aug 2019 09:25:30 +0000 (UTC) Date: Fri, 09 Aug 2019 10:13:00 -0000 From: Richard Biener To: Uros Bizjak cc: Jakub Jelinek , "gcc-patches@gcc.gnu.org" Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs In-Reply-To: Message-ID: References: <20190805125358.GR2726@tucnak> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: multipart/mixed; BOUNDARY="-1609908220-2058439771-1565342730=:11741" X-SW-Source: 2019-08/txt/msg00614.txt.bz2 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1609908220-2058439771-1565342730=:11741 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Content-length: 4315 On Fri, 9 Aug 2019, Uros Bizjak wrote: > On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak wrote: > > > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"]) > > > > > > > > > > > > > > and then we need to split DImode for 32bits, too. > > > > > > > > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode > > > > > > condition, I'll provide _doubleword splitter later. > > > > > > > > > > Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc. > > > > > to force use of %zmmN? > > > > > > > > It generates V4SI mode, so - yes, AVX512VL. > > > > > > case SMAX: > > > case SMIN: > > > case UMAX: > > > case UMIN: > > > if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL)) > > > || (mode == SImode && !TARGET_SSE4_1)) > > > return false; > > > > > > so there's no way to use AVX512VL for 32bit? > > > > There is a way, but on 32bit targets, we need to split DImode > > operation to a sequence of SImode operations for unconverted pattern. > > This is of course doable, but somehow more complex than simply > > emitting a DImode compare + DImode cmove, which is what current > > splitter does. So, a follow-up task. > > Please find attached the complete .md part that enables SImode for > TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit > targets. The patterns also allows for memory operand 2, so STV has > chance to create the vector pattern with implicit load. In case STV > fails, the memory operand 2 is loaded to the register first; operand > 2 is used in compare and cmove instruction, so pre-loading of the > operand should be beneficial. Thanks. > Also note, that splitting should happen rarely. Due to the cost > function, STV should effectively always convert minmax to a vector > insn. I've analyzed the 464.h264ref slowdown on Haswell and it is due to this kind of "simple" conversion: 5.50 │1d0: test %esi,%es 0.07 │ mov $0x0,%ex │ cmovs %eax,%es 5.84 │ imul %r8d,%es to 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0 40.45 │ vmovd %xmm0,%eax 2.45 │ imul %r8d,%eax which looks like a RA artifact in the end. We spill %esi only with -mstv here as STV introduces a (subreg:V4SI ...) use of a pseudo ultimatively set from di. STV creates an additional pseudo for this (copy-in) but it places that copy next to the original def rather than next to the start of the chain it converts which is probably the issue why we spill. And this is because it inserts those at each definition of the pseudo rather than just at the reaching definition(s) or at the uses of the pseudo in the chain (that because there may be defs of that pseudo in the chain itself). Note that STV emits such "conversion" copies as simple reg-reg moves: (insn 1094 3 4 2 (set (reg:SI 777) (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1 (nil)) but those do not prevail very long (this one gets removed by CSE2). So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use and computes r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618 so I wonder if STV shouldn't instead emit gpr->xmm moves here (but I guess nothing again prevents RTL optimizers from combining that with the single-use in the max instruction...). So this boils down to STV splitting live-ranges but other passes undoing that and then RA not considering splitting live-ranges here, arriving at unoptimal allocation. A testcase showing this issue is (simplified from 464.h264ref UMVLine16Y_11): unsigned short UMVLine16Y_11 (short unsigned int * Pic, int y, int width) { if (y != width) { y = y < 0 ? 0 : y; return Pic[y * width]; } return Pic[y]; } where the condition and the Pic[y] load mimics the other use of y. Different, even worse spilling is generated by unsigned short UMVLine16Y_11 (short unsigned int * Pic, int y, int width) { y = y < 0 ? 0 : y; return Pic[y * width] + y; } I guess this all shows that STVs "trick" of simply wrapping integer mode pseudos in (subreg:vector-mode ...) is bad? I've added a (failing) testcase to reflect the above. Richard. ---1609908220-2058439771-1565342730=:11741--