From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-506519-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 14121 invoked by alias); 8 Aug 2019 15:24:43 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 2324 invoked by uid 89); 8 Aug 2019 15:24:36 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS autolearn=ham version=3.3.1 spammy=barebone, honza, Honza
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 08 Aug 2019 15:24:34 +0000
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id E8F3B65401;	Thu,  8 Aug 2019 15:24:32 +0000 (UTC)
Received: from localhost.localdomain (ovpn-112-43.rdu2.redhat.com [10.10.112.43])	by smtp.corp.redhat.com (Postfix) with ESMTP id 161B85C219;	Thu,  8 Aug 2019 15:24:30 +0000 (UTC)
Subject: Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
To: Uros Bizjak <ubizjak@gmail.com>, Richard Biener <rguenther@suse.de>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>, Jakub Jelinek <jakub@redhat.com>, "H. J. Lu" <hjl.tools@gmail.com>, Jan Hubicka <hubicka@ucw.cz>
References: <alpine.LSU.2.20.1907231548290.30921@zhemvz.fhfr.qr> <ri6o91i32a4.fsf@suse.cz> <alpine.LSU.2.20.1907251405590.4958@zhemvz.fhfr.qr> <CAFULd4ZEAOsRSGZ5SD7kOz2NLmkvkfd6X6v=vkTm99=NgxXZiw@mail.gmail.com> <CAFULd4b9LXOZyqLSSS5mz1r7xe71LiwXFAgm5=VZFRLE57E2cw@mail.gmail.com> <alpine.LSU.2.20.1907311309240.19626@zhemvz.fhfr.qr> <CAFULd4YHaEZsYn8Ym_F61QZ52hFyG4bRwJPU6y3qtSj39xRBuw@mail.gmail.com> <alpine.LSU.2.20.1908011122550.19626@zhemvz.fhfr.qr> <CAFULd4YmJ1W1is5F3oJCMdk0uKUsLUr6wei8K7TsVrPygBBwgA@mail.gmail.com> <alpine.DEB.2.21.1908031923130.5283@rguenther-XPS-13-9380> <CAFULd4ams5g7J0PwqDpn-cPZcVa9niTqw5_BWVzGg6MtcDi8sQ@mail.gmail.com> <alpine.LSU.2.20.1908051316260.19626@zhemvz.fhfr.qr> <CAFULd4aDDZCw0LkZbo7J1c=eEBqjeh-Q9NLWv94HLBDvNhd+WQ@mail.gmail.com>
From: Jeff Law <law@redhat.com>
Openpgp: preference=signencrypt
Message-ID: <5fa725f8-d821-cfb2-c5dc-b0cbec8801d4@redhat.com>
Date: Thu, 08 Aug 2019 16:23:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0
MIME-Version: 1.0
In-Reply-To: <CAFULd4aDDZCw0LkZbo7J1c=eEBqjeh-Q9NLWv94HLBDvNhd+WQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2019-08/txt/msg00564.txt.bz2

On 8/5/19 6:32 AM, Uros Bizjak wrote:
> On Mon, Aug 5, 2019 at 1:50 PM Richard Biener <rguenther@suse.de> wrote:
>>
>> On Sun, 4 Aug 2019, Uros Bizjak wrote:
>>
>>> On Sat, Aug 3, 2019 at 7:26 PM Richard Biener <rguenther@suse.de> wrote:
>>>>
>>>> On Thu, 1 Aug 2019, Uros Bizjak wrote:
>>>>
>>>>> On Thu, Aug 1, 2019 at 11:28 AM Richard Biener <rguenther@suse.de> wrote:
>>>>>
>>>>>>>> So you unconditionally add a smaxdi3 pattern - indeed this looks
>>>>>>>> necessary even when going the STV route.  The actual regression
>>>>>>>> for the testcase could also be solved by turing the smaxsi3
>>>>>>>> back into a compare and jump rather than a conditional move sequence.
>>>>>>>> So I wonder how you'd do that given that there's pass_if_after_reload
>>>>>>>> after pass_split_after_reload and I'm not sure we can split
>>>>>>>> as late as pass_split_before_sched2 (there's also a split _after_
>>>>>>>> sched2 on x86 it seems).
>>>>>>>>
>>>>>>>> So how would you go implement {s,u}{min,max}{si,di}3 for the
>>>>>>>> case STV doesn't end up doing any transform?
>>>>>>>
>>>>>>> If STV doesn't transform the insn, then a pre-reload splitter splits
>>>>>>> the insn back to compare+cmove.
>>>>>>
>>>>>> OK, that would work.  But there's no way to force a jumpy sequence then
>>>>>> which we know is faster than compare+cmove because later RTL
>>>>>> if-conversion passes happily re-discover the smax (or conditional move)
>>>>>> sequence.
>>>>>>
>>>>>>> However, considering the SImode move
>>>>>>> from/to int/xmm register is relatively cheap, the cost function should
>>>>>>> be tuned so that STV always converts smaxsi3 pattern.
>>>>>>
>>>>>> Note that on both Zen and even more so bdverN the int/xmm transition
>>>>>> makes it no longer profitable but a _lot_ slower than the cmp/cmov
>>>>>> sequence... (for the loop in hmmer which is the only one I see
>>>>>> any effect of any of my patches).  So identifying chains that
>>>>>> start/end in memory is important for cost reasons.
>>>>>
>>>>> Please note that the cost function also considers the cost of move
>>>>> from/to xmm. So, the cost of the whole chain would disable the
>>>>> transformation.
>>>>>
>>>>>> So I think the splitting has to happen after the last if-conversion
>>>>>> pass (and thus we may need to allocate a scratch register for this
>>>>>> purpose?)
>>>>>
>>>>> I really hope that the underlying issue will be solved by a machine
>>>>> dependant pass inserted somewhere after the pre-reload split. This
>>>>> way, we can split unconverted smax to the cmove, and this later pass
>>>>> would handle jcc and cmove instructions. Until then... yes your
>>>>> proposed approach is one of the ways to avoid unwanted if-conversion,
>>>>> although sometimes we would like to split to cmove instead.
>>>>
>>>> So the following makes STV also consider SImode chains, re-using the
>>>> DImode chain code.  I've kept a simple incomplete smaxsi3 pattern
>>>> and also did not alter the {SI,DI}mode chain cost function - it's
>>>> quite off for TARGET_64BIT.  With this I get the expected conversion
>>>> for the testcase derived from hmmer.
>>>>
>>>> No further testing sofar.
>>>>
>>>> Is it OK to re-use the DImode chain code this way?  I'll clean things
>>>> up some more of course.
>>>
>>> Yes, the approach looks OK to me. It makes chain building mode
>>> agnostic, and the chain building can be used for
>>> a) DImode x86_32 (as is now), but maybe 64bit minmax operation can be added.
>>> b) SImode x86_32 and x86_64 (this will be mainly used for SImode
>>> minmax and surrounding SImode operations)
>>> c) DImode x86_64 (also, mainly used for DImode minmax and surrounding
>>> DImode operations)
>>>
>>>> Still need help with the actual patterns for minmax and how the splitters
>>>> should look like.
>>>
>>> Please look at the attached patch. Maybe we can add memory_operand as
>>> operand 1 and operand 2 predicate, but let's keep things simple for
>>> now.
>>
>> Thanks.  The attached patch makes the patch cleaner and it survives
>> "some" barebone testing.  It also touches the cost function to
>> avoid being too overly trigger-happy.  I've also ended up using
>> ix86_cost->sse_op instead of COSTS_N_INSN-based magic.  In
>> particular we estimated GPR reg-reg move as COST_N_INSNS(2) while
>> move costs shouldn't be wrapped in COST_N_INSNS.
>> IMHO we should probably disregard any reg-reg moves for costing pre-RA.
>> At least with the current code every reg-reg move biases in favor of
>> SSE...
> 
> This is currently a bit mixed-up area in x86 target support. HJ is
> looking into this [1] and I hope Honza can review the patch.
Yea, Honza's input on that would be greatly appreciated.

Jeff
>