From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-435422-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 86843 invoked by alias); 7 Sep 2016 12:38:14 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 86746 invoked by uid 89); 7 Sep 2016 12:38:13 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.0 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=principles, decisions, our
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 07 Sep 2016 12:38:10 +0000
Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id F34E561E47;	Wed,  7 Sep 2016 12:38:08 +0000 (UTC)
Received: from localhost.localdomain (ovpn-116-177.phx2.redhat.com [10.3.116.177])	by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u87Cc8xW032064;	Wed, 7 Sep 2016 08:38:08 -0400
Subject: Re: [PATCH][v3] GIMPLE store merging pass
To: Richard Biener <rguenther@suse.de>, Jakub Jelinek <jakub@redhat.com>
References: <57CEDD67.6010801@foss.arm.com> <20160906153250.GK14857@tucnak.redhat.com> <57CEE7DB.8070604@foss.arm.com> <20160906162133.GL14857@tucnak.redhat.com> <alpine.LSU.2.11.1609071010210.26629@t29.fhfr.qr>
Cc: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>,        Uros Bizjak <ubizjak@gmail.com>, GCC Patches <gcc-patches@gcc.gnu.org>
From: Jeff Law <law@redhat.com>
Message-ID: <6571269d-c92c-6fa6-7878-7a456886d807@redhat.com>
Date: Wed, 07 Sep 2016 12:43:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0
MIME-Version: 1.0
In-Reply-To: <alpine.LSU.2.11.1609071010210.26629@t29.fhfr.qr>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2016-09/txt/msg00385.txt.bz2

On 09/07/2016 02:19 AM, Richard Biener wrote:
> On Tue, 6 Sep 2016, Jakub Jelinek wrote:
>
>> On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote:
>>> On 06/09/16 16:32, Jakub Jelinek wrote:
>>>> On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
>>>>> The v3 of this patch addresses feedback I received on the version posted at [1].
>>>>> The merged store buffer is now represented as a char array that we splat values onto with
>>>>> native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr
>>>>> accepts, including floating point values and short vectors. So this version extends the functionality
>>>>> of the previous one in that it handles floating point values as well.
>>>>>
>>>>> The first phase of the algorithm that detects the contiguous stores is also slightly refactored according
>>>>> to feedback to read more fluently.
>>>>>
>>>>> Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64.
>>>>> MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended
>>>>> synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now
>>>>> I've left the maximum merge size to be BITS_PER_WORD.
>>>> At least from playing with this kind of things in the RTL PR22141 patch,
>>>> I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
>>>> constants usually isn't a win (not just for speed optimized blocks but also for
>>>> -Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
>>>> 32-bit you need movabsq into some temporary register which has like 3 times worse
>>>> latency than normal store if I remember well, and then store it.
>>>
>>> We could restrict the maximum width of the stores generated to 32 bits on x86_64.
>>> I think this would need another parameter or target macro for the target to set.
>>> Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate
>>> expansion? For example break up the 64-bit movabsq immediate into two SImode immediates?
>>
>> If you want a 64-bit store, you'd need to merge the two, and that would be
>> even more expensive.  It is a matter of say:
>> 	movl $0x12345678, (%rsp)
>> 	movl $0x09abcdef, 4(%rsp)
>> vs.
>> 	movabsq $0x09abcdef12345678, %rax
>> 	movq %rax, (%rsp)
>> vs.
>> 	movl $0x09abcdef, %eax
>> 	salq $32, %rax
>> 	orq $0x12345678, %rax
>> 	movq $rax, (%rsp)
>
> vs.
>
>         movq $LC0, (%rsp)
>
> ?
>
>> etc.  Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure
>> the last sequence is the worst one.
>
> I think the important part to notice is that it should be straight forward
> for a target / the expander to split a large store from an immediate
> into any of the above but very hard to do the opposite.  Thus from a
> GIMPLE side "canonicalizing" to large stores (that are eventually
> supported and well-aligned) seems best to me.
Agreed.


>
>>> I'm aware of that. The patch already has logic to avoid emitting unaligned accesses
>>> for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter
>>> PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
>>> forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure
>>> how to behave more intelligently here. Any ideas?
>>
>> Dunno, the heuristics was the main problem with my patch.  Generally, I'd
>> say there is a difference between cold and hot blocks, in cold ones perhaps
>> unaligned stores are more appropriate (if supported at all and not way too
>> slow), while in hot ones less desirable.
>
> Note that I repeatedly argue that if we can canonicalize sth to "larger"
> then even if unaligned, the expander should be able to produce optimal
> code again (it might not do, of course).
And agreed.  Furthermore, it's in line with our guiding principles WRT 
separation of the tree/SSA optimizers from target dependencies.

So let's push those decisions into the expanders/backend/target and 
canonicalize to the larger stores.

jeff