From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 86843 invoked by alias); 7 Sep 2016 12:38:14 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 86746 invoked by uid 89); 7 Sep 2016 12:38:13 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.0 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=principles, decisions, our X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 07 Sep 2016 12:38:10 +0000 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id F34E561E47; Wed, 7 Sep 2016 12:38:08 +0000 (UTC) Received: from localhost.localdomain (ovpn-116-177.phx2.redhat.com [10.3.116.177]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u87Cc8xW032064; Wed, 7 Sep 2016 08:38:08 -0400 Subject: Re: [PATCH][v3] GIMPLE store merging pass To: Richard Biener , Jakub Jelinek References: <57CEDD67.6010801@foss.arm.com> <20160906153250.GK14857@tucnak.redhat.com> <57CEE7DB.8070604@foss.arm.com> <20160906162133.GL14857@tucnak.redhat.com> Cc: Kyrill Tkachov , Uros Bizjak , GCC Patches From: Jeff Law Message-ID: <6571269d-c92c-6fa6-7878-7a456886d807@redhat.com> Date: Wed, 07 Sep 2016 12:43:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2016-09/txt/msg00385.txt.bz2 On 09/07/2016 02:19 AM, Richard Biener wrote: > On Tue, 6 Sep 2016, Jakub Jelinek wrote: > >> On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote: >>> On 06/09/16 16:32, Jakub Jelinek wrote: >>>> On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: >>>>> The v3 of this patch addresses feedback I received on the version posted at [1]. >>>>> The merged store buffer is now represented as a char array that we splat values onto with >>>>> native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr >>>>> accepts, including floating point values and short vectors. So this version extends the functionality >>>>> of the previous one in that it handles floating point values as well. >>>>> >>>>> The first phase of the algorithm that detects the contiguous stores is also slightly refactored according >>>>> to feedback to read more fluently. >>>>> >>>>> Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. >>>>> MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended >>>>> synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now >>>>> I've left the maximum merge size to be BITS_PER_WORD. >>>> At least from playing with this kind of things in the RTL PR22141 patch, >>>> I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit >>>> constants usually isn't a win (not just for speed optimized blocks but also for >>>> -Os). For 64-bit store if the constant isn't signed 32-bit or unsigned >>>> 32-bit you need movabsq into some temporary register which has like 3 times worse >>>> latency than normal store if I remember well, and then store it. >>> >>> We could restrict the maximum width of the stores generated to 32 bits on x86_64. >>> I think this would need another parameter or target macro for the target to set. >>> Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate >>> expansion? For example break up the 64-bit movabsq immediate into two SImode immediates? >> >> If you want a 64-bit store, you'd need to merge the two, and that would be >> even more expensive. It is a matter of say: >> movl $0x12345678, (%rsp) >> movl $0x09abcdef, 4(%rsp) >> vs. >> movabsq $0x09abcdef12345678, %rax >> movq %rax, (%rsp) >> vs. >> movl $0x09abcdef, %eax >> salq $32, %rax >> orq $0x12345678, %rax >> movq $rax, (%rsp) > > vs. > > movq $LC0, (%rsp) > > ? > >> etc. Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure >> the last sequence is the worst one. > > I think the important part to notice is that it should be straight forward > for a target / the expander to split a large store from an immediate > into any of the above but very hard to do the opposite. Thus from a > GIMPLE side "canonicalizing" to large stores (that are eventually > supported and well-aligned) seems best to me. Agreed. > >>> I'm aware of that. The patch already has logic to avoid emitting unaligned accesses >>> for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter >>> PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to >>> forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure >>> how to behave more intelligently here. Any ideas? >> >> Dunno, the heuristics was the main problem with my patch. Generally, I'd >> say there is a difference between cold and hot blocks, in cold ones perhaps >> unaligned stores are more appropriate (if supported at all and not way too >> slow), while in hot ones less desirable. > > Note that I repeatedly argue that if we can canonicalize sth to "larger" > then even if unaligned, the expander should be able to produce optimal > code again (it might not do, of course). And agreed. Furthermore, it's in line with our guiding principles WRT separation of the tree/SSA optimizers from target dependencies. So let's push those decisions into the expanders/backend/target and canonicalize to the larger stores. jeff