From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12685 invoked by alias); 7 Sep 2016 08:19:26 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 12667 invoked by uid 89); 7 Sep 2016 08:19:25 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-3.0 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD,SPF_PASS autolearn=ham version=3.3.2 spammy=irregular, straight, lc0, LC0 X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 07 Sep 2016 08:19:15 +0000 Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id EDAF4AAF1; Wed, 7 Sep 2016 08:19:11 +0000 (UTC) Date: Wed, 07 Sep 2016 09:11:00 -0000 From: Richard Biener To: Jakub Jelinek cc: Kyrill Tkachov , Uros Bizjak , GCC Patches Subject: Re: [PATCH][v3] GIMPLE store merging pass In-Reply-To: <20160906162133.GL14857@tucnak.redhat.com> Message-ID: References: <57CEDD67.6010801@foss.arm.com> <20160906153250.GK14857@tucnak.redhat.com> <57CEE7DB.8070604@foss.arm.com> <20160906162133.GL14857@tucnak.redhat.com> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-SW-Source: 2016-09/txt/msg00366.txt.bz2 On Tue, 6 Sep 2016, Jakub Jelinek wrote: > On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote: > > On 06/09/16 16:32, Jakub Jelinek wrote: > > >On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote: > > >>The v3 of this patch addresses feedback I received on the version posted at [1]. > > >>The merged store buffer is now represented as a char array that we splat values onto with > > >>native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr > > >>accepts, including floating point values and short vectors. So this version extends the functionality > > >>of the previous one in that it handles floating point values as well. > > >> > > >>The first phase of the algorithm that detects the contiguous stores is also slightly refactored according > > >>to feedback to read more fluently. > > >> > > >>Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64. > > >>MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended > > >>synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now > > >>I've left the maximum merge size to be BITS_PER_WORD. > > >At least from playing with this kind of things in the RTL PR22141 patch, > > >I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit > > >constants usually isn't a win (not just for speed optimized blocks but also for > > >-Os). For 64-bit store if the constant isn't signed 32-bit or unsigned > > >32-bit you need movabsq into some temporary register which has like 3 times worse > > >latency than normal store if I remember well, and then store it. > > > > We could restrict the maximum width of the stores generated to 32 bits on x86_64. > > I think this would need another parameter or target macro for the target to set. > > Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate > > expansion? For example break up the 64-bit movabsq immediate into two SImode immediates? > > If you want a 64-bit store, you'd need to merge the two, and that would be > even more expensive. It is a matter of say: > movl $0x12345678, (%rsp) > movl $0x09abcdef, 4(%rsp) > vs. > movabsq $0x09abcdef12345678, %rax > movq %rax, (%rsp) > vs. > movl $0x09abcdef, %eax > salq $32, %rax > orq $0x12345678, %rax > movq $rax, (%rsp) vs. movq $LC0, (%rsp) ? > etc. Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure > the last sequence is the worst one. I think the important part to notice is that it should be straight forward for a target / the expander to split a large store from an immediate into any of the above but very hard to do the opposite. Thus from a GIMPLE side "canonicalizing" to large stores (that are eventually supported and well-aligned) seems best to me. > > >What alias set is used for the accesses if there are different alias sets > > >involved in between the merged stores? > > > > As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in those cases > > would be ptr_type_node. See the get_type_for_merged_store function in the patch. > > Richi knows this best. I just wonder if e.g. all the stores go into fields > of the same structure it wouldn't be better if you need to punt use that > structure as the type rather than alias set 0. Well, yes - if the IL always accesses a common handled component you can use the alias set of that component. But it's some work to do this correctly as you can have MEM[&a + 4] = 1; which stores an 'int' to a struct with two floats. So you do have to be careful, also when merging overlapping stores (in which case you'll certainly have an irregular situation). Which is why I suggested the "easy" approach above which should handle a lot of cases well already. > > I'm aware of that. The patch already has logic to avoid emitting unaligned accesses > > for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter > > PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to > > forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure > > how to behave more intelligently here. Any ideas? > > Dunno, the heuristics was the main problem with my patch. Generally, I'd > say there is a difference between cold and hot blocks, in cold ones perhaps > unaligned stores are more appropriate (if supported at all and not way too > slow), while in hot ones less desirable. Note that I repeatedly argue that if we can canonicalize sth to "larger" then even if unaligned, the expander should be able to produce optimal code again (it might not do, of course). Hope to look at the patch in detail soon. Richard. > > >And, do you have some SPEC2k and/or SPEC2k6 numbers, for > > > e.g. x86_64/i686/arm/aarch64/powerpc64le? > > > > I did some benchmarking on aarch64 in the initial submission at > > https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html > > aarch64 showed some improvement and no regressions on x86_64. > > I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded > > in scope since then (handling more bitfields, floating point constants). > > I just wanted to get this version out before the Cauldron for comments. > > > > >The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k* > > >on powerpc. > > > > Unfortunately I don't have access to SPEC on powerpc. Any help with testing/benchmarking > > there would be very much appreciated. > > I don't have SPEC set up on any arch, perhaps you can ask some of the IBM > folks to benchmark it for you (that is what I've done back when writing the > patch)? > > BTW, do you handle bitfield stores too? Once you don't support just > constant stores, for bitfields but perhaps also for other adjacent > operations not just copying, but e.g. also bitwise operations (&,|,^,~) > on adjacent memory locations might be useful, or for bitfields even other > operations (e.g. addition if there are enough bits in between that would eat > possible overflow bits, by reading, masking, adding, masking, and oring the > other bits). > > Jakub > > -- Richard Biener SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)