From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-435356-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 116958 invoked by alias); 6 Sep 2016 15:59:38 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 116931 invoked by uid 89); 6 Sep 2016 15:59:38 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.0 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=sk:get_typ, sk:PARAM_S, bitfields, sk:param_s
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 06 Sep 2016 15:59:27 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9E29BF0;	Tue,  6 Sep 2016 08:59:25 -0700 (PDT)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0C1333F211;	Tue,  6 Sep 2016 08:59:24 -0700 (PDT)
Message-ID: <57CEE7DB.8070604@foss.arm.com>
Date: Tue, 06 Sep 2016 16:21:00 -0000
From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Jakub Jelinek <jakub@redhat.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>,  Richard Biener <rguenther@suse.de>
Subject: Re: [PATCH][v3] GIMPLE store merging pass
References: <57CEDD67.6010801@foss.arm.com> <20160906153250.GK14857@tucnak.redhat.com>
In-Reply-To: <20160906153250.GK14857@tucnak.redhat.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-SW-Source: 2016-09/txt/msg00319.txt.bz2

Hi Jakub,

On 06/09/16 16:32, Jakub Jelinek wrote:
> On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
>> The v3 of this patch addresses feedback I received on the version posted at [1].
>> The merged store buffer is now represented as a char array that we splat values onto with
>> native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr
>> accepts, including floating point values and short vectors. So this version extends the functionality
>> of the previous one in that it handles floating point values as well.
>>
>> The first phase of the algorithm that detects the contiguous stores is also slightly refactored according
>> to feedback to read more fluently.
>>
>> Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64.
>> MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended
>> synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now
>> I've left the maximum merge size to be BITS_PER_WORD.
> At least from playing with this kind of things in the RTL PR22141 patch,
> I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
> constants usually isn't a win (not just for speed optimized blocks but also for
> -Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
> 32-bit you need movabsq into some temporary register which has like 3 times worse
> latency than normal store if I remember well, and then store it.

We could restrict the maximum width of the stores generated to 32 bits on x86_64.
I think this would need another parameter or target macro for the target to set.
Alternatively, is it a possibility for x86 to be a bit smarter in its DImode mov-immediate
expansion? For example break up the 64-bit movabsq immediate into two SImode immediates?

>    If it can
> be CSEd and the same constant used multiple times in adjacent code perhaps.

Perhaps. From glancing at SPEC2006 generated code for aarch64 I didn't spot too many opportunities
for that though.

> Various other targets have different costs for different constants,
> so it would be nice if the pass considered that (computed RTX costs of those
> constants and used that in some heuristics).

Could do. That could avoid creating too expensive immediates.

> What alias set is used for the accesses if there are different alias sets
> involved in between the merged stores?

As per https://gcc.gnu.org/ml/gcc/2016-06/msg00162.html the type used in those cases
would be ptr_type_node. See the get_type_for_merged_store function in the patch.

> Also alignment can matter, even on non-strict alignment targets (speed vs.
> -Os for that).

I'm aware of that. The patch already has logic to avoid emitting unaligned accesses
for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the parameter
PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
forbid generation of unaligned stores by the pass altogether. Beyond that I'm not sure
how to behave more intelligently here. Any ideas?

> And, do you have some SPEC2k and/or SPEC2k6 numbers, for
>   e.g. x86_64/i686/arm/aarch64/powerpc64le?

I did some benchmarking on aarch64 in the initial submission at
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00942.html
aarch64 showed some improvement and no regressions on x86_64.
I'll be rerunning the numbers on aarch64/x86_64/arm as the patch has expanded
in scope since then (handling more bitfields, floating point constants).
I just wanted to get this version out before the Cauldron for comments.

> The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k*
> on powerpc.

Unfortunately I don't have access to SPEC on powerpc. Any help with testing/benchmarking
there would be very much appreciated.

> Also, do you only handle constants or also the case where there is partial
> or complete copying from some other memory, where it could be turned into
> larger chunk loads + stores or __builtin_memcpy?

At the moment just constants. I hope in the future to extend it to perform more tricks
involving contiguous stores.

>> I've disabled the pass for PDP-endian targets as the merging code proved to be quite fiddly to get right for different
>> endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious
>> testing capabilities. I hope that's ok (I note the bswap pass also doesn't try to do anything on such targets).
> I think that is fine, it isn't the only pass that punts in this case.

Thanks for the comments and ideas,
Kyrill

> 	Jakub
>