From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-441825-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 104614 invoked by alias); 17 Nov 2016 14:55:27 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 103776 invoked by uid 89); 17 Nov 2016 14:55:26 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-3.8 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=Hx-languages-length:3171, wrapping
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 17 Nov 2016 14:55:25 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3A6E215AD;	Thu, 17 Nov 2016 06:55:23 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 379433F24D;	Thu, 17 Nov 2016 06:55:22 -0800 (PST)
Message-ID: <582DC4D8.9060701@foss.arm.com>
Date: Thu, 17 Nov 2016 14:55:00 -0000
From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Segher Boessenkool <segher@kernel.crashing.org>
CC: Andrew Pinski <pinskia@gmail.com>,  GCC Patches <gcc-patches@gcc.gnu.org>, Marcus Shawcroft <marcus.shawcroft@arm.com>,  Richard Earnshaw <Richard.Earnshaw@arm.com>, James Greenhalgh <james.greenhalgh@arm.com>
Subject: Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
References: <5824836B.5030302@foss.arm.com> <CA+=Sn1mZgmeU0VVfTi7HqWOWrZtoN=_43BJqgR9iTfs0Eq34zw@mail.gmail.com> <20161110233943.GC17570@gate.crashing.org> <58259AD6.4040203@foss.arm.com> <5825E454.6070302@foss.arm.com> <5829C958.2010601@foss.arm.com> <582DBD10.8010200@foss.arm.com> <20161117144424.GA3732@gate.crashing.org>
In-Reply-To: <20161117144424.GA3732@gate.crashing.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-SW-Source: 2016-11/txt/msg01811.txt.bz2


On 17/11/16 14:44, Segher Boessenkool wrote:
> Hi Kyrill,
>
> On Thu, Nov 17, 2016 at 02:22:08PM +0000, Kyrill Tkachov wrote:
>>>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there
>>>>>>>> were
>>>>>>>> some interesting swings.
>>>>>>>> 458.sjeng     +1.45%
>>>>>>>> 471.omnetpp   +2.19%
>>>>>>>> 445.gobmk     -2.01%
>>>>>>>>
>>>>>>>> On SPECFP:
>>>>>>>> 453.povray    +7.00%
>> After looking at the gobmk performance with performance counters it looks
>> like more icache pressure.
>> I see an increase in misses.
>> This looks to me like an effect of code size increase, though it is not
>> that large an increase (0.4% with SWS).
> Right.  I don't see how to improve on this (but see below); ideas welcome :-)
>
>> Branch mispredicts also go up a bit but not as much as icache misses.
> I don't see that happening -- for some testcases we get unlucky and have
> more branch predictor aliasing, and for some we have less, it's pretty
> random.  Some testcases are really sensitive to this.

Right, I don't think it's the branch prediction at fault in this case,
rather the above icache stuff.

>
>> I don't think there's anything we can do here, or at least that this patch
>> can do about it.
>> Overall, there's a slight improvement in SPECINT, even with the gobmk
>> regression and a slightly larger improvement
>> on SPECFP due to povray.
> And that is for only the "normal" GPRs, not LR or FP yet, right?

This patch does implement FP registers wrapping as well but not LR.
Though I remember seeing the improvement even when only GPRs were wrapped
in an earlier version of the patch.

>> Segher, one curious artifact I spotted while looking at codegen differences
>> in gobmk was a case where we fail
>> to emit load-pairs as effectively in the epilogue and its preceeding basic
>> block.
>> So before we had this epilogue:
>> .L43:
>>      ldp    x21, x22, [sp, 16]
>>      ldp    x23, x24, [sp, 32]
>>      ldp    x25, x26, [sp, 48]
>>      ldp    x27, x28, [sp, 64]
>>      ldr    x30, [sp, 80]
>>      ldp    x19, x20, [sp], 112
>>      ret
>>
>> and I see this becoming (among numerous other changes in the function):
>>
>> .L69:
>>      ldp    x21, x22, [sp, 16]
>>      ldr    x24, [sp, 40]
>> .L43:
>>      ldp    x25, x26, [sp, 48]
>>      ldp    x27, x28, [sp, 64]
>>      ldr    x23, [sp, 32]
>>      ldr    x30, [sp, 80]
>>      ldp    x19, x20, [sp], 112
>>      ret
>>
>> So this is better in the cases where we jump straight into .L43 because we
>> load fewer registers
>> but worse when we jump to or fallthrough to .L69 because x23 and x24 are
>> now restored using two loads
>> rather than a single load-pair. This hunk isn't critical to performance in
>> gobmk though.
> Is loading/storing a pair as cheap as loading/storing a single register?
> In that case you could shrink-wrap per pair of registers instead.

I suppose it can vary by microarchitecture. For the purposes of codegen I'd say
it's more expensive than load/storing a single register (as there's more memory bandwidth required after all)
but cheaper than two separate loads stores (alignment quirks notwithstanding).
Interesting idea. That could help with code size too. I'll try it out.
Thanks,
Kyrill

>
> Segher