From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-418554-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 109751 invoked by alias); 8 Jan 2016 22:55:32 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 109739 invoked by uid 89); 8 Jan 2016 22:55:31 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=1.0 required=5.0 tests=AWL,BAYES_40,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=importance, tuning, suit, sideeffects
X-HELO: usmailout3.samsung.com
Received: from mailout3.w2.samsung.com (HELO usmailout3.samsung.com) (211.189.100.13) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Fri, 08 Jan 2016 22:55:30 +0000
Received: from uscpsbgm1.samsung.com (u114.gpu85.samsung.co.kr [203.254.195.114]) by usmailout3.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May  5 2014)) with ESMTP id <0O0N00GLYP0FZJ20@usmailout3.samsung.com> for gcc-patches@gcc.gnu.org; Fri, 08 Jan 2016 17:55:27 -0500 (EST)
Received: from ussync2.samsung.com ( [203.254.195.82])	by uscpsbgm1.samsung.com (USCPMTA) with SMTP id A5.AA.23844.F5E30965; Fri, 8 Jan 2016 17:55:27 -0500 (EST)
Received: from [172.31.207.192] ([105.140.31.209]) by ussync2.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May  5 2014)) with ESMTPA id <0O0N00EQZP0DTOA0@ussync2.samsung.com>; Fri, 08 Jan 2016 17:55:27 -0500 (EST)
Subject: Re: [PATCH 2/4][AArch64] Increase the loop peeling limit
To: "Richard Earnshaw (lists)" <Richard.Earnshaw@arm.com>, James Greenhalgh <james.greenhalgh@arm.com>
References: <001b01d1110d$0008f890$001ae9b0$@samsung.com> <563A9040.60805@samsung.com> <563BC15D.3080608@samsung.com> <564E4779.6020702@samsung.com> <20151120115334.GA12442@arm.com> <5660AF1F.8040803@samsung.com> <20151214112614.GA18673@arm.com> <5670A38A.9030000@samsung.com> <56714A03.1010407@arm.com> <5671C55C.20601@samsung.com>
Cc: 'gcc-patches' <gcc-patches@gcc.gnu.org>, 'Marcus Shawcroft' <Marcus.Shawcroft@arm.com>, 'Kyrill Tkachov' <kyrylo.tkachov@arm.com>, Andrew Pinski <pinskia@gmail.com>, ramana.radhakrishnan@arm.com, richard.guenther@gmail.com
From: Evandro Menezes <e.menezes@samsung.com>
Message-id: <56903E5C.1090808@samsung.com>
Date: Fri, 08 Jan 2016 22:55:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0
MIME-version: 1.0
In-reply-to: <5671C55C.20601@samsung.com>
Content-type: text/plain; charset=utf-8; format=flowed
Content-transfer-encoding: 7bit
X-IsSubscribed: yes
X-SW-Source: 2016-01/txt/msg00484.txt.bz2

On 12/16/2015 02:11 PM, Evandro Menezes wrote:
> On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote:
>> On 15/12/15 23:34, Evandro Menezes wrote:
>>> On 12/14/2015 05:26 AM, James Greenhalgh wrote:
>>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote:
>>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote:
>>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote:
>>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote:
>>>>>>>> 2015-11-05  Evandro Menezes <e.menezes@samsung.com>
>>>>>>>>
>>>>>>>>     gcc/
>>>>>>>>
>>>>>>>>         * config/aarch64/aarch64.c
>>>>>>>> (aarch64_override_options_internal):
>>>>>>>>         Increase loop peeling limit.
>>>>>>>>
>>>>>>>> This patch increases the limit for the number of peeled insns.
>>>>>>>> With this change, I noticed no major regression in either
>>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP
>>>>>>>> ones, improved significantly.
>>>>>>>>
>>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to
>>>>>>>> benefit from this tuning too.  However, I'd appreciate comments
>>>>>>> >from other stakeholders.
>>>>>>>
>>>>>>> Ping.
>>>>>> I'd like to leave this for a call from the port maintainers. I can
>>>>>> see why
>>>>>> this leads to more opportunities for vectorization, but I'm
>>>>>> concerned about
>>>>>> the wider impact on code size. Certainly I wouldn't expect this to
>>>>>> be our
>>>>>> default at -O2 and below.
>>>>>>
>>>>>> My gut feeling is that this doesn't really belong in the back-end
>>>>>> (there are
>>>>>> presumably good reasons why the default for this parameter across
>>>>>> GCC has
>>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I 
>>>>>> say, I'd
>>>>>> like Marcus or Richard to make the call as to whether or not we take
>>>>>> this
>>>>>> patch.
>>>>> Please, correct me if I'm wrong, but loop peeling is enabled only
>>>>> with loop unrolling (and with PGO).  If so, then extra code size is
>>>>> not a concern, for this heuristic is only active when unrolling
>>>>> loops, when code size is already of secondary importance.
>>>> My understanding was that loop peeling is enabled from -O2 upwards, 
>>>> and
>>>> is also used to partially peel unaligned loops for vectorization
>>>> (allowing
>>>> the vector code to be well aligned), or to completely peel inner loops
>>>> which
>>>> may then become amenable to SLP vectorization.
>>>>
>>>> If I'm wrong then I take back these objections. But I was sure this
>>>> parameter was used in a number of situations outside of just
>>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing
>>>> performance
>>>> sensitivities to this parameter at -O3 in some internal workloads I 
>>>> was
>>>> analysing.
>>> Vectorization, including SLP, is only enabled at -O3, isn't it?  It
>>> seems to me that peeling is only used by optimizations which already
>>> lead to potential increase in code size.
>>>
>>> For instance, with "-Ofast -funroll-all-loops", the total text size for
>>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB
>>> without it; with just "-O2", it is the same at 23.1MB regardless of 
>>> this
>>> setting.
>>>
>>> So it seems to me that this proposal should be neutral for up to -O2.
>>>
>>> Thank you,
>>>
>> My preference would be to not diverge from the global parameter
>> settings.  I haven't looked in detail at this parameter but it seems to
>> me there are two possible paths:
>>
>> 1) We could get agreement globally that the parameter should be 
>> increased.
>> 2) We could agree that this specific use of the parameter is distinct
>> from some other uses and deserves a new param in its own right with a
>> higher value.
>>
>
> Here's what I have observed, not only in AArch64: architectures 
> benefit differently from certain loop optimizations, especially those 
> dealing with vectorization.  Be it because some have plenty of 
> registers of more aggressive loop unrolling, or because some have 
> lower costs to vectorize.  With this, I'm trying to imply that there 
> may be the case to wiggle this parameter to suit loop optimizations 
> better to specific targets.  While it is not the only parameter 
> related to loop optimizations, it seems to be the one with the desired 
> effects, as exemplified by PPC, S390 and x86 (AOSP).  Though there is 
> the possibility that they are actually side-effects, as Richard Biener 
> perhaps implied in another reply.
>


Gents,

Any new thoughts on this proposal?

Thank you,

-- 
Evandro Menezes