From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 109751 invoked by alias); 8 Jan 2016 22:55:32 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 109739 invoked by uid 89); 8 Jan 2016 22:55:31 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=1.0 required=5.0 tests=AWL,BAYES_40,KAM_LAZY_DOMAIN_SECURITY,RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=importance, tuning, suit, sideeffects X-HELO: usmailout3.samsung.com Received: from mailout3.w2.samsung.com (HELO usmailout3.samsung.com) (211.189.100.13) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Fri, 08 Jan 2016 22:55:30 +0000 Received: from uscpsbgm1.samsung.com (u114.gpu85.samsung.co.kr [203.254.195.114]) by usmailout3.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0O0N00GLYP0FZJ20@usmailout3.samsung.com> for gcc-patches@gcc.gnu.org; Fri, 08 Jan 2016 17:55:27 -0500 (EST) Received: from ussync2.samsung.com ( [203.254.195.82]) by uscpsbgm1.samsung.com (USCPMTA) with SMTP id A5.AA.23844.F5E30965; Fri, 8 Jan 2016 17:55:27 -0500 (EST) Received: from [172.31.207.192] ([105.140.31.209]) by ussync2.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTPA id <0O0N00EQZP0DTOA0@ussync2.samsung.com>; Fri, 08 Jan 2016 17:55:27 -0500 (EST) Subject: Re: [PATCH 2/4][AArch64] Increase the loop peeling limit To: "Richard Earnshaw (lists)" , James Greenhalgh References: <001b01d1110d$0008f890$001ae9b0$@samsung.com> <563A9040.60805@samsung.com> <563BC15D.3080608@samsung.com> <564E4779.6020702@samsung.com> <20151120115334.GA12442@arm.com> <5660AF1F.8040803@samsung.com> <20151214112614.GA18673@arm.com> <5670A38A.9030000@samsung.com> <56714A03.1010407@arm.com> <5671C55C.20601@samsung.com> Cc: 'gcc-patches' , 'Marcus Shawcroft' , 'Kyrill Tkachov' , Andrew Pinski , ramana.radhakrishnan@arm.com, richard.guenther@gmail.com From: Evandro Menezes Message-id: <56903E5C.1090808@samsung.com> Date: Fri, 08 Jan 2016 22:55:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-version: 1.0 In-reply-to: <5671C55C.20601@samsung.com> Content-type: text/plain; charset=utf-8; format=flowed Content-transfer-encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2016-01/txt/msg00484.txt.bz2 On 12/16/2015 02:11 PM, Evandro Menezes wrote: > On 12/16/2015 05:24 AM, Richard Earnshaw (lists) wrote: >> On 15/12/15 23:34, Evandro Menezes wrote: >>> On 12/14/2015 05:26 AM, James Greenhalgh wrote: >>>> On Thu, Dec 03, 2015 at 03:07:43PM -0600, Evandro Menezes wrote: >>>>> On 11/20/2015 05:53 AM, James Greenhalgh wrote: >>>>>> On Thu, Nov 19, 2015 at 04:04:41PM -0600, Evandro Menezes wrote: >>>>>>> On 11/05/2015 02:51 PM, Evandro Menezes wrote: >>>>>>>> 2015-11-05 Evandro Menezes >>>>>>>> >>>>>>>> gcc/ >>>>>>>> >>>>>>>> * config/aarch64/aarch64.c >>>>>>>> (aarch64_override_options_internal): >>>>>>>> Increase loop peeling limit. >>>>>>>> >>>>>>>> This patch increases the limit for the number of peeled insns. >>>>>>>> With this change, I noticed no major regression in either >>>>>>>> Geekbench v3 or SPEC CPU2000 while some benchmarks, typically FP >>>>>>>> ones, improved significantly. >>>>>>>> >>>>>>>> I tested this tuning on Exynos M1 and on A57. ThunderX seems to >>>>>>>> benefit from this tuning too. However, I'd appreciate comments >>>>>>> >from other stakeholders. >>>>>>> >>>>>>> Ping. >>>>>> I'd like to leave this for a call from the port maintainers. I can >>>>>> see why >>>>>> this leads to more opportunities for vectorization, but I'm >>>>>> concerned about >>>>>> the wider impact on code size. Certainly I wouldn't expect this to >>>>>> be our >>>>>> default at -O2 and below. >>>>>> >>>>>> My gut feeling is that this doesn't really belong in the back-end >>>>>> (there are >>>>>> presumably good reasons why the default for this parameter across >>>>>> GCC has >>>>>> fluctuated from 400 to 100 to 200 over recent years), but as I >>>>>> say, I'd >>>>>> like Marcus or Richard to make the call as to whether or not we take >>>>>> this >>>>>> patch. >>>>> Please, correct me if I'm wrong, but loop peeling is enabled only >>>>> with loop unrolling (and with PGO). If so, then extra code size is >>>>> not a concern, for this heuristic is only active when unrolling >>>>> loops, when code size is already of secondary importance. >>>> My understanding was that loop peeling is enabled from -O2 upwards, >>>> and >>>> is also used to partially peel unaligned loops for vectorization >>>> (allowing >>>> the vector code to be well aligned), or to completely peel inner loops >>>> which >>>> may then become amenable to SLP vectorization. >>>> >>>> If I'm wrong then I take back these objections. But I was sure this >>>> parameter was used in a number of situations outside of just >>>> -funroll-loops/-funroll-all-loops . Certainly I remember seeing >>>> performance >>>> sensitivities to this parameter at -O3 in some internal workloads I >>>> was >>>> analysing. >>> Vectorization, including SLP, is only enabled at -O3, isn't it? It >>> seems to me that peeling is only used by optimizations which already >>> lead to potential increase in code size. >>> >>> For instance, with "-Ofast -funroll-all-loops", the total text size for >>> the SPEC CPU2000 suite is 26.9MB with this proposed change and 26.8MB >>> without it; with just "-O2", it is the same at 23.1MB regardless of >>> this >>> setting. >>> >>> So it seems to me that this proposal should be neutral for up to -O2. >>> >>> Thank you, >>> >> My preference would be to not diverge from the global parameter >> settings. I haven't looked in detail at this parameter but it seems to >> me there are two possible paths: >> >> 1) We could get agreement globally that the parameter should be >> increased. >> 2) We could agree that this specific use of the parameter is distinct >> from some other uses and deserves a new param in its own right with a >> higher value. >> > > Here's what I have observed, not only in AArch64: architectures > benefit differently from certain loop optimizations, especially those > dealing with vectorization. Be it because some have plenty of > registers of more aggressive loop unrolling, or because some have > lower costs to vectorize. With this, I'm trying to imply that there > may be the case to wiggle this parameter to suit loop optimizations > better to specific targets. While it is not the only parameter > related to loop optimizations, it seems to be the one with the desired > effects, as exemplified by PPC, S390 and x86 (AOSP). Though there is > the possibility that they are actually side-effects, as Richard Biener > perhaps implied in another reply. > Gents, Any new thoughts on this proposal? Thank you, -- Evandro Menezes