From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 70492 invoked by alias); 13 Nov 2018 09:49:01 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 70249 invoked by uid 89); 13 Nov 2018 09:48:47 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-10.4 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT autolearn=ham version=3.3.2 spammy= X-HELO: foss.arm.com Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Nov 2018 09:48:44 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F0AA7A78; Tue, 13 Nov 2018 01:48:42 -0800 (PST) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D96F93F5CF; Tue, 13 Nov 2018 01:48:41 -0800 (PST) Message-ID: <5BEA9DF8.2030004@foss.arm.com> Date: Tue, 13 Nov 2018 09:49:00 -0000 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Richard Biener CC: GCC Patches , Marcus Shawcroft , Richard Earnshaw , James Greenhalgh , Richard Sandiford Subject: Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64 References: <5BE565CE.5000709@foss.arm.com> <5BE5CA6D.9060408@foss.arm.com> <5BE9C452.3080709@foss.arm.com> <5BEA9631.1070500@foss.arm.com> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-SW-Source: 2018-11/txt/msg01058.txt.bz2 On 13/11/18 09:28, Richard Biener wrote: > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov > wrote: >> Hi Richard, >> >> On 13/11/18 08:24, Richard Biener wrote: >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov >>> wrote: >>>> On 12/11/18 14:10, Richard Biener wrote: >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov >>>>> wrote: >>>>>> On 09/11/18 12:18, Richard Biener wrote: >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov >>>>>>> wrote: >>>>>>>> Hi all, >>>>>>>> >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: >>>>>>>> >>>>>>>> fully_peel_me: >>>>>>>> mov x1, 5 >>>>>>>> ptrue p1.d, all >>>>>>>> whilelo p0.d, xzr, x1 >>>>>>>> ld1d z0.d, p0/z, [x0] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x0] >>>>>>>> cntd x2 >>>>>>>> addvl x3, x0, #1 >>>>>>>> whilelo p0.d, x2, x1 >>>>>>>> beq .L1 >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x3] >>>>>>>> cntw x2 >>>>>>>> incb x0, all, mul #2 >>>>>>>> whilelo p0.d, x2, x1 >>>>>>>> beq .L1 >>>>>>>> ld1d z0.d, p0/z, [x0] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x0] >>>>>>>> .L1: >>>>>>>> ret >>>>>>>> >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size >>>>>>>> and hurts icache performance. >>>>>>>> >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate >>>>>>>> the branches. >>>>>>>> >>>>>>>> So for the above testcase we generate now: >>>>>>>> fully_peel_me: >>>>>>>> mov x2, 5 >>>>>>>> mov x3, x2 >>>>>>>> mov x1, 0 >>>>>>>> whilelo p0.d, xzr, x2 >>>>>>>> ptrue p1.d, all >>>>>>>> .L2: >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] >>>>>>>> fadd z0.d, z0.d, z0.d >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3] >>>>>>>> incd x1 >>>>>>>> whilelo p0.d, x1, x3 >>>>>>>> bne .L2 >>>>>>>> ret >>>>>>>> >>>>>>>> Not perfect still, but it's preferable to the original code. >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged >>>>>>>> (until other target people experiment with it and set it, if appropriate). >>>>>>>> >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. >>>>>>>> >>>>>>>> Ok for trunk? >>>>>>> Hum. Why introduce a new --param and not simply key on >>>>>>> flag_peel_loops instead? That is >>>>>>> enabled by default at -O3 and with FDO but you of course can control >>>>>>> that in your targets >>>>>>> post-option-processing hook. >>>>>> You mean like this? >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) >>>>>> But I suppose it's a reasonable change. >>>>> No, that change is backward. What I said is that peeling is already >>>>> conditional on >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable >>>>> flag_peel_loops for >>>>> SVE instead in the target. >>>> Sorry, I got confused by the similarly named functions. >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. >>> Well, peeling gets disabled. From your patch I see you want to >>> disable "unrolling" when >>> the number of loop iteration is not constant. That is called peeling >>> where we need to >>> emit the loop exit test N times. >>> >>> Did you check your testcases with -fno-peel-loops? >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) >> can be called through two paths, only one of which is gated on flag_peel_loops. > I don't see the obvious here so I have to either sit down with a > non-SVE specific testcase > showing this, or I am misunderstanding the actual transform that you > want to avoid. > allow_peel is false when called from canonicalize_induction_variables. > There's the slight > chance that UL_NO_GROWTH lets through cases - is your case one of > that? That is, > does the following help? > > Index: gcc/tree-ssa-loop-ivcanon.c > =================================================================== > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > exit = NULL; > > /* See if we can improve our estimate by using recorded loop bounds. */ > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > + if ((allow_peel || maxiter == 0) > && maxiter >= 0 > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > { > > IIRC I allowed that case when adding allow_peel simply because it avoided some > testsuite regressions. This means you eventually want to work on the > size estimate > of SVE style loops? This doesn't help. Sorry if we're talking over each other here, I'm not very familiar with this area :( For this loop: for (int i = 0; i < 5; i++) x[i] = x[i] * 2; For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch. This gets fully unrolled as: ldp q2, q1, [x0] ldr d0, [x0, 32] fadd v2.2d, v2.2d, v2.2d fadd v1.2d, v1.2d, v1.2d fadd d0, d0, d0 stp q2, q1, [x0] str d0, [x0, 32] For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. Note that the number of iterations in SVE is still "constant", just not known at compile-time. In this case peeling doesn't eliminate any branches and only serves to bloat code size. Kyrill > Richard. > >> Thanks, >> Kyrill >> >> >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? >>>> >>>> Thanks, >>>> Kyrill >>>> >>>>>>> It might also make sense to have more fine-grained control for this >>>>>>> and allow a target >>>>>>> to say whether it wants to peel a specific loop or not when the >>>>>>> middle-end thinks that >>>>>>> would be profitable. >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing >>>>>> the gimple statements of the loop to figure out its cost? >>>>> Kind-of. Sth like >>>>> >>>>> bool targetm.peel_loop (struct loop *); >>>>> >>>>> I have no idea whether you can easily detect a SVE vectorized loop though. >>>>> Maybe there's always a special IV or so (the mask?) >>>>> >>>>> Richard. >>>>> >>>>>> Thanks, >>>>>> Kyrill >>>>>> >>>>>> >>>>>> 2018-11-09 Kyrylo Tkachov >>>>>> >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll >>>>>> loop when number of iterations is not known and flag_peel_loops is in >>>>>> effect. >>>>>> >>>>>> 2018-11-09 Kyrylo Tkachov >>>>>> >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test. >>>>>>