From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 60020 invoked by alias); 13 Nov 2018 14:34:08 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 60001 invoked by uid 89); 13 Nov 2018 14:34:06 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-10.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: mail-lf1-f44.google.com Received: from mail-lf1-f44.google.com (HELO mail-lf1-f44.google.com) (209.85.167.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Nov 2018 14:34:04 +0000 Received: by mail-lf1-f44.google.com with SMTP id h192so8983693lfg.3 for ; Tue, 13 Nov 2018 06:34:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ncZpEuPmmHsUe/djn9d3m+ZD/g1/dRigwE0jprhVEFY=; b=QAEEtEyIuym4rVDyI9HNCaDSFY/5Xib72ob1eJlG4fECd9iV/jLYchD78Nt7uAXgJW R1FZ8MkvTRbbWP9BjcaIfTGSPyGlKbw44qy5d5CJd/BJb9HepkhNb9QaE6cYt6hJlQwx Tyh3V6SReevfA6YBp+Ylxl8dY9WydWjHG81cq/pILcBNDaogQSjXv8bsWKuuSo39w9NA Ql1PH9P8qWLJoUgaoaTUjpTJfIbWRqLFhPsnyUQ42KMdzRHjVee7tFK6uzsOlMKAzekp FrxdB5wpkO/yxuFKA/jspwY6UTatc4ZLwVzf8b0HFpuhuhhgs86y9Rd335C0mDvAmCRR 7gEQ== MIME-Version: 1.0 References: <5BE565CE.5000709@foss.arm.com> <5BE5CA6D.9060408@foss.arm.com> <5BE9C452.3080709@foss.arm.com> <5BEA9631.1070500@foss.arm.com> <5BEA9DF8.2030004@foss.arm.com> In-Reply-To: <5BEA9DF8.2030004@foss.arm.com> From: Richard Biener Date: Tue, 13 Nov 2018 14:34:00 -0000 Message-ID: Subject: Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64 To: kyrylo.tkachov@foss.arm.com Cc: GCC Patches , Marcus Shawcroft , Richard Earnshaw , James Greenhalgh , Richard Sandiford Content-Type: text/plain; charset="UTF-8" X-IsSubscribed: yes X-SW-Source: 2018-11/txt/msg01098.txt.bz2 On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov wrote: > > > On 13/11/18 09:28, Richard Biener wrote: > > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov > > wrote: > >> Hi Richard, > >> > >> On 13/11/18 08:24, Richard Biener wrote: > >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov > >>> wrote: > >>>> On 12/11/18 14:10, Richard Biener wrote: > >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov > >>>>> wrote: > >>>>>> On 09/11/18 12:18, Richard Biener wrote: > >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov > >>>>>>> wrote: > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling: > >>>>>>>> > >>>>>>>> fully_peel_me: > >>>>>>>> mov x1, 5 > >>>>>>>> ptrue p1.d, all > >>>>>>>> whilelo p0.d, xzr, x1 > >>>>>>>> ld1d z0.d, p0/z, [x0] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x0] > >>>>>>>> cntd x2 > >>>>>>>> addvl x3, x0, #1 > >>>>>>>> whilelo p0.d, x2, x1 > >>>>>>>> beq .L1 > >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x3] > >>>>>>>> cntw x2 > >>>>>>>> incb x0, all, mul #2 > >>>>>>>> whilelo p0.d, x2, x1 > >>>>>>>> beq .L1 > >>>>>>>> ld1d z0.d, p0/z, [x0] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x0] > >>>>>>>> .L1: > >>>>>>>> ret > >>>>>>>> > >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count. > >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size > >>>>>>>> and hurts icache performance. > >>>>>>>> > >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration > >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some > >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate > >>>>>>>> the branches. > >>>>>>>> > >>>>>>>> So for the above testcase we generate now: > >>>>>>>> fully_peel_me: > >>>>>>>> mov x2, 5 > >>>>>>>> mov x3, x2 > >>>>>>>> mov x1, 0 > >>>>>>>> whilelo p0.d, xzr, x2 > >>>>>>>> ptrue p1.d, all > >>>>>>>> .L2: > >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3] > >>>>>>>> fadd z0.d, z0.d, z0.d > >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3] > >>>>>>>> incd x1 > >>>>>>>> whilelo p0.d, x1, x3 > >>>>>>>> bne .L2 > >>>>>>>> ret > >>>>>>>> > >>>>>>>> Not perfect still, but it's preferable to the original code. > >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged > >>>>>>>> (until other target people experiment with it and set it, if appropriate). > >>>>>>>> > >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu. > >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance. > >>>>>>>> > >>>>>>>> Ok for trunk? > >>>>>>> Hum. Why introduce a new --param and not simply key on > >>>>>>> flag_peel_loops instead? That is > >>>>>>> enabled by default at -O3 and with FDO but you of course can control > >>>>>>> that in your targets > >>>>>>> post-option-processing hook. > >>>>>> You mean like this? > >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :) > >>>>>> But I suppose it's a reasonable change. > >>>>> No, that change is backward. What I said is that peeling is already > >>>>> conditional on > >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable > >>>>> flag_peel_loops for > >>>>> SVE instead in the target. > >>>> Sorry, I got confused by the similarly named functions. > >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass > >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path. > >>> Well, peeling gets disabled. From your patch I see you want to > >>> disable "unrolling" when > >>> the number of loop iteration is not constant. That is called peeling > >>> where we need to > >>> emit the loop exit test N times. > >>> > >>> Did you check your testcases with -fno-peel-loops? > >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely) > >> can be called through two paths, only one of which is gated on flag_peel_loops. > > I don't see the obvious here so I have to either sit down with a > > non-SVE specific testcase > > showing this, or I am misunderstanding the actual transform that you > > want to avoid. > > allow_peel is false when called from canonicalize_induction_variables. > > There's the slight > > chance that UL_NO_GROWTH lets through cases - is your case one of > > that? That is, > > does the following help? > > > > Index: gcc/tree-ssa-loop-ivcanon.c > > =================================================================== > > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056) > > +++ gcc/tree-ssa-loop-ivcanon.c (working copy) > > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop > > exit = NULL; > > > > /* See if we can improve our estimate by using recorded loop bounds. */ > > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) > > + if ((allow_peel || maxiter == 0) > > && maxiter >= 0 > > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) > > { > > > > IIRC I allowed that case when adding allow_peel simply because it avoided some > > testsuite regressions. This means you eventually want to work on the > > size estimate > > of SVE style loops? > > This doesn't help. > > Sorry if we're talking over each other here, I'm not very familiar with this area :( > For this loop: > for (int i = 0; i < 5; i++) > x[i] = x[i] * 2; > > For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch. > This gets fully unrolled as: > ldp q2, q1, [x0] > ldr d0, [x0, 32] > fadd v2.2d, v2.2d, v2.2d > fadd v1.2d, v1.2d, v1.2d > fadd d0, d0, d0 > stp q2, q1, [x0] > str d0, [x0, 32] > > For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process > with each loop iteration. So the NEON unrolling becomes SVE peeling I guess. > Note that the number of iterations in SVE is still "constant", just not known at compile-time. > In this case peeling doesn't eliminate any branches and only serves to bloat code size. So I sat down with a cross and indeed for the late unrolling pass we simply pass in allow_peel == true given that try_unroll_loop_completely also performs peeling (and not just try_peel_loops which guards itself with flag_peel_loops). That means instead of the above the below fixes this (with -fno-peel-loops): Index: gcc/tree-ssa-loop-ivcanon.c =================================================================== --- gcc/tree-ssa-loop-ivcanon.c (revision 266072) +++ gcc/tree-ssa-loop-ivcanon.c (working copy) @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop exit = NULL; /* See if we can improve our estimate by using recorded loop bounds. */ - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH) + if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH) && maxiter >= 0 && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll)) { there might be quite some testsuite fallout since flag_peel_loops is only enabled at -O3+, but one has to double-check. As said, a per-loop target control whether loop peeling is desirable would be an improvement I guess (apart from generally disabling peeling for aarch64). I suppose you can benchmark that together with the above fix. Richard. > Kyrill > > > Richard. > > > >> Thanks, > >> Kyrill > >> > >> > >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops. > >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable? > >>>> > >>>> Thanks, > >>>> Kyrill > >>>> > >>>>>>> It might also make sense to have more fine-grained control for this > >>>>>>> and allow a target > >>>>>>> to say whether it wants to peel a specific loop or not when the > >>>>>>> middle-end thinks that > >>>>>>> would be profitable. > >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing > >>>>>> the gimple statements of the loop to figure out its cost? > >>>>> Kind-of. Sth like > >>>>> > >>>>> bool targetm.peel_loop (struct loop *); > >>>>> > >>>>> I have no idea whether you can easily detect a SVE vectorized loop though. > >>>>> Maybe there's always a special IV or so (the mask?) > >>>>> > >>>>> Richard. > >>>>> > >>>>>> Thanks, > >>>>>> Kyrill > >>>>>> > >>>>>> > >>>>>> 2018-11-09 Kyrylo Tkachov > >>>>>> > >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll > >>>>>> loop when number of iterations is not known and flag_peel_loops is in > >>>>>> effect. > >>>>>> > >>>>>> 2018-11-09 Kyrylo Tkachov > >>>>>> > >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test. > >>>>>> >