From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-489896-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 60020 invoked by alias); 13 Nov 2018 14:34:08 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 60001 invoked by uid 89); 13 Nov 2018 14:34:06 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-10.5 required=5.0 tests=AWL,BAYES_00,FREEMAIL_FROM,GIT_PATCH_2,GIT_PATCH_3,KAM_ASCII_DIVIDERS,KAM_NUMSUBJECT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=
X-HELO: mail-lf1-f44.google.com
Received: from mail-lf1-f44.google.com (HELO mail-lf1-f44.google.com) (209.85.167.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Nov 2018 14:34:04 +0000
Received: by mail-lf1-f44.google.com with SMTP id h192so8983693lfg.3        for <gcc-patches@gcc.gnu.org>; Tue, 13 Nov 2018 06:34:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;        d=gmail.com; s=20161025;        h=mime-version:references:in-reply-to:from:date:message-id:subject:to         :cc;        bh=ncZpEuPmmHsUe/djn9d3m+ZD/g1/dRigwE0jprhVEFY=;        b=QAEEtEyIuym4rVDyI9HNCaDSFY/5Xib72ob1eJlG4fECd9iV/jLYchD78Nt7uAXgJW         R1FZ8MkvTRbbWP9BjcaIfTGSPyGlKbw44qy5d5CJd/BJb9HepkhNb9QaE6cYt6hJlQwx         Tyh3V6SReevfA6YBp+Ylxl8dY9WydWjHG81cq/pILcBNDaogQSjXv8bsWKuuSo39w9NA         Ql1PH9P8qWLJoUgaoaTUjpTJfIbWRqLFhPsnyUQ42KMdzRHjVee7tFK6uzsOlMKAzekp         FrxdB5wpkO/yxuFKA/jspwY6UTatc4ZLwVzf8b0HFpuhuhhgs86y9Rd335C0mDvAmCRR         7gEQ==
MIME-Version: 1.0
References: <5BE565CE.5000709@foss.arm.com> <CAFiYyc2fZ4WTdguC9RjLWfYTiUK1t=Gd7W7mVmA2qvA1t2vjuA@mail.gmail.com> <5BE5CA6D.9060408@foss.arm.com> <CAFiYyc1ojB9VFk63u7PG6O-fo4O64ufOCfEm__LGu_MrzEYYAg@mail.gmail.com> <5BE9C452.3080709@foss.arm.com> <CAFiYyc1VTDC3==EYRNK=AbO-fe_TMu-BgvjE5LOsrFhLaxb9jA@mail.gmail.com> <5BEA9631.1070500@foss.arm.com> <CAFiYyc1nHvBf+TvKjOgaHYy3rJD2e9X7hbDy=eQOBOkZdX7KxQ@mail.gmail.com> <5BEA9DF8.2030004@foss.arm.com>
In-Reply-To: <5BEA9DF8.2030004@foss.arm.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 13 Nov 2018 14:34:00 -0000
Message-ID: <CAFiYyc22sdyiUGbVzboJ977QU9mS-oj1hToxkxUUY3y6cjdj4w@mail.gmail.com>
Subject: Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
To: kyrylo.tkachov@foss.arm.com
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, Marcus Shawcroft <marcus.shawcroft@arm.com>, 	Richard Earnshaw <richard.earnshaw@arm.com>, James Greenhalgh <james.greenhalgh@arm.com>, 	Richard Sandiford <richard.sandiford@arm.com>
Content-Type: text/plain; charset="UTF-8"
X-IsSubscribed: yes
X-SW-Source: 2018-11/txt/msg01098.txt.bz2

On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
>
> On 13/11/18 09:28, Richard Biener wrote:
> > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
> > <kyrylo.tkachov@foss.arm.com> wrote:
> >> Hi Richard,
> >>
> >> On 13/11/18 08:24, Richard Biener wrote:
> >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> >>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>> On 12/11/18 14:10, Richard Biener wrote:
> >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> >>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>>>> On 09/11/18 12:18, Richard Biener wrote:
> >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> >>>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
> >>>>>>>>
> >>>>>>>> fully_peel_me:
> >>>>>>>>             mov     x1, 5
> >>>>>>>>             ptrue   p1.d, all
> >>>>>>>>             whilelo p0.d, xzr, x1
> >>>>>>>>             ld1d    z0.d, p0/z, [x0]
> >>>>>>>>             fadd    z0.d, z0.d, z0.d
> >>>>>>>>             st1d    z0.d, p0, [x0]
> >>>>>>>>             cntd    x2
> >>>>>>>>             addvl   x3, x0, #1
> >>>>>>>>             whilelo p0.d, x2, x1
> >>>>>>>>             beq     .L1
> >>>>>>>>             ld1d    z0.d, p0/z, [x0, #1, mul vl]
> >>>>>>>>             fadd    z0.d, z0.d, z0.d
> >>>>>>>>             st1d    z0.d, p0, [x3]
> >>>>>>>>             cntw    x2
> >>>>>>>>             incb    x0, all, mul #2
> >>>>>>>>             whilelo p0.d, x2, x1
> >>>>>>>>             beq     .L1
> >>>>>>>>             ld1d    z0.d, p0/z, [x0]
> >>>>>>>>             fadd    z0.d, z0.d, z0.d
> >>>>>>>>             st1d    z0.d, p0, [x0]
> >>>>>>>> .L1:
> >>>>>>>>             ret
> >>>>>>>>
> >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> >>>>>>>> and hurts icache performance.
> >>>>>>>>
> >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> >>>>>>>> the branches.
> >>>>>>>>
> >>>>>>>> So for the above testcase we generate now:
> >>>>>>>> fully_peel_me:
> >>>>>>>>             mov     x2, 5
> >>>>>>>>             mov     x3, x2
> >>>>>>>>             mov     x1, 0
> >>>>>>>>             whilelo p0.d, xzr, x2
> >>>>>>>>             ptrue   p1.d, all
> >>>>>>>> .L2:
> >>>>>>>>             ld1d    z0.d, p0/z, [x0, x1, lsl 3]
> >>>>>>>>             fadd    z0.d, z0.d, z0.d
> >>>>>>>>             st1d    z0.d, p0, [x0, x1, lsl 3]
> >>>>>>>>             incd    x1
> >>>>>>>>             whilelo p0.d, x1, x3
> >>>>>>>>             bne     .L2
> >>>>>>>>             ret
> >>>>>>>>
> >>>>>>>> Not perfect still, but it's preferable to the original code.
> >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> >>>>>>>> (until other target people experiment with it and set it, if appropriate).
> >>>>>>>>
> >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
> >>>>>>>>
> >>>>>>>> Ok for trunk?
> >>>>>>> Hum.  Why introduce a new --param and not simply key on
> >>>>>>> flag_peel_loops instead?  That is
> >>>>>>> enabled by default at -O3 and with FDO but you of course can control
> >>>>>>> that in your targets
> >>>>>>> post-option-processing hook.
> >>>>>> You mean like this?
> >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
> >>>>>> But I suppose it's a reasonable change.
> >>>>> No, that change is backward.  What I said is that peeling is already
> >>>>> conditional on
> >>>>> flag_peel_loops and that is enabled by -O3.  So you want to disable
> >>>>> flag_peel_loops for
> >>>>> SVE instead in the target.
> >>>> Sorry, I got confused by the similarly named functions.
> >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
> >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
> >>> Well, peeling gets disabled.  From your patch I see you want to
> >>> disable "unrolling" when
> >>> the number of loop iteration is not constant.  That is called peeling
> >>> where we need to
> >>> emit the loop exit test N times.
> >>>
> >>> Did you check your testcases with -fno-peel-loops?
> >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
> >> can be called through two paths, only one of which is gated on flag_peel_loops.
> > I don't see the obvious here so I have to either sit down with a
> > non-SVE specific testcase
> > showing this, or I am misunderstanding the actual transform that you
> > want to avoid.
> > allow_peel is false when called from canonicalize_induction_variables.
> > There's the slight
> > chance that UL_NO_GROWTH lets through cases - is your case one of
> > that?  That is,
> > does the following help?
> >
> > Index: gcc/tree-ssa-loop-ivcanon.c
> > ===================================================================
> > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
> > +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
> >       exit = NULL;
> >
> >     /* See if we can improve our estimate by using recorded loop bounds.  */
> > -  if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> > +  if ((allow_peel || maxiter == 0)
> >         && maxiter >= 0
> >         && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> >       {
> >
> > IIRC I allowed that case when adding allow_peel simply because it avoided some
> > testsuite regressions.  This means you eventually want to work on the
> > size estimate
> > of SVE style loops?
>
> This doesn't help.
>
> Sorry if we're talking over each other here, I'm not very familiar with this area :(
> For this loop:
>    for (int i = 0; i < 5; i++)
>      x[i] = x[i] * 2;
>
> For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch.
> This gets fully unrolled as:
>          ldp     q2, q1, [x0]
>          ldr     d0, [x0, 32]
>          fadd    v2.2d, v2.2d, v2.2d
>          fadd    v1.2d, v1.2d, v1.2d
>          fadd    d0, d0, d0
>          stp     q2, q1, [x0]
>          str     d0, [x0, 32]
>
> For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process
> with each loop iteration. So the NEON unrolling becomes SVE peeling I guess.
> Note that the number of iterations in SVE is still "constant", just not known at compile-time.
> In this case peeling doesn't eliminate any branches and only serves to bloat code size.

So I sat down with a cross and indeed for the late unrolling pass we
simply pass in allow_peel == true
given that try_unroll_loop_completely also performs peeling (and not
just try_peel_loops which guards
itself with flag_peel_loops).  That means instead of the above the
below fixes this (with -fno-peel-loops):

Index: gcc/tree-ssa-loop-ivcanon.c
===================================================================
--- gcc/tree-ssa-loop-ivcanon.c (revision 266072)
+++ gcc/tree-ssa-loop-ivcanon.c (working copy)
@@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
     exit = NULL;

   /* See if we can improve our estimate by using recorded loop bounds.  */
-  if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
+  if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH)
       && maxiter >= 0
       && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
     {

there might be quite some testsuite fallout since flag_peel_loops is
only enabled at -O3+,
but one has to double-check.  As said, a per-loop target control
whether loop peeling is
desirable would be an improvement I guess (apart from generally
disabling peeling for aarch64).
I suppose you can benchmark that together with the above fix.

Richard.

> Kyrill
>
> > Richard.
> >
> >> Thanks,
> >> Kyrill
> >>
> >>
> >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
> >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
> >>>>
> >>>> Thanks,
> >>>> Kyrill
> >>>>
> >>>>>>> It might also make sense to have more fine-grained control for this
> >>>>>>> and allow a target
> >>>>>>> to say whether it wants to peel a specific loop or not when the
> >>>>>>> middle-end thinks that
> >>>>>>> would be profitable.
> >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
> >>>>>> the gimple statements of the loop to figure out its cost?
> >>>>> Kind-of.  Sth like
> >>>>>
> >>>>>      bool targetm.peel_loop (struct loop *);
> >>>>>
> >>>>> I have no idea whether you can easily detect a SVE vectorized loop though.
> >>>>> Maybe there's always a special IV or so (the mask?)
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>>> Thanks,
> >>>>>> Kyrill
> >>>>>>
> >>>>>>
> >>>>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> >>>>>>
> >>>>>>            * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
> >>>>>>            loop when number of iterations is not known and flag_peel_loops is in
> >>>>>>            effect.
> >>>>>>
> >>>>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> >>>>>>
> >>>>>>            * gcc.target/aarch64/sve/unroll-1.c: New test.
> >>>>>>
>