From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-489856-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 70492 invoked by alias); 13 Nov 2018 09:49:01 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 70249 invoked by uid 89); 13 Nov 2018 09:48:47 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-10.4 required=5.0 tests=BAYES_00,GIT_PATCH_2,GIT_PATCH_3,KAM_LAZY_DOMAIN_SECURITY,KAM_NUMSUBJECT autolearn=ham version=3.3.2 spammy=
X-HELO: foss.arm.com
Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 13 Nov 2018 09:48:44 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F0AA7A78;	Tue, 13 Nov 2018 01:48:42 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D96F93F5CF;	Tue, 13 Nov 2018 01:48:41 -0800 (PST)
Message-ID: <5BEA9DF8.2030004@foss.arm.com>
Date: Tue, 13 Nov 2018 09:49:00 -0000
From: Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Richard Biener <richard.guenther@gmail.com>
CC: GCC Patches <gcc-patches@gcc.gnu.org>,  Marcus Shawcroft <marcus.shawcroft@arm.com>, Richard Earnshaw <richard.earnshaw@arm.com>,  James Greenhalgh <james.greenhalgh@arm.com>, Richard Sandiford <richard.sandiford@arm.com>
Subject: Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
References: <5BE565CE.5000709@foss.arm.com> <CAFiYyc2fZ4WTdguC9RjLWfYTiUK1t=Gd7W7mVmA2qvA1t2vjuA@mail.gmail.com> <5BE5CA6D.9060408@foss.arm.com> <CAFiYyc1ojB9VFk63u7PG6O-fo4O64ufOCfEm__LGu_MrzEYYAg@mail.gmail.com> <5BE9C452.3080709@foss.arm.com> <CAFiYyc1VTDC3==EYRNK=AbO-fe_TMu-BgvjE5LOsrFhLaxb9jA@mail.gmail.com> <5BEA9631.1070500@foss.arm.com> <CAFiYyc1nHvBf+TvKjOgaHYy3rJD2e9X7hbDy=eQOBOkZdX7KxQ@mail.gmail.com>
In-Reply-To: <CAFiYyc1nHvBf+TvKjOgaHYy3rJD2e9X7hbDy=eQOBOkZdX7KxQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-SW-Source: 2018-11/txt/msg01058.txt.bz2


On 13/11/18 09:28, Richard Biener wrote:
> On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi Richard,
>>
>> On 13/11/18 08:24, Richard Biener wrote:
>>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> On 12/11/18 14:10, Richard Biener wrote:
>>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>> On 09/11/18 12:18, Richard Biener wrote:
>>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
>>>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
>>>>>>>>
>>>>>>>> fully_peel_me:
>>>>>>>>             mov     x1, 5
>>>>>>>>             ptrue   p1.d, all
>>>>>>>>             whilelo p0.d, xzr, x1
>>>>>>>>             ld1d    z0.d, p0/z, [x0]
>>>>>>>>             fadd    z0.d, z0.d, z0.d
>>>>>>>>             st1d    z0.d, p0, [x0]
>>>>>>>>             cntd    x2
>>>>>>>>             addvl   x3, x0, #1
>>>>>>>>             whilelo p0.d, x2, x1
>>>>>>>>             beq     .L1
>>>>>>>>             ld1d    z0.d, p0/z, [x0, #1, mul vl]
>>>>>>>>             fadd    z0.d, z0.d, z0.d
>>>>>>>>             st1d    z0.d, p0, [x3]
>>>>>>>>             cntw    x2
>>>>>>>>             incb    x0, all, mul #2
>>>>>>>>             whilelo p0.d, x2, x1
>>>>>>>>             beq     .L1
>>>>>>>>             ld1d    z0.d, p0/z, [x0]
>>>>>>>>             fadd    z0.d, z0.d, z0.d
>>>>>>>>             st1d    z0.d, p0, [x0]
>>>>>>>> .L1:
>>>>>>>>             ret
>>>>>>>>
>>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
>>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
>>>>>>>> and hurts icache performance.
>>>>>>>>
>>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
>>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
>>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
>>>>>>>> the branches.
>>>>>>>>
>>>>>>>> So for the above testcase we generate now:
>>>>>>>> fully_peel_me:
>>>>>>>>             mov     x2, 5
>>>>>>>>             mov     x3, x2
>>>>>>>>             mov     x1, 0
>>>>>>>>             whilelo p0.d, xzr, x2
>>>>>>>>             ptrue   p1.d, all
>>>>>>>> .L2:
>>>>>>>>             ld1d    z0.d, p0/z, [x0, x1, lsl 3]
>>>>>>>>             fadd    z0.d, z0.d, z0.d
>>>>>>>>             st1d    z0.d, p0, [x0, x1, lsl 3]
>>>>>>>>             incd    x1
>>>>>>>>             whilelo p0.d, x1, x3
>>>>>>>>             bne     .L2
>>>>>>>>             ret
>>>>>>>>
>>>>>>>> Not perfect still, but it's preferable to the original code.
>>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
>>>>>>>> (until other target people experiment with it and set it, if appropriate).
>>>>>>>>
>>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
>>>>>>>>
>>>>>>>> Ok for trunk?
>>>>>>> Hum.  Why introduce a new --param and not simply key on
>>>>>>> flag_peel_loops instead?  That is
>>>>>>> enabled by default at -O3 and with FDO but you of course can control
>>>>>>> that in your targets
>>>>>>> post-option-processing hook.
>>>>>> You mean like this?
>>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
>>>>>> But I suppose it's a reasonable change.
>>>>> No, that change is backward.  What I said is that peeling is already
>>>>> conditional on
>>>>> flag_peel_loops and that is enabled by -O3.  So you want to disable
>>>>> flag_peel_loops for
>>>>> SVE instead in the target.
>>>> Sorry, I got confused by the similarly named functions.
>>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
>>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
>>> Well, peeling gets disabled.  From your patch I see you want to
>>> disable "unrolling" when
>>> the number of loop iteration is not constant.  That is called peeling
>>> where we need to
>>> emit the loop exit test N times.
>>>
>>> Did you check your testcases with -fno-peel-loops?
>> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
>> can be called through two paths, only one of which is gated on flag_peel_loops.
> I don't see the obvious here so I have to either sit down with a
> non-SVE specific testcase
> showing this, or I am misunderstanding the actual transform that you
> want to avoid.
> allow_peel is false when called from canonicalize_induction_variables.
> There's the slight
> chance that UL_NO_GROWTH lets through cases - is your case one of
> that?  That is,
> does the following help?
>
> Index: gcc/tree-ssa-loop-ivcanon.c
> ===================================================================
> --- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
> +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
>       exit = NULL;
>
>     /* See if we can improve our estimate by using recorded loop bounds.  */
> -  if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> +  if ((allow_peel || maxiter == 0)
>         && maxiter >= 0
>         && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
>       {
>
> IIRC I allowed that case when adding allow_peel simply because it avoided some
> testsuite regressions.  This means you eventually want to work on the
> size estimate
> of SVE style loops?

This doesn't help.

Sorry if we're talking over each other here, I'm not very familiar with this area :(
For this loop:
   for (int i = 0; i < 5; i++)
     x[i] = x[i] * 2;

For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch.
This gets fully unrolled as:
         ldp     q2, q1, [x0]
         ldr     d0, [x0, 32]
         fadd    v2.2d, v2.2d, v2.2d
         fadd    v1.2d, v1.2d, v1.2d
         fadd    d0, d0, d0
         stp     q2, q1, [x0]
         str     d0, [x0, 32]

For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process
with each loop iteration. So the NEON unrolling becomes SVE peeling I guess.
Note that the number of iterations in SVE is still "constant", just not known at compile-time.
In this case peeling doesn't eliminate any branches and only serves to bloat code size.

Kyrill

> Richard.
>
>> Thanks,
>> Kyrill
>>
>>
>>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
>>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>>>> It might also make sense to have more fine-grained control for this
>>>>>>> and allow a target
>>>>>>> to say whether it wants to peel a specific loop or not when the
>>>>>>> middle-end thinks that
>>>>>>> would be profitable.
>>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
>>>>>> the gimple statements of the loop to figure out its cost?
>>>>> Kind-of.  Sth like
>>>>>
>>>>>      bool targetm.peel_loop (struct loop *);
>>>>>
>>>>> I have no idea whether you can easily detect a SVE vectorized loop though.
>>>>> Maybe there's always a special IV or so (the mask?)
>>>>>
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>>
>>>>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>>            * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
>>>>>>            loop when number of iterations is not known and flag_peel_loops is in
>>>>>>            effect.
>>>>>>
>>>>>> 2018-11-09  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>>            * gcc.target/aarch64/sve/unroll-1.c: New test.
>>>>>>