* [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
@ 2018-11-09 10:47 Kyrill Tkachov
2018-11-09 12:19 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2018-11-09 10:47 UTC (permalink / raw)
To: gcc-patches
Cc: Marcus Shawcroft, Richard Earnshaw (lists),
James Greenhalgh, Richard Sandiford
[-- Attachment #1: Type: text/plain, Size: 2678 bytes --]
Hi all,
In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
fully_peel_me:
mov x1, 5
ptrue p1.d, all
whilelo p0.d, xzr, x1
ld1d z0.d, p0/z, [x0]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x0]
cntd x2
addvl x3, x0, #1
whilelo p0.d, x2, x1
beq .L1
ld1d z0.d, p0/z, [x0, #1, mul vl]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x3]
cntw x2
incb x0, all, mul #2
whilelo p0.d, x2, x1
beq .L1
ld1d z0.d, p0/z, [x0]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x0]
.L1:
ret
In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
and hurts icache performance.
This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
the branches.
So for the above testcase we generate now:
fully_peel_me:
mov x2, 5
mov x3, x2
mov x1, 0
whilelo p0.d, xzr, x2
ptrue p1.d, all
.L2:
ld1d z0.d, p0/z, [x0, x1, lsl 3]
fadd z0.d, z0.d, z0.d
st1d z0.d, p0, [x0, x1, lsl 3]
incd x1
whilelo p0.d, x1, x3
bne .L2
ret
Not perfect still, but it's preferable to the original code.
The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
(until other target people experiment with it and set it, if appropriate).
Bootstrapped and tested on aarch64-none-linux-gnu.
Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
Ok for trunk?
Thanks,
Kyrill
2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* params.def (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY): Define.
* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Use above to
disable unrolling on unknown iteration count.
* config/aarch64/aarch64.c (aarch64_override_options_internal): Set
PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY to 1.
* doc/invoke.texi (--param unroll-known-loop-iterations-only):
Document.
2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* gcc.target/aarch64/sve/unroll-1.c: New test.
[-- Attachment #2: unroll.patch --]
[-- Type: text/x-patch, Size: 3505 bytes --]
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index f0e54eda80656829528c018357dde2e1e87f6ebd..34d08a075221fd4c098e9b5e8fabd8fe3948d285 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -10993,6 +10993,12 @@ aarch64_override_options_internal (struct gcc_options *opts)
opts->x_param_values,
global_options_set.x_param_values);
+ /* Don't unroll loops where the exact iteration count is not known at
+ compile-time. */
+ maybe_set_param_value (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY, 1,
+ opts->x_param_values,
+ global_options_set.x_param_values);
+
/* If the user hasn't changed it via configure then set the default to 64 KB
for the backend. */
maybe_set_param_value (PARAM_STACK_CLASH_PROTECTION_GUARD_SIZE,
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 802cc642453aef2d2c516bcbda22246252ec87c1..74e2aeda27d718264188761cf522d6c9f8025e07 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10732,6 +10732,9 @@ The maximum number of branches on the hot path through the peeled sequence.
@item max-completely-peeled-insns
The maximum number of insns of a completely peeled loop.
+@item unroll-known-loop-iterations-only
+Only completely unroll loops where the iteration count is known.
+
@item max-completely-peel-times
The maximum number of iterations of a loop to be suitable for complete peeling.
diff --git a/gcc/params.def b/gcc/params.def
index 4a5f2042dac72bb457488ac8bc35d09df94c929c..07946552232058cee41303e81ed694f7f0bb615e 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -344,6 +344,11 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS,
"The maximum depth of a loop nest we completely peel.",
8, 0, 0)
+DEFPARAM(PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY,
+ "unroll-known-loop-iterations-only",
+ "Only completely unroll loops where the iteration count is known",
+ 0, 0, 1)
+
/* The maximum number of insns of an unswitched loop. */
DEFPARAM(PARAM_MAX_UNSWITCH_INSNS,
"max-unswitch-insns",
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7f53d20cbf8e18a4389b86c037f56f024bac22a5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+/* Check that simple loop is not fully unrolled. */
+
+void
+fully_peel_me (double *x)
+{
+ for (int i = 0; i < 5; i++)
+ x[i] = x[i] * 2;
+}
+
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[.+]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[.+\]\n} 1 } } */
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index eeae2a8c54af14e58970d1797c92ecc86ac0523c..a67800fe8807ba003c05c3d8bdd820cc8df93e57 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -883,6 +883,17 @@ try_unroll_loop_completely (struct loop *loop,
loop->num);
return false;
}
+ else if (TREE_CODE (niter) == SCEV_NOT_KNOWN
+ && PARAM_VALUE (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY)
+ == 1)
+ {
+ if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Not unrolling loop %d: "
+ "exact number of iterations not known "
+ "(--param unroll-known-loop-iterations-only).\n",
+ loop->num);
+ return false;
+ }
}
initialize_original_copy_tables ();
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-09 10:47 [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64 Kyrill Tkachov
@ 2018-11-09 12:19 ` Richard Biener
2018-11-09 17:57 ` Kyrill Tkachov
0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-11-09 12:19 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> Hi all,
>
> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
>
> fully_peel_me:
> mov x1, 5
> ptrue p1.d, all
> whilelo p0.d, xzr, x1
> ld1d z0.d, p0/z, [x0]
> fadd z0.d, z0.d, z0.d
> st1d z0.d, p0, [x0]
> cntd x2
> addvl x3, x0, #1
> whilelo p0.d, x2, x1
> beq .L1
> ld1d z0.d, p0/z, [x0, #1, mul vl]
> fadd z0.d, z0.d, z0.d
> st1d z0.d, p0, [x3]
> cntw x2
> incb x0, all, mul #2
> whilelo p0.d, x2, x1
> beq .L1
> ld1d z0.d, p0/z, [x0]
> fadd z0.d, z0.d, z0.d
> st1d z0.d, p0, [x0]
> .L1:
> ret
>
> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> and hurts icache performance.
>
> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> the branches.
>
> So for the above testcase we generate now:
> fully_peel_me:
> mov x2, 5
> mov x3, x2
> mov x1, 0
> whilelo p0.d, xzr, x2
> ptrue p1.d, all
> .L2:
> ld1d z0.d, p0/z, [x0, x1, lsl 3]
> fadd z0.d, z0.d, z0.d
> st1d z0.d, p0, [x0, x1, lsl 3]
> incd x1
> whilelo p0.d, x1, x3
> bne .L2
> ret
>
> Not perfect still, but it's preferable to the original code.
> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> (until other target people experiment with it and set it, if appropriate).
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
>
> Ok for trunk?
Hum. Why introduce a new --param and not simply key on
flag_peel_loops instead? That is
enabled by default at -O3 and with FDO but you of course can control
that in your targets
post-option-processing hook.
It might also make sense to have more fine-grained control for this
and allow a target
to say whether it wants to peel a specific loop or not when the
middle-end thinks that
would be profitable.
Richard.
> Thanks,
> Kyrill
>
>
> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * params.def (PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY): Define.
> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Use above to
> disable unrolling on unknown iteration count.
> * config/aarch64/aarch64.c (aarch64_override_options_internal): Set
> PARAM_UNROLL_KNOWN_LOOP_ITERATIONS_ONLY to 1.
> * doc/invoke.texi (--param unroll-known-loop-iterations-only):
> Document.
>
> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.target/aarch64/sve/unroll-1.c: New test.
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-09 12:19 ` Richard Biener
@ 2018-11-09 17:57 ` Kyrill Tkachov
2018-11-12 14:10 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2018-11-09 17:57 UTC (permalink / raw)
To: Richard Biener
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
[-- Attachment #1: Type: text/plain, Size: 3493 bytes --]
On 09/11/18 12:18, Richard Biener wrote:
> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>>
>> Hi all,
>>
>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
>>
>> fully_peel_me:
>> mov x1, 5
>> ptrue p1.d, all
>> whilelo p0.d, xzr, x1
>> ld1d z0.d, p0/z, [x0]
>> fadd z0.d, z0.d, z0.d
>> st1d z0.d, p0, [x0]
>> cntd x2
>> addvl x3, x0, #1
>> whilelo p0.d, x2, x1
>> beq .L1
>> ld1d z0.d, p0/z, [x0, #1, mul vl]
>> fadd z0.d, z0.d, z0.d
>> st1d z0.d, p0, [x3]
>> cntw x2
>> incb x0, all, mul #2
>> whilelo p0.d, x2, x1
>> beq .L1
>> ld1d z0.d, p0/z, [x0]
>> fadd z0.d, z0.d, z0.d
>> st1d z0.d, p0, [x0]
>> .L1:
>> ret
>>
>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
>> and hurts icache performance.
>>
>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
>> the branches.
>>
>> So for the above testcase we generate now:
>> fully_peel_me:
>> mov x2, 5
>> mov x3, x2
>> mov x1, 0
>> whilelo p0.d, xzr, x2
>> ptrue p1.d, all
>> .L2:
>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
>> fadd z0.d, z0.d, z0.d
>> st1d z0.d, p0, [x0, x1, lsl 3]
>> incd x1
>> whilelo p0.d, x1, x3
>> bne .L2
>> ret
>>
>> Not perfect still, but it's preferable to the original code.
>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
>> (until other target people experiment with it and set it, if appropriate).
>>
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
>>
>> Ok for trunk?
>
> Hum. Why introduce a new --param and not simply key on
> flag_peel_loops instead? That is
> enabled by default at -O3 and with FDO but you of course can control
> that in your targets
> post-option-processing hook.
You mean like this?
It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
But I suppose it's a reasonable change.
>
> It might also make sense to have more fine-grained control for this
> and allow a target
> to say whether it wants to peel a specific loop or not when the
> middle-end thinks that
> would be profitable.
Can be worth looking at as a follow-up. Do you envisage the target analysing
the gimple statements of the loop to figure out its cost?
Thanks,
Kyrill
2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
loop when number of iterations is not known and flag_peel_loops is in
effect.
2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* gcc.target/aarch64/sve/unroll-1.c: New test.
[-- Attachment #2: unroll2.patch --]
[-- Type: text/x-patch, Size: 1379 bytes --]
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7f53d20cbf8e18a4389b86c037f56f024bac22a5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+/* Check that simple loop is not fully unrolled. */
+
+void
+fully_peel_me (double *x)
+{
+ for (int i = 0; i < 5; i++)
+ x[i] = x[i] * 2;
+}
+
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[.+]\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[.+\]\n} 1 } } */
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index c2953059fb9218d4bc4cf12fe9277a552b4a04bd..daeddb384254775d6482ed5580e8262e4b26a87f 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -883,6 +883,16 @@ try_unroll_loop_completely (struct loop *loop,
loop->num);
return false;
}
+ else if (TREE_CODE (niter) == SCEV_NOT_KNOWN
+ && flag_peel_loops)
+ {
+ if (dump_enabled_p ())
+ dump_printf (MSG_NOTE, "Not unrolling loop %d: "
+ "exact number of iterations not known "
+ "(-fpeel-loops).\n",
+ loop->num);
+ return false;
+ }
}
initialize_original_copy_tables ();
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-09 17:57 ` Kyrill Tkachov
@ 2018-11-12 14:10 ` Richard Biener
2018-11-12 18:20 ` Kyrill Tkachov
0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-11-12 14:10 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> On 09/11/18 12:18, Richard Biener wrote:
> > On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> > <kyrylo.tkachov@foss.arm.com> wrote:
> >>
> >> Hi all,
> >>
> >> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
> >>
> >> fully_peel_me:
> >> mov x1, 5
> >> ptrue p1.d, all
> >> whilelo p0.d, xzr, x1
> >> ld1d z0.d, p0/z, [x0]
> >> fadd z0.d, z0.d, z0.d
> >> st1d z0.d, p0, [x0]
> >> cntd x2
> >> addvl x3, x0, #1
> >> whilelo p0.d, x2, x1
> >> beq .L1
> >> ld1d z0.d, p0/z, [x0, #1, mul vl]
> >> fadd z0.d, z0.d, z0.d
> >> st1d z0.d, p0, [x3]
> >> cntw x2
> >> incb x0, all, mul #2
> >> whilelo p0.d, x2, x1
> >> beq .L1
> >> ld1d z0.d, p0/z, [x0]
> >> fadd z0.d, z0.d, z0.d
> >> st1d z0.d, p0, [x0]
> >> .L1:
> >> ret
> >>
> >> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> >> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> >> and hurts icache performance.
> >>
> >> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> >> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> >> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> >> the branches.
> >>
> >> So for the above testcase we generate now:
> >> fully_peel_me:
> >> mov x2, 5
> >> mov x3, x2
> >> mov x1, 0
> >> whilelo p0.d, xzr, x2
> >> ptrue p1.d, all
> >> .L2:
> >> ld1d z0.d, p0/z, [x0, x1, lsl 3]
> >> fadd z0.d, z0.d, z0.d
> >> st1d z0.d, p0, [x0, x1, lsl 3]
> >> incd x1
> >> whilelo p0.d, x1, x3
> >> bne .L2
> >> ret
> >>
> >> Not perfect still, but it's preferable to the original code.
> >> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> >> (until other target people experiment with it and set it, if appropriate).
> >>
> >> Bootstrapped and tested on aarch64-none-linux-gnu.
> >> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
> >>
> >> Ok for trunk?
> >
> > Hum. Why introduce a new --param and not simply key on
> > flag_peel_loops instead? That is
> > enabled by default at -O3 and with FDO but you of course can control
> > that in your targets
> > post-option-processing hook.
>
> You mean like this?
> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
> But I suppose it's a reasonable change.
No, that change is backward. What I said is that peeling is already
conditional on
flag_peel_loops and that is enabled by -O3. So you want to disable
flag_peel_loops for
SVE instead in the target.
> >
> > It might also make sense to have more fine-grained control for this
> > and allow a target
> > to say whether it wants to peel a specific loop or not when the
> > middle-end thinks that
> > would be profitable.
>
> Can be worth looking at as a follow-up. Do you envisage the target analysing
> the gimple statements of the loop to figure out its cost?
Kind-of. Sth like
bool targetm.peel_loop (struct loop *);
I have no idea whether you can easily detect a SVE vectorized loop though.
Maybe there's always a special IV or so (the mask?)
Richard.
> Thanks,
> Kyrill
>
>
> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
> loop when number of iterations is not known and flag_peel_loops is in
> effect.
>
> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.target/aarch64/sve/unroll-1.c: New test.
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-12 14:10 ` Richard Biener
@ 2018-11-12 18:20 ` Kyrill Tkachov
2018-11-13 8:25 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2018-11-12 18:20 UTC (permalink / raw)
To: Richard Biener
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On 12/11/18 14:10, Richard Biener wrote:
> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> On 09/11/18 12:18, Richard Biener wrote:
>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> Hi all,
>>>>
>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
>>>>
>>>> fully_peel_me:
>>>> mov x1, 5
>>>> ptrue p1.d, all
>>>> whilelo p0.d, xzr, x1
>>>> ld1d z0.d, p0/z, [x0]
>>>> fadd z0.d, z0.d, z0.d
>>>> st1d z0.d, p0, [x0]
>>>> cntd x2
>>>> addvl x3, x0, #1
>>>> whilelo p0.d, x2, x1
>>>> beq .L1
>>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
>>>> fadd z0.d, z0.d, z0.d
>>>> st1d z0.d, p0, [x3]
>>>> cntw x2
>>>> incb x0, all, mul #2
>>>> whilelo p0.d, x2, x1
>>>> beq .L1
>>>> ld1d z0.d, p0/z, [x0]
>>>> fadd z0.d, z0.d, z0.d
>>>> st1d z0.d, p0, [x0]
>>>> .L1:
>>>> ret
>>>>
>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
>>>> and hurts icache performance.
>>>>
>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
>>>> the branches.
>>>>
>>>> So for the above testcase we generate now:
>>>> fully_peel_me:
>>>> mov x2, 5
>>>> mov x3, x2
>>>> mov x1, 0
>>>> whilelo p0.d, xzr, x2
>>>> ptrue p1.d, all
>>>> .L2:
>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
>>>> fadd z0.d, z0.d, z0.d
>>>> st1d z0.d, p0, [x0, x1, lsl 3]
>>>> incd x1
>>>> whilelo p0.d, x1, x3
>>>> bne .L2
>>>> ret
>>>>
>>>> Not perfect still, but it's preferable to the original code.
>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
>>>> (until other target people experiment with it and set it, if appropriate).
>>>>
>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
>>>>
>>>> Ok for trunk?
>>> Hum. Why introduce a new --param and not simply key on
>>> flag_peel_loops instead? That is
>>> enabled by default at -O3 and with FDO but you of course can control
>>> that in your targets
>>> post-option-processing hook.
>> You mean like this?
>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
>> But I suppose it's a reasonable change.
> No, that change is backward. What I said is that peeling is already
> conditional on
> flag_peel_loops and that is enabled by -O3. So you want to disable
> flag_peel_loops for
> SVE instead in the target.
Sorry, I got confused by the similarly named functions.
I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
(sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
Thanks,
Kyrill
>>> It might also make sense to have more fine-grained control for this
>>> and allow a target
>>> to say whether it wants to peel a specific loop or not when the
>>> middle-end thinks that
>>> would be profitable.
>> Can be worth looking at as a follow-up. Do you envisage the target analysing
>> the gimple statements of the loop to figure out its cost?
> Kind-of. Sth like
>
> bool targetm.peel_loop (struct loop *);
>
> I have no idea whether you can easily detect a SVE vectorized loop though.
> Maybe there's always a special IV or so (the mask?)
>
> Richard.
>
>> Thanks,
>> Kyrill
>>
>>
>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>
>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
>> loop when number of iterations is not known and flag_peel_loops is in
>> effect.
>>
>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>
>> * gcc.target/aarch64/sve/unroll-1.c: New test.
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-12 18:20 ` Kyrill Tkachov
@ 2018-11-13 8:25 ` Richard Biener
2018-11-13 9:15 ` Kyrill Tkachov
0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-11-13 8:25 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
>
> On 12/11/18 14:10, Richard Biener wrote:
> > On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> > <kyrylo.tkachov@foss.arm.com> wrote:
> >> On 09/11/18 12:18, Richard Biener wrote:
> >>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> >>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>> Hi all,
> >>>>
> >>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
> >>>>
> >>>> fully_peel_me:
> >>>> mov x1, 5
> >>>> ptrue p1.d, all
> >>>> whilelo p0.d, xzr, x1
> >>>> ld1d z0.d, p0/z, [x0]
> >>>> fadd z0.d, z0.d, z0.d
> >>>> st1d z0.d, p0, [x0]
> >>>> cntd x2
> >>>> addvl x3, x0, #1
> >>>> whilelo p0.d, x2, x1
> >>>> beq .L1
> >>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
> >>>> fadd z0.d, z0.d, z0.d
> >>>> st1d z0.d, p0, [x3]
> >>>> cntw x2
> >>>> incb x0, all, mul #2
> >>>> whilelo p0.d, x2, x1
> >>>> beq .L1
> >>>> ld1d z0.d, p0/z, [x0]
> >>>> fadd z0.d, z0.d, z0.d
> >>>> st1d z0.d, p0, [x0]
> >>>> .L1:
> >>>> ret
> >>>>
> >>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> >>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> >>>> and hurts icache performance.
> >>>>
> >>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> >>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> >>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> >>>> the branches.
> >>>>
> >>>> So for the above testcase we generate now:
> >>>> fully_peel_me:
> >>>> mov x2, 5
> >>>> mov x3, x2
> >>>> mov x1, 0
> >>>> whilelo p0.d, xzr, x2
> >>>> ptrue p1.d, all
> >>>> .L2:
> >>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
> >>>> fadd z0.d, z0.d, z0.d
> >>>> st1d z0.d, p0, [x0, x1, lsl 3]
> >>>> incd x1
> >>>> whilelo p0.d, x1, x3
> >>>> bne .L2
> >>>> ret
> >>>>
> >>>> Not perfect still, but it's preferable to the original code.
> >>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> >>>> (until other target people experiment with it and set it, if appropriate).
> >>>>
> >>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> >>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
> >>>>
> >>>> Ok for trunk?
> >>> Hum. Why introduce a new --param and not simply key on
> >>> flag_peel_loops instead? That is
> >>> enabled by default at -O3 and with FDO but you of course can control
> >>> that in your targets
> >>> post-option-processing hook.
> >> You mean like this?
> >> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
> >> But I suppose it's a reasonable change.
> > No, that change is backward. What I said is that peeling is already
> > conditional on
> > flag_peel_loops and that is enabled by -O3. So you want to disable
> > flag_peel_loops for
> > SVE instead in the target.
>
> Sorry, I got confused by the similarly named functions.
> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
Well, peeling gets disabled. From your patch I see you want to
disable "unrolling" when
the number of loop iteration is not constant. That is called peeling
where we need to
emit the loop exit test N times.
Did you check your testcases with -fno-peel-loops?
> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
>
> Thanks,
> Kyrill
>
> >>> It might also make sense to have more fine-grained control for this
> >>> and allow a target
> >>> to say whether it wants to peel a specific loop or not when the
> >>> middle-end thinks that
> >>> would be profitable.
> >> Can be worth looking at as a follow-up. Do you envisage the target analysing
> >> the gimple statements of the loop to figure out its cost?
> > Kind-of. Sth like
> >
> > bool targetm.peel_loop (struct loop *);
> >
> > I have no idea whether you can easily detect a SVE vectorized loop though.
> > Maybe there's always a special IV or so (the mask?)
> >
> > Richard.
> >
> >> Thanks,
> >> Kyrill
> >>
> >>
> >> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> >>
> >> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
> >> loop when number of iterations is not known and flag_peel_loops is in
> >> effect.
> >>
> >> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> >>
> >> * gcc.target/aarch64/sve/unroll-1.c: New test.
> >>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-13 8:25 ` Richard Biener
@ 2018-11-13 9:15 ` Kyrill Tkachov
2018-11-13 9:28 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2018-11-13 9:15 UTC (permalink / raw)
To: Richard Biener
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
Hi Richard,
On 13/11/18 08:24, Richard Biener wrote:
> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>>
>> On 12/11/18 14:10, Richard Biener wrote:
>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> On 09/11/18 12:18, Richard Biener wrote:
>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
>>>>>>
>>>>>> fully_peel_me:
>>>>>> mov x1, 5
>>>>>> ptrue p1.d, all
>>>>>> whilelo p0.d, xzr, x1
>>>>>> ld1d z0.d, p0/z, [x0]
>>>>>> fadd z0.d, z0.d, z0.d
>>>>>> st1d z0.d, p0, [x0]
>>>>>> cntd x2
>>>>>> addvl x3, x0, #1
>>>>>> whilelo p0.d, x2, x1
>>>>>> beq .L1
>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
>>>>>> fadd z0.d, z0.d, z0.d
>>>>>> st1d z0.d, p0, [x3]
>>>>>> cntw x2
>>>>>> incb x0, all, mul #2
>>>>>> whilelo p0.d, x2, x1
>>>>>> beq .L1
>>>>>> ld1d z0.d, p0/z, [x0]
>>>>>> fadd z0.d, z0.d, z0.d
>>>>>> st1d z0.d, p0, [x0]
>>>>>> .L1:
>>>>>> ret
>>>>>>
>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
>>>>>> and hurts icache performance.
>>>>>>
>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
>>>>>> the branches.
>>>>>>
>>>>>> So for the above testcase we generate now:
>>>>>> fully_peel_me:
>>>>>> mov x2, 5
>>>>>> mov x3, x2
>>>>>> mov x1, 0
>>>>>> whilelo p0.d, xzr, x2
>>>>>> ptrue p1.d, all
>>>>>> .L2:
>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
>>>>>> fadd z0.d, z0.d, z0.d
>>>>>> st1d z0.d, p0, [x0, x1, lsl 3]
>>>>>> incd x1
>>>>>> whilelo p0.d, x1, x3
>>>>>> bne .L2
>>>>>> ret
>>>>>>
>>>>>> Not perfect still, but it's preferable to the original code.
>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
>>>>>> (until other target people experiment with it and set it, if appropriate).
>>>>>>
>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
>>>>>>
>>>>>> Ok for trunk?
>>>>> Hum. Why introduce a new --param and not simply key on
>>>>> flag_peel_loops instead? That is
>>>>> enabled by default at -O3 and with FDO but you of course can control
>>>>> that in your targets
>>>>> post-option-processing hook.
>>>> You mean like this?
>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
>>>> But I suppose it's a reasonable change.
>>> No, that change is backward. What I said is that peeling is already
>>> conditional on
>>> flag_peel_loops and that is enabled by -O3. So you want to disable
>>> flag_peel_loops for
>>> SVE instead in the target.
>> Sorry, I got confused by the similarly named functions.
>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
> Well, peeling gets disabled. From your patch I see you want to
> disable "unrolling" when
> the number of loop iteration is not constant. That is called peeling
> where we need to
> emit the loop exit test N times.
>
> Did you check your testcases with -fno-peel-loops?
-fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
can be called through two paths, only one of which is gated on flag_peel_loops.
Thanks,
Kyrill
>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
>>
>> Thanks,
>> Kyrill
>>
>>>>> It might also make sense to have more fine-grained control for this
>>>>> and allow a target
>>>>> to say whether it wants to peel a specific loop or not when the
>>>>> middle-end thinks that
>>>>> would be profitable.
>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
>>>> the gimple statements of the loop to figure out its cost?
>>> Kind-of. Sth like
>>>
>>> bool targetm.peel_loop (struct loop *);
>>>
>>> I have no idea whether you can easily detect a SVE vectorized loop though.
>>> Maybe there's always a special IV or so (the mask?)
>>>
>>> Richard.
>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>
>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>>>
>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
>>>> loop when number of iterations is not known and flag_peel_loops is in
>>>> effect.
>>>>
>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>>>
>>>> * gcc.target/aarch64/sve/unroll-1.c: New test.
>>>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-13 9:15 ` Kyrill Tkachov
@ 2018-11-13 9:28 ` Richard Biener
2018-11-13 9:49 ` Kyrill Tkachov
0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-11-13 9:28 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> Hi Richard,
>
> On 13/11/18 08:24, Richard Biener wrote:
> > On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> > <kyrylo.tkachov@foss.arm.com> wrote:
> >>
> >> On 12/11/18 14:10, Richard Biener wrote:
> >>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> >>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>> On 09/11/18 12:18, Richard Biener wrote:
> >>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> >>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
> >>>>>>
> >>>>>> fully_peel_me:
> >>>>>> mov x1, 5
> >>>>>> ptrue p1.d, all
> >>>>>> whilelo p0.d, xzr, x1
> >>>>>> ld1d z0.d, p0/z, [x0]
> >>>>>> fadd z0.d, z0.d, z0.d
> >>>>>> st1d z0.d, p0, [x0]
> >>>>>> cntd x2
> >>>>>> addvl x3, x0, #1
> >>>>>> whilelo p0.d, x2, x1
> >>>>>> beq .L1
> >>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
> >>>>>> fadd z0.d, z0.d, z0.d
> >>>>>> st1d z0.d, p0, [x3]
> >>>>>> cntw x2
> >>>>>> incb x0, all, mul #2
> >>>>>> whilelo p0.d, x2, x1
> >>>>>> beq .L1
> >>>>>> ld1d z0.d, p0/z, [x0]
> >>>>>> fadd z0.d, z0.d, z0.d
> >>>>>> st1d z0.d, p0, [x0]
> >>>>>> .L1:
> >>>>>> ret
> >>>>>>
> >>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> >>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> >>>>>> and hurts icache performance.
> >>>>>>
> >>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> >>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> >>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> >>>>>> the branches.
> >>>>>>
> >>>>>> So for the above testcase we generate now:
> >>>>>> fully_peel_me:
> >>>>>> mov x2, 5
> >>>>>> mov x3, x2
> >>>>>> mov x1, 0
> >>>>>> whilelo p0.d, xzr, x2
> >>>>>> ptrue p1.d, all
> >>>>>> .L2:
> >>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
> >>>>>> fadd z0.d, z0.d, z0.d
> >>>>>> st1d z0.d, p0, [x0, x1, lsl 3]
> >>>>>> incd x1
> >>>>>> whilelo p0.d, x1, x3
> >>>>>> bne .L2
> >>>>>> ret
> >>>>>>
> >>>>>> Not perfect still, but it's preferable to the original code.
> >>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> >>>>>> (until other target people experiment with it and set it, if appropriate).
> >>>>>>
> >>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> >>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
> >>>>>>
> >>>>>> Ok for trunk?
> >>>>> Hum. Why introduce a new --param and not simply key on
> >>>>> flag_peel_loops instead? That is
> >>>>> enabled by default at -O3 and with FDO but you of course can control
> >>>>> that in your targets
> >>>>> post-option-processing hook.
> >>>> You mean like this?
> >>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
> >>>> But I suppose it's a reasonable change.
> >>> No, that change is backward. What I said is that peeling is already
> >>> conditional on
> >>> flag_peel_loops and that is enabled by -O3. So you want to disable
> >>> flag_peel_loops for
> >>> SVE instead in the target.
> >> Sorry, I got confused by the similarly named functions.
> >> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
> >> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
> > Well, peeling gets disabled. From your patch I see you want to
> > disable "unrolling" when
> > the number of loop iteration is not constant. That is called peeling
> > where we need to
> > emit the loop exit test N times.
> >
> > Did you check your testcases with -fno-peel-loops?
>
> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
> can be called through two paths, only one of which is gated on flag_peel_loops.
I don't see the obvious here so I have to either sit down with a
non-SVE specific testcase
showing this, or I am misunderstanding the actual transform that you
want to avoid.
allow_peel is false when called from canonicalize_induction_variables.
There's the slight
chance that UL_NO_GROWTH lets through cases - is your case one of
that? That is,
does the following help?
Index: gcc/tree-ssa-loop-ivcanon.c
===================================================================
--- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
+++ gcc/tree-ssa-loop-ivcanon.c (working copy)
@@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
exit = NULL;
/* See if we can improve our estimate by using recorded loop bounds. */
- if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
+ if ((allow_peel || maxiter == 0)
&& maxiter >= 0
&& (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
{
IIRC I allowed that case when adding allow_peel simply because it avoided some
testsuite regressions. This means you eventually want to work on the
size estimate
of SVE style loops?
Richard.
> Thanks,
> Kyrill
>
>
> >> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
> >> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
> >>
> >> Thanks,
> >> Kyrill
> >>
> >>>>> It might also make sense to have more fine-grained control for this
> >>>>> and allow a target
> >>>>> to say whether it wants to peel a specific loop or not when the
> >>>>> middle-end thinks that
> >>>>> would be profitable.
> >>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
> >>>> the gimple statements of the loop to figure out its cost?
> >>> Kind-of. Sth like
> >>>
> >>> bool targetm.peel_loop (struct loop *);
> >>>
> >>> I have no idea whether you can easily detect a SVE vectorized loop though.
> >>> Maybe there's always a special IV or so (the mask?)
> >>>
> >>> Richard.
> >>>
> >>>> Thanks,
> >>>> Kyrill
> >>>>
> >>>>
> >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> >>>>
> >>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
> >>>> loop when number of iterations is not known and flag_peel_loops is in
> >>>> effect.
> >>>>
> >>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> >>>>
> >>>> * gcc.target/aarch64/sve/unroll-1.c: New test.
> >>>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-13 9:28 ` Richard Biener
@ 2018-11-13 9:49 ` Kyrill Tkachov
2018-11-13 14:34 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2018-11-13 9:49 UTC (permalink / raw)
To: Richard Biener
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On 13/11/18 09:28, Richard Biener wrote:
> On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>> Hi Richard,
>>
>> On 13/11/18 08:24, Richard Biener wrote:
>>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>> On 12/11/18 14:10, Richard Biener wrote:
>>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>> On 09/11/18 12:18, Richard Biener wrote:
>>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
>>>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
>>>>>>>>
>>>>>>>> fully_peel_me:
>>>>>>>> mov x1, 5
>>>>>>>> ptrue p1.d, all
>>>>>>>> whilelo p0.d, xzr, x1
>>>>>>>> ld1d z0.d, p0/z, [x0]
>>>>>>>> fadd z0.d, z0.d, z0.d
>>>>>>>> st1d z0.d, p0, [x0]
>>>>>>>> cntd x2
>>>>>>>> addvl x3, x0, #1
>>>>>>>> whilelo p0.d, x2, x1
>>>>>>>> beq .L1
>>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
>>>>>>>> fadd z0.d, z0.d, z0.d
>>>>>>>> st1d z0.d, p0, [x3]
>>>>>>>> cntw x2
>>>>>>>> incb x0, all, mul #2
>>>>>>>> whilelo p0.d, x2, x1
>>>>>>>> beq .L1
>>>>>>>> ld1d z0.d, p0/z, [x0]
>>>>>>>> fadd z0.d, z0.d, z0.d
>>>>>>>> st1d z0.d, p0, [x0]
>>>>>>>> .L1:
>>>>>>>> ret
>>>>>>>>
>>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
>>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
>>>>>>>> and hurts icache performance.
>>>>>>>>
>>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
>>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
>>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
>>>>>>>> the branches.
>>>>>>>>
>>>>>>>> So for the above testcase we generate now:
>>>>>>>> fully_peel_me:
>>>>>>>> mov x2, 5
>>>>>>>> mov x3, x2
>>>>>>>> mov x1, 0
>>>>>>>> whilelo p0.d, xzr, x2
>>>>>>>> ptrue p1.d, all
>>>>>>>> .L2:
>>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
>>>>>>>> fadd z0.d, z0.d, z0.d
>>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3]
>>>>>>>> incd x1
>>>>>>>> whilelo p0.d, x1, x3
>>>>>>>> bne .L2
>>>>>>>> ret
>>>>>>>>
>>>>>>>> Not perfect still, but it's preferable to the original code.
>>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
>>>>>>>> (until other target people experiment with it and set it, if appropriate).
>>>>>>>>
>>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
>>>>>>>>
>>>>>>>> Ok for trunk?
>>>>>>> Hum. Why introduce a new --param and not simply key on
>>>>>>> flag_peel_loops instead? That is
>>>>>>> enabled by default at -O3 and with FDO but you of course can control
>>>>>>> that in your targets
>>>>>>> post-option-processing hook.
>>>>>> You mean like this?
>>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
>>>>>> But I suppose it's a reasonable change.
>>>>> No, that change is backward. What I said is that peeling is already
>>>>> conditional on
>>>>> flag_peel_loops and that is enabled by -O3. So you want to disable
>>>>> flag_peel_loops for
>>>>> SVE instead in the target.
>>>> Sorry, I got confused by the similarly named functions.
>>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
>>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
>>> Well, peeling gets disabled. From your patch I see you want to
>>> disable "unrolling" when
>>> the number of loop iteration is not constant. That is called peeling
>>> where we need to
>>> emit the loop exit test N times.
>>>
>>> Did you check your testcases with -fno-peel-loops?
>> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
>> can be called through two paths, only one of which is gated on flag_peel_loops.
> I don't see the obvious here so I have to either sit down with a
> non-SVE specific testcase
> showing this, or I am misunderstanding the actual transform that you
> want to avoid.
> allow_peel is false when called from canonicalize_induction_variables.
> There's the slight
> chance that UL_NO_GROWTH lets through cases - is your case one of
> that? That is,
> does the following help?
>
> Index: gcc/tree-ssa-loop-ivcanon.c
> ===================================================================
> --- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
> +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
> exit = NULL;
>
> /* See if we can improve our estimate by using recorded loop bounds. */
> - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> + if ((allow_peel || maxiter == 0)
> && maxiter >= 0
> && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> {
>
> IIRC I allowed that case when adding allow_peel simply because it avoided some
> testsuite regressions. This means you eventually want to work on the
> size estimate
> of SVE style loops?
This doesn't help.
Sorry if we're talking over each other here, I'm not very familiar with this area :(
For this loop:
for (int i = 0; i < 5; i++)
x[i] = x[i] * 2;
For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch.
This gets fully unrolled as:
ldp q2, q1, [x0]
ldr d0, [x0, 32]
fadd v2.2d, v2.2d, v2.2d
fadd v1.2d, v1.2d, v1.2d
fadd d0, d0, d0
stp q2, q1, [x0]
str d0, [x0, 32]
For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process
with each loop iteration. So the NEON unrolling becomes SVE peeling I guess.
Note that the number of iterations in SVE is still "constant", just not known at compile-time.
In this case peeling doesn't eliminate any branches and only serves to bloat code size.
Kyrill
> Richard.
>
>> Thanks,
>> Kyrill
>>
>>
>>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
>>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
>>>>
>>>> Thanks,
>>>> Kyrill
>>>>
>>>>>>> It might also make sense to have more fine-grained control for this
>>>>>>> and allow a target
>>>>>>> to say whether it wants to peel a specific loop or not when the
>>>>>>> middle-end thinks that
>>>>>>> would be profitable.
>>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
>>>>>> the gimple statements of the loop to figure out its cost?
>>>>> Kind-of. Sth like
>>>>>
>>>>> bool targetm.peel_loop (struct loop *);
>>>>>
>>>>> I have no idea whether you can easily detect a SVE vectorized loop though.
>>>>> Maybe there's always a special IV or so (the mask?)
>>>>>
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> Kyrill
>>>>>>
>>>>>>
>>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
>>>>>> loop when number of iterations is not known and flag_peel_loops is in
>>>>>> effect.
>>>>>>
>>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>>>>>>
>>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test.
>>>>>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-13 9:49 ` Kyrill Tkachov
@ 2018-11-13 14:34 ` Richard Biener
2018-11-14 15:01 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-11-13 14:34 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
>
> On 13/11/18 09:28, Richard Biener wrote:
> > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
> > <kyrylo.tkachov@foss.arm.com> wrote:
> >> Hi Richard,
> >>
> >> On 13/11/18 08:24, Richard Biener wrote:
> >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> >>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>> On 12/11/18 14:10, Richard Biener wrote:
> >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> >>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>>>> On 09/11/18 12:18, Richard Biener wrote:
> >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> >>>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
> >>>>>>>>
> >>>>>>>> fully_peel_me:
> >>>>>>>> mov x1, 5
> >>>>>>>> ptrue p1.d, all
> >>>>>>>> whilelo p0.d, xzr, x1
> >>>>>>>> ld1d z0.d, p0/z, [x0]
> >>>>>>>> fadd z0.d, z0.d, z0.d
> >>>>>>>> st1d z0.d, p0, [x0]
> >>>>>>>> cntd x2
> >>>>>>>> addvl x3, x0, #1
> >>>>>>>> whilelo p0.d, x2, x1
> >>>>>>>> beq .L1
> >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
> >>>>>>>> fadd z0.d, z0.d, z0.d
> >>>>>>>> st1d z0.d, p0, [x3]
> >>>>>>>> cntw x2
> >>>>>>>> incb x0, all, mul #2
> >>>>>>>> whilelo p0.d, x2, x1
> >>>>>>>> beq .L1
> >>>>>>>> ld1d z0.d, p0/z, [x0]
> >>>>>>>> fadd z0.d, z0.d, z0.d
> >>>>>>>> st1d z0.d, p0, [x0]
> >>>>>>>> .L1:
> >>>>>>>> ret
> >>>>>>>>
> >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> >>>>>>>> and hurts icache performance.
> >>>>>>>>
> >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> >>>>>>>> the branches.
> >>>>>>>>
> >>>>>>>> So for the above testcase we generate now:
> >>>>>>>> fully_peel_me:
> >>>>>>>> mov x2, 5
> >>>>>>>> mov x3, x2
> >>>>>>>> mov x1, 0
> >>>>>>>> whilelo p0.d, xzr, x2
> >>>>>>>> ptrue p1.d, all
> >>>>>>>> .L2:
> >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
> >>>>>>>> fadd z0.d, z0.d, z0.d
> >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3]
> >>>>>>>> incd x1
> >>>>>>>> whilelo p0.d, x1, x3
> >>>>>>>> bne .L2
> >>>>>>>> ret
> >>>>>>>>
> >>>>>>>> Not perfect still, but it's preferable to the original code.
> >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> >>>>>>>> (until other target people experiment with it and set it, if appropriate).
> >>>>>>>>
> >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
> >>>>>>>>
> >>>>>>>> Ok for trunk?
> >>>>>>> Hum. Why introduce a new --param and not simply key on
> >>>>>>> flag_peel_loops instead? That is
> >>>>>>> enabled by default at -O3 and with FDO but you of course can control
> >>>>>>> that in your targets
> >>>>>>> post-option-processing hook.
> >>>>>> You mean like this?
> >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
> >>>>>> But I suppose it's a reasonable change.
> >>>>> No, that change is backward. What I said is that peeling is already
> >>>>> conditional on
> >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable
> >>>>> flag_peel_loops for
> >>>>> SVE instead in the target.
> >>>> Sorry, I got confused by the similarly named functions.
> >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
> >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
> >>> Well, peeling gets disabled. From your patch I see you want to
> >>> disable "unrolling" when
> >>> the number of loop iteration is not constant. That is called peeling
> >>> where we need to
> >>> emit the loop exit test N times.
> >>>
> >>> Did you check your testcases with -fno-peel-loops?
> >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
> >> can be called through two paths, only one of which is gated on flag_peel_loops.
> > I don't see the obvious here so I have to either sit down with a
> > non-SVE specific testcase
> > showing this, or I am misunderstanding the actual transform that you
> > want to avoid.
> > allow_peel is false when called from canonicalize_induction_variables.
> > There's the slight
> > chance that UL_NO_GROWTH lets through cases - is your case one of
> > that? That is,
> > does the following help?
> >
> > Index: gcc/tree-ssa-loop-ivcanon.c
> > ===================================================================
> > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
> > +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
> > exit = NULL;
> >
> > /* See if we can improve our estimate by using recorded loop bounds. */
> > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> > + if ((allow_peel || maxiter == 0)
> > && maxiter >= 0
> > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> > {
> >
> > IIRC I allowed that case when adding allow_peel simply because it avoided some
> > testsuite regressions. This means you eventually want to work on the
> > size estimate
> > of SVE style loops?
>
> This doesn't help.
>
> Sorry if we're talking over each other here, I'm not very familiar with this area :(
> For this loop:
> for (int i = 0; i < 5; i++)
> x[i] = x[i] * 2;
>
> For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch.
> This gets fully unrolled as:
> ldp q2, q1, [x0]
> ldr d0, [x0, 32]
> fadd v2.2d, v2.2d, v2.2d
> fadd v1.2d, v1.2d, v1.2d
> fadd d0, d0, d0
> stp q2, q1, [x0]
> str d0, [x0, 32]
>
> For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process
> with each loop iteration. So the NEON unrolling becomes SVE peeling I guess.
> Note that the number of iterations in SVE is still "constant", just not known at compile-time.
> In this case peeling doesn't eliminate any branches and only serves to bloat code size.
So I sat down with a cross and indeed for the late unrolling pass we
simply pass in allow_peel == true
given that try_unroll_loop_completely also performs peeling (and not
just try_peel_loops which guards
itself with flag_peel_loops). That means instead of the above the
below fixes this (with -fno-peel-loops):
Index: gcc/tree-ssa-loop-ivcanon.c
===================================================================
--- gcc/tree-ssa-loop-ivcanon.c (revision 266072)
+++ gcc/tree-ssa-loop-ivcanon.c (working copy)
@@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
exit = NULL;
/* See if we can improve our estimate by using recorded loop bounds. */
- if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
+ if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH)
&& maxiter >= 0
&& (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
{
there might be quite some testsuite fallout since flag_peel_loops is
only enabled at -O3+,
but one has to double-check. As said, a per-loop target control
whether loop peeling is
desirable would be an improvement I guess (apart from generally
disabling peeling for aarch64).
I suppose you can benchmark that together with the above fix.
Richard.
> Kyrill
>
> > Richard.
> >
> >> Thanks,
> >> Kyrill
> >>
> >>
> >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
> >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
> >>>>
> >>>> Thanks,
> >>>> Kyrill
> >>>>
> >>>>>>> It might also make sense to have more fine-grained control for this
> >>>>>>> and allow a target
> >>>>>>> to say whether it wants to peel a specific loop or not when the
> >>>>>>> middle-end thinks that
> >>>>>>> would be profitable.
> >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
> >>>>>> the gimple statements of the loop to figure out its cost?
> >>>>> Kind-of. Sth like
> >>>>>
> >>>>> bool targetm.peel_loop (struct loop *);
> >>>>>
> >>>>> I have no idea whether you can easily detect a SVE vectorized loop though.
> >>>>> Maybe there's always a special IV or so (the mask?)
> >>>>>
> >>>>> Richard.
> >>>>>
> >>>>>> Thanks,
> >>>>>> Kyrill
> >>>>>>
> >>>>>>
> >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> >>>>>>
> >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
> >>>>>> loop when number of iterations is not known and flag_peel_loops is in
> >>>>>> effect.
> >>>>>>
> >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> >>>>>>
> >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test.
> >>>>>>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64
2018-11-13 14:34 ` Richard Biener
@ 2018-11-14 15:01 ` Richard Biener
2018-11-16 18:02 ` [PATCH] Disable unrolling for loops vectorised with non-constant VF (was: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64) Kyrill Tkachov
0 siblings, 1 reply; 13+ messages in thread
From: Richard Biener @ 2018-11-14 15:01 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Tue, Nov 13, 2018 at 3:33 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Nov 13, 2018 at 10:48 AM Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
> >
> >
> > On 13/11/18 09:28, Richard Biener wrote:
> > > On Tue, Nov 13, 2018 at 10:15 AM Kyrill Tkachov
> > > <kyrylo.tkachov@foss.arm.com> wrote:
> > >> Hi Richard,
> > >>
> > >> On 13/11/18 08:24, Richard Biener wrote:
> > >>> On Mon, Nov 12, 2018 at 7:20 PM Kyrill Tkachov
> > >>> <kyrylo.tkachov@foss.arm.com> wrote:
> > >>>> On 12/11/18 14:10, Richard Biener wrote:
> > >>>>> On Fri, Nov 9, 2018 at 6:57 PM Kyrill Tkachov
> > >>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> > >>>>>> On 09/11/18 12:18, Richard Biener wrote:
> > >>>>>>> On Fri, Nov 9, 2018 at 11:47 AM Kyrill Tkachov
> > >>>>>>> <kyrylo.tkachov@foss.arm.com> wrote:
> > >>>>>>>> Hi all,
> > >>>>>>>>
> > >>>>>>>> In this testcase the codegen for VLA SVE is worse than it could be due to unrolling:
> > >>>>>>>>
> > >>>>>>>> fully_peel_me:
> > >>>>>>>> mov x1, 5
> > >>>>>>>> ptrue p1.d, all
> > >>>>>>>> whilelo p0.d, xzr, x1
> > >>>>>>>> ld1d z0.d, p0/z, [x0]
> > >>>>>>>> fadd z0.d, z0.d, z0.d
> > >>>>>>>> st1d z0.d, p0, [x0]
> > >>>>>>>> cntd x2
> > >>>>>>>> addvl x3, x0, #1
> > >>>>>>>> whilelo p0.d, x2, x1
> > >>>>>>>> beq .L1
> > >>>>>>>> ld1d z0.d, p0/z, [x0, #1, mul vl]
> > >>>>>>>> fadd z0.d, z0.d, z0.d
> > >>>>>>>> st1d z0.d, p0, [x3]
> > >>>>>>>> cntw x2
> > >>>>>>>> incb x0, all, mul #2
> > >>>>>>>> whilelo p0.d, x2, x1
> > >>>>>>>> beq .L1
> > >>>>>>>> ld1d z0.d, p0/z, [x0]
> > >>>>>>>> fadd z0.d, z0.d, z0.d
> > >>>>>>>> st1d z0.d, p0, [x0]
> > >>>>>>>> .L1:
> > >>>>>>>> ret
> > >>>>>>>>
> > >>>>>>>> In this case, due to the vector-length-agnostic nature of SVE the compiler doesn't know the loop iteration count.
> > >>>>>>>> For such loops we don't want to unroll if we don't end up eliminating branches as this just bloats code size
> > >>>>>>>> and hurts icache performance.
> > >>>>>>>>
> > >>>>>>>> This patch introduces a new unroll-known-loop-iterations-only param that disables cunroll when the loop iteration
> > >>>>>>>> count is unknown (SCEV_NOT_KNOWN). This case occurs much more often for SVE VLA code, but it does help some
> > >>>>>>>> Advanced SIMD cases as well where loops with an unknown iteration count are not unrolled when it doesn't eliminate
> > >>>>>>>> the branches.
> > >>>>>>>>
> > >>>>>>>> So for the above testcase we generate now:
> > >>>>>>>> fully_peel_me:
> > >>>>>>>> mov x2, 5
> > >>>>>>>> mov x3, x2
> > >>>>>>>> mov x1, 0
> > >>>>>>>> whilelo p0.d, xzr, x2
> > >>>>>>>> ptrue p1.d, all
> > >>>>>>>> .L2:
> > >>>>>>>> ld1d z0.d, p0/z, [x0, x1, lsl 3]
> > >>>>>>>> fadd z0.d, z0.d, z0.d
> > >>>>>>>> st1d z0.d, p0, [x0, x1, lsl 3]
> > >>>>>>>> incd x1
> > >>>>>>>> whilelo p0.d, x1, x3
> > >>>>>>>> bne .L2
> > >>>>>>>> ret
> > >>>>>>>>
> > >>>>>>>> Not perfect still, but it's preferable to the original code.
> > >>>>>>>> The new param is enabled by default on aarch64 but disabled for other targets, leaving their behaviour unchanged
> > >>>>>>>> (until other target people experiment with it and set it, if appropriate).
> > >>>>>>>>
> > >>>>>>>> Bootstrapped and tested on aarch64-none-linux-gnu.
> > >>>>>>>> Benchmarked on SPEC2017 on a Cortex-A57 and there are no differences in performance.
> > >>>>>>>>
> > >>>>>>>> Ok for trunk?
> > >>>>>>> Hum. Why introduce a new --param and not simply key on
> > >>>>>>> flag_peel_loops instead? That is
> > >>>>>>> enabled by default at -O3 and with FDO but you of course can control
> > >>>>>>> that in your targets
> > >>>>>>> post-option-processing hook.
> > >>>>>> You mean like this?
> > >>>>>> It's certainly a simpler patch, but I was just a bit hesitant of making this change for all targets :)
> > >>>>>> But I suppose it's a reasonable change.
> > >>>>> No, that change is backward. What I said is that peeling is already
> > >>>>> conditional on
> > >>>>> flag_peel_loops and that is enabled by -O3. So you want to disable
> > >>>>> flag_peel_loops for
> > >>>>> SVE instead in the target.
> > >>>> Sorry, I got confused by the similarly named functions.
> > >>>> I'm talking about try_unroll_loop_completely when run as part of canonicalize_induction_variables i.e. the "ivcanon" pass
> > >>>> (sorry about blaming cunroll here). This doesn't get called through the try_unroll_loops_completely path.
> > >>> Well, peeling gets disabled. From your patch I see you want to
> > >>> disable "unrolling" when
> > >>> the number of loop iteration is not constant. That is called peeling
> > >>> where we need to
> > >>> emit the loop exit test N times.
> > >>>
> > >>> Did you check your testcases with -fno-peel-loops?
> > >> -fno-peel-loops doesn't help in the testcases. The code that does this peeling (try_unroll_loop_completely)
> > >> can be called through two paths, only one of which is gated on flag_peel_loops.
> > > I don't see the obvious here so I have to either sit down with a
> > > non-SVE specific testcase
> > > showing this, or I am misunderstanding the actual transform that you
> > > want to avoid.
> > > allow_peel is false when called from canonicalize_induction_variables.
> > > There's the slight
> > > chance that UL_NO_GROWTH lets through cases - is your case one of
> > > that? That is,
> > > does the following help?
> > >
> > > Index: gcc/tree-ssa-loop-ivcanon.c
> > > ===================================================================
> > > --- gcc/tree-ssa-loop-ivcanon.c (revision 266056)
> > > +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> > > @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
> > > exit = NULL;
> > >
> > > /* See if we can improve our estimate by using recorded loop bounds. */
> > > - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> > > + if ((allow_peel || maxiter == 0)
> > > && maxiter >= 0
> > > && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> > > {
> > >
> > > IIRC I allowed that case when adding allow_peel simply because it avoided some
> > > testsuite regressions. This means you eventually want to work on the
> > > size estimate
> > > of SVE style loops?
> >
> > This doesn't help.
> >
> > Sorry if we're talking over each other here, I'm not very familiar with this area :(
> > For this loop:
> > for (int i = 0; i < 5; i++)
> > x[i] = x[i] * 2;
> >
> > For normal vectorisation (e.g. AArch64 NEON) we know the exact number of executions of the loop latch.
> > This gets fully unrolled as:
> > ldp q2, q1, [x0]
> > ldr d0, [x0, 32]
> > fadd v2.2d, v2.2d, v2.2d
> > fadd v1.2d, v1.2d, v1.2d
> > fadd d0, d0, d0
> > stp q2, q1, [x0]
> > str d0, [x0, 32]
> >
> > For vector length-agnostic SVE vectorisation we don't as we don't know the number of elements we process
> > with each loop iteration. So the NEON unrolling becomes SVE peeling I guess.
> > Note that the number of iterations in SVE is still "constant", just not known at compile-time.
> > In this case peeling doesn't eliminate any branches and only serves to bloat code size.
>
> So I sat down with a cross and indeed for the late unrolling pass we
> simply pass in allow_peel == true
> given that try_unroll_loop_completely also performs peeling (and not
> just try_peel_loops which guards
> itself with flag_peel_loops). That means instead of the above the
> below fixes this (with -fno-peel-loops):
>
> Index: gcc/tree-ssa-loop-ivcanon.c
> ===================================================================
> --- gcc/tree-ssa-loop-ivcanon.c (revision 266072)
> +++ gcc/tree-ssa-loop-ivcanon.c (working copy)
> @@ -724,7 +724,7 @@ try_unroll_loop_completely (struct loop
> exit = NULL;
>
> /* See if we can improve our estimate by using recorded loop bounds. */
> - if ((allow_peel || maxiter == 0 || ul == UL_NO_GROWTH)
> + if (((allow_peel && flag_peel_loops) || maxiter == 0 || ul == UL_NO_GROWTH)
> && maxiter >= 0
> && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> {
>
> there might be quite some testsuite fallout since flag_peel_loops is
> only enabled at -O3+,
> but one has to double-check. As said, a per-loop target control
> whether loop peeling is
> desirable would be an improvement I guess (apart from generally
> disabling peeling for aarch64).
> I suppose you can benchmark that together with the above fix.
Oh - and I completely forgot about loop->unroll which the vectorizer
could set (for SVE loops) to 1 which disables any unrolling.
Richard.
> Richard.
>
> > Kyrill
> >
> > > Richard.
> > >
> > >> Thanks,
> > >> Kyrill
> > >>
> > >>
> > >>>> try_unroll_loop_completely doesn't get disabled with -fno-peel-loops or -fno-unroll-loops.
> > >>>> Maybe disabling peeling inside try_unroll_loop_completely itself when !flag_peel_loops is viable?
> > >>>>
> > >>>> Thanks,
> > >>>> Kyrill
> > >>>>
> > >>>>>>> It might also make sense to have more fine-grained control for this
> > >>>>>>> and allow a target
> > >>>>>>> to say whether it wants to peel a specific loop or not when the
> > >>>>>>> middle-end thinks that
> > >>>>>>> would be profitable.
> > >>>>>> Can be worth looking at as a follow-up. Do you envisage the target analysing
> > >>>>>> the gimple statements of the loop to figure out its cost?
> > >>>>> Kind-of. Sth like
> > >>>>>
> > >>>>> bool targetm.peel_loop (struct loop *);
> > >>>>>
> > >>>>> I have no idea whether you can easily detect a SVE vectorized loop though.
> > >>>>> Maybe there's always a special IV or so (the mask?)
> > >>>>>
> > >>>>> Richard.
> > >>>>>
> > >>>>>> Thanks,
> > >>>>>> Kyrill
> > >>>>>>
> > >>>>>>
> > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> > >>>>>>
> > >>>>>> * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not unroll
> > >>>>>> loop when number of iterations is not known and flag_peel_loops is in
> > >>>>>> effect.
> > >>>>>>
> > >>>>>> 2018-11-09 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
> > >>>>>>
> > >>>>>> * gcc.target/aarch64/sve/unroll-1.c: New test.
> > >>>>>>
> >
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] Disable unrolling for loops vectorised with non-constant VF (was: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64)
2018-11-14 15:01 ` Richard Biener
@ 2018-11-16 18:02 ` Kyrill Tkachov
2018-11-19 14:43 ` Richard Biener
0 siblings, 1 reply; 13+ messages in thread
From: Kyrill Tkachov @ 2018-11-16 18:02 UTC (permalink / raw)
To: Richard Biener
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
[-- Attachment #1: Type: text/plain, Size: 772 bytes --]
Hi all,
This is an alternative to https://gcc.gnu.org/ml/gcc-patches/2018-11/msg00694.html
As richi suggested, this disables unrolling of loops vectorised with variable-length SVE
in the vectoriser itself through the loop->unroll member.
It took me a few tries to get it right, as it needs to be set to '1' to disable unrolling,
the rationale for that mechanism is described in the comment in cfgloop.h.
Bootstrapped and tested on aarch64-none-linux-gnu.
Is this ok for trunk?
Thanks,
Kyrill
2018-11-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* tree-vect-loop.c (vect_transform_loop): Disable further unrolling
of the loop if vf is non-constant.
2018-11-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
* gcc.target/aarch64/sve/unroll-1.c: New test.
[-- Attachment #2: unroll-vect.patch --]
[-- Type: text/x-patch, Size: 1294 bytes --]
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..d4353009e2145ec59b3ac74a8fc0a4a16e441581
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/unroll-1.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O3" } */
+
+/* Check that simple loop is not fully unrolled. */
+
+void
+fully_peel_me (double *x)
+{
+ for (int i = 0; i < 5; i++)
+ x[i] = x[i] * 2;
+}
+
+/* { dg-final { scan-assembler-times {b..\t\.L.\n} 1 } } */
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index f2d9d8ac2bc44398f955650591eea20dc7fca8a5..40d9584a00ba8d0b3fda58b3ee8df17f24432d5e 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -8515,6 +8515,15 @@ vect_transform_loop (loop_vec_info loop_vinfo)
}
}
+ /* Loops vectorized with a variable factor won't benefit from
+ unrolling/peeling. */
+ if (!vf.is_constant ())
+ {
+ loop->unroll = 1;
+ if (dump_enabled_p ())
+ dump_printf_loc (MSG_NOTE, vect_location, "Disabling unrolling due to"
+ " variable-length vectorization factor\n");
+ }
/* Free SLP instances here because otherwise stmt reference counting
won't work. */
slp_instance instance;
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Disable unrolling for loops vectorised with non-constant VF (was: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64)
2018-11-16 18:02 ` [PATCH] Disable unrolling for loops vectorised with non-constant VF (was: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64) Kyrill Tkachov
@ 2018-11-19 14:43 ` Richard Biener
0 siblings, 0 replies; 13+ messages in thread
From: Richard Biener @ 2018-11-19 14:43 UTC (permalink / raw)
To: kyrylo.tkachov
Cc: GCC Patches, Marcus Shawcroft, Richard Earnshaw,
James Greenhalgh, Richard Sandiford
On Fri, Nov 16, 2018 at 7:02 PM Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
>
> Hi all,
>
> This is an alternative to https://gcc.gnu.org/ml/gcc-patches/2018-11/msg00694.html
> As richi suggested, this disables unrolling of loops vectorised with variable-length SVE
> in the vectoriser itself through the loop->unroll member.
>
> It took me a few tries to get it right, as it needs to be set to '1' to disable unrolling,
> the rationale for that mechanism is described in the comment in cfgloop.h.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Is this ok for trunk?
OK.
Richard.
> Thanks,
> Kyrill
>
> 2018-11-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * tree-vect-loop.c (vect_transform_loop): Disable further unrolling
> of the loop if vf is non-constant.
>
> 2018-11-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com>
>
> * gcc.target/aarch64/sve/unroll-1.c: New test.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-11-19 14:43 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-09 10:47 [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64 Kyrill Tkachov
2018-11-09 12:19 ` Richard Biener
2018-11-09 17:57 ` Kyrill Tkachov
2018-11-12 14:10 ` Richard Biener
2018-11-12 18:20 ` Kyrill Tkachov
2018-11-13 8:25 ` Richard Biener
2018-11-13 9:15 ` Kyrill Tkachov
2018-11-13 9:28 ` Richard Biener
2018-11-13 9:49 ` Kyrill Tkachov
2018-11-13 14:34 ` Richard Biener
2018-11-14 15:01 ` Richard Biener
2018-11-16 18:02 ` [PATCH] Disable unrolling for loops vectorised with non-constant VF (was: [PATCH][cunroll] Add unroll-known-loop-iterations-only param and use it in aarch64) Kyrill Tkachov
2018-11-19 14:43 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).