* [PATCH] RFC: Optimise SLP permutes of non-consecutive loads
@ 2022-06-23 12:00 Richard Sandiford
2022-06-24 7:25 ` Richard Biener
0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2022-06-23 12:00 UTC (permalink / raw)
To: gcc-patches; +Cc: rguenther
In a reduction pair like:
typedef float T;
void
f1 (T *x)
{
T res1 = 0;
T res2 = 0;
for (int i = 0; i < 100; ++i)
{
res1 += x[i * 2];
res2 += x[i * 2 + 1];
}
x[0] = res1;
x[1] = res2;
}
it isn't easy to predict whether the initial reduction group will
be { res1, res2 } or { res2, res1 }. If the initial group ends up
being { res2, res1 }, vect_optimize_slp will try to swap it around
in order to remove an unncessary permute on the load. This gives:
f1:
.LFB0:
.cfi_startproc
movi v0.4s, 0
mov x1, x0
add x2, x0, 800
.p2align 3,,7
.L2:
ldr q1, [x1], 16
fadd v0.4s, v0.4s, v1.4s
cmp x2, x1
bne .L2
dup d1, v0.d[1]
fadd v0.2s, v0.2s, v1.2s
str d0, [x0]
ret
But there are no specific ordering constraints for non-consecutive
loads. For example, in:
void
f2 (T *x)
{
T res1 = 0;
T res2 = 0;
for (int i = 0; i < 100; ++i)
{
res1 += x[i * 4];
res2 += x[i * 4 + 2];
}
x[0] = res1;
x[1] = res2;
}
the reduction group happens to be { res2, res1 }, and so we get
a load permutation of { 2, 0, 6, 4 }. On aarch64 this gives:
f2:
.LFB1:
.cfi_startproc
adrp x2, .LC0
mov x1, x0
movi v2.4s, 0
ldr q3, [x2, #:lo12:.LC0]
add x2, x0, 1568
.p2align 3,,7
.L6:
ldp q0, q1, [x1]
add x1, x1, 32
tbl v0.16b, {v0.16b - v1.16b}, v3.16b
fadd v2.4s, v2.4s, v0.4s
cmp x1, x2
bne .L6
...untidy code, 18 insns...
ret
where we have to load the permute selector from memory and use
the general TBL instruction. If the order happened to be { res1, res2 }
instead we would generate the much tidier:
f2:
.LFB1:
.cfi_startproc
movi v1.4s, 0
mov x1, x0
add x2, x0, 1568
.p2align 3,,7
.L6:
ldp q0, q2, [x1]
add x1, x1, 32
uzp1 v0.4s, v0.4s, v2.4s
fadd v1.4s, v1.4s, v0.4s
cmp x1, x2
bne .L6
ldr d0, [x0, 1568]
ldr d5, [x0, 1576]
ldr s2, [x0, 1584]
dup d3, v1.d[1]
ldr s4, [x0, 1592]
zip1 v0.2s, v0.2s, v5.2s
ins v2.s[1], v4.s[0]
fadd v0.2s, v0.2s, v2.2s
fadd v0.2s, v0.2s, v1.2s
fadd v0.2s, v0.2s, v3.2s
str d0, [x0]
ret
This WIP patch makes vect_optimize_slp try to put load permutations
into index order. However, this new transform might be a loss if it
forces permutations elsewhere. For example, given:
void
f3 (T *restrict x, T *restrict y, T *restrict z)
{
for (int i = 0; i < 100; ++i)
{
x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
x[i * 2 + 1] = y[i * 4] + z[i * 4];
}
}
we would generate:
.L9:
ldr q0, [x1, x3]
ldr q3, [x6, x3]
ldr q1, [x2, x3]
ldr q2, [x5, x3]
add x3, x3, 32
uzp1 v0.4s, v0.4s, v3.4s
uzp1 v1.4s, v1.4s, v2.4s
fadd v0.4s, v0.4s, v1.4s
rev64 v0.4s, v0.4s
str q0, [x4], 16
cmp x3, 1568
bne .L9
(3 permutes) rather than:
.L9:
ldr q0, [x1, x3]
ldr q1, [x6, x3]
ldr q2, [x2, x3]
ldr q3, [x5, x3]
tbl v0.16b, {v0.16b - v1.16b}, v4.16b
add x3, x3, 32
tbl v2.16b, {v2.16b - v3.16b}, v4.16b
fadd v0.4s, v0.4s, v2.4s
str q0, [x4], 16
cmp x3, 1568
bne .L9
(2 permutes).
One simple(ish) fix would be to conditionalise the new case on
whether all "roots" of the load are reduction groups. Alternatively,
we could treat the new case as a soft preference and back out if it
would require any new VEC_PERM_EXPR nodes to be created. This would
require a propagation back from uses to definitions.
WDYT? Does this look like a feasible direction? If so, any thoughts
on when the new case should be enabled?
The patch keeps the bijection requirement, since relaxing that seems
like separate work.
Tested on aarch64-linux-gnu, no regressions.
Thanks,
Richard
---
gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
1 file changed, 14 insertions(+), 27 deletions(-)
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index dab5daddcc5..fde35d8c954 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see
<http://www.gnu.org/licenses/>. */
#include "config.h"
+#define INCLUDE_ALGORITHM
#include "system.h"
#include "coretypes.h"
#include "backend.h"
@@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
continue;
dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
- unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
- bool any_permute = false;
- for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
- {
- unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
- imin = MIN (imin, idx);
- imax = MAX (imax, idx);
- if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
- any_permute = true;
- }
- /* If there's no permute no need to split one out. */
- if (!any_permute)
- continue;
- /* If the span doesn't match we'd disrupt VF computation, avoid
- that for now. */
- if (imax - imin + 1 != SLP_TREE_LANES (node))
+
+ auto_vec<unsigned> sorted;
+ sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
+ std::sort (sorted.begin (), sorted.end ());
+ if (std::equal (sorted.begin (), sorted.end (),
+ SLP_TREE_LOAD_PERMUTATION (node).begin ()))
continue;
/* For now only handle true permutes, like
vect_attempt_slp_rearrange_stmts did. This allows us to be lazy
when permuting constants and invariants keeping the permute
bijective. */
- auto_sbitmap load_index (SLP_TREE_LANES (node));
- bitmap_clear (load_index);
- for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
- bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
- unsigned j;
- for (j = 0; j < SLP_TREE_LANES (node); ++j)
- if (!bitmap_bit_p (load_index, j))
- break;
- if (j != SLP_TREE_LANES (node))
+ if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end ())
continue;
vec<unsigned> perm = vNULL;
perm.safe_grow (SLP_TREE_LANES (node), true);
for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
- perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
+ {
+ auto it = std::lower_bound (sorted.begin (), sorted.end (),
+ SLP_TREE_LOAD_PERMUTATION (node)[j]);
+ perm[j] = it - sorted.begin ();
+ }
perms.safe_push (perm);
vertices[idx].perm_in = perms.length () - 1;
vertices[idx].perm_out = perms.length () - 1;
@@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
{
stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
int vec_index = 0;
- tree vectype = STMT_VINFO_VECTYPE (stmt_info);
+ tree vectype = SLP_TREE_VECTYPE (node);
unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
unsigned int mask_element;
machine_mode mode;
--
2.25.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads
2022-06-23 12:00 [PATCH] RFC: Optimise SLP permutes of non-consecutive loads Richard Sandiford
@ 2022-06-24 7:25 ` Richard Biener
2022-06-24 9:07 ` Richard Sandiford
0 siblings, 1 reply; 4+ messages in thread
From: Richard Biener @ 2022-06-24 7:25 UTC (permalink / raw)
To: Richard Sandiford; +Cc: gcc-patches
On Thu, 23 Jun 2022, Richard Sandiford wrote:
> In a reduction pair like:
>
> typedef float T;
>
> void
> f1 (T *x)
> {
> T res1 = 0;
> T res2 = 0;
> for (int i = 0; i < 100; ++i)
> {
> res1 += x[i * 2];
> res2 += x[i * 2 + 1];
> }
> x[0] = res1;
> x[1] = res2;
> }
>
> it isn't easy to predict whether the initial reduction group will
> be { res1, res2 } or { res2, res1 }. If the initial group ends up
> being { res2, res1 }, vect_optimize_slp will try to swap it around
> in order to remove an unncessary permute on the load. This gives:
>
> f1:
> .LFB0:
> .cfi_startproc
> movi v0.4s, 0
> mov x1, x0
> add x2, x0, 800
> .p2align 3,,7
> .L2:
> ldr q1, [x1], 16
> fadd v0.4s, v0.4s, v1.4s
> cmp x2, x1
> bne .L2
> dup d1, v0.d[1]
> fadd v0.2s, v0.2s, v1.2s
> str d0, [x0]
> ret
>
> But there are no specific ordering constraints for non-consecutive
> loads. For example, in:
>
> void
> f2 (T *x)
> {
> T res1 = 0;
> T res2 = 0;
> for (int i = 0; i < 100; ++i)
> {
> res1 += x[i * 4];
> res2 += x[i * 4 + 2];
> }
> x[0] = res1;
> x[1] = res2;
> }
>
> the reduction group happens to be { res2, res1 }, and so we get
> a load permutation of { 2, 0, 6, 4 }. On aarch64 this gives:
>
> f2:
> .LFB1:
> .cfi_startproc
> adrp x2, .LC0
> mov x1, x0
> movi v2.4s, 0
> ldr q3, [x2, #:lo12:.LC0]
> add x2, x0, 1568
> .p2align 3,,7
> .L6:
> ldp q0, q1, [x1]
> add x1, x1, 32
> tbl v0.16b, {v0.16b - v1.16b}, v3.16b
> fadd v2.4s, v2.4s, v0.4s
> cmp x1, x2
> bne .L6
> ...untidy code, 18 insns...
> ret
>
> where we have to load the permute selector from memory and use
> the general TBL instruction. If the order happened to be { res1, res2 }
> instead we would generate the much tidier:
>
> f2:
> .LFB1:
> .cfi_startproc
> movi v1.4s, 0
> mov x1, x0
> add x2, x0, 1568
> .p2align 3,,7
> .L6:
> ldp q0, q2, [x1]
> add x1, x1, 32
> uzp1 v0.4s, v0.4s, v2.4s
> fadd v1.4s, v1.4s, v0.4s
> cmp x1, x2
> bne .L6
> ldr d0, [x0, 1568]
> ldr d5, [x0, 1576]
> ldr s2, [x0, 1584]
> dup d3, v1.d[1]
> ldr s4, [x0, 1592]
> zip1 v0.2s, v0.2s, v5.2s
> ins v2.s[1], v4.s[0]
> fadd v0.2s, v0.2s, v2.2s
> fadd v0.2s, v0.2s, v1.2s
> fadd v0.2s, v0.2s, v3.2s
> str d0, [x0]
> ret
>
> This WIP patch makes vect_optimize_slp try to put load permutations
> into index order. However, this new transform might be a loss if it
> forces permutations elsewhere. For example, given:
>
> void
> f3 (T *restrict x, T *restrict y, T *restrict z)
> {
> for (int i = 0; i < 100; ++i)
> {
> x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
> x[i * 2 + 1] = y[i * 4] + z[i * 4];
> }
> }
>
> we would generate:
>
> .L9:
> ldr q0, [x1, x3]
> ldr q3, [x6, x3]
> ldr q1, [x2, x3]
> ldr q2, [x5, x3]
> add x3, x3, 32
> uzp1 v0.4s, v0.4s, v3.4s
> uzp1 v1.4s, v1.4s, v2.4s
> fadd v0.4s, v0.4s, v1.4s
> rev64 v0.4s, v0.4s
> str q0, [x4], 16
> cmp x3, 1568
> bne .L9
>
> (3 permutes) rather than:
>
> .L9:
> ldr q0, [x1, x3]
> ldr q1, [x6, x3]
> ldr q2, [x2, x3]
> ldr q3, [x5, x3]
> tbl v0.16b, {v0.16b - v1.16b}, v4.16b
> add x3, x3, 32
> tbl v2.16b, {v2.16b - v3.16b}, v4.16b
> fadd v0.4s, v0.4s, v2.4s
> str q0, [x4], 16
> cmp x3, 1568
> bne .L9
>
> (2 permutes).
>
> One simple(ish) fix would be to conditionalise the new case on
> whether all "roots" of the load are reduction groups. Alternatively,
> we could treat the new case as a soft preference and back out if it
> would require any new VEC_PERM_EXPR nodes to be created. This would
> require a propagation back from uses to definitions.
>
> WDYT? Does this look like a feasible direction? If so, any thoughts
> on when the new case should be enabled?
>
> The patch keeps the bijection requirement, since relaxing that seems
> like separate work.
>
> Tested on aarch64-linux-gnu, no regressions.
>
> Thanks,
> Richard
>
> ---
> gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
> 1 file changed, 14 insertions(+), 27 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index dab5daddcc5..fde35d8c954 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see
> <http://www.gnu.org/licenses/>. */
>
> #include "config.h"
> +#define INCLUDE_ALGORITHM
> #include "system.h"
> #include "coretypes.h"
> #include "backend.h"
> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> continue;
> dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> - bool any_permute = false;
> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> - {
> - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> - imin = MIN (imin, idx);
> - imax = MAX (imax, idx);
> - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> - any_permute = true;
> - }
> - /* If there's no permute no need to split one out. */
> - if (!any_permute)
> - continue;
> - /* If the span doesn't match we'd disrupt VF computation, avoid
> - that for now. */
> - if (imax - imin + 1 != SLP_TREE_LANES (node))
> +
> + auto_vec<unsigned> sorted;
> + sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> + std::sort (sorted.begin (), sorted.end ());
> + if (std::equal (sorted.begin (), sorted.end (),
> + SLP_TREE_LOAD_PERMUTATION (node).begin ()))
> continue;
>
> /* For now only handle true permutes, like
> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy
> when permuting constants and invariants keeping the permute
> bijective. */
> - auto_sbitmap load_index (SLP_TREE_LANES (node));
> - bitmap_clear (load_index);
> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> - unsigned j;
> - for (j = 0; j < SLP_TREE_LANES (node); ++j)
> - if (!bitmap_bit_p (load_index, j))
> - break;
> - if (j != SLP_TREE_LANES (node))
> + if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end ())
> continue;
So without the std:: rewrite this is the only change, where
previously we rejected { 0, 2, ... } because '1' was missing
but now we only reject duplicates? As the comment says this
is what vect_attempt_slp_rearrange_stmts did so we would need
to adjust the comment as well.
(I'm unsure about the std:: stuff but mostly because I'm not
familiar with it and its semantics)
The current "propagation" isn't really designed to find
optimal solutions, it rather relies on the initial
permute "anticipation" we set up and simply propagates
those up as far as possible. I think at some point we need
to come up with something more elaborate and rewrite this all.
It might also be that a target handles { 2, 0 } but not
{ 0, 2 } while we know it handles the unpermuted case always.
Can you think of a way to express this as dataflow problem
capturing finding the optimal solution as well?
> vec<unsigned> perm = vNULL;
> perm.safe_grow (SLP_TREE_LANES (node), true);
> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> + {
> + auto it = std::lower_bound (sorted.begin (), sorted.end (),
> + SLP_TREE_LOAD_PERMUTATION (node)[j]);
> + perm[j] = it - sorted.begin ();
> + }
> perms.safe_push (perm);
> vertices[idx].perm_in = perms.length () - 1;
> vertices[idx].perm_out = perms.length () - 1;
> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
> {
> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
> int vec_index = 0;
> - tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> + tree vectype = SLP_TREE_VECTYPE (node);
that looks like an obvious fix.
Richard.
> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> unsigned int mask_element;
> machine_mode mode;
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstra
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads
2022-06-24 7:25 ` Richard Biener
@ 2022-06-24 9:07 ` Richard Sandiford
2022-06-24 12:18 ` Richard Biener
0 siblings, 1 reply; 4+ messages in thread
From: Richard Sandiford @ 2022-06-24 9:07 UTC (permalink / raw)
To: Richard Biener; +Cc: gcc-patches
Richard Biener <rguenther@suse.de> writes:
> On Thu, 23 Jun 2022, Richard Sandiford wrote:
>> In a reduction pair like:
>>
>> typedef float T;
>>
>> void
>> f1 (T *x)
>> {
>> T res1 = 0;
>> T res2 = 0;
>> for (int i = 0; i < 100; ++i)
>> {
>> res1 += x[i * 2];
>> res2 += x[i * 2 + 1];
>> }
>> x[0] = res1;
>> x[1] = res2;
>> }
>>
>> it isn't easy to predict whether the initial reduction group will
>> be { res1, res2 } or { res2, res1 }. If the initial group ends up
>> being { res2, res1 }, vect_optimize_slp will try to swap it around
>> in order to remove an unncessary permute on the load. This gives:
>>
>> f1:
>> .LFB0:
>> .cfi_startproc
>> movi v0.4s, 0
>> mov x1, x0
>> add x2, x0, 800
>> .p2align 3,,7
>> .L2:
>> ldr q1, [x1], 16
>> fadd v0.4s, v0.4s, v1.4s
>> cmp x2, x1
>> bne .L2
>> dup d1, v0.d[1]
>> fadd v0.2s, v0.2s, v1.2s
>> str d0, [x0]
>> ret
>>
>> But there are no specific ordering constraints for non-consecutive
>> loads. For example, in:
>>
>> void
>> f2 (T *x)
>> {
>> T res1 = 0;
>> T res2 = 0;
>> for (int i = 0; i < 100; ++i)
>> {
>> res1 += x[i * 4];
>> res2 += x[i * 4 + 2];
>> }
>> x[0] = res1;
>> x[1] = res2;
>> }
>>
>> the reduction group happens to be { res2, res1 }, and so we get
>> a load permutation of { 2, 0, 6, 4 }. On aarch64 this gives:
>>
>> f2:
>> .LFB1:
>> .cfi_startproc
>> adrp x2, .LC0
>> mov x1, x0
>> movi v2.4s, 0
>> ldr q3, [x2, #:lo12:.LC0]
>> add x2, x0, 1568
>> .p2align 3,,7
>> .L6:
>> ldp q0, q1, [x1]
>> add x1, x1, 32
>> tbl v0.16b, {v0.16b - v1.16b}, v3.16b
>> fadd v2.4s, v2.4s, v0.4s
>> cmp x1, x2
>> bne .L6
>> ...untidy code, 18 insns...
>> ret
>>
>> where we have to load the permute selector from memory and use
>> the general TBL instruction. If the order happened to be { res1, res2 }
>> instead we would generate the much tidier:
>>
>> f2:
>> .LFB1:
>> .cfi_startproc
>> movi v1.4s, 0
>> mov x1, x0
>> add x2, x0, 1568
>> .p2align 3,,7
>> .L6:
>> ldp q0, q2, [x1]
>> add x1, x1, 32
>> uzp1 v0.4s, v0.4s, v2.4s
>> fadd v1.4s, v1.4s, v0.4s
>> cmp x1, x2
>> bne .L6
>> ldr d0, [x0, 1568]
>> ldr d5, [x0, 1576]
>> ldr s2, [x0, 1584]
>> dup d3, v1.d[1]
>> ldr s4, [x0, 1592]
>> zip1 v0.2s, v0.2s, v5.2s
>> ins v2.s[1], v4.s[0]
>> fadd v0.2s, v0.2s, v2.2s
>> fadd v0.2s, v0.2s, v1.2s
>> fadd v0.2s, v0.2s, v3.2s
>> str d0, [x0]
>> ret
>>
>> This WIP patch makes vect_optimize_slp try to put load permutations
>> into index order. However, this new transform might be a loss if it
>> forces permutations elsewhere. For example, given:
>>
>> void
>> f3 (T *restrict x, T *restrict y, T *restrict z)
>> {
>> for (int i = 0; i < 100; ++i)
>> {
>> x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
>> x[i * 2 + 1] = y[i * 4] + z[i * 4];
>> }
>> }
>>
>> we would generate:
>>
>> .L9:
>> ldr q0, [x1, x3]
>> ldr q3, [x6, x3]
>> ldr q1, [x2, x3]
>> ldr q2, [x5, x3]
>> add x3, x3, 32
>> uzp1 v0.4s, v0.4s, v3.4s
>> uzp1 v1.4s, v1.4s, v2.4s
>> fadd v0.4s, v0.4s, v1.4s
>> rev64 v0.4s, v0.4s
>> str q0, [x4], 16
>> cmp x3, 1568
>> bne .L9
>>
>> (3 permutes) rather than:
>>
>> .L9:
>> ldr q0, [x1, x3]
>> ldr q1, [x6, x3]
>> ldr q2, [x2, x3]
>> ldr q3, [x5, x3]
>> tbl v0.16b, {v0.16b - v1.16b}, v4.16b
>> add x3, x3, 32
>> tbl v2.16b, {v2.16b - v3.16b}, v4.16b
>> fadd v0.4s, v0.4s, v2.4s
>> str q0, [x4], 16
>> cmp x3, 1568
>> bne .L9
>>
>> (2 permutes).
>>
>> One simple(ish) fix would be to conditionalise the new case on
>> whether all "roots" of the load are reduction groups. Alternatively,
>> we could treat the new case as a soft preference and back out if it
>> would require any new VEC_PERM_EXPR nodes to be created. This would
>> require a propagation back from uses to definitions.
>>
>> WDYT? Does this look like a feasible direction? If so, any thoughts
>> on when the new case should be enabled?
>>
>> The patch keeps the bijection requirement, since relaxing that seems
>> like separate work.
>>
>> Tested on aarch64-linux-gnu, no regressions.
>>
>> Thanks,
>> Richard
>>
>> ---
>> gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
>> 1 file changed, 14 insertions(+), 27 deletions(-)
>>
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index dab5daddcc5..fde35d8c954 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see
>> <http://www.gnu.org/licenses/>. */
>>
>> #include "config.h"
>> +#define INCLUDE_ALGORITHM
>> #include "system.h"
>> #include "coretypes.h"
>> #include "backend.h"
>> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
>> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>> continue;
>> dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
>> - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
>> - bool any_permute = false;
>> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>> - {
>> - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
>> - imin = MIN (imin, idx);
>> - imax = MAX (imax, idx);
>> - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>> - any_permute = true;
>> - }
>> - /* If there's no permute no need to split one out. */
>> - if (!any_permute)
>> - continue;
>> - /* If the span doesn't match we'd disrupt VF computation, avoid
>> - that for now. */
>> - if (imax - imin + 1 != SLP_TREE_LANES (node))
>> +
>> + auto_vec<unsigned> sorted;
>> + sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
>> + std::sort (sorted.begin (), sorted.end ());
>> + if (std::equal (sorted.begin (), sorted.end (),
>> + SLP_TREE_LOAD_PERMUTATION (node).begin ()))
>> continue;
>>
>> /* For now only handle true permutes, like
>> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy
>> when permuting constants and invariants keeping the permute
>> bijective. */
>> - auto_sbitmap load_index (SLP_TREE_LANES (node));
>> - bitmap_clear (load_index);
>> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>> - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
>> - unsigned j;
>> - for (j = 0; j < SLP_TREE_LANES (node); ++j)
>> - if (!bitmap_bit_p (load_index, j))
>> - break;
>> - if (j != SLP_TREE_LANES (node))
>> + if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end ())
>> continue;
>
> So without the std:: rewrite this is the only change, where
> previously we rejected { 0, 2, ... } because '1' was missing
> but now we only reject duplicates?
All three changes work together really. I thought the std:: stuff
made things simpler, but the things that use std:: would still need
to be rewritten if the patch didn't use std::.
For a load permute (Lo) like { 2, 0, 6, 4 } we want the lane permute (La)
to be { 1, 0, 3, 2 }, so that the new load permute (Lo') is { 0, 2, 4, 6 }.
So the “SLP_TREE_LOAD_PERMUTATION (node)[j] - imin” below is no longer
enough when computing La, since subtracting the minimum doesn't have
the necessary compressive effect. We instead need to set La[i] to the
index of Lo[i] in Lo'. And the easiest way of doing that seemed to be
to compute Lo' (by sorting) and then doing a binary search for each
element. Having the sorted array also makes it easy to see whether
the permute is a no-op and whether the same source element appears
twice.
> As the comment says this
> is what vect_attempt_slp_rearrange_stmts did so we would need
> to adjust the comment as well.
I thought the comment meant the that permute entered into the perm
array is a bijection on [0, SLP_TREE_LANES (node) - 1]. That's still
true with the patch, and so for example we still don't need to worry
about:
/* ??? But for constants once we want to handle
non-bijective permutes we have to verify the permute,
when unifying lanes, will not unify different constants.
For example see gcc.dg/vect/bb-slp-14.c for a case
that would break. */
(I hope).
> (I'm unsure about the std:: stuff but mostly because I'm not
> familiar with it and its semantics)
>
> The current "propagation" isn't really designed to find
> optimal solutions, it rather relies on the initial
> permute "anticipation" we set up and simply propagates
> those up as far as possible. I think at some point we need
> to come up with something more elaborate and rewrite this all.
>
> It might also be that a target handles { 2, 0 } but not
> { 0, 2 } while we know it handles the unpermuted case always.
OK. I guess we should reject new permutes if vect_transform_slp_perm_load
is false for them and true for the original.
> Can you think of a way to express this as dataflow problem
> capturing finding the optimal solution as well?
I think it's inevitably going to be a two-way thing: propagate what's
best for producers as far as possible, then propagate back what's
best for consumers as far as possible, or the other way around.
Thanks,
Richard
>
>> vec<unsigned> perm = vNULL;
>> perm.safe_grow (SLP_TREE_LANES (node), true);
>> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>> - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
>> + {
>> + auto it = std::lower_bound (sorted.begin (), sorted.end (),
>> + SLP_TREE_LOAD_PERMUTATION (node)[j]);
>> + perm[j] = it - sorted.begin ();
>> + }
>> perms.safe_push (perm);
>> vertices[idx].perm_in = perms.length () - 1;
>> vertices[idx].perm_out = perms.length () - 1;
>> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
>> {
>> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>> int vec_index = 0;
>> - tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> + tree vectype = SLP_TREE_VECTYPE (node);
>
> that looks like an obvious fix.
>
> Richard.
>
>> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>> unsigned int mask_element;
>> machine_mode mode;
>>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads
2022-06-24 9:07 ` Richard Sandiford
@ 2022-06-24 12:18 ` Richard Biener
0 siblings, 0 replies; 4+ messages in thread
From: Richard Biener @ 2022-06-24 12:18 UTC (permalink / raw)
To: Richard Sandiford; +Cc: gcc-patches
On Fri, 24 Jun 2022, Richard Sandiford wrote:
> Richard Biener <rguenther@suse.de> writes:
> > On Thu, 23 Jun 2022, Richard Sandiford wrote:
> >> In a reduction pair like:
> >>
> >> typedef float T;
> >>
> >> void
> >> f1 (T *x)
> >> {
> >> T res1 = 0;
> >> T res2 = 0;
> >> for (int i = 0; i < 100; ++i)
> >> {
> >> res1 += x[i * 2];
> >> res2 += x[i * 2 + 1];
> >> }
> >> x[0] = res1;
> >> x[1] = res2;
> >> }
> >>
> >> it isn't easy to predict whether the initial reduction group will
> >> be { res1, res2 } or { res2, res1 }. If the initial group ends up
> >> being { res2, res1 }, vect_optimize_slp will try to swap it around
> >> in order to remove an unncessary permute on the load. This gives:
> >>
> >> f1:
> >> .LFB0:
> >> .cfi_startproc
> >> movi v0.4s, 0
> >> mov x1, x0
> >> add x2, x0, 800
> >> .p2align 3,,7
> >> .L2:
> >> ldr q1, [x1], 16
> >> fadd v0.4s, v0.4s, v1.4s
> >> cmp x2, x1
> >> bne .L2
> >> dup d1, v0.d[1]
> >> fadd v0.2s, v0.2s, v1.2s
> >> str d0, [x0]
> >> ret
> >>
> >> But there are no specific ordering constraints for non-consecutive
> >> loads. For example, in:
> >>
> >> void
> >> f2 (T *x)
> >> {
> >> T res1 = 0;
> >> T res2 = 0;
> >> for (int i = 0; i < 100; ++i)
> >> {
> >> res1 += x[i * 4];
> >> res2 += x[i * 4 + 2];
> >> }
> >> x[0] = res1;
> >> x[1] = res2;
> >> }
> >>
> >> the reduction group happens to be { res2, res1 }, and so we get
> >> a load permutation of { 2, 0, 6, 4 }. On aarch64 this gives:
> >>
> >> f2:
> >> .LFB1:
> >> .cfi_startproc
> >> adrp x2, .LC0
> >> mov x1, x0
> >> movi v2.4s, 0
> >> ldr q3, [x2, #:lo12:.LC0]
> >> add x2, x0, 1568
> >> .p2align 3,,7
> >> .L6:
> >> ldp q0, q1, [x1]
> >> add x1, x1, 32
> >> tbl v0.16b, {v0.16b - v1.16b}, v3.16b
> >> fadd v2.4s, v2.4s, v0.4s
> >> cmp x1, x2
> >> bne .L6
> >> ...untidy code, 18 insns...
> >> ret
> >>
> >> where we have to load the permute selector from memory and use
> >> the general TBL instruction. If the order happened to be { res1, res2 }
> >> instead we would generate the much tidier:
> >>
> >> f2:
> >> .LFB1:
> >> .cfi_startproc
> >> movi v1.4s, 0
> >> mov x1, x0
> >> add x2, x0, 1568
> >> .p2align 3,,7
> >> .L6:
> >> ldp q0, q2, [x1]
> >> add x1, x1, 32
> >> uzp1 v0.4s, v0.4s, v2.4s
> >> fadd v1.4s, v1.4s, v0.4s
> >> cmp x1, x2
> >> bne .L6
> >> ldr d0, [x0, 1568]
> >> ldr d5, [x0, 1576]
> >> ldr s2, [x0, 1584]
> >> dup d3, v1.d[1]
> >> ldr s4, [x0, 1592]
> >> zip1 v0.2s, v0.2s, v5.2s
> >> ins v2.s[1], v4.s[0]
> >> fadd v0.2s, v0.2s, v2.2s
> >> fadd v0.2s, v0.2s, v1.2s
> >> fadd v0.2s, v0.2s, v3.2s
> >> str d0, [x0]
> >> ret
> >>
> >> This WIP patch makes vect_optimize_slp try to put load permutations
> >> into index order. However, this new transform might be a loss if it
> >> forces permutations elsewhere. For example, given:
> >>
> >> void
> >> f3 (T *restrict x, T *restrict y, T *restrict z)
> >> {
> >> for (int i = 0; i < 100; ++i)
> >> {
> >> x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
> >> x[i * 2 + 1] = y[i * 4] + z[i * 4];
> >> }
> >> }
> >>
> >> we would generate:
> >>
> >> .L9:
> >> ldr q0, [x1, x3]
> >> ldr q3, [x6, x3]
> >> ldr q1, [x2, x3]
> >> ldr q2, [x5, x3]
> >> add x3, x3, 32
> >> uzp1 v0.4s, v0.4s, v3.4s
> >> uzp1 v1.4s, v1.4s, v2.4s
> >> fadd v0.4s, v0.4s, v1.4s
> >> rev64 v0.4s, v0.4s
> >> str q0, [x4], 16
> >> cmp x3, 1568
> >> bne .L9
> >>
> >> (3 permutes) rather than:
> >>
> >> .L9:
> >> ldr q0, [x1, x3]
> >> ldr q1, [x6, x3]
> >> ldr q2, [x2, x3]
> >> ldr q3, [x5, x3]
> >> tbl v0.16b, {v0.16b - v1.16b}, v4.16b
> >> add x3, x3, 32
> >> tbl v2.16b, {v2.16b - v3.16b}, v4.16b
> >> fadd v0.4s, v0.4s, v2.4s
> >> str q0, [x4], 16
> >> cmp x3, 1568
> >> bne .L9
> >>
> >> (2 permutes).
> >>
> >> One simple(ish) fix would be to conditionalise the new case on
> >> whether all "roots" of the load are reduction groups. Alternatively,
> >> we could treat the new case as a soft preference and back out if it
> >> would require any new VEC_PERM_EXPR nodes to be created. This would
> >> require a propagation back from uses to definitions.
> >>
> >> WDYT? Does this look like a feasible direction? If so, any thoughts
> >> on when the new case should be enabled?
> >>
> >> The patch keeps the bijection requirement, since relaxing that seems
> >> like separate work.
> >>
> >> Tested on aarch64-linux-gnu, no regressions.
> >>
> >> Thanks,
> >> Richard
> >>
> >> ---
> >> gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
> >> 1 file changed, 14 insertions(+), 27 deletions(-)
> >>
> >> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> >> index dab5daddcc5..fde35d8c954 100644
> >> --- a/gcc/tree-vect-slp.cc
> >> +++ b/gcc/tree-vect-slp.cc
> >> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see
> >> <http://www.gnu.org/licenses/>. */
> >>
> >> #include "config.h"
> >> +#define INCLUDE_ALGORITHM
> >> #include "system.h"
> >> #include "coretypes.h"
> >> #include "backend.h"
> >> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
> >> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> >> continue;
> >> dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> >> - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> >> - bool any_permute = false;
> >> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> >> - {
> >> - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> >> - imin = MIN (imin, idx);
> >> - imax = MAX (imax, idx);
> >> - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> >> - any_permute = true;
> >> - }
> >> - /* If there's no permute no need to split one out. */
> >> - if (!any_permute)
> >> - continue;
> >> - /* If the span doesn't match we'd disrupt VF computation, avoid
> >> - that for now. */
> >> - if (imax - imin + 1 != SLP_TREE_LANES (node))
> >> +
> >> + auto_vec<unsigned> sorted;
> >> + sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> >> + std::sort (sorted.begin (), sorted.end ());
> >> + if (std::equal (sorted.begin (), sorted.end (),
> >> + SLP_TREE_LOAD_PERMUTATION (node).begin ()))
> >> continue;
> >>
> >> /* For now only handle true permutes, like
> >> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy
> >> when permuting constants and invariants keeping the permute
> >> bijective. */
> >> - auto_sbitmap load_index (SLP_TREE_LANES (node));
> >> - bitmap_clear (load_index);
> >> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> >> - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> >> - unsigned j;
> >> - for (j = 0; j < SLP_TREE_LANES (node); ++j)
> >> - if (!bitmap_bit_p (load_index, j))
> >> - break;
> >> - if (j != SLP_TREE_LANES (node))
> >> + if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end ())
> >> continue;
> >
> > So without the std:: rewrite this is the only change, where
> > previously we rejected { 0, 2, ... } because '1' was missing
> > but now we only reject duplicates?
>
> All three changes work together really. I thought the std:: stuff
> made things simpler, but the things that use std:: would still need
> to be rewritten if the patch didn't use std::.
>
> For a load permute (Lo) like { 2, 0, 6, 4 } we want the lane permute (La)
> to be { 1, 0, 3, 2 }, so that the new load permute (Lo') is { 0, 2, 4, 6 }.
>
> So the ?SLP_TREE_LOAD_PERMUTATION (node)[j] - imin? below is no longer
> enough when computing La, since subtracting the minimum doesn't have
> the necessary compressive effect. We instead need to set La[i] to the
> index of Lo[i] in Lo'. And the easiest way of doing that seemed to be
> to compute Lo' (by sorting) and then doing a binary search for each
> element. Having the sorted array also makes it easy to see whether
> the permute is a no-op and whether the same source element appears
> twice.
>
> > As the comment says this
> > is what vect_attempt_slp_rearrange_stmts did so we would need
> > to adjust the comment as well.
>
> I thought the comment meant the that permute entered into the perm
> array is a bijection on [0, SLP_TREE_LANES (node) - 1]. That's still
> true with the patch, and so for example we still don't need to worry
> about:
>
> /* ??? But for constants once we want to handle
> non-bijective permutes we have to verify the permute,
> when unifying lanes, will not unify different constants.
> For example see gcc.dg/vect/bb-slp-14.c for a case
> that would break. */
>
> (I hope).
Yes. I think the comment meant that permutes that only select a part
of the vector are not supported, but I'd have to go back to the original
vect_attempt_slp_rearrange_stmts (which was quite limited) to check.
The idea there was that the "unpermuted" vector load would be always
a natively supported operation and thus always better than the
permuted one.
I think extending on that is OK. Ideally targets would expose
multi-step permutes here so we can better assess cost.
> > (I'm unsure about the std:: stuff but mostly because I'm not
> > familiar with it and its semantics)
> >
> > The current "propagation" isn't really designed to find
> > optimal solutions, it rather relies on the initial
> > permute "anticipation" we set up and simply propagates
> > those up as far as possible. I think at some point we need
> > to come up with something more elaborate and rewrite this all.
> >
> > It might also be that a target handles { 2, 0 } but not
> > { 0, 2 } while we know it handles the unpermuted case always.
>
> OK. I guess we should reject new permutes if vect_transform_slp_perm_load
> is false for them and true for the original.
Yes, for example. Or somehow "compute" a permute that makes the
permute supported (since when it ends up not supported we disqualify
the vectorization later).
> > Can you think of a way to express this as dataflow problem
> > capturing finding the optimal solution as well?
>
> I think it's inevitably going to be a two-way thing: propagate what's
> best for producers as far as possible, then propagate back what's
> best for consumers as far as possible, or the other way around.
Yeah, with the additional constraint what the target handles.
Richard.
> Thanks,
> Richard
>
> >
> >> vec<unsigned> perm = vNULL;
> >> perm.safe_grow (SLP_TREE_LANES (node), true);
> >> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> >> - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> >> + {
> >> + auto it = std::lower_bound (sorted.begin (), sorted.end (),
> >> + SLP_TREE_LOAD_PERMUTATION (node)[j]);
> >> + perm[j] = it - sorted.begin ();
> >> + }
> >> perms.safe_push (perm);
> >> vertices[idx].perm_in = perms.length () - 1;
> >> vertices[idx].perm_out = perms.length () - 1;
> >> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
> >> {
> >> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
> >> int vec_index = 0;
> >> - tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> >> + tree vectype = SLP_TREE_VECTYPE (node);
> >
> > that looks like an obvious fix.
> >
> > Richard.
> >
> >> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
> >> unsigned int mask_element;
> >> machine_mode mode;
> >>
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstra
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-06-24 12:18 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-23 12:00 [PATCH] RFC: Optimise SLP permutes of non-consecutive loads Richard Sandiford
2022-06-24 7:25 ` Richard Biener
2022-06-24 9:07 ` Richard Sandiford
2022-06-24 12:18 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).