* [PATCH] vect: Try to remove single-vector permutes from SLP graph
@ 2022-09-01 10:29 Richard Sandiford
2022-09-01 10:49 ` Richard Biener
0 siblings, 1 reply; 2+ messages in thread
From: Richard Sandiford @ 2022-09-01 10:29 UTC (permalink / raw)
To: gcc-patches; +Cc: rguenther
This patch extends the SLP layout optimisation pass so that it
tries to remove layout changes that are brought about by permutes
of existing vectors. This fixes the bb-slp-pr54400.c regression on
x86_64 and also means that we can remove the permutes in cases like:
typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
float __attribute__((noipa))
f(v4sf v0, v4sf v1)
{
return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
}
The new test is a simple adaption of bb-slp-pr54400.c, with the
same style of markup.
Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install?
Richard
PS. Sorry for missing the regression during testing. The initial
version of the new bb-slp-layout* tests had markup that led
to a lot of FAILs on x86_64. I fixed the markup to avoid those,
but didn't notice this extra (unrelated) failure at the end.
gcc/
* tree-vect-slp.cc (vect_build_slp_tree_2): When building a
VEC_PERM_EXPR of an existing vector, set the SLP_TREE_LANES
to the number of vector elements, if that's a known constant.
(vect_optimize_slp_pass::is_compatible_layout): Remove associated
comment about zero SLP_TREE_LANES.
(vect_optimize_slp_pass::start_choosing_layouts): Iterate over
all partition members when looking for potential layouts.
Handle existing permutes of fixed-length vectors.
gcc/testsuite/
* gcc.dg/vect/bb-slp-pr54400.c: Extend to aarch64.
* gcc.dg/vect/bb-slp-layout-18.c: New test.
---
gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c | 15 +++++
gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c | 4 +-
gcc/tree-vect-slp.cc | 67 +++++++++++++-------
3 files changed, 61 insertions(+), 25 deletions(-)
create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
new file mode 100644
index 00000000000..ff462722507
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_float} */
+/* { dg-additional-options "-w -Wno-psabi -ffast-math" } */
+
+typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
+
+float __attribute__((noipa))
+f(v4sf v0, v4sf v1)
+{
+ return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
+}
+
+/* We are lacking an effective target for .REDUC_PLUS support. */
+/* { dg-final { scan-tree-dump-times "basic block part vectorized" 1 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
index 6b427aac774..8aec2092f4d 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
@@ -39,5 +39,5 @@ main ()
}
/* We are lacking an effective target for .REDUC_PLUS support. */
-/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* } } } */
-/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* } } } */
+/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
+/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 226550635cc..cc04ef350a6 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1840,6 +1840,10 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
TREE_TYPE (TREE_TYPE (vec))));
SLP_TREE_VECTYPE (vnode) = TREE_TYPE (vec);
}
+ auto nunits = TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (vnode));
+ unsigned HOST_WIDE_INT const_nunits;
+ if (nunits.is_constant (&const_nunits))
+ SLP_TREE_LANES (vnode) = const_nunits;
SLP_TREE_VEC_DEFS (vnode).safe_push (vec);
/* We are always building a permutation node even if it is an identity
permute to shield the rest of the vectorizer from the odd node
@@ -4325,8 +4329,6 @@ vect_optimize_slp_pass::is_compatible_layout (slp_tree node,
if (layout_i == 0)
return true;
- /* SLP_TREE_LANES is zero for existing vectors, but those only support
- layout 0 anyway. */
if (SLP_TREE_LANES (node) != m_perms[layout_i].length ())
return false;
@@ -4521,38 +4523,57 @@ vect_optimize_slp_pass::start_choosing_layouts ()
m_perms.safe_push (vNULL);
/* Create layouts from existing permutations. */
- for (unsigned int node_i : m_leafs)
+ auto_load_permutation_t tmp_perm;
+ for (unsigned int node_i : m_partitioned_nodes)
{
- auto &vertex = m_vertices[node_i];
- if (vertex.partition < 0)
- continue;
-
/* Leafs also double as entries to the reverse graph. Allow the
layout of those to be changed. */
+ auto &vertex = m_vertices[node_i];
auto &partition = m_partitions[vertex.partition];
if (!m_slpg->vertices[node_i].succ)
partition.layout = 0;
- /* Loads are the only thing generating permutes. */
+ /* Loads and VEC_PERM_EXPRs are the only things generating permutes. */
slp_tree node = vertex.node;
- if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
- continue;
-
- /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
- unpermuted, record a layout that reverses this permutation. */
- gcc_assert (partition.layout == 0);
stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
- if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
- continue;
- dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
- unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
+ slp_tree child;
+ unsigned HOST_WIDE_INT imin, imax = 0;
bool any_permute = false;
+ tmp_perm.truncate (0);
+ if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
+ {
+ /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
+ unpermuted, record a layout that reverses this permutation. */
+ gcc_assert (partition.layout == 0);
+ if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
+ continue;
+ dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
+ imin = DR_GROUP_SIZE (dr_stmt) + 1;
+ tmp_perm.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
+ }
+ else if (SLP_TREE_CODE (node) == VEC_PERM_EXPR
+ && SLP_TREE_CHILDREN (node).length () == 1
+ && (child = SLP_TREE_CHILDREN (node)[0])
+ && (TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (child))
+ .is_constant (&imin)))
+ {
+ /* If the child has the same vector size as this node,
+ reversing the permutation can make the permutation a no-op.
+ In other cases it can change a true permutation into a
+ full-vector extract. */
+ tmp_perm.reserve (SLP_TREE_LANES (node));
+ for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
+ tmp_perm.quick_push (SLP_TREE_LANE_PERMUTATION (node)[j].second);
+ }
+ else
+ continue;
+
for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
{
- unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
+ unsigned idx = tmp_perm[j];
imin = MIN (imin, idx);
imax = MAX (imax, idx);
- if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
+ if (idx - tmp_perm[0] != j)
any_permute = true;
}
/* If the span doesn't match we'd disrupt VF computation, avoid
@@ -4560,7 +4581,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
if (imax - imin + 1 != SLP_TREE_LANES (node))
continue;
/* If there's no permute no need to split one out. In this case
- we can consider turning the load into a permuted load, if that
+ we can consider turning a load into a permuted load, if that
turns out to be cheaper than alternatives. */
if (!any_permute)
{
@@ -4575,7 +4596,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
auto_sbitmap load_index (SLP_TREE_LANES (node));
bitmap_clear (load_index);
for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
- bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
+ bitmap_set_bit (load_index, tmp_perm[j] - imin);
unsigned j;
for (j = 0; j < SLP_TREE_LANES (node); ++j)
if (!bitmap_bit_p (load_index, j))
@@ -4586,7 +4607,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
vec<unsigned> perm = vNULL;
perm.safe_grow (SLP_TREE_LANES (node), true);
for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
- perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
+ perm[j] = tmp_perm[j] - imin;
if (int (m_perms.length ()) >= param_vect_max_layout_candidates)
{
--
2.25.1
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [PATCH] vect: Try to remove single-vector permutes from SLP graph
2022-09-01 10:29 [PATCH] vect: Try to remove single-vector permutes from SLP graph Richard Sandiford
@ 2022-09-01 10:49 ` Richard Biener
0 siblings, 0 replies; 2+ messages in thread
From: Richard Biener @ 2022-09-01 10:49 UTC (permalink / raw)
To: Richard Sandiford; +Cc: gcc-patches
On Thu, 1 Sep 2022, Richard Sandiford wrote:
> This patch extends the SLP layout optimisation pass so that it
> tries to remove layout changes that are brought about by permutes
> of existing vectors. This fixes the bb-slp-pr54400.c regression on
> x86_64 and also means that we can remove the permutes in cases like:
>
> typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
>
> float __attribute__((noipa))
> f(v4sf v0, v4sf v1)
> {
> return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
> }
>
> The new test is a simple adaption of bb-slp-pr54400.c, with the
> same style of markup.
>
> Tested on aarch64-linux-gnu and x86_64-linux-gnu. OK to install?
OK.
Thanks,
Richard.
>
> Richard
>
> PS. Sorry for missing the regression during testing. The initial
> version of the new bb-slp-layout* tests had markup that led
> to a lot of FAILs on x86_64. I fixed the markup to avoid those,
> but didn't notice this extra (unrelated) failure at the end.
>
>
> gcc/
> * tree-vect-slp.cc (vect_build_slp_tree_2): When building a
> VEC_PERM_EXPR of an existing vector, set the SLP_TREE_LANES
> to the number of vector elements, if that's a known constant.
> (vect_optimize_slp_pass::is_compatible_layout): Remove associated
> comment about zero SLP_TREE_LANES.
> (vect_optimize_slp_pass::start_choosing_layouts): Iterate over
> all partition members when looking for potential layouts.
> Handle existing permutes of fixed-length vectors.
>
> gcc/testsuite/
> * gcc.dg/vect/bb-slp-pr54400.c: Extend to aarch64.
> * gcc.dg/vect/bb-slp-layout-18.c: New test.
> ---
> gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c | 15 +++++
> gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c | 4 +-
> gcc/tree-vect-slp.cc | 67 +++++++++++++-------
> 3 files changed, 61 insertions(+), 25 deletions(-)
> create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
> new file mode 100644
> index 00000000000..ff462722507
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-layout-18.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target vect_float} */
> +/* { dg-additional-options "-w -Wno-psabi -ffast-math" } */
> +
> +typedef float v4sf __attribute__((vector_size(sizeof(float)*4)));
> +
> +float __attribute__((noipa))
> +f(v4sf v0, v4sf v1)
> +{
> + return v0[0]*v1[0]+v0[1]*v1[1]+v0[2]*v1[2]+v0[3]*v1[3];
> +}
> +
> +/* We are lacking an effective target for .REDUC_PLUS support. */
> +/* { dg-final { scan-tree-dump-times "basic block part vectorized" 1 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
> index 6b427aac774..8aec2092f4d 100644
> --- a/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
> +++ b/gcc/testsuite/gcc.dg/vect/bb-slp-pr54400.c
> @@ -39,5 +39,5 @@ main ()
> }
>
> /* We are lacking an effective target for .REDUC_PLUS support. */
> -/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* } } } */
> -/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "basic block part vectorized" 3 "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-not " = VEC_PERM_EXPR" "slp2" { target x86_64-*-* aarch64*-*-* } } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 226550635cc..cc04ef350a6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -1840,6 +1840,10 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
> TREE_TYPE (TREE_TYPE (vec))));
> SLP_TREE_VECTYPE (vnode) = TREE_TYPE (vec);
> }
> + auto nunits = TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (vnode));
> + unsigned HOST_WIDE_INT const_nunits;
> + if (nunits.is_constant (&const_nunits))
> + SLP_TREE_LANES (vnode) = const_nunits;
> SLP_TREE_VEC_DEFS (vnode).safe_push (vec);
> /* We are always building a permutation node even if it is an identity
> permute to shield the rest of the vectorizer from the odd node
> @@ -4325,8 +4329,6 @@ vect_optimize_slp_pass::is_compatible_layout (slp_tree node,
> if (layout_i == 0)
> return true;
>
> - /* SLP_TREE_LANES is zero for existing vectors, but those only support
> - layout 0 anyway. */
> if (SLP_TREE_LANES (node) != m_perms[layout_i].length ())
> return false;
>
> @@ -4521,38 +4523,57 @@ vect_optimize_slp_pass::start_choosing_layouts ()
> m_perms.safe_push (vNULL);
>
> /* Create layouts from existing permutations. */
> - for (unsigned int node_i : m_leafs)
> + auto_load_permutation_t tmp_perm;
> + for (unsigned int node_i : m_partitioned_nodes)
> {
> - auto &vertex = m_vertices[node_i];
> - if (vertex.partition < 0)
> - continue;
> -
> /* Leafs also double as entries to the reverse graph. Allow the
> layout of those to be changed. */
> + auto &vertex = m_vertices[node_i];
> auto &partition = m_partitions[vertex.partition];
> if (!m_slpg->vertices[node_i].succ)
> partition.layout = 0;
>
> - /* Loads are the only thing generating permutes. */
> + /* Loads and VEC_PERM_EXPRs are the only things generating permutes. */
> slp_tree node = vertex.node;
> - if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> - continue;
> -
> - /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
> - unpermuted, record a layout that reverses this permutation. */
> - gcc_assert (partition.layout == 0);
> stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> - if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> - continue;
> - dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> + slp_tree child;
> + unsigned HOST_WIDE_INT imin, imax = 0;
> bool any_permute = false;
> + tmp_perm.truncate (0);
> + if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> + {
> + /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the node
> + unpermuted, record a layout that reverses this permutation. */
> + gcc_assert (partition.layout == 0);
> + if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> + continue;
> + dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> + imin = DR_GROUP_SIZE (dr_stmt) + 1;
> + tmp_perm.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> + }
> + else if (SLP_TREE_CODE (node) == VEC_PERM_EXPR
> + && SLP_TREE_CHILDREN (node).length () == 1
> + && (child = SLP_TREE_CHILDREN (node)[0])
> + && (TYPE_VECTOR_SUBPARTS (SLP_TREE_VECTYPE (child))
> + .is_constant (&imin)))
> + {
> + /* If the child has the same vector size as this node,
> + reversing the permutation can make the permutation a no-op.
> + In other cases it can change a true permutation into a
> + full-vector extract. */
> + tmp_perm.reserve (SLP_TREE_LANES (node));
> + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> + tmp_perm.quick_push (SLP_TREE_LANE_PERMUTATION (node)[j].second);
> + }
> + else
> + continue;
> +
> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> {
> - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> + unsigned idx = tmp_perm[j];
> imin = MIN (imin, idx);
> imax = MAX (imax, idx);
> - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> + if (idx - tmp_perm[0] != j)
> any_permute = true;
> }
> /* If the span doesn't match we'd disrupt VF computation, avoid
> @@ -4560,7 +4581,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
> if (imax - imin + 1 != SLP_TREE_LANES (node))
> continue;
> /* If there's no permute no need to split one out. In this case
> - we can consider turning the load into a permuted load, if that
> + we can consider turning a load into a permuted load, if that
> turns out to be cheaper than alternatives. */
> if (!any_permute)
> {
> @@ -4575,7 +4596,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
> auto_sbitmap load_index (SLP_TREE_LANES (node));
> bitmap_clear (load_index);
> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> + bitmap_set_bit (load_index, tmp_perm[j] - imin);
> unsigned j;
> for (j = 0; j < SLP_TREE_LANES (node); ++j)
> if (!bitmap_bit_p (load_index, j))
> @@ -4586,7 +4607,7 @@ vect_optimize_slp_pass::start_choosing_layouts ()
> vec<unsigned> perm = vNULL;
> perm.safe_grow (SLP_TREE_LANES (node), true);
> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> + perm[j] = tmp_perm[j] - imin;
>
> if (int (m_perms.length ()) >= param_vect_max_layout_candidates)
> {
>
--
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
HRB 36809 (AG Nuernberg)
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-09-01 10:49 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-01 10:29 [PATCH] vect: Try to remove single-vector permutes from SLP graph Richard Sandiford
2022-09-01 10:49 ` Richard Biener
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).