[PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
@ 2022-11-04  0:04 Hongyu Wang
  2022-11-04  6:43 ` Prathamesh Kulkarni
  2022-11-16 19:21 ` Marc Glisse
  0 siblings, 2 replies; 16+ messages in thread
From: Hongyu Wang @ 2022-11-04  0:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: richard.guenther, hongtao.liu

Hi,

This is a follow-up patch for PR98167

The sequence
     c1 = VEC_PERM_EXPR (a, a, mask)
     c2 = VEC_PERM_EXPR (b, b, mask)
     c3 = c1 op c2
can be optimized to
     c = a op b
     c3 = VEC_PERM_EXPR (c, c, mask)
for all integer vector operation, and float operation with
full permutation.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

gcc/ChangeLog:

	PR target/98167
	* match.pd: New perm + vector op patterns for int and fp vector.

gcc/testsuite/ChangeLog:

	PR target/98167
	* gcc.target/i386/pr98167.c: New test.
---
 gcc/match.pd                            | 49 +++++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
 2 files changed, 93 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 194ba8f5188..b85ad34f609 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8189,3 +8189,52 @@ and,
  (bit_and (negate @0) integer_onep@1)
  (if (!TYPE_OVERFLOW_SANITIZED (type))
   (bit_and @0 @1)))
+
+/* Optimize
+   c1 = VEC_PERM_EXPR (a, a, mask)
+   c2 = VEC_PERM_EXPR (b, b, mask)
+   c3 = c1 op c2
+   -->
+   c = a op b
+   c3 = VEC_PERM_EXPR (c, c, mask)
+   For all integer non-div operations.  */
+(for op (plus minus mult bit_and bit_ior bit_xor
+	 lshift rshift)
+ (simplify
+  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
+    (if (VECTOR_INTEGER_TYPE_P (type))
+     (vec_perm (op @0 @1) (op @0 @1) @2))))
+
+/* Similar for float arithmetic when permutation constant covers
+   all vector elements.  */
+(for op (plus minus mult)
+ (simplify
+  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
+    (if (VECTOR_FLOAT_TYPE_P (type))
+     (with
+      {
+	tree perm_cst = @2;
+	vec_perm_builder builder;
+	bool full_perm_p = false;
+	if (tree_to_vec_perm_builder (&builder, perm_cst))
+	  {
+	    /* Create a vec_perm_indices for the integer vector.  */
+	    int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+	    vec_perm_indices sel (builder, 1, nelts);
+
+	    /* Check if perm indices covers all vector elements.  */
+	    int count = 0, i, j;
+	    for (i = 0; i < nelts; i++)
+	      for (j = 0; j < nelts; j++)
+		{
+		  if (sel[j].to_constant () == i)
+		    {
+		      count++;
+		      break;
+		    }
+		}
+	    full_perm_p = count == nelts;
+	  }
+       }
+       (if (full_perm_p)
+	(vec_perm (op @0 @1) (op @0 @1) @2))))))
diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
new file mode 100644
index 00000000000..40e0ac11332
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr98167.c
@@ -0,0 +1,44 @@
+/* PR target/98167 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+
+/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
+/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
+
+#define VEC_PERM_4 \
+  2, 3, 1, 0
+#define VEC_PERM_8 \
+  4, 5, 6, 7, 3, 2, 1, 0
+#define VEC_PERM_16 \
+  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
+
+#define TYPE_PERM_OP(type, size, op, name) \
+  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
+  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
+					      v##size##s##type b) \
+  { \
+    v##size##s##type a1 = __builtin_shufflevector (a, a, \
+						   VEC_PERM_##size); \
+    v##size##s##type b1 = __builtin_shufflevector (b, b, \
+						   VEC_PERM_##size); \
+    return a1 op b1; \
+  }
+
+#define INT_PERMS(op, name) \
+  TYPE_PERM_OP (int, 4, op, name) \
+
+#define FP_PERMS(op, name) \
+  TYPE_PERM_OP (float, 4, op, name) \
+
+INT_PERMS (+, add)
+INT_PERMS (-, sub)
+INT_PERMS (*, mul)
+INT_PERMS (|, ior)
+INT_PERMS (^, xor)
+INT_PERMS (&, and)
+INT_PERMS (<<, shl)
+INT_PERMS (>>, shr)
+FP_PERMS (+, add)
+FP_PERMS (-, sub)
+FP_PERMS (*, mul)
+
-- 
2.18.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-04  0:04 [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167] Hongyu Wang
@ 2022-11-04  6:43 ` Prathamesh Kulkarni
  2022-11-08 14:37   ` Richard Biener
  2022-11-16 19:21 ` Marc Glisse
  1 sibling, 1 reply; 16+ messages in thread
From: Prathamesh Kulkarni @ 2022-11-04  6:43 UTC (permalink / raw)
  To: Hongyu Wang; +Cc: gcc-patches, hongtao.liu

On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> This is a follow-up patch for PR98167
>
> The sequence
>      c1 = VEC_PERM_EXPR (a, a, mask)
>      c2 = VEC_PERM_EXPR (b, b, mask)
>      c3 = c1 op c2
> can be optimized to
>      c = a op b
>      c3 = VEC_PERM_EXPR (c, c, mask)
> for all integer vector operation, and float operation with
> full permutation.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
>
> gcc/ChangeLog:
>
>         PR target/98167
>         * match.pd: New perm + vector op patterns for int and fp vector.
>
> gcc/testsuite/ChangeLog:
>
>         PR target/98167
>         * gcc.target/i386/pr98167.c: New test.
> ---
>  gcc/match.pd                            | 49 +++++++++++++++++++++++++
>  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
>  2 files changed, 93 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
>
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 194ba8f5188..b85ad34f609 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -8189,3 +8189,52 @@ and,
>   (bit_and (negate @0) integer_onep@1)
>   (if (!TYPE_OVERFLOW_SANITIZED (type))
>    (bit_and @0 @1)))
> +
> +/* Optimize
> +   c1 = VEC_PERM_EXPR (a, a, mask)
> +   c2 = VEC_PERM_EXPR (b, b, mask)
> +   c3 = c1 op c2
> +   -->
> +   c = a op b
> +   c3 = VEC_PERM_EXPR (c, c, mask)
> +   For all integer non-div operations.  */
> +(for op (plus minus mult bit_and bit_ior bit_xor
> +        lshift rshift)
> + (simplify
> +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> +    (if (VECTOR_INTEGER_TYPE_P (type))
> +     (vec_perm (op @0 @1) (op @0 @1) @2))))
Just wondering, why should mask be CST here ?
I guess the transform should work as long as mask is same for both
vectors even if it's
not constant ?
> +
> +/* Similar for float arithmetic when permutation constant covers
> +   all vector elements.  */
> +(for op (plus minus mult)
> + (simplify
> +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> +    (if (VECTOR_FLOAT_TYPE_P (type))
> +     (with
> +      {
> +       tree perm_cst = @2;
> +       vec_perm_builder builder;
> +       bool full_perm_p = false;
> +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> +         {
> +           /* Create a vec_perm_indices for the integer vector.  */
> +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
If this transform is meant only for VLS vectors, I guess you should
bail out if TYPE_VECTOR_SUBPARTS is not constant,
otherwise it will crash for VLA vectors.

Thanks,
Prathamesh
> +           vec_perm_indices sel (builder, 1, nelts);
> +
> +           /* Check if perm indices covers all vector elements.  */
> +           int count = 0, i, j;
> +           for (i = 0; i < nelts; i++)
> +             for (j = 0; j < nelts; j++)
> +               {
> +                 if (sel[j].to_constant () == i)
> +                   {
> +                     count++;
> +                     break;
> +                   }
> +               }
> +           full_perm_p = count == nelts;
> +         }
> +       }
> +       (if (full_perm_p)
> +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> new file mode 100644
> index 00000000000..40e0ac11332
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> @@ -0,0 +1,44 @@
> +/* PR target/98167 */
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mavx2" } */
> +
> +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> +
> +#define VEC_PERM_4 \
> +  2, 3, 1, 0
> +#define VEC_PERM_8 \
> +  4, 5, 6, 7, 3, 2, 1, 0
> +#define VEC_PERM_16 \
> +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> +
> +#define TYPE_PERM_OP(type, size, op, name) \
> +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> +                                             v##size##s##type b) \
> +  { \
> +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> +                                                  VEC_PERM_##size); \
> +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> +                                                  VEC_PERM_##size); \
> +    return a1 op b1; \
> +  }
> +
> +#define INT_PERMS(op, name) \
> +  TYPE_PERM_OP (int, 4, op, name) \
> +
> +#define FP_PERMS(op, name) \
> +  TYPE_PERM_OP (float, 4, op, name) \
> +
> +INT_PERMS (+, add)
> +INT_PERMS (-, sub)
> +INT_PERMS (*, mul)
> +INT_PERMS (|, ior)
> +INT_PERMS (^, xor)
> +INT_PERMS (&, and)
> +INT_PERMS (<<, shl)
> +INT_PERMS (>>, shr)
> +FP_PERMS (+, add)
> +FP_PERMS (-, sub)
> +FP_PERMS (*, mul)
> +
> --
> 2.18.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-04  6:43 ` Prathamesh Kulkarni
@ 2022-11-08 14:37   ` Richard Biener
  2022-11-10  2:22     ` Hongyu Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2022-11-08 14:37 UTC (permalink / raw)
  To: Prathamesh Kulkarni, Richard Sandiford
  Cc: Hongyu Wang, hongtao.liu, gcc-patches

On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > Hi,
> >
> > This is a follow-up patch for PR98167
> >
> > The sequence
> >      c1 = VEC_PERM_EXPR (a, a, mask)
> >      c2 = VEC_PERM_EXPR (b, b, mask)
> >      c3 = c1 op c2
> > can be optimized to
> >      c = a op b
> >      c3 = VEC_PERM_EXPR (c, c, mask)
> > for all integer vector operation, and float operation with
> > full permutation.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> >
> > gcc/ChangeLog:
> >
> >         PR target/98167
> >         * match.pd: New perm + vector op patterns for int and fp vector.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         PR target/98167
> >         * gcc.target/i386/pr98167.c: New test.
> > ---
> >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> >  2 files changed, 93 insertions(+)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> >
> > diff --git a/gcc/match.pd b/gcc/match.pd
> > index 194ba8f5188..b85ad34f609 100644
> > --- a/gcc/match.pd
> > +++ b/gcc/match.pd
> > @@ -8189,3 +8189,52 @@ and,
> >   (bit_and (negate @0) integer_onep@1)
> >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> >    (bit_and @0 @1)))
> > +
> > +/* Optimize
> > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > +   c3 = c1 op c2
> > +   -->
> > +   c = a op b
> > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > +   For all integer non-div operations.  */
> > +(for op (plus minus mult bit_and bit_ior bit_xor
> > +        lshift rshift)
> > + (simplify
> > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> Just wondering, why should mask be CST here ?
> I guess the transform should work as long as mask is same for both
> vectors even if it's
> not constant ?

Yes, please change accordingly (and maybe push separately).

> > +
> > +/* Similar for float arithmetic when permutation constant covers
> > +   all vector elements.  */
> > +(for op (plus minus mult)
> > + (simplify
> > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > +     (with
> > +      {
> > +       tree perm_cst = @2;
> > +       vec_perm_builder builder;
> > +       bool full_perm_p = false;
> > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > +         {
> > +           /* Create a vec_perm_indices for the integer vector.  */
> > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> If this transform is meant only for VLS vectors, I guess you should
> bail out if TYPE_VECTOR_SUBPARTS is not constant,
> otherwise it will crash for VLA vectors.

I suppose it's difficult to create a VLA permute that covers all elements
and that is not trivial though.  But indeed add ().is_constant to the
VECTOR_FLOAT_TYPE_P guard.

>
> Thanks,
> Prathamesh
> > +           vec_perm_indices sel (builder, 1, nelts);
> > +
> > +           /* Check if perm indices covers all vector elements.  */
> > +           int count = 0, i, j;
> > +           for (i = 0; i < nelts; i++)
> > +             for (j = 0; j < nelts; j++)

Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
(as said I can't think of a non-full encoding that isn't trivial
but covers all elements) and then simply .qsort () the vector_builder
(it derives
from vec<>) so the scan is O(n log n).

Maybe Richard has a better idea here though.

Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
always that they explode the match decision tree, we'd ideally have a way to
match those with (op ..) (op ..) first to be able to share more of the matching
code.  That said, match.pd is a less than ideal place for these (but mostly
because of the way we code generate *-match.cc)

Richard.

> > +               {
> > +                 if (sel[j].to_constant () == i)
> > +                   {
> > +                     count++;
> > +                     break;
> > +                   }
> > +               }
> > +           full_perm_p = count == nelts;
> > +         }
> > +       }
> > +       (if (full_perm_p)
> > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > new file mode 100644
> > index 00000000000..40e0ac11332
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > @@ -0,0 +1,44 @@
> > +/* PR target/98167 */
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -mavx2" } */
> > +
> > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > +
> > +#define VEC_PERM_4 \
> > +  2, 3, 1, 0
> > +#define VEC_PERM_8 \
> > +  4, 5, 6, 7, 3, 2, 1, 0
> > +#define VEC_PERM_16 \
> > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > +
> > +#define TYPE_PERM_OP(type, size, op, name) \
> > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > +                                             v##size##s##type b) \
> > +  { \
> > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > +                                                  VEC_PERM_##size); \
> > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > +                                                  VEC_PERM_##size); \
> > +    return a1 op b1; \
> > +  }
> > +
> > +#define INT_PERMS(op, name) \
> > +  TYPE_PERM_OP (int, 4, op, name) \
> > +
> > +#define FP_PERMS(op, name) \
> > +  TYPE_PERM_OP (float, 4, op, name) \
> > +
> > +INT_PERMS (+, add)
> > +INT_PERMS (-, sub)
> > +INT_PERMS (*, mul)
> > +INT_PERMS (|, ior)
> > +INT_PERMS (^, xor)
> > +INT_PERMS (&, and)
> > +INT_PERMS (<<, shl)
> > +INT_PERMS (>>, shr)
> > +FP_PERMS (+, add)
> > +FP_PERMS (-, sub)
> > +FP_PERMS (*, mul)
> > +
> > --
> > 2.18.1
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-08 14:37   ` Richard Biener
@ 2022-11-10  2:22     ` Hongyu Wang
  2022-11-10  8:56       ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Hongyu Wang @ 2022-11-10  2:22 UTC (permalink / raw)
  To: Richard Biener
  Cc: Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang, hongtao.liu,
	gcc-patches

[-- Attachment #1: Type: text/plain, Size: 8260 bytes --]

Hi Prathamesh and Richard,

Thanks for the review and nice suggestions!

> > I guess the transform should work as long as mask is same for both
> > vectors even if it's
> > not constant ?
>
> Yes, please change accordingly (and maybe push separately).
>

Removed VECTOR_CST for integer ops.

> > If this transform is meant only for VLS vectors, I guess you should
> > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > otherwise it will crash for VLA vectors.
>
> I suppose it's difficult to create a VLA permute that covers all elements
> and that is not trivial though.  But indeed add ().is_constant to the
> VECTOR_FLOAT_TYPE_P guard.

Added.

> Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> (as said I can't think of a non-full encoding that isn't trivial
> but covers all elements) and then simply .qsort () the vector_builder
> (it derives
> from vec<>) so the scan is O(n log n).

The .qsort () approach requires an extra cmp_func that IMO would not
be feasible to be implemented in match.pd (I suppose lambda function
would not be a good idea either).
Another solution would be using hash_set but it does not work here for
int64_t or poly_int64 type.
So I kept current O(n^2) simple code here, and I suppose usually the
permutation indices would be a small number even for O(n^2)
complexity.

Attached updated patch.

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道：


>
> On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> >
> > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > >
> > > This is a follow-up patch for PR98167
> > >
> > > The sequence
> > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > >      c3 = c1 op c2
> > > can be optimized to
> > >      c = a op b
> > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > for all integer vector operation, and float operation with
> > > full permutation.
> > >
> > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > Ok for trunk?
> > >
> > > gcc/ChangeLog:
> > >
> > >         PR target/98167
> > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         PR target/98167
> > >         * gcc.target/i386/pr98167.c: New test.
> > > ---
> > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > >  2 files changed, 93 insertions(+)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > >
> > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > index 194ba8f5188..b85ad34f609 100644
> > > --- a/gcc/match.pd
> > > +++ b/gcc/match.pd
> > > @@ -8189,3 +8189,52 @@ and,
> > >   (bit_and (negate @0) integer_onep@1)
> > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > >    (bit_and @0 @1)))
> > > +
> > > +/* Optimize
> > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > +   c3 = c1 op c2
> > > +   -->
> > > +   c = a op b
> > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > +   For all integer non-div operations.  */
> > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > +        lshift rshift)
> > > + (simplify
> > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > Just wondering, why should mask be CST here ?
> > I guess the transform should work as long as mask is same for both
> > vectors even if it's
> > not constant ?
>
> Yes, please change accordingly (and maybe push separately).
>
> > > +
> > > +/* Similar for float arithmetic when permutation constant covers
> > > +   all vector elements.  */
> > > +(for op (plus minus mult)
> > > + (simplify
> > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > +     (with
> > > +      {
> > > +       tree perm_cst = @2;
> > > +       vec_perm_builder builder;
> > > +       bool full_perm_p = false;
> > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > +         {
> > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > If this transform is meant only for VLS vectors, I guess you should
> > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > otherwise it will crash for VLA vectors.
>
> I suppose it's difficult to create a VLA permute that covers all elements
> and that is not trivial though.  But indeed add ().is_constant to the
> VECTOR_FLOAT_TYPE_P guard.
>
> >
> > Thanks,
> > Prathamesh
> > > +           vec_perm_indices sel (builder, 1, nelts);
> > > +
> > > +           /* Check if perm indices covers all vector elements.  */
> > > +           int count = 0, i, j;
> > > +           for (i = 0; i < nelts; i++)
> > > +             for (j = 0; j < nelts; j++)
>
> Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> (as said I can't think of a non-full encoding that isn't trivial
> but covers all elements) and then simply .qsort () the vector_builder
> (it derives
> from vec<>) so the scan is O(n log n).
>
> Maybe Richard has a better idea here though.
>
> Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> always that they explode the match decision tree, we'd ideally have a way to
> match those with (op ..) (op ..) first to be able to share more of the matching
> code.  That said, match.pd is a less than ideal place for these (but mostly
> because of the way we code generate *-match.cc)
>
> Richard.
>
> > > +               {
> > > +                 if (sel[j].to_constant () == i)
> > > +                   {
> > > +                     count++;
> > > +                     break;
> > > +                   }
> > > +               }
> > > +           full_perm_p = count == nelts;
> > > +         }
> > > +       }
> > > +       (if (full_perm_p)
> > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > new file mode 100644
> > > index 00000000000..40e0ac11332
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > @@ -0,0 +1,44 @@
> > > +/* PR target/98167 */
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O2 -mavx2" } */
> > > +
> > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > +
> > > +#define VEC_PERM_4 \
> > > +  2, 3, 1, 0
> > > +#define VEC_PERM_8 \
> > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > +#define VEC_PERM_16 \
> > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > +
> > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > +                                             v##size##s##type b) \
> > > +  { \
> > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > +                                                  VEC_PERM_##size); \
> > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > +                                                  VEC_PERM_##size); \
> > > +    return a1 op b1; \
> > > +  }
> > > +
> > > +#define INT_PERMS(op, name) \
> > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > +
> > > +#define FP_PERMS(op, name) \
> > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > +
> > > +INT_PERMS (+, add)
> > > +INT_PERMS (-, sub)
> > > +INT_PERMS (*, mul)
> > > +INT_PERMS (|, ior)
> > > +INT_PERMS (^, xor)
> > > +INT_PERMS (&, and)
> > > +INT_PERMS (<<, shl)
> > > +INT_PERMS (>>, shr)
> > > +FP_PERMS (+, add)
> > > +FP_PERMS (-, sub)
> > > +FP_PERMS (*, mul)
> > > +
> > > --
> > > 2.18.1
> > >

[-- Attachment #2: 0001-Optimize-VEC_PERM_EXPR-with-same-permutation-index-a.patch --]
[-- Type: text/x-patch, Size: 4042 bytes --]

From 2d0014e3b0f9fedcd75fe31cffd4f998db6db543 Mon Sep 17 00:00:00 2001
From: Hongyu Wang <hongyu.wang@intel.com>
Date: Mon, 17 Jan 2022 13:01:51 +0800
Subject: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and
 operation

The sequence
     c1 = VEC_PERM_EXPR (a, a, mask)
     c2 = VEC_PERM_EXPR (b, b, mask)
     c3 = c1 op c2
can be optimized to
     c = a op b
     c3 = VEC_PERM_EXPR (c, c, mask)
for all integer vector operation, and float operation with
full permutation.

gcc/ChangeLog:

	PR target/98167
	* match.pd: New perm + vector op patterns for int and fp vector.

gcc/testsuite/ChangeLog:

	PR target/98167
	* gcc.target/i386/pr98167.c: New test.
---
 gcc/match.pd                            | 50 +++++++++++++++++++++++++
 gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c

diff --git a/gcc/match.pd b/gcc/match.pd
index 194ba8f5188..a394d664226 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -8189,3 +8189,53 @@ and,
  (bit_and (negate @0) integer_onep@1)
  (if (!TYPE_OVERFLOW_SANITIZED (type))
   (bit_and @0 @1)))
+
+/* Optimize
+   c1 = VEC_PERM_EXPR (a, a, mask)
+   c2 = VEC_PERM_EXPR (b, b, mask)
+   c3 = c1 op c2
+   -->
+   c = a op b
+   c3 = VEC_PERM_EXPR (c, c, mask)
+   For all integer non-div operations.  */
+(for op (plus minus mult bit_and bit_ior bit_xor
+	 lshift rshift)
+ (simplify
+  (op (vec_perm @0 @0 @2) (vec_perm @1 @1 @2))
+   (if (VECTOR_INTEGER_TYPE_P (type))
+    (vec_perm (op @0 @1) (op @0 @1) @2))))
+
+/* Similar for float arithmetic when permutation constant covers
+   all vector elements.  */
+(for op (plus minus mult)
+ (simplify
+  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
+   (if (VECTOR_FLOAT_TYPE_P (type)
+	&& TYPE_VECTOR_SUBPARTS (type).is_constant())
+    (with
+     {
+       tree perm_cst = @2;
+       vec_perm_builder builder;
+       bool full_perm_p = false;
+       if (tree_to_vec_perm_builder (&builder, perm_cst))
+	 {
+	   unsigned HOST_WIDE_INT nelts
+	     = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+	   /* Create a vec_perm_indices for the VECTOR_CST.  */
+	   vec_perm_indices sel (builder, 1, nelts);
+
+	   /* Check if perm indices covers all vector elements.  */
+	   unsigned HOST_WIDE_INT i, j, count = 0;
+
+	   for (i = 0; i < nelts; i++)
+	     for (j = 0; j < nelts; j++)
+		if (known_eq (poly_uint64 (sel[j]), i))
+		  {
+		    count++;
+		    break;
+		  }
+	   full_perm_p = known_eq (count, nelts);
+	 }
+      }
+      (if (full_perm_p)
+	(vec_perm (op @0 @1) (op @0 @1) @2))))))
diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
new file mode 100644
index 00000000000..40e0ac11332
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr98167.c
@@ -0,0 +1,44 @@
+/* PR target/98167 */
+/* { dg-do compile } */
+/* { dg-options "-O2 -mavx2" } */
+
+/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
+/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
+
+#define VEC_PERM_4 \
+  2, 3, 1, 0
+#define VEC_PERM_8 \
+  4, 5, 6, 7, 3, 2, 1, 0
+#define VEC_PERM_16 \
+  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
+
+#define TYPE_PERM_OP(type, size, op, name) \
+  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
+  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
+					      v##size##s##type b) \
+  { \
+    v##size##s##type a1 = __builtin_shufflevector (a, a, \
+						   VEC_PERM_##size); \
+    v##size##s##type b1 = __builtin_shufflevector (b, b, \
+						   VEC_PERM_##size); \
+    return a1 op b1; \
+  }
+
+#define INT_PERMS(op, name) \
+  TYPE_PERM_OP (int, 4, op, name) \
+
+#define FP_PERMS(op, name) \
+  TYPE_PERM_OP (float, 4, op, name) \
+
+INT_PERMS (+, add)
+INT_PERMS (-, sub)
+INT_PERMS (*, mul)
+INT_PERMS (|, ior)
+INT_PERMS (^, xor)
+INT_PERMS (&, and)
+INT_PERMS (<<, shl)
+INT_PERMS (>>, shr)
+FP_PERMS (+, add)
+FP_PERMS (-, sub)
+FP_PERMS (*, mul)
+
-- 
2.18.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-10  2:22     ` Hongyu Wang
@ 2022-11-10  8:56       ` Richard Biener
  2022-11-10 14:27         ` Hongyu Wang
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2022-11-10  8:56 UTC (permalink / raw)
  To: Hongyu Wang
  Cc: Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang, hongtao.liu,
	gcc-patches

On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> Hi Prathamesh and Richard,
>
> Thanks for the review and nice suggestions!
>
> > > I guess the transform should work as long as mask is same for both
> > > vectors even if it's
> > > not constant ?
> >
> > Yes, please change accordingly (and maybe push separately).
> >
>
> Removed VECTOR_CST for integer ops.
>
> > > If this transform is meant only for VLS vectors, I guess you should
> > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > otherwise it will crash for VLA vectors.
> >
> > I suppose it's difficult to create a VLA permute that covers all elements
> > and that is not trivial though.  But indeed add ().is_constant to the
> > VECTOR_FLOAT_TYPE_P guard.
>
> Added.
>
> > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > (as said I can't think of a non-full encoding that isn't trivial
> > but covers all elements) and then simply .qsort () the vector_builder
> > (it derives
> > from vec<>) so the scan is O(n log n).
>
> The .qsort () approach requires an extra cmp_func that IMO would not
> be feasible to be implemented in match.pd (I suppose lambda function
> would not be a good idea either).
> Another solution would be using hash_set but it does not work here for
> int64_t or poly_int64 type.
> So I kept current O(n^2) simple code here, and I suppose usually the
> permutation indices would be a small number even for O(n^2)
> complexity.

Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
a lambda function is fine to use.  The alternative (used by the vectorizer
in some places) is to use sth like

 auto_sbitmap seen (nelts);
 for (i = 0; i < nelts; i++)
   {
     if (!bitmap_set_bit (seen, i))
       break;
     count++;
   }
 full_perm_p = count == nelts;

I'll note that you should still check .encoding ().encoded_full_vector_p ()
and only bother to check that case, that's a very simple check.

>
> Attached updated patch.
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道：
>
>
> >
> > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > >
> > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This is a follow-up patch for PR98167
> > > >
> > > > The sequence
> > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > >      c3 = c1 op c2
> > > > can be optimized to
> > > >      c = a op b
> > > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > > for all integer vector operation, and float operation with
> > > > full permutation.
> > > >
> > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > Ok for trunk?
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         PR target/98167
> > > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         PR target/98167
> > > >         * gcc.target/i386/pr98167.c: New test.
> > > > ---
> > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > > >  2 files changed, 93 insertions(+)
> > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > >
> > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > index 194ba8f5188..b85ad34f609 100644
> > > > --- a/gcc/match.pd
> > > > +++ b/gcc/match.pd
> > > > @@ -8189,3 +8189,52 @@ and,
> > > >   (bit_and (negate @0) integer_onep@1)
> > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > >    (bit_and @0 @1)))
> > > > +
> > > > +/* Optimize
> > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > +   c3 = c1 op c2
> > > > +   -->
> > > > +   c = a op b
> > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > +   For all integer non-div operations.  */
> > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > +        lshift rshift)
> > > > + (simplify
> > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > Just wondering, why should mask be CST here ?
> > > I guess the transform should work as long as mask is same for both
> > > vectors even if it's
> > > not constant ?
> >
> > Yes, please change accordingly (and maybe push separately).
> >
> > > > +
> > > > +/* Similar for float arithmetic when permutation constant covers
> > > > +   all vector elements.  */
> > > > +(for op (plus minus mult)
> > > > + (simplify
> > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > +     (with
> > > > +      {
> > > > +       tree perm_cst = @2;
> > > > +       vec_perm_builder builder;
> > > > +       bool full_perm_p = false;
> > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > +         {
> > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > If this transform is meant only for VLS vectors, I guess you should
> > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > otherwise it will crash for VLA vectors.
> >
> > I suppose it's difficult to create a VLA permute that covers all elements
> > and that is not trivial though.  But indeed add ().is_constant to the
> > VECTOR_FLOAT_TYPE_P guard.
> >
> > >
> > > Thanks,
> > > Prathamesh
> > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > +
> > > > +           /* Check if perm indices covers all vector elements.  */
> > > > +           int count = 0, i, j;
> > > > +           for (i = 0; i < nelts; i++)
> > > > +             for (j = 0; j < nelts; j++)
> >
> > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > (as said I can't think of a non-full encoding that isn't trivial
> > but covers all elements) and then simply .qsort () the vector_builder
> > (it derives
> > from vec<>) so the scan is O(n log n).
> >
> > Maybe Richard has a better idea here though.
> >
> > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> > always that they explode the match decision tree, we'd ideally have a way to
> > match those with (op ..) (op ..) first to be able to share more of the matching
> > code.  That said, match.pd is a less than ideal place for these (but mostly
> > because of the way we code generate *-match.cc)
> >
> > Richard.
> >
> > > > +               {
> > > > +                 if (sel[j].to_constant () == i)
> > > > +                   {
> > > > +                     count++;
> > > > +                     break;
> > > > +                   }
> > > > +               }
> > > > +           full_perm_p = count == nelts;
> > > > +         }
> > > > +       }
> > > > +       (if (full_perm_p)
> > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > new file mode 100644
> > > > index 00000000000..40e0ac11332
> > > > --- /dev/null
> > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > @@ -0,0 +1,44 @@
> > > > +/* PR target/98167 */
> > > > +/* { dg-do compile } */
> > > > +/* { dg-options "-O2 -mavx2" } */
> > > > +
> > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > +
> > > > +#define VEC_PERM_4 \
> > > > +  2, 3, 1, 0
> > > > +#define VEC_PERM_8 \
> > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > +#define VEC_PERM_16 \
> > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > +
> > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > > +                                             v##size##s##type b) \
> > > > +  { \
> > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > +                                                  VEC_PERM_##size); \
> > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > +                                                  VEC_PERM_##size); \
> > > > +    return a1 op b1; \
> > > > +  }
> > > > +
> > > > +#define INT_PERMS(op, name) \
> > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > +
> > > > +#define FP_PERMS(op, name) \
> > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > +
> > > > +INT_PERMS (+, add)
> > > > +INT_PERMS (-, sub)
> > > > +INT_PERMS (*, mul)
> > > > +INT_PERMS (|, ior)
> > > > +INT_PERMS (^, xor)
> > > > +INT_PERMS (&, and)
> > > > +INT_PERMS (<<, shl)
> > > > +INT_PERMS (>>, shr)
> > > > +FP_PERMS (+, add)
> > > > +FP_PERMS (-, sub)
> > > > +FP_PERMS (*, mul)
> > > > +
> > > > --
> > > > 2.18.1
> > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-10  8:56       ` Richard Biener
@ 2022-11-10 14:27         ` Hongyu Wang
  2022-11-14 14:53           ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Hongyu Wang @ 2022-11-10 14:27 UTC (permalink / raw)
  To: Richard Biener
  Cc: Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang, hongtao.liu,
	gcc-patches

[-- Attachment #1: Type: text/plain, Size: 10558 bytes --]

> Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> a lambda function is fine to use.  The alternative (used by the vectorizer
> in some places) is to use sth like
>
>  auto_sbitmap seen (nelts);
>  for (i = 0; i < nelts; i++)
>    {
>      if (!bitmap_set_bit (seen, i))
>        break;
>      count++;
>    }
>  full_perm_p = count == nelts;
>
> I'll note that you should still check .encoding ().encoded_full_vector_p ()
> and only bother to check that case, that's a very simple check.

Thanks for the good example! We also tried using wide_int as a bitmask
but your code looks more simple and reasonable.

Updated the patch accordingly.

Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四 16:56写道：


>
> On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> >
> > Hi Prathamesh and Richard,
> >
> > Thanks for the review and nice suggestions!
> >
> > > > I guess the transform should work as long as mask is same for both
> > > > vectors even if it's
> > > > not constant ?
> > >
> > > Yes, please change accordingly (and maybe push separately).
> > >
> >
> > Removed VECTOR_CST for integer ops.
> >
> > > > If this transform is meant only for VLS vectors, I guess you should
> > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > otherwise it will crash for VLA vectors.
> > >
> > > I suppose it's difficult to create a VLA permute that covers all elements
> > > and that is not trivial though.  But indeed add ().is_constant to the
> > > VECTOR_FLOAT_TYPE_P guard.
> >
> > Added.
> >
> > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > (as said I can't think of a non-full encoding that isn't trivial
> > > but covers all elements) and then simply .qsort () the vector_builder
> > > (it derives
> > > from vec<>) so the scan is O(n log n).
> >
> > The .qsort () approach requires an extra cmp_func that IMO would not
> > be feasible to be implemented in match.pd (I suppose lambda function
> > would not be a good idea either).
> > Another solution would be using hash_set but it does not work here for
> > int64_t or poly_int64 type.
> > So I kept current O(n^2) simple code here, and I suppose usually the
> > permutation indices would be a small number even for O(n^2)
> > complexity.
>
> Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> a lambda function is fine to use.  The alternative (used by the vectorizer
> in some places) is to use sth like
>
>  auto_sbitmap seen (nelts);
>  for (i = 0; i < nelts; i++)
>    {
>      if (!bitmap_set_bit (seen, i))
>        break;
>      count++;
>    }
>  full_perm_p = count == nelts;
>
> I'll note that you should still check .encoding ().encoded_full_vector_p ()
> and only bother to check that case, that's a very simple check.
>
> >
> > Attached updated patch.
> >
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道：
> >
> >
> > >
> > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > This is a follow-up patch for PR98167
> > > > >
> > > > > The sequence
> > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > >      c3 = c1 op c2
> > > > > can be optimized to
> > > > >      c = a op b
> > > > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > for all integer vector operation, and float operation with
> > > > > full permutation.
> > > > >
> > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > >
> > > > > Ok for trunk?
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         PR target/98167
> > > > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         PR target/98167
> > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > ---
> > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > > > >  2 files changed, 93 insertions(+)
> > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > > >
> > > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > > index 194ba8f5188..b85ad34f609 100644
> > > > > --- a/gcc/match.pd
> > > > > +++ b/gcc/match.pd
> > > > > @@ -8189,3 +8189,52 @@ and,
> > > > >   (bit_and (negate @0) integer_onep@1)
> > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > >    (bit_and @0 @1)))
> > > > > +
> > > > > +/* Optimize
> > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > +   c3 = c1 op c2
> > > > > +   -->
> > > > > +   c = a op b
> > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > +   For all integer non-div operations.  */
> > > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > > +        lshift rshift)
> > > > > + (simplify
> > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > Just wondering, why should mask be CST here ?
> > > > I guess the transform should work as long as mask is same for both
> > > > vectors even if it's
> > > > not constant ?
> > >
> > > Yes, please change accordingly (and maybe push separately).
> > >
> > > > > +
> > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > +   all vector elements.  */
> > > > > +(for op (plus minus mult)
> > > > > + (simplify
> > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > +     (with
> > > > > +      {
> > > > > +       tree perm_cst = @2;
> > > > > +       vec_perm_builder builder;
> > > > > +       bool full_perm_p = false;
> > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > +         {
> > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > > If this transform is meant only for VLS vectors, I guess you should
> > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > otherwise it will crash for VLA vectors.
> > >
> > > I suppose it's difficult to create a VLA permute that covers all elements
> > > and that is not trivial though.  But indeed add ().is_constant to the
> > > VECTOR_FLOAT_TYPE_P guard.
> > >
> > > >
> > > > Thanks,
> > > > Prathamesh
> > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > +
> > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > +           int count = 0, i, j;
> > > > > +           for (i = 0; i < nelts; i++)
> > > > > +             for (j = 0; j < nelts; j++)
> > >
> > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > (as said I can't think of a non-full encoding that isn't trivial
> > > but covers all elements) and then simply .qsort () the vector_builder
> > > (it derives
> > > from vec<>) so the scan is O(n log n).
> > >
> > > Maybe Richard has a better idea here though.
> > >
> > > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> > > always that they explode the match decision tree, we'd ideally have a way to
> > > match those with (op ..) (op ..) first to be able to share more of the matching
> > > code.  That said, match.pd is a less than ideal place for these (but mostly
> > > because of the way we code generate *-match.cc)
> > >
> > > Richard.
> > >
> > > > > +               {
> > > > > +                 if (sel[j].to_constant () == i)
> > > > > +                   {
> > > > > +                     count++;
> > > > > +                     break;
> > > > > +                   }
> > > > > +               }
> > > > > +           full_perm_p = count == nelts;
> > > > > +         }
> > > > > +       }
> > > > > +       (if (full_perm_p)
> > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > new file mode 100644
> > > > > index 00000000000..40e0ac11332
> > > > > --- /dev/null
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > @@ -0,0 +1,44 @@
> > > > > +/* PR target/98167 */
> > > > > +/* { dg-do compile } */
> > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > +
> > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > +
> > > > > +#define VEC_PERM_4 \
> > > > > +  2, 3, 1, 0
> > > > > +#define VEC_PERM_8 \
> > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > +#define VEC_PERM_16 \
> > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > +
> > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > > > +                                             v##size##s##type b) \
> > > > > +  { \
> > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > +                                                  VEC_PERM_##size); \
> > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > +                                                  VEC_PERM_##size); \
> > > > > +    return a1 op b1; \
> > > > > +  }
> > > > > +
> > > > > +#define INT_PERMS(op, name) \
> > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > +
> > > > > +#define FP_PERMS(op, name) \
> > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > +
> > > > > +INT_PERMS (+, add)
> > > > > +INT_PERMS (-, sub)
> > > > > +INT_PERMS (*, mul)
> > > > > +INT_PERMS (|, ior)
> > > > > +INT_PERMS (^, xor)
> > > > > +INT_PERMS (&, and)
> > > > > +INT_PERMS (<<, shl)
> > > > > +INT_PERMS (>>, shr)
> > > > > +FP_PERMS (+, add)
> > > > > +FP_PERMS (-, sub)
> > > > > +FP_PERMS (*, mul)
> > > > > +
> > > > > --
> > > > > 2.18.1
> > > > >

[-- Attachment #2: 0001-Optimize-VEC_PERM_EXPR-with-same-permutation-index-a.patch --]
[-- Type: application/x-patch, Size: 4129 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-10 14:27         ` Hongyu Wang
@ 2022-11-14 14:53           ` Richard Biener
  2022-11-16 15:25             ` Tamar Christina
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2022-11-14 14:53 UTC (permalink / raw)
  To: Hongyu Wang
  Cc: Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang, hongtao.liu,
	gcc-patches

On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> > a lambda function is fine to use.  The alternative (used by the vectorizer
> > in some places) is to use sth like
> >
> >  auto_sbitmap seen (nelts);
> >  for (i = 0; i < nelts; i++)
> >    {
> >      if (!bitmap_set_bit (seen, i))
> >        break;
> >      count++;
> >    }
> >  full_perm_p = count == nelts;
> >
> > I'll note that you should still check .encoding ().encoded_full_vector_p ()
> > and only bother to check that case, that's a very simple check.
>
> Thanks for the good example! We also tried using wide_int as a bitmask
> but your code looks more simple and reasonable.
>
> Updated the patch accordingly.

OK.

Thanks,
Richard.

> Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四 16:56写道：
>
>
> >
> > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
> > >
> > > Hi Prathamesh and Richard,
> > >
> > > Thanks for the review and nice suggestions!
> > >
> > > > > I guess the transform should work as long as mask is same for both
> > > > > vectors even if it's
> > > > > not constant ?
> > > >
> > > > Yes, please change accordingly (and maybe push separately).
> > > >
> > >
> > > Removed VECTOR_CST for integer ops.
> > >
> > > > > If this transform is meant only for VLS vectors, I guess you should
> > > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > otherwise it will crash for VLA vectors.
> > > >
> > > > I suppose it's difficult to create a VLA permute that covers all elements
> > > > and that is not trivial though.  But indeed add ().is_constant to the
> > > > VECTOR_FLOAT_TYPE_P guard.
> > >
> > > Added.
> > >
> > > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > > (as said I can't think of a non-full encoding that isn't trivial
> > > > but covers all elements) and then simply .qsort () the vector_builder
> > > > (it derives
> > > > from vec<>) so the scan is O(n log n).
> > >
> > > The .qsort () approach requires an extra cmp_func that IMO would not
> > > be feasible to be implemented in match.pd (I suppose lambda function
> > > would not be a good idea either).
> > > Another solution would be using hash_set but it does not work here for
> > > int64_t or poly_int64 type.
> > > So I kept current O(n^2) simple code here, and I suppose usually the
> > > permutation indices would be a small number even for O(n^2)
> > > complexity.
> >
> > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I think
> > a lambda function is fine to use.  The alternative (used by the vectorizer
> > in some places) is to use sth like
> >
> >  auto_sbitmap seen (nelts);
> >  for (i = 0; i < nelts; i++)
> >    {
> >      if (!bitmap_set_bit (seen, i))
> >        break;
> >      count++;
> >    }
> >  full_perm_p = count == nelts;
> >
> > I'll note that you should still check .encoding ().encoded_full_vector_p ()
> > and only bother to check that case, that's a very simple check.
> >
> > >
> > > Attached updated patch.
> > >
> > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> 于2022年11月8日周二 22:38写道：
> > >
> > >
> > > >
> > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is a follow-up patch for PR98167
> > > > > >
> > > > > > The sequence
> > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > >      c3 = c1 op c2
> > > > > > can be optimized to
> > > > > >      c = a op b
> > > > > >      c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > for all integer vector operation, and float operation with
> > > > > > full permutation.
> > > > > >
> > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > >
> > > > > > Ok for trunk?
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         PR target/98167
> > > > > >         * match.pd: New perm + vector op patterns for int and fp vector.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         PR target/98167
> > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > ---
> > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 ++++++++++++++++++++++
> > > > > >  2 files changed, 93 insertions(+)
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > >
> > > > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > > > index 194ba8f5188..b85ad34f609 100644
> > > > > > --- a/gcc/match.pd
> > > > > > +++ b/gcc/match.pd
> > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > >    (bit_and @0 @1)))
> > > > > > +
> > > > > > +/* Optimize
> > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > +   c3 = c1 op c2
> > > > > > +   -->
> > > > > > +   c = a op b
> > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > +   For all integer non-div operations.  */
> > > > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > > > +        lshift rshift)
> > > > > > + (simplify
> > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > Just wondering, why should mask be CST here ?
> > > > > I guess the transform should work as long as mask is same for both
> > > > > vectors even if it's
> > > > > not constant ?
> > > >
> > > > Yes, please change accordingly (and maybe push separately).
> > > >
> > > > > > +
> > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > +   all vector elements.  */
> > > > > > +(for op (plus minus mult)
> > > > > > + (simplify
> > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST@2))
> > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > +     (with
> > > > > > +      {
> > > > > > +       tree perm_cst = @2;
> > > > > > +       vec_perm_builder builder;
> > > > > > +       bool full_perm_p = false;
> > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > +         {
> > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> > > > > If this transform is meant only for VLS vectors, I guess you should
> > > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > otherwise it will crash for VLA vectors.
> > > >
> > > > I suppose it's difficult to create a VLA permute that covers all elements
> > > > and that is not trivial though.  But indeed add ().is_constant to the
> > > > VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > +
> > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > +           int count = 0, i, j;
> > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > +             for (j = 0; j < nelts; j++)
> > > >
> > > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_full_vector_p ()
> > > > (as said I can't think of a non-full encoding that isn't trivial
> > > > but covers all elements) and then simply .qsort () the vector_builder
> > > > (it derives
> > > > from vec<>) so the scan is O(n log n).
> > > >
> > > > Maybe Richard has a better idea here though.
> > > >
> > > > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) patterns it's
> > > > always that they explode the match decision tree, we'd ideally have a way to
> > > > match those with (op ..) (op ..) first to be able to share more of the matching
> > > > code.  That said, match.pd is a less than ideal place for these (but mostly
> > > > because of the way we code generate *-match.cc)
> > > >
> > > > Richard.
> > > >
> > > > > > +               {
> > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > +                   {
> > > > > > +                     count++;
> > > > > > +                     break;
> > > > > > +                   }
> > > > > > +               }
> > > > > > +           full_perm_p = count == nelts;
> > > > > > +         }
> > > > > > +       }
> > > > > > +       (if (full_perm_p)
> > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..40e0ac11332
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > @@ -0,0 +1,44 @@
> > > > > > +/* PR target/98167 */
> > > > > > +/* { dg-do compile } */
> > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > +
> > > > > > +#define VEC_PERM_4 \
> > > > > > +  2, 3, 1, 0
> > > > > > +#define VEC_PERM_8 \
> > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > +#define VEC_PERM_16 \
> > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > +
> > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*size))); \
> > > > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type a, \
> > > > > > +                                             v##size##s##type b) \
> > > > > > +  { \
> > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > +                                                  VEC_PERM_##size); \
> > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > +                                                  VEC_PERM_##size); \
> > > > > > +    return a1 op b1; \
> > > > > > +  }
> > > > > > +
> > > > > > +#define INT_PERMS(op, name) \
> > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > +
> > > > > > +#define FP_PERMS(op, name) \
> > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > +
> > > > > > +INT_PERMS (+, add)
> > > > > > +INT_PERMS (-, sub)
> > > > > > +INT_PERMS (*, mul)
> > > > > > +INT_PERMS (|, ior)
> > > > > > +INT_PERMS (^, xor)
> > > > > > +INT_PERMS (&, and)
> > > > > > +INT_PERMS (<<, shl)
> > > > > > +INT_PERMS (>>, shr)
> > > > > > +FP_PERMS (+, add)
> > > > > > +FP_PERMS (-, sub)
> > > > > > +FP_PERMS (*, mul)
> > > > > > +
> > > > > > --
> > > > > > 2.18.1
> > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-14 14:53           ` Richard Biener
@ 2022-11-16 15:25             ` Tamar Christina
  2022-11-16 15:29               ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Tamar Christina @ 2022-11-16 15:25 UTC (permalink / raw)
  To: Richard Biener, Hongyu Wang
  Cc: Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang, hongtao.liu,
	gcc-patches

Hi,

This patch is causing several ICEs because it changes the permutes from a single register
permute to a multi register due to the lowering of the expressions to different SSA names.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717

I have a prototype fix which adds a new rule to CSE this back to a single register permute,
but would this be the right solution? It seems hard to later on during expand realize that
the two operands are the same.

It's probably also ok to just block this from happening after vec_lower, however I'm worried that
If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.

Thanks,
Tamar

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> Biener via Gcc-patches
> Sent: Monday, November 14, 2022 2:53 PM
> To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> <wwwhhhyyy333@gmail.com> wrote:
> >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> >
> > Thanks for the good example! We also tried using wide_int as a bitmask
> > but your code looks more simple and reasonable.
> >
> > Updated the patch accordingly.
> 
> OK.
> 
> Thanks,
> Richard.
> 
> > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> 16:56写道：
> >
> >
> > >
> > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> <wwwhhhyyy333@gmail.com> wrote:
> > > >
> > > > Hi Prathamesh and Richard,
> > > >
> > > > Thanks for the review and nice suggestions!
> > > >
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > >
> > > > Removed VECTOR_CST for integer ops.
> > > >
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > Added.
> > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > >
> > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > function would not be a good idea either).
> > > > Another solution would be using hash_set but it does not work here
> > > > for int64_t or poly_int64 type.
> > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > the permutation indices would be a small number even for O(n^2)
> > > > complexity.
> > >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> > >
> > > >
> > > > Attached updated patch.
> > > >
> > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > 于2022年11月8日周二 22:38写道：
> > > >
> > > >
> > > > >
> > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This is a follow-up patch for PR98167
> > > > > > >
> > > > > > > The sequence
> > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > >      c3 = c1 op c2
> > > > > > > can be optimized to
> > > > > > >      c = a op b
> > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > operation, and float operation with full permutation.
> > > > > > >
> > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > >
> > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> vector.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > ---
> > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > ++++++++++++++++++++++
> > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > >
> > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > --- a/gcc/match.pd
> > > > > > > +++ b/gcc/match.pd
> > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > >    (bit_and @0 @1)))
> > > > > > > +
> > > > > > > +/* Optimize
> > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > +   c3 = c1 op c2
> > > > > > > +   -->
> > > > > > > +   c = a op b
> > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > +        lshift rshift)
> > > > > > > + (simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > Just wondering, why should mask be CST here ?
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > > > > > +
> > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > +(simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > +     (with
> > > > > > > +      {
> > > > > > > +       tree perm_cst = @2;
> > > > > > > +       vec_perm_builder builder;
> > > > > > > +       bool full_perm_p = false;
> > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > +         {
> > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > +(type).to_constant ();
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > +
> > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > +           int count = 0, i, j;
> > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > +             for (j = 0; j < nelts; j++)
> > > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > > >
> > > > > Maybe Richard has a better idea here though.
> > > > >
> > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > ..)) patterns it's always that they explode the match decision
> > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > ..) first to be able to share more of the matching code.  That
> > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > because of the way we code generate *-match.cc)
> > > > >
> > > > > Richard.
> > > > >
> > > > > > > +               {
> > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > +                   {
> > > > > > > +                     count++;
> > > > > > > +                     break;
> > > > > > > +                   }
> > > > > > > +               }
> > > > > > > +           full_perm_p = count == nelts;
> > > > > > > +         }
> > > > > > > +       }
> > > > > > > +       (if (full_perm_p)
> > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000000..40e0ac11332
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > @@ -0,0 +1,44 @@
> > > > > > > +/* PR target/98167 */
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > +
> > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > +
> > > > > > > +#define VEC_PERM_4 \
> > > > > > > +  2, 3, 1, 0
> > > > > > > +#define VEC_PERM_8 \
> > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > +#define VEC_PERM_16 \
> > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > +
> > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > +((vector_size(4*size))); \
> > > > > > > +  v##size##s##type type##foo##size##i_##name
> (v##size##s##type a, \
> > > > > > > +
> > > > > > > +v##size##s##type b) \
> > > > > > > +  { \
> > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > +    return a1 op b1; \
> > > > > > > +  }
> > > > > > > +
> > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > +
> > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > +
> > > > > > > +INT_PERMS (+, add)
> > > > > > > +INT_PERMS (-, sub)
> > > > > > > +INT_PERMS (*, mul)
> > > > > > > +INT_PERMS (|, ior)
> > > > > > > +INT_PERMS (^, xor)
> > > > > > > +INT_PERMS (&, and)
> > > > > > > +INT_PERMS (<<, shl)
> > > > > > > +INT_PERMS (>>, shr)
> > > > > > > +FP_PERMS (+, add)
> > > > > > > +FP_PERMS (-, sub)
> > > > > > > +FP_PERMS (*, mul)
> > > > > > > +
> > > > > > > --
> > > > > > > 2.18.1
> > > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 15:25             ` Tamar Christina
@ 2022-11-16 15:29               ` Richard Biener
  2022-11-16 15:30                 ` Richard Biener
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Biener @ 2022-11-16 15:29 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Hongyu Wang, Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang,
	hongtao.liu, gcc-patches

On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
>
> Hi,
>
> This patch is causing several ICEs because it changes the permutes from a single register
> permute to a multi register due to the lowering of the expressions to different SSA names.
>
> See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
>
> I have a prototype fix which adds a new rule to CSE this back to a single register permute,
> but would this be the right solution? It seems hard to later on during expand realize that
> the two operands are the same.
>
> It's probably also ok to just block this from happening after vec_lower, however I'm worried that
> If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.

You can use

 (vec_perm (op@7 @0 @1) @3)

to avoid this issue.

> Thanks,
> Tamar
>
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-
> > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > Biener via Gcc-patches
> > Sent: Monday, November 14, 2022 2:53 PM
> > To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> > Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> > Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> > <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> > patches@gcc.gnu.org
> > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> > index and operation [PR98167]
> >
> > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > <wwwhhhyyy333@gmail.com> wrote:
> > >
> > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > think a lambda function is fine to use.  The alternative (used by
> > > > the vectorizer in some places) is to use sth like
> > > >
> > > >  auto_sbitmap seen (nelts);
> > > >  for (i = 0; i < nelts; i++)
> > > >    {
> > > >      if (!bitmap_set_bit (seen, i))
> > > >        break;
> > > >      count++;
> > > >    }
> > > >  full_perm_p = count == nelts;
> > > >
> > > > I'll note that you should still check .encoding
> > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > very simple check.
> > >
> > > Thanks for the good example! We also tried using wide_int as a bitmask
> > > but your code looks more simple and reasonable.
> > >
> > > Updated the patch accordingly.
> >
> > OK.
> >
> > Thanks,
> > Richard.
> >
> > > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> > 16:56写道：
> > >
> > >
> > > >
> > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > <wwwhhhyyy333@gmail.com> wrote:
> > > > >
> > > > > Hi Prathamesh and Richard,
> > > > >
> > > > > Thanks for the review and nice suggestions!
> > > > >
> > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > both vectors even if it's not constant ?
> > > > > >
> > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > >
> > > > >
> > > > > Removed VECTOR_CST for integer ops.
> > > > >
> > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > otherwise it will crash for VLA vectors.
> > > > > >
> > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > elements and that is not trivial though.  But indeed add
> > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > Added.
> > > > >
> > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > the scan is O(n log n).
> > > > >
> > > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > > function would not be a good idea either).
> > > > > Another solution would be using hash_set but it does not work here
> > > > > for int64_t or poly_int64 type.
> > > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > > the permutation indices would be a small number even for O(n^2)
> > > > > complexity.
> > > >
> > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > think a lambda function is fine to use.  The alternative (used by
> > > > the vectorizer in some places) is to use sth like
> > > >
> > > >  auto_sbitmap seen (nelts);
> > > >  for (i = 0; i < nelts; i++)
> > > >    {
> > > >      if (!bitmap_set_bit (seen, i))
> > > >        break;
> > > >      count++;
> > > >    }
> > > >  full_perm_p = count == nelts;
> > > >
> > > > I'll note that you should still check .encoding
> > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > very simple check.
> > > >
> > > > >
> > > > > Attached updated patch.
> > > > >
> > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > 于2022年11月8日周二 22:38写道：
> > > > >
> > > > >
> > > > > >
> > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This is a follow-up patch for PR98167
> > > > > > > >
> > > > > > > > The sequence
> > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > >      c3 = c1 op c2
> > > > > > > > can be optimized to
> > > > > > > >      c = a op b
> > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > > operation, and float operation with full permutation.
> > > > > > > >
> > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > >
> > > > > > > > Ok for trunk?
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > >         PR target/98167
> > > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> > vector.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > >         PR target/98167
> > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > ---
> > > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > ++++++++++++++++++++++
> > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > >
> > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > --- a/gcc/match.pd
> > > > > > > > +++ b/gcc/match.pd
> > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > >    (bit_and @0 @1)))
> > > > > > > > +
> > > > > > > > +/* Optimize
> > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > +   c3 = c1 op c2
> > > > > > > > +   -->
> > > > > > > > +   c = a op b
> > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > > +        lshift rshift)
> > > > > > > > + (simplify
> > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > VECTOR_CST@2))
> > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > both vectors even if it's not constant ?
> > > > > >
> > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > >
> > > > > > > > +
> > > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > +(simplify
> > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > VECTOR_CST@2))
> > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > +     (with
> > > > > > > > +      {
> > > > > > > > +       tree perm_cst = @2;
> > > > > > > > +       vec_perm_builder builder;
> > > > > > > > +       bool full_perm_p = false;
> > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > +         {
> > > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > +(type).to_constant ();
> > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > otherwise it will crash for VLA vectors.
> > > > > >
> > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > elements and that is not trivial though.  But indeed add
> > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Prathamesh
> > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > +
> > > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > > +           int count = 0, i, j;
> > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > >
> > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > the scan is O(n log n).
> > > > > >
> > > > > > Maybe Richard has a better idea here though.
> > > > > >
> > > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > > ..)) patterns it's always that they explode the match decision
> > > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > > ..) first to be able to share more of the matching code.  That
> > > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > > because of the way we code generate *-match.cc)
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > > +               {
> > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > +                   {
> > > > > > > > +                     count++;
> > > > > > > > +                     break;
> > > > > > > > +                   }
> > > > > > > > +               }
> > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > +         }
> > > > > > > > +       }
> > > > > > > > +       (if (full_perm_p)
> > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > new file mode 100644
> > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > +/* PR target/98167 */
> > > > > > > > +/* { dg-do compile } */
> > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > +
> > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > > +
> > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > +  2, 3, 1, 0
> > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > > +#define VEC_PERM_16 \
> > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > > +
> > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > +((vector_size(4*size))); \
> > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > (v##size##s##type a, \
> > > > > > > > +
> > > > > > > > +v##size##s##type b) \
> > > > > > > > +  { \
> > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > +    return a1 op b1; \
> > > > > > > > +  }
> > > > > > > > +
> > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > +
> > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > +
> > > > > > > > +INT_PERMS (+, add)
> > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > +INT_PERMS (&, and)
> > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > +FP_PERMS (+, add)
> > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > +
> > > > > > > > --
> > > > > > > > 2.18.1
> > > > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 15:29               ` Richard Biener
@ 2022-11-16 15:30                 ` Richard Biener
  2022-11-16 15:34                   ` Tamar Christina
  2022-11-16 15:37                   ` Jakub Jelinek
  0 siblings, 2 replies; 16+ messages in thread
From: Richard Biener @ 2022-11-16 15:30 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Hongyu Wang, Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang,
	hongtao.liu, gcc-patches

On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> >
> > Hi,
> >
> > This patch is causing several ICEs because it changes the permutes from a single register
> > permute to a multi register due to the lowering of the expressions to different SSA names.
> >
> > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> >
> > I have a prototype fix which adds a new rule to CSE this back to a single register permute,
> > but would this be the right solution? It seems hard to later on during expand realize that
> > the two operands are the same.
> >
> > It's probably also ok to just block this from happening after vec_lower, however I'm worried that
> > If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.
>
> You can use
>
>  (vec_perm (op@7 @0 @1) @3)

Err, (vec_perm (op@7 @0 @1) @7) obviously.

> to avoid this issue.
>
> > Thanks,
> > Tamar
> >
> > > -----Original Message-----
> > > From: Gcc-patches <gcc-patches-
> > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > Biener via Gcc-patches
> > > Sent: Monday, November 14, 2022 2:53 PM
> > > To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> > > Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> > > Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> > > <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> > > patches@gcc.gnu.org
> > > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> > > index and operation [PR98167]
> > >
> > > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > > <wwwhhhyyy333@gmail.com> wrote:
> > > >
> > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > > think a lambda function is fine to use.  The alternative (used by
> > > > > the vectorizer in some places) is to use sth like
> > > > >
> > > > >  auto_sbitmap seen (nelts);
> > > > >  for (i = 0; i < nelts; i++)
> > > > >    {
> > > > >      if (!bitmap_set_bit (seen, i))
> > > > >        break;
> > > > >      count++;
> > > > >    }
> > > > >  full_perm_p = count == nelts;
> > > > >
> > > > > I'll note that you should still check .encoding
> > > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > > very simple check.
> > > >
> > > > Thanks for the good example! We also tried using wide_int as a bitmask
> > > > but your code looks more simple and reasonable.
> > > >
> > > > Updated the patch accordingly.
> > >
> > > OK.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> > > 16:56写道：
> > > >
> > > >
> > > > >
> > > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > > <wwwhhhyyy333@gmail.com> wrote:
> > > > > >
> > > > > > Hi Prathamesh and Richard,
> > > > > >
> > > > > > Thanks for the review and nice suggestions!
> > > > > >
> > > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > > both vectors even if it's not constant ?
> > > > > > >
> > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > >
> > > > > >
> > > > > > Removed VECTOR_CST for integer ops.
> > > > > >
> > > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > > otherwise it will crash for VLA vectors.
> > > > > > >
> > > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > > elements and that is not trivial though.  But indeed add
> > > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > >
> > > > > > Added.
> > > > > >
> > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > > the scan is O(n log n).
> > > > > >
> > > > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > > > function would not be a good idea either).
> > > > > > Another solution would be using hash_set but it does not work here
> > > > > > for int64_t or poly_int64 type.
> > > > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > > > the permutation indices would be a small number even for O(n^2)
> > > > > > complexity.
> > > > >
> > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > > > think a lambda function is fine to use.  The alternative (used by
> > > > > the vectorizer in some places) is to use sth like
> > > > >
> > > > >  auto_sbitmap seen (nelts);
> > > > >  for (i = 0; i < nelts; i++)
> > > > >    {
> > > > >      if (!bitmap_set_bit (seen, i))
> > > > >        break;
> > > > >      count++;
> > > > >    }
> > > > >  full_perm_p = count == nelts;
> > > > >
> > > > > I'll note that you should still check .encoding
> > > > > ().encoded_full_vector_p () and only bother to check that case, that's a
> > > very simple check.
> > > > >
> > > > > >
> > > > > > Attached updated patch.
> > > > > >
> > > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > > 于2022年11月8日周二 22:38写道：
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > >
> > > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > This is a follow-up patch for PR98167
> > > > > > > > >
> > > > > > > > > The sequence
> > > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > >      c3 = c1 op c2
> > > > > > > > > can be optimized to
> > > > > > > > >      c = a op b
> > > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > > > operation, and float operation with full permutation.
> > > > > > > > >
> > > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > > >
> > > > > > > > > Ok for trunk?
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >         PR target/98167
> > > > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> > > vector.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >         PR target/98167
> > > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > > ---
> > > > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > > ++++++++++++++++++++++
> > > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > >
> > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > > --- a/gcc/match.pd
> > > > > > > > > +++ b/gcc/match.pd
> > > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > > >    (bit_and @0 @1)))
> > > > > > > > > +
> > > > > > > > > +/* Optimize
> > > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > > +   c3 = c1 op c2
> > > > > > > > > +   -->
> > > > > > > > > +   c = a op b
> > > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > > > +        lshift rshift)
> > > > > > > > > + (simplify
> > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > > VECTOR_CST@2))
> > > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > > I guess the transform should work as long as mask is same for
> > > > > > > > both vectors even if it's not constant ?
> > > > > > >
> > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > > +(simplify
> > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> > > VECTOR_CST@2))
> > > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > > +     (with
> > > > > > > > > +      {
> > > > > > > > > +       tree perm_cst = @2;
> > > > > > > > > +       vec_perm_builder builder;
> > > > > > > > > +       bool full_perm_p = false;
> > > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > > +         {
> > > > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > > +(type).to_constant ();
> > > > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > > > otherwise it will crash for VLA vectors.
> > > > > > >
> > > > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > > > elements and that is not trivial though.  But indeed add
> > > > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Prathamesh
> > > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > > +
> > > > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > > > +           int count = 0, i, j;
> > > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > > >
> > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > > > encoding that isn't trivial but covers all elements) and then
> > > > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > > > the scan is O(n log n).
> > > > > > >
> > > > > > > Maybe Richard has a better idea here though.
> > > > > > >
> > > > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > > > ..)) patterns it's always that they explode the match decision
> > > > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > > > ..) first to be able to share more of the matching code.  That
> > > > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > > > because of the way we code generate *-match.cc)
> > > > > > >
> > > > > > > Richard.
> > > > > > >
> > > > > > > > > +               {
> > > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > > +                   {
> > > > > > > > > +                     count++;
> > > > > > > > > +                     break;
> > > > > > > > > +                   }
> > > > > > > > > +               }
> > > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > > +         }
> > > > > > > > > +       }
> > > > > > > > > +       (if (full_perm_p)
> > > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > new file mode 100644
> > > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > > +/* PR target/98167 */
> > > > > > > > > +/* { dg-do compile } */
> > > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > > +
> > > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > > > +
> > > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > > +  2, 3, 1, 0
> > > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > > > +#define VEC_PERM_16 \
> > > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > > > +
> > > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > > +((vector_size(4*size))); \
> > > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > > (v##size##s##type a, \
> > > > > > > > > +
> > > > > > > > > +v##size##s##type b) \
> > > > > > > > > +  { \
> > > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > +    return a1 op b1; \
> > > > > > > > > +  }
> > > > > > > > > +
> > > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > > +
> > > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > > +
> > > > > > > > > +INT_PERMS (+, add)
> > > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > > +INT_PERMS (&, and)
> > > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > > +FP_PERMS (+, add)
> > > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > > +
> > > > > > > > > --
> > > > > > > > > 2.18.1
> > > > > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 15:30                 ` Richard Biener
@ 2022-11-16 15:34                   ` Tamar Christina
  2022-11-16 15:37                   ` Jakub Jelinek
  1 sibling, 0 replies; 16+ messages in thread
From: Tamar Christina @ 2022-11-16 15:34 UTC (permalink / raw)
  To: Richard Biener
  Cc: Hongyu Wang, Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang,
	hongtao.liu, gcc-patches

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Wednesday, November 16, 2022 3:30 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: Hongyu Wang <wwwhhhyyy333@gmail.com>; Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org>; Richard Sandiford
> <Richard.Sandiford@arm.com>; Hongyu Wang <hongyu.wang@intel.com>;
> hongtao.liu@intel.com; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina
> <Tamar.Christina@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch is causing several ICEs because it changes the permutes
> > > from a single register permute to a multi register due to the lowering of
> the expressions to different SSA names.
> > >
> > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> > >
> > > I have a prototype fix which adds a new rule to CSE this back to a
> > > single register permute, but would this be the right solution? It
> > > seems hard to later on during expand realize that the two operands are
> the same.
> > >
> > > It's probably also ok to just block this from happening after
> > > vec_lower, however I'm worried that If it wasn't CSE'd before vec_lower
> it'll lower it so something much less efficient.
> >
> > You can use
> >
> >  (vec_perm (op@7 @0 @1) @3)
> 
> Err, (vec_perm (op@7 @0 @1) @7) obviously.

Oh wow, I had no idea you could capture during rewrite!

That's nifty and good to know.

I'll regtest and submit patch.

Thanks,
Tamar

> 
> > to avoid this issue.
> >
> > > Thanks,
> > > Tamar
> > >
> > > > -----Original Message-----
> > > > From: Gcc-patches <gcc-patches-
> > > > bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> > > > Biener via Gcc-patches
> > > > Sent: Monday, November 14, 2022 2:53 PM
> > > > To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> > > > Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> > > > Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> > > > <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> > > > patches@gcc.gnu.org
> > > > Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same
> permutation
> > > > index and operation [PR98167]
> > > >
> > > > On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> > > > <wwwhhhyyy333@gmail.com> wrote:
> > > > >
> > > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.
> > > > > > I think a lambda function is fine to use.  The alternative
> > > > > > (used by the vectorizer in some places) is to use sth like
> > > > > >
> > > > > >  auto_sbitmap seen (nelts);
> > > > > >  for (i = 0; i < nelts; i++)
> > > > > >    {
> > > > > >      if (!bitmap_set_bit (seen, i))
> > > > > >        break;
> > > > > >      count++;
> > > > > >    }
> > > > > >  full_perm_p = count == nelts;
> > > > > >
> > > > > > I'll note that you should still check .encoding
> > > > > > ().encoded_full_vector_p () and only bother to check that
> > > > > > case, that's a
> > > > very simple check.
> > > > >
> > > > > Thanks for the good example! We also tried using wide_int as a
> > > > > bitmask but your code looks more simple and reasonable.
> > > > >
> > > > > Updated the patch accordingly.
> > > >
> > > > OK.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周
> 四
> > > > 16:56写道：
> > > > >
> > > > >
> > > > > >
> > > > > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> > > > <wwwhhhyyy333@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi Prathamesh and Richard,
> > > > > > >
> > > > > > > Thanks for the review and nice suggestions!
> > > > > > >
> > > > > > > > > I guess the transform should work as long as mask is
> > > > > > > > > same for both vectors even if it's not constant ?
> > > > > > > >
> > > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > > >
> > > > > > >
> > > > > > > Removed VECTOR_CST for integer ops.
> > > > > > >
> > > > > > > > > If this transform is meant only for VLS vectors, I guess
> > > > > > > > > you should bail out if TYPE_VECTOR_SUBPARTS is not
> > > > > > > > > constant, otherwise it will crash for VLA vectors.
> > > > > > > >
> > > > > > > > I suppose it's difficult to create a VLA permute that
> > > > > > > > covers all elements and that is not trivial though.  But
> > > > > > > > indeed add ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > > >
> > > > > > > Added.
> > > > > > >
> > > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > > ().encoded_full_vector_p () (as said I can't think of a
> > > > > > > > non-full encoding that isn't trivial but covers all
> > > > > > > > elements) and then simply .qsort () the vector_builder (it
> > > > > > > > derives from vec<>) so the scan is O(n log n).
> > > > > > >
> > > > > > > The .qsort () approach requires an extra cmp_func that IMO
> > > > > > > would not be feasible to be implemented in match.pd (I
> > > > > > > suppose lambda function would not be a good idea either).
> > > > > > > Another solution would be using hash_set but it does not
> > > > > > > work here for int64_t or poly_int64 type.
> > > > > > > So I kept current O(n^2) simple code here, and I suppose
> > > > > > > usually the permutation indices would be a small number even
> > > > > > > for O(n^2) complexity.
> > > > > >
> > > > > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.
> > > > > > I think a lambda function is fine to use.  The alternative
> > > > > > (used by the vectorizer in some places) is to use sth like
> > > > > >
> > > > > >  auto_sbitmap seen (nelts);
> > > > > >  for (i = 0; i < nelts; i++)
> > > > > >    {
> > > > > >      if (!bitmap_set_bit (seen, i))
> > > > > >        break;
> > > > > >      count++;
> > > > > >    }
> > > > > >  full_perm_p = count == nelts;
> > > > > >
> > > > > > I'll note that you should still check .encoding
> > > > > > ().encoded_full_vector_p () and only bother to check that
> > > > > > case, that's a
> > > > very simple check.
> > > > > >
> > > > > > >
> > > > > > > Attached updated patch.
> > > > > > >
> > > > > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > > > > 于2022年11月8日周二 22:38写道：
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > This is a follow-up patch for PR98167
> > > > > > > > > >
> > > > > > > > > > The sequence
> > > > > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > > >      c3 = c1 op c2
> > > > > > > > > > can be optimized to
> > > > > > > > > >      c = a op b
> > > > > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer
> > > > > > > > > > vector operation, and float operation with full permutation.
> > > > > > > > > >
> > > > > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > > > > >
> > > > > > > > > > Ok for trunk?
> > > > > > > > > >
> > > > > > > > > > gcc/ChangeLog:
> > > > > > > > > >
> > > > > > > > > >         PR target/98167
> > > > > > > > > >         * match.pd: New perm + vector op patterns for
> > > > > > > > > > int and fp
> > > > vector.
> > > > > > > > > >
> > > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > > >
> > > > > > > > > >         PR target/98167
> > > > > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > > > > ---
> > > > > > > > > >  gcc/match.pd                            | 49
> +++++++++++++++++++++++++
> > > > > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > > > > ++++++++++++++++++++++
> > > > > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > >
> > > > > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > > > > --- a/gcc/match.pd
> > > > > > > > > > +++ b/gcc/match.pd
> > > > > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > > > > >    (bit_and @0 @1)))
> > > > > > > > > > +
> > > > > > > > > > +/* Optimize
> > > > > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > > > > +   c3 = c1 op c2
> > > > > > > > > > +   -->
> > > > > > > > > > +   c = a op b
> > > > > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > > > > +   For all integer non-div operations.  */ (for op
> > > > > > > > > > +(plus minus mult bit_and bit_ior bit_xor
> > > > > > > > > > +        lshift rshift)  (simplify
> > > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1
> @1
> > > > VECTOR_CST@2))
> > > > > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > > > > Just wondering, why should mask be CST here ?
> > > > > > > > > I guess the transform should work as long as mask is
> > > > > > > > > same for both vectors even if it's not constant ?
> > > > > > > >
> > > > > > > > Yes, please change accordingly (and maybe push separately).
> > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +/* Similar for float arithmetic when permutation constant
> covers
> > > > > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > > > > +(simplify
> > > > > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1
> @1
> > > > VECTOR_CST@2))
> > > > > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > > > > +     (with
> > > > > > > > > > +      {
> > > > > > > > > > +       tree perm_cst = @2;
> > > > > > > > > > +       vec_perm_builder builder;
> > > > > > > > > > +       bool full_perm_p = false;
> > > > > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > > > > +         {
> > > > > > > > > > +           /* Create a vec_perm_indices for the integer vector.
> */
> > > > > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > > > > +(type).to_constant ();
> > > > > > > > > If this transform is meant only for VLS vectors, I guess
> > > > > > > > > you should bail out if TYPE_VECTOR_SUBPARTS is not
> > > > > > > > > constant, otherwise it will crash for VLA vectors.
> > > > > > > >
> > > > > > > > I suppose it's difficult to create a VLA permute that
> > > > > > > > covers all elements and that is not trivial though.  But
> > > > > > > > indeed add ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Prathamesh
> > > > > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > > > > +
> > > > > > > > > > +           /* Check if perm indices covers all vector elements.
> */
> > > > > > > > > > +           int count = 0, i, j;
> > > > > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > > > > +             for (j = 0; j < nelts; j++)
> > > > > > > >
> > > > > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > > > > ().encoded_full_vector_p () (as said I can't think of a
> > > > > > > > non-full encoding that isn't trivial but covers all
> > > > > > > > elements) and then simply .qsort () the vector_builder (it
> > > > > > > > derives from vec<>) so the scan is O(n log n).
> > > > > > > >
> > > > > > > > Maybe Richard has a better idea here though.
> > > > > > > >
> > > > > > > > Otherwise looks OK, though with these kind of (* (op ..)
> > > > > > > > (op
> > > > > > > > ..)) patterns it's always that they explode the match
> > > > > > > > decision tree, we'd ideally have a way to match those with
> > > > > > > > (op ..) (op
> > > > > > > > ..) first to be able to share more of the matching code.
> > > > > > > > That said, match.pd is a less than ideal place for these
> > > > > > > > (but mostly because of the way we code generate
> > > > > > > > *-match.cc)
> > > > > > > >
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > > +               {
> > > > > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > > > > +                   {
> > > > > > > > > > +                     count++;
> > > > > > > > > > +                     break;
> > > > > > > > > > +                   }
> > > > > > > > > > +               }
> > > > > > > > > > +           full_perm_p = count == nelts;
> > > > > > > > > > +         }
> > > > > > > > > > +       }
> > > > > > > > > > +       (if (full_perm_p)
> > > > > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 00000000000..40e0ac11332
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > > > > @@ -0,0 +1,44 @@
> > > > > > > > > > +/* PR target/98167 */
> > > > > > > > > > +/* { dg-do compile } */
> > > > > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > > > > +
> > > > > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 }
> > > > > > > > > > +} */
> > > > > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3
> > > > > > > > > > +} } */
> > > > > > > > > > +
> > > > > > > > > > +#define VEC_PERM_4 \
> > > > > > > > > > +  2, 3, 1, 0
> > > > > > > > > > +#define VEC_PERM_8 \
> > > > > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0 #define VEC_PERM_16 \
> > > > > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1,
> > > > > > > > > > +0
> > > > > > > > > > +
> > > > > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > > > > +((vector_size(4*size))); \
> > > > > > > > > > +  v##size##s##type type##foo##size##i_##name
> > > > (v##size##s##type a, \
> > > > > > > > > > +
> > > > > > > > > > +v##size##s##type b) \
> > > > > > > > > > +  { \
> > > > > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > > > > +    return a1 op b1; \
> > > > > > > > > > +  }
> > > > > > > > > > +
> > > > > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > > > > +
> > > > > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > > > > +
> > > > > > > > > > +INT_PERMS (+, add)
> > > > > > > > > > +INT_PERMS (-, sub)
> > > > > > > > > > +INT_PERMS (*, mul)
> > > > > > > > > > +INT_PERMS (|, ior)
> > > > > > > > > > +INT_PERMS (^, xor)
> > > > > > > > > > +INT_PERMS (&, and)
> > > > > > > > > > +INT_PERMS (<<, shl)
> > > > > > > > > > +INT_PERMS (>>, shr)
> > > > > > > > > > +FP_PERMS (+, add)
> > > > > > > > > > +FP_PERMS (-, sub)
> > > > > > > > > > +FP_PERMS (*, mul)
> > > > > > > > > > +
> > > > > > > > > > --
> > > > > > > > > > 2.18.1
> > > > > > > > > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 15:30                 ` Richard Biener
  2022-11-16 15:34                   ` Tamar Christina
@ 2022-11-16 15:37                   ` Jakub Jelinek
  2022-11-16 15:40                     ` Tamar Christina
  1 sibling, 1 reply; 16+ messages in thread
From: Jakub Jelinek @ 2022-11-16 15:37 UTC (permalink / raw)
  To: Richard Biener
  Cc: Tamar Christina, Hongyu Wang, Prathamesh Kulkarni,
	Richard Sandiford, Hongyu Wang, hongtao.liu, gcc-patches

On Wed, Nov 16, 2022 at 04:30:06PM +0100, Richard Biener via Gcc-patches wrote:
> On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina <Tamar.Christina@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch is causing several ICEs because it changes the permutes from a single register
> > > permute to a multi register due to the lowering of the expressions to different SSA names.
> > >
> > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> > >
> > > I have a prototype fix which adds a new rule to CSE this back to a single register permute,
> > > but would this be the right solution? It seems hard to later on during expand realize that
> > > the two operands are the same.
> > >
> > > It's probably also ok to just block this from happening after vec_lower, however I'm worried that
> > > If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.
> >
> > You can use
> >
> >  (vec_perm (op@7 @0 @1) @3)
> 
> Err, (vec_perm (op@7 @0 @1) @7) obviously.

Even:

--- gcc/match.pd.jj	2022-11-15 07:56:05.240348804 +0100
+++ gcc/match.pd	2022-11-16 16:35:34.854080956 +0100
@@ -8259,7 +8259,7 @@ and,
  (simplify
   (op (vec_perm @0 @0 @2) (vec_perm @1 @1 @2))
    (if (VECTOR_INTEGER_TYPE_P (type))
-    (vec_perm (op @0 @1) (op @0 @1) @2))))
+    (vec_perm (op@3 @0 @1) @3 @2))))
 
 /* Similar for float arithmetic when permutation constant covers
    all vector elements.  */
@@ -8298,4 +8298,4 @@ and,
 	 }
       }
       (if (full_perm_p)
-	(vec_perm (op @0 @1) (op @0 @1) @2))))))
+	(vec_perm (op@3 @0 @1) @3 @2))))))

From quick look at the dumps, it seems to work fine.

	Jakub


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 15:37                   ` Jakub Jelinek
@ 2022-11-16 15:40                     ` Tamar Christina
  2022-11-16 15:41                       ` Jakub Jelinek
  0 siblings, 1 reply; 16+ messages in thread
From: Tamar Christina @ 2022-11-16 15:40 UTC (permalink / raw)
  To: Jakub Jelinek, Richard Biener
  Cc: Hongyu Wang, Prathamesh Kulkarni, Richard Sandiford, Hongyu Wang,
	hongtao.liu, gcc-patches

> -----Original Message-----
> From: Jakub Jelinek <jakub@redhat.com>
> Sent: Wednesday, November 16, 2022 3:38 PM
> To: Richard Biener <richard.guenther@gmail.com>
> Cc: Tamar Christina <Tamar.Christina@arm.com>; Hongyu Wang
> <wwwhhhyyy333@gmail.com>; Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org>; Richard Sandiford
> <Richard.Sandiford@arm.com>; Hongyu Wang <hongyu.wang@intel.com>;
> hongtao.liu@intel.com; gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Wed, Nov 16, 2022 at 04:30:06PM +0100, Richard Biener via Gcc-patches
> wrote:
> > On Wed, Nov 16, 2022 at 4:29 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Nov 16, 2022 at 4:25 PM Tamar Christina
> <Tamar.Christina@arm.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This patch is causing several ICEs because it changes the permutes
> > > > from a single register permute to a multi register due to the lowering of
> the expressions to different SSA names.
> > > >
> > > > See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717
> > > >
> > > > I have a prototype fix which adds a new rule to CSE this back to a
> > > > single register permute, but would this be the right solution? It
> > > > seems hard to later on during expand realize that the two operands are
> the same.
> > > >
> > > > It's probably also ok to just block this from happening after
> > > > vec_lower, however I'm worried that If it wasn't CSE'd before
> vec_lower it'll lower it so something much less efficient.
> > >
> > > You can use
> > >
> > >  (vec_perm (op@7 @0 @1) @3)
> >
> > Err, (vec_perm (op@7 @0 @1) @7) obviously.
> 
> Even:
> 
> --- gcc/match.pd.jj	2022-11-15 07:56:05.240348804 +0100
> +++ gcc/match.pd	2022-11-16 16:35:34.854080956 +0100
> @@ -8259,7 +8259,7 @@ and,
>   (simplify
>    (op (vec_perm @0 @0 @2) (vec_perm @1 @1 @2))
>     (if (VECTOR_INTEGER_TYPE_P (type))
> -    (vec_perm (op @0 @1) (op @0 @1) @2))))
> +    (vec_perm (op@3 @0 @1) @3 @2))))
> 
>  /* Similar for float arithmetic when permutation constant covers
>     all vector elements.  */
> @@ -8298,4 +8298,4 @@ and,
>  	 }
>        }
>        (if (full_perm_p)
> -	(vec_perm (op @0 @1) (op @0 @1) @2))))))
> +	(vec_perm (op@3 @0 @1) @3 @2))))))
> 
> From quick look at the dumps, it seems to work fine.
> 
> 	Jakub

Yes that's the patch I'm currently reg-testing 😊

Cheers,
Tamar


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 15:40                     ` Tamar Christina
@ 2022-11-16 15:41                       ` Jakub Jelinek
  0 siblings, 0 replies; 16+ messages in thread
From: Jakub Jelinek @ 2022-11-16 15:41 UTC (permalink / raw)
  To: Tamar Christina
  Cc: Richard Biener, Hongyu Wang, Prathamesh Kulkarni,
	Richard Sandiford, Hongyu Wang, hongtao.liu, gcc-patches

On Wed, Nov 16, 2022 at 03:40:02PM +0000, Tamar Christina wrote:
> > Even:
> > 
> > --- gcc/match.pd.jj	2022-11-15 07:56:05.240348804 +0100
> > +++ gcc/match.pd	2022-11-16 16:35:34.854080956 +0100
> > @@ -8259,7 +8259,7 @@ and,
> >   (simplify
> >    (op (vec_perm @0 @0 @2) (vec_perm @1 @1 @2))
> >     (if (VECTOR_INTEGER_TYPE_P (type))
> > -    (vec_perm (op @0 @1) (op @0 @1) @2))))
> > +    (vec_perm (op@3 @0 @1) @3 @2))))
> > 
> >  /* Similar for float arithmetic when permutation constant covers
> >     all vector elements.  */
> > @@ -8298,4 +8298,4 @@ and,
> >  	 }
> >        }
> >        (if (full_perm_p)
> > -	(vec_perm (op @0 @1) (op @0 @1) @2))))))
> > +	(vec_perm (op@3 @0 @1) @3 @2))))))
> > 
> > From quick look at the dumps, it seems to work fine.
> > 
> > 	Jakub
> 
> Yes that's the patch I'm currently reg-testing 😊

Great ;)

	Jakub


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-04  0:04 [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167] Hongyu Wang
  2022-11-04  6:43 ` Prathamesh Kulkarni
@ 2022-11-16 19:21 ` Marc Glisse
  2022-11-17  6:05   ` Hongyu Wang
  1 sibling, 1 reply; 16+ messages in thread
From: Marc Glisse @ 2022-11-16 19:21 UTC (permalink / raw)
  To: Hongyu Wang; +Cc: gcc-patches, hongtao.liu

On Fri, 4 Nov 2022, Hongyu Wang via Gcc-patches wrote:

> This is a follow-up patch for PR98167
>
> The sequence
>     c1 = VEC_PERM_EXPR (a, a, mask)
>     c2 = VEC_PERM_EXPR (b, b, mask)
>     c3 = c1 op c2
> can be optimized to
>     c = a op b
>     c3 = VEC_PERM_EXPR (c, c, mask)
> for all integer vector operation, and float operation with
> full permutation.

Hello,

I assume the "full permutation" condition is to avoid performing some 
extra operations that would raise exception flags. If so, are there 
conditions (-fno-trapping-math?) where the transformation would be safe 
with arbitrary shuffles?

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
  2022-11-16 19:21 ` Marc Glisse
@ 2022-11-17  6:05   ` Hongyu Wang
  0 siblings, 0 replies; 16+ messages in thread
From: Hongyu Wang @ 2022-11-17  6:05 UTC (permalink / raw)
  To: Marc Glisse; +Cc: Hongyu Wang, gcc-patches, hongtao.liu

> I assume the "full permutation" condition is to avoid performing some
> extra operations that would raise exception flags. If so, are there
> conditions (-fno-trapping-math?) where the transformation would be safe
> with arbitrary shuffles?

Yes, that could be an alternative choice with -fno-trapping-math on
arbitrary shuffles. I may have a follow-up patch about this after the
ICE is fixed. Thanks.


> --
> Marc Glisse

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-11-17  6:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-04  0:04 [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167] Hongyu Wang
2022-11-04  6:43 ` Prathamesh Kulkarni
2022-11-08 14:37   ` Richard Biener
2022-11-10  2:22     ` Hongyu Wang
2022-11-10  8:56       ` Richard Biener
2022-11-10 14:27         ` Hongyu Wang
2022-11-14 14:53           ` Richard Biener
2022-11-16 15:25             ` Tamar Christina
2022-11-16 15:29               ` Richard Biener
2022-11-16 15:30                 ` Richard Biener
2022-11-16 15:34                   ` Tamar Christina
2022-11-16 15:37                   ` Jakub Jelinek
2022-11-16 15:40                     ` Tamar Christina
2022-11-16 15:41                       ` Jakub Jelinek
2022-11-16 19:21 ` Marc Glisse
2022-11-17  6:05   ` Hongyu Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).