RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tamar Christina <Tamar.Christina@arm.com>
To: Richard Biener <richard.guenther@gmail.com>,
	Hongyu Wang <wwwhhhyyy333@gmail.com>
Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>,
	Richard Sandiford <Richard.Sandiford@arm.com>,
	Hongyu Wang <hongyu.wang@intel.com>,
	"hongtao.liu@intel.com" <hongtao.liu@intel.com>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]
Date: Wed, 16 Nov 2022 15:25:08 +0000	[thread overview]
Message-ID: <VI1PR08MB5325FE3ECD61848E8277B26FFF079@VI1PR08MB5325.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <CAFiYyc3FkN7F5uJNo4zM=vmJRm_KfT5Jr6K7GT6u5P7zCUtHjQ@mail.gmail.com>

Hi,

This patch is causing several ICEs because it changes the permutes from a single register
permute to a multi register due to the lowering of the expressions to different SSA names.

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717

I have a prototype fix which adds a new rule to CSE this back to a single register permute,
but would this be the right solution? It seems hard to later on during expand realize that
the two operands are the same.

It's probably also ok to just block this from happening after vec_lower, however I'm worried that
If it wasn't CSE'd before vec_lower it'll lower it so something much less efficient.

Thanks,
Tamar

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm.com@gcc.gnu.org> On Behalf Of Richard
> Biener via Gcc-patches
> Sent: Monday, November 14, 2022 2:53 PM
> To: Hongyu Wang <wwwhhhyyy333@gmail.com>
> Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>; Richard
> Sandiford <Richard.Sandiford@arm.com>; Hongyu Wang
> <hongyu.wang@intel.com>; hongtao.liu@intel.com; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> <wwwhhhyyy333@gmail.com> wrote:
> >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> >
> > Thanks for the good example! We also tried using wide_int as a bitmask
> > but your code looks more simple and reasonable.
> >
> > Updated the patch accordingly.
> 
> OK.
> 
> Thanks,
> Richard.
> 
> > Richard Biener <richard.guenther@gmail.com> 于2022年11月10日周四
> 16:56写道：
> >
> >
> > >
> > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> <wwwhhhyyy333@gmail.com> wrote:
> > > >
> > > > Hi Prathamesh and Richard,
> > > >
> > > > Thanks for the review and nice suggestions!
> > > >
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > >
> > > > Removed VECTOR_CST for integer ops.
> > > >
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > Added.
> > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > >
> > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > function would not be a good idea either).
> > > > Another solution would be using hash_set but it does not work here
> > > > for int64_t or poly_int64 type.
> > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > the permutation indices would be a small number even for O(n^2)
> > > > complexity.
> > >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> > >
> > > >
> > > > Attached updated patch.
> > > >
> > > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org>
> > > > 于2022年11月8日周二 22:38写道：
> > > >
> > > >
> > > > >
> > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This is a follow-up patch for PR98167
> > > > > > >
> > > > > > > The sequence
> > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > >      c3 = c1 op c2
> > > > > > > can be optimized to
> > > > > > >      c = a op b
> > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > operation, and float operation with full permutation.
> > > > > > >
> > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > >
> > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> vector.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > ---
> > > > > > >  gcc/match.pd                            | 49 +++++++++++++++++++++++++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > ++++++++++++++++++++++
> > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > >
> > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > --- a/gcc/match.pd
> > > > > > > +++ b/gcc/match.pd
> > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > >    (bit_and @0 @1)))
> > > > > > > +
> > > > > > > +/* Optimize
> > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > +   c3 = c1 op c2
> > > > > > > +   -->
> > > > > > > +   c = a op b
> > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > +        lshift rshift)
> > > > > > > + (simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > Just wondering, why should mask be CST here ?
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > > > > > +
> > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > +(simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > +     (with
> > > > > > > +      {
> > > > > > > +       tree perm_cst = @2;
> > > > > > > +       vec_perm_builder builder;
> > > > > > > +       bool full_perm_p = false;
> > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > +         {
> > > > > > > +           /* Create a vec_perm_indices for the integer vector.  */
> > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > +(type).to_constant ();
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > +
> > > > > > > +           /* Check if perm indices covers all vector elements.  */
> > > > > > > +           int count = 0, i, j;
> > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > +             for (j = 0; j < nelts; j++)
> > > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > > >
> > > > > Maybe Richard has a better idea here though.
> > > > >
> > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > ..)) patterns it's always that they explode the match decision
> > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > ..) first to be able to share more of the matching code.  That
> > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > because of the way we code generate *-match.cc)
> > > > >
> > > > > Richard.
> > > > >
> > > > > > > +               {
> > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > +                   {
> > > > > > > +                     count++;
> > > > > > > +                     break;
> > > > > > > +                   }
> > > > > > > +               }
> > > > > > > +           full_perm_p = count == nelts;
> > > > > > > +         }
> > > > > > > +       }
> > > > > > > +       (if (full_perm_p)
> > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000000..40e0ac11332
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > @@ -0,0 +1,44 @@
> > > > > > > +/* PR target/98167 */
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > +
> > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > +
> > > > > > > +#define VEC_PERM_4 \
> > > > > > > +  2, 3, 1, 0
> > > > > > > +#define VEC_PERM_8 \
> > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > +#define VEC_PERM_16 \
> > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > +
> > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > +((vector_size(4*size))); \
> > > > > > > +  v##size##s##type type##foo##size##i_##name
> (v##size##s##type a, \
> > > > > > > +
> > > > > > > +v##size##s##type b) \
> > > > > > > +  { \
> > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > +                                                  VEC_PERM_##size); \
> > > > > > > +    return a1 op b1; \
> > > > > > > +  }
> > > > > > > +
> > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > +
> > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > +
> > > > > > > +INT_PERMS (+, add)
> > > > > > > +INT_PERMS (-, sub)
> > > > > > > +INT_PERMS (*, mul)
> > > > > > > +INT_PERMS (|, ior)
> > > > > > > +INT_PERMS (^, xor)
> > > > > > > +INT_PERMS (&, and)
> > > > > > > +INT_PERMS (<<, shl)
> > > > > > > +INT_PERMS (>>, shr)
> > > > > > > +FP_PERMS (+, add)
> > > > > > > +FP_PERMS (-, sub)
> > > > > > > +FP_PERMS (*, mul)
> > > > > > > +
> > > > > > > --
> > > > > > > 2.18.1
> > > > > > >

next prev parent reply	other threads:[~2022-11-16 15:25 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-04  0:04 Hongyu Wang
2022-11-04  6:43 ` Prathamesh Kulkarni
2022-11-08 14:37   ` Richard Biener
2022-11-10  2:22     ` Hongyu Wang
2022-11-10  8:56       ` Richard Biener
2022-11-10 14:27         ` Hongyu Wang
2022-11-14 14:53           ` Richard Biener
2022-11-16 15:25             ` Tamar Christina [this message]
2022-11-16 15:29               ` Richard Biener
2022-11-16 15:30                 ` Richard Biener
2022-11-16 15:34                   ` Tamar Christina
2022-11-16 15:37                   ` Jakub Jelinek
2022-11-16 15:40                     ` Tamar Christina
2022-11-16 15:41                       ` Jakub Jelinek
2022-11-16 19:21 ` Marc Glisse
2022-11-17  6:05   ` Hongyu Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VI1PR08MB5325FE3ECD61848E8277B26FFF079@VI1PR08MB5325.eurprd08.prod.outlook.com \
    --to=tamar.christina@arm.com \
    --cc=Richard.Sandiford@arm.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongtao.liu@intel.com \
    --cc=hongyu.wang@intel.com \
    --cc=prathamesh.kulkarni@linaro.org \
    --cc=richard.guenther@gmail.com \
    --cc=wwwhhhyyy333@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).