From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <richard.guenther@gmail.com>
Received: from mail-lj1-x233.google.com (mail-lj1-x233.google.com [IPv6:2a00:1450:4864:20::233])
	by sourceware.org (Postfix) with ESMTPS id 4A7D1385781F
	for <gcc-patches@gcc.gnu.org>; Mon, 14 Nov 2022 14:53:40 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4A7D1385781F
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x233.google.com with SMTP id h12so13495847ljg.9
        for <gcc-patches@gcc.gnu.org>; Mon, 14 Nov 2022 06:53:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=+AndBzmOWO8pwoAAGAjeJ04Igm7lt8iEmxtgBZWw5eE=;
        b=dLRAhzlI+Z/X+hn7OaI8hkr8tyJ9sB4w5kmV3GNiKZAIvazW4CZb5TJ8H7XLJX2egg
         q1fNRN9ZpByXNxE9/7uEOC+3VtgM2oN2VdkP4MVfWl62cw8v5gfxusVkHvGxDdp2ghTK
         4cp9hlKtFz+Qyqr7iAvfbSR3/Pkss98Yh54RzzJ6iimWFgXqtPwFJYM3T4oQh1s1VxPJ
         cTrMVAtJ49ktZZGKQkrNQX7t68Wu3E9PSmTz1DoZUjtewS6lLg7mhbXj817sC8O9Pq/3
         78pvW+Q4Z7BO2OYFtoT1w3wy4yA4DkU8CZ4b7konrUjHq9GvhdyOjp2eE3OaoU4+EN4e
         5Llg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=+AndBzmOWO8pwoAAGAjeJ04Igm7lt8iEmxtgBZWw5eE=;
        b=rTTaDUTMXCFj2tuLnx9wrJo8TE3VMZ5f1rKiDhdB5AxcnMxh5yE4hTIrUW/wubWAGE
         3NOpfqlUj2JZqrcUGnxFVxOR6bMlsNIbzKaRzpFL+HVjItxVNPqXUqWl/wobi01JIoGP
         jezXbmGHGLlsZZNV3SRzl9SB+uHdmWHHZLV1bdY/ex95NeZmxR2nntJPEfGa7HMpGExu
         sTzdIQHzJyxWgFc1p6z3L0yIEgqol1KJZ6Xzoodu/3590YsCqIMDn9DfcOGoDm41w8Ux
         +0qTXkadYadJQddkWPXPzSmh2ykwMS0DsTFFiXCaFtlr6ggW0Rkl1nAU+tSfc/YxL1hY
         2uvw==
X-Gm-Message-State: ANoB5plfqd/IKjVJSIUMIoxyTTzxv/DYedaWEOpIGyKaSuucsLGYTx7u
	+gSjOn356w3KBbUITcmMtsIriGAq7f1yoeopJBk=
X-Google-Smtp-Source: AA0mqf7jmVN3h+j0rw+cH8FhLYxzhZSEe4KN2WmyN3yYqXgFSOOe3vTXswwziJHC2ULC1a9s+Ky/wCpRAqZBfZ6Sjo0=
X-Received: by 2002:a2e:9857:0:b0:277:eba:852 with SMTP id e23-20020a2e9857000000b002770eba0852mr4597982ljj.31.1668437618327;
 Mon, 14 Nov 2022 06:53:38 -0800 (PST)
MIME-Version: 1.0
References: <20221104000432.15254-1-hongyu.wang@intel.com> <CAAgBjM=uZ_+057TkSd9wHXfs3D1770GfRsbeR7QfCWpfo3o8FQ@mail.gmail.com>
 <CAFiYyc3cckQfGUtnwua2s=XQnz-fovLc=vuHOFjweV1uUS_dow@mail.gmail.com>
 <CA+OydWnyt4qSxXRHzmqVXTBss_1CMwvHY9nt+tVStEsuM=M87g@mail.gmail.com>
 <CAFiYyc1F6bwVcjgT2tprNWuabVOSPRoVMvta49GHouyESWoUcg@mail.gmail.com> <CA+OydWmS7sVFcQEYFfgmt8uz_5b_AunNSPKn=chp1MB6NO_E6Q@mail.gmail.com>
In-Reply-To: <CA+OydWmS7sVFcQEYFfgmt8uz_5b_AunNSPKn=chp1MB6NO_E6Q@mail.gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Mon, 14 Nov 2022 15:53:25 +0100
Message-ID: <CAFiYyc3FkN7F5uJNo4zM=vmJRm_KfT5Jr6K7GT6u5P7zCUtHjQ@mail.gmail.com>
Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and
 operation [PR98167]
To: Hongyu Wang <wwwhhhyyy333@gmail.com>
Cc: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>, 
	Richard Sandiford <richard.sandiford@arm.com>, Hongyu Wang <hongyu.wang@intel.com>, hongtao.liu@intel.com, 
	gcc-patches@gcc.gnu.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-5.7 required=5.0 tests=BAYES_00,BODY_8BITS,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang <wwwhhhyyy333@gmail.com> wrote:
>
> > Well, with AVX512 v64qi that's 64*64 =3D=3D 4096 cases to check.  I thi=
nk
> > a lambda function is fine to use.  The alternative (used by the vectori=
zer
> > in some places) is to use sth like
> >
> >  auto_sbitmap seen (nelts);
> >  for (i =3D 0; i < nelts; i++)
> >    {
> >      if (!bitmap_set_bit (seen, i))
> >        break;
> >      count++;
> >    }
> >  full_perm_p =3D count =3D=3D nelts;
> >
> > I'll note that you should still check .encoding ().encoded_full_vector_=
p ()
> > and only bother to check that case, that's a very simple check.
>
> Thanks for the good example! We also tried using wide_int as a bitmask
> but your code looks more simple and reasonable.
>
> Updated the patch accordingly.

OK.

Thanks,
Richard.

> Richard Biener <richard.guenther@gmail.com> =E4=BA=8E2022=E5=B9=B411=E6=
=9C=8810=E6=97=A5=E5=91=A8=E5=9B=9B 16:56=E5=86=99=E9=81=93=EF=BC=9A
>
>
> >
> > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang <wwwhhhyyy333@gmail.com> wr=
ote:
> > >
> > > Hi Prathamesh and Richard,
> > >
> > > Thanks for the review and nice suggestions!
> > >
> > > > > I guess the transform should work as long as mask is same for bot=
h
> > > > > vectors even if it's
> > > > > not constant ?
> > > >
> > > > Yes, please change accordingly (and maybe push separately).
> > > >
> > >
> > > Removed VECTOR_CST for integer ops.
> > >
> > > > > If this transform is meant only for VLS vectors, I guess you shou=
ld
> > > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > otherwise it will crash for VLA vectors.
> > > >
> > > > I suppose it's difficult to create a VLA permute that covers all el=
ements
> > > > and that is not trivial though.  But indeed add ().is_constant to t=
he
> > > > VECTOR_FLOAT_TYPE_P guard.
> > >
> > > Added.
> > >
> > > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_ful=
l_vector_p ()
> > > > (as said I can't think of a non-full encoding that isn't trivial
> > > > but covers all elements) and then simply .qsort () the vector_build=
er
> > > > (it derives
> > > > from vec<>) so the scan is O(n log n).
> > >
> > > The .qsort () approach requires an extra cmp_func that IMO would not
> > > be feasible to be implemented in match.pd (I suppose lambda function
> > > would not be a good idea either).
> > > Another solution would be using hash_set but it does not work here fo=
r
> > > int64_t or poly_int64 type.
> > > So I kept current O(n^2) simple code here, and I suppose usually the
> > > permutation indices would be a small number even for O(n^2)
> > > complexity.
> >
> > Well, with AVX512 v64qi that's 64*64 =3D=3D 4096 cases to check.  I thi=
nk
> > a lambda function is fine to use.  The alternative (used by the vectori=
zer
> > in some places) is to use sth like
> >
> >  auto_sbitmap seen (nelts);
> >  for (i =3D 0; i < nelts; i++)
> >    {
> >      if (!bitmap_set_bit (seen, i))
> >        break;
> >      count++;
> >    }
> >  full_perm_p =3D count =3D=3D nelts;
> >
> > I'll note that you should still check .encoding ().encoded_full_vector_=
p ()
> > and only bother to check that case, that's a very simple check.
> >
> > >
> > > Attached updated patch.
> > >
> > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> =E4=BA=8E202=
2=E5=B9=B411=E6=9C=888=E6=97=A5=E5=91=A8=E4=BA=8C 22:38=E5=86=99=E9=81=93=
=EF=BC=9A
> > >
> > >
> > > >
> > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is a follow-up patch for PR98167
> > > > > >
> > > > > > The sequence
> > > > > >      c1 =3D VEC_PERM_EXPR (a, a, mask)
> > > > > >      c2 =3D VEC_PERM_EXPR (b, b, mask)
> > > > > >      c3 =3D c1 op c2
> > > > > > can be optimized to
> > > > > >      c =3D a op b
> > > > > >      c3 =3D VEC_PERM_EXPR (c, c, mask)
> > > > > > for all integer vector operation, and float operation with
> > > > > > full permutation.
> > > > > >
> > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > >
> > > > > > Ok for trunk?
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >         PR target/98167
> > > > > >         * match.pd: New perm + vector op patterns for int and f=
p vector.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >         PR target/98167
> > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > ---
> > > > > >  gcc/match.pd                            | 49 +++++++++++++++++=
++++++++
> > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44 +++++++++++++++++=
+++++
> > > > > >  2 files changed, 93 insertions(+)
> > > > > >  create mode 100644 gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > >
> > > > > > diff --git a/gcc/match.pd b/gcc/match.pd
> > > > > > index 194ba8f5188..b85ad34f609 100644
> > > > > > --- a/gcc/match.pd
> > > > > > +++ b/gcc/match.pd
> > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > >    (bit_and @0 @1)))
> > > > > > +
> > > > > > +/* Optimize
> > > > > > +   c1 =3D VEC_PERM_EXPR (a, a, mask)
> > > > > > +   c2 =3D VEC_PERM_EXPR (b, b, mask)
> > > > > > +   c3 =3D c1 op c2
> > > > > > +   -->
> > > > > > +   c =3D a op b
> > > > > > +   c3 =3D VEC_PERM_EXPR (c, c, mask)
> > > > > > +   For all integer non-div operations.  */
> > > > > > +(for op (plus minus mult bit_and bit_ior bit_xor
> > > > > > +        lshift rshift)
> > > > > > + (simplify
> > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST=
@2))
> > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > Just wondering, why should mask be CST here ?
> > > > > I guess the transform should work as long as mask is same for bot=
h
> > > > > vectors even if it's
> > > > > not constant ?
> > > >
> > > > Yes, please change accordingly (and maybe push separately).
> > > >
> > > > > > +
> > > > > > +/* Similar for float arithmetic when permutation constant cove=
rs
> > > > > > +   all vector elements.  */
> > > > > > +(for op (plus minus mult)
> > > > > > + (simplify
> > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1 VECTOR_CST=
@2))
> > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > +     (with
> > > > > > +      {
> > > > > > +       tree perm_cst =3D @2;
> > > > > > +       vec_perm_builder builder;
> > > > > > +       bool full_perm_p =3D false;
> > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > +         {
> > > > > > +           /* Create a vec_perm_indices for the integer vector=
.  */
> > > > > > +           int nelts =3D TYPE_VECTOR_SUBPARTS (type).to_consta=
nt ();
> > > > > If this transform is meant only for VLS vectors, I guess you shou=
ld
> > > > > bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > otherwise it will crash for VLA vectors.
> > > >
> > > > I suppose it's difficult to create a VLA permute that covers all el=
ements
> > > > and that is not trivial though.  But indeed add ().is_constant to t=
he
> > > > VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > +
> > > > > > +           /* Check if perm indices covers all vector elements=
.  */
> > > > > > +           int count =3D 0, i, j;
> > > > > > +           for (i =3D 0; i < nelts; i++)
> > > > > > +             for (j =3D 0; j < nelts; j++)
> > > >
> > > > Meh, that's quadratic!  I suggest to check .encoding ().encoded_ful=
l_vector_p ()
> > > > (as said I can't think of a non-full encoding that isn't trivial
> > > > but covers all elements) and then simply .qsort () the vector_build=
er
> > > > (it derives
> > > > from vec<>) so the scan is O(n log n).
> > > >
> > > > Maybe Richard has a better idea here though.
> > > >
> > > > Otherwise looks OK, though with these kind of (* (op ..) (op ..)) p=
atterns it's
> > > > always that they explode the match decision tree, we'd ideally have=
 a way to
> > > > match those with (op ..) (op ..) first to be able to share more of =
the matching
> > > > code.  That said, match.pd is a less than ideal place for these (bu=
t mostly
> > > > because of the way we code generate *-match.cc)
> > > >
> > > > Richard.
> > > >
> > > > > > +               {
> > > > > > +                 if (sel[j].to_constant () =3D=3D i)
> > > > > > +                   {
> > > > > > +                     count++;
> > > > > > +                     break;
> > > > > > +                   }
> > > > > > +               }
> > > > > > +           full_perm_p =3D count =3D=3D nelts;
> > > > > > +         }
> > > > > > +       }
> > > > > > +       (if (full_perm_p)
> > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c b/gcc/test=
suite/gcc.target/i386/pr98167.c
> > > > > > new file mode 100644
> > > > > > index 00000000000..40e0ac11332
> > > > > > --- /dev/null
> > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > @@ -0,0 +1,44 @@
> > > > > > +/* PR target/98167 */
> > > > > > +/* { dg-do compile } */
> > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > +
> > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > +
> > > > > > +#define VEC_PERM_4 \
> > > > > > +  2, 3, 1, 0
> > > > > > +#define VEC_PERM_8 \
> > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > +#define VEC_PERM_16 \
> > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > +
> > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > +  typedef type v##size##s##type __attribute__ ((vector_size(4*=
size))); \
> > > > > > +  v##size##s##type type##foo##size##i_##name (v##size##s##type=
 a, \
> > > > > > +                                             v##size##s##type =
b) \
> > > > > > +  { \
> > > > > > +    v##size##s##type a1 =3D __builtin_shufflevector (a, a, \
> > > > > > +                                                  VEC_PERM_##s=
ize); \
> > > > > > +    v##size##s##type b1 =3D __builtin_shufflevector (b, b, \
> > > > > > +                                                  VEC_PERM_##s=
ize); \
> > > > > > +    return a1 op b1; \
> > > > > > +  }
> > > > > > +
> > > > > > +#define INT_PERMS(op, name) \
> > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > +
> > > > > > +#define FP_PERMS(op, name) \
> > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > +
> > > > > > +INT_PERMS (+, add)
> > > > > > +INT_PERMS (-, sub)
> > > > > > +INT_PERMS (*, mul)
> > > > > > +INT_PERMS (|, ior)
> > > > > > +INT_PERMS (^, xor)
> > > > > > +INT_PERMS (&, and)
> > > > > > +INT_PERMS (<<, shl)
> > > > > > +INT_PERMS (>>, shr)
> > > > > > +FP_PERMS (+, add)
> > > > > > +FP_PERMS (-, sub)
> > > > > > +FP_PERMS (*, mul)
> > > > > > +
> > > > > > --
> > > > > > 2.18.1
> > > > > >